diff mbox series

[1/2] blk-mq: Export iterating all tagged requests

Message ID 20181130202635.11145-1-keith.busch@intel.com (mailing list archive)
State New, archived
Headers show
Series [1/2] blk-mq: Export iterating all tagged requests | expand

Commit Message

Keith Busch Nov. 30, 2018, 8:26 p.m. UTC
A driver may wish to iterate every tagged request, not just ones that
satisfy blk_mq_request_started(). The intended use is so a driver may
terminate entered requests on quiesced queues.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 block/blk-mq-tag.c     | 41 +++++++++++++++++++++++++++++++++++------
 block/blk-mq-tag.h     |  2 --
 include/linux/blk-mq.h |  4 ++++
 3 files changed, 39 insertions(+), 8 deletions(-)

Comments

Jens Axboe Nov. 30, 2018, 8:36 p.m. UTC | #1
On 11/30/18 1:26 PM, Keith Busch wrote:
> A driver may wish to iterate every tagged request, not just ones that
> satisfy blk_mq_request_started(). The intended use is so a driver may
> terminate entered requests on quiesced queues.

How about we just move the started check into the handler passed in for
those that care about it? Much saner to make the interface iterate
everything, and leave whatever state check to the callback.
Keith Busch Nov. 30, 2018, 8:39 p.m. UTC | #2
On Fri, Nov 30, 2018 at 01:36:09PM -0700, Jens Axboe wrote:
> On 11/30/18 1:26 PM, Keith Busch wrote:
> > A driver may wish to iterate every tagged request, not just ones that
> > satisfy blk_mq_request_started(). The intended use is so a driver may
> > terminate entered requests on quiesced queues.
> 
> How about we just move the started check into the handler passed in for
> those that care about it? Much saner to make the interface iterate
> everything, and leave whatever state check to the callback.

I agree that's better. I initially didn't want to examine all the users
but looks like only 4 drivers using the current tagset iterator, so
that's not too daunting. I'll give it a look.
Christoph Hellwig Dec. 1, 2018, 4:48 p.m. UTC | #3
On Fri, Nov 30, 2018 at 01:36:09PM -0700, Jens Axboe wrote:
> On 11/30/18 1:26 PM, Keith Busch wrote:
> > A driver may wish to iterate every tagged request, not just ones that
> > satisfy blk_mq_request_started(). The intended use is so a driver may
> > terminate entered requests on quiesced queues.
> 
> How about we just move the started check into the handler passed in for
> those that care about it? Much saner to make the interface iterate
> everything, and leave whatever state check to the callback.

So we used to do that, and I changed it back in May to test for
MQ_RQ_IN_FLIGHT, and then Ming changed it to check
blk_mq_request_started.  So this is clearly a minefield of sorts..

Note that at least mtip32xx, nbd, skd and the various nvme transports
want to use the function to terminate all requests in the error
path, and it would be great to have one single understood, documented
and debugged helper for that in the core, so this is a vote for moving
more of the logic in your second helper into the core code.  skd
will need actually use ->complete to release resources for that, though
and mtip plays some odd abort bits.  If it weren't for the interesting
abort behavior in nvme-fc that means we could even unexport the
low-level interface.
Hannes Reinecke Dec. 1, 2018, 5:11 p.m. UTC | #4
On 12/1/18 5:48 PM, Christoph Hellwig wrote:
> On Fri, Nov 30, 2018 at 01:36:09PM -0700, Jens Axboe wrote:
>> On 11/30/18 1:26 PM, Keith Busch wrote:
>>> A driver may wish to iterate every tagged request, not just ones that
>>> satisfy blk_mq_request_started(). The intended use is so a driver may
>>> terminate entered requests on quiesced queues.
>>
>> How about we just move the started check into the handler passed in for
>> those that care about it? Much saner to make the interface iterate
>> everything, and leave whatever state check to the callback.
> 
> So we used to do that, and I changed it back in May to test for
> MQ_RQ_IN_FLIGHT, and then Ming changed it to check
> blk_mq_request_started.  So this is clearly a minefield of sorts..
> 
> Note that at least mtip32xx, nbd, skd and the various nvme transports
> want to use the function to terminate all requests in the error
> path, and it would be great to have one single understood, documented
> and debugged helper for that in the core, so this is a vote for moving
> more of the logic in your second helper into the core code.  skd
> will need actually use ->complete to release resources for that, though
> and mtip plays some odd abort bits.  If it weren't for the interesting
> abort behavior in nvme-fc that means we could even unexport the
> low-level interface.
> 
Yes, I'm very much in favour of this, too.
We always have this IMO slightly weird notion of stopping the queue, set 
some error flags in the driver, then _restarting_ the queue, just so 
that the driver then sees the error flag and terminates the requests.
Which I always found quite counter-intuitive.
So having a common helper for terminating requests for queue errors 
would be very welcomed here.

But when we have that we really should audit all drivers to ensure they 
do the right thin (tm).

Cheers,

Hannes
Bart Van Assche Dec. 1, 2018, 6:32 p.m. UTC | #5
On 12/1/18 9:11 AM, Hannes Reinecke wrote:
> On 12/1/18 5:48 PM, Christoph Hellwig wrote:
>> On Fri, Nov 30, 2018 at 01:36:09PM -0700, Jens Axboe wrote:
>>> On 11/30/18 1:26 PM, Keith Busch wrote:
>>>> A driver may wish to iterate every tagged request, not just ones that
>>>> satisfy blk_mq_request_started(). The intended use is so a driver may
>>>> terminate entered requests on quiesced queues.
>>>
>>> How about we just move the started check into the handler passed in for
>>> those that care about it? Much saner to make the interface iterate
>>> everything, and leave whatever state check to the callback.
>>
>> So we used to do that, and I changed it back in May to test for
>> MQ_RQ_IN_FLIGHT, and then Ming changed it to check
>> blk_mq_request_started.  So this is clearly a minefield of sorts..
>>
>> Note that at least mtip32xx, nbd, skd and the various nvme transports
>> want to use the function to terminate all requests in the error
>> path, and it would be great to have one single understood, documented
>> and debugged helper for that in the core, so this is a vote for moving
>> more of the logic in your second helper into the core code.  skd
>> will need actually use ->complete to release resources for that, though
>> and mtip plays some odd abort bits.  If it weren't for the interesting
>> abort behavior in nvme-fc that means we could even unexport the
>> low-level interface.
>>
> Yes, I'm very much in favour of this, too.
> We always have this IMO slightly weird notion of stopping the queue, set 
> some error flags in the driver, then _restarting_ the queue, just so 
> that the driver then sees the error flag and terminates the requests.
> Which I always found quite counter-intuitive.
> So having a common helper for terminating requests for queue errors 
> would be very welcomed here.
> 
> But when we have that we really should audit all drivers to ensure they 
> do the right thin (tm).

Would calling blk_abort_request() for all outstanding requests be 
sufficient to avoid that the queue has to be stopped and restarted in 
the nvme-fc driver?

Bart.
Ming Lei Dec. 3, 2018, 7:44 a.m. UTC | #6
On Sun, Dec 2, 2018 at 12:48 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Fri, Nov 30, 2018 at 01:36:09PM -0700, Jens Axboe wrote:
> > On 11/30/18 1:26 PM, Keith Busch wrote:
> > > A driver may wish to iterate every tagged request, not just ones that
> > > satisfy blk_mq_request_started(). The intended use is so a driver may
> > > terminate entered requests on quiesced queues.
> >
> > How about we just move the started check into the handler passed in for
> > those that care about it? Much saner to make the interface iterate
> > everything, and leave whatever state check to the callback.
>
> So we used to do that, and I changed it back in May to test for
> MQ_RQ_IN_FLIGHT, and then Ming changed it to check
> blk_mq_request_started.  So this is clearly a minefield of sorts..

The change to blk_mq_request_started() is for just fixing SCSI's boot
hang issue.

>
> Note that at least mtip32xx, nbd, skd and the various nvme transports
> want to use the function to terminate all requests in the error
> path, and it would be great to have one single understood, documented
> and debugged helper for that in the core, so this is a vote for moving
> more of the logic in your second helper into the core code.  skd
> will need actually use ->complete to release resources for that, though
> and mtip plays some odd abort bits.  If it weren't for the interesting
> abort behavior in nvme-fc that means we could even unexport the
> low-level interface.

Agree, and especially SCSI's use should be understood given
SCSI is so widely deployed in product system.

thanks,
Ming Lei
James Smart Dec. 3, 2018, 6:57 p.m. UTC | #7
On 12/1/2018 10:32 AM, Bart Van Assche wrote:
> On 12/1/18 9:11 AM, Hannes Reinecke wrote:
>>
>> Yes, I'm very much in favour of this, too.
>> We always have this IMO slightly weird notion of stopping the queue, 
>> set some error flags in the driver, then _restarting_ the queue, just 
>> so that the driver then sees the error flag and terminates the requests.
>> Which I always found quite counter-intuitive.
>> So having a common helper for terminating requests for queue errors 
>> would be very welcomed here.
>>
>> But when we have that we really should audit all drivers to ensure 
>> they do the right thin (tm).
>
> Would calling blk_abort_request() for all outstanding requests be 
> sufficient to avoid that the queue has to be stopped and restarted in 
> the nvme-fc driver?

what nvme-fc does is the same as what is done in all the other 
transports - for the same reasons.  If we're eliminating those 
synchronization reasons, and now that we've plugged the request_queue 
path into the transports to check state appropriately, I don' t think 
there are reasons to block the queue.  In some respects, it is nice to 
stop new io while the work to terminate everything else happens, but I 
don't know that it's required.  I would hope that the bounced work due 
to the controller state (returned BLK_STAT_RESOURCE) is actually pausing 
for a short while. I've seen some circumstances where it didn't and was 
infinitely polling. Which would be a change in behavior vs the queue stops.

-- james
Sagi Grimberg Dec. 4, 2018, 1:33 a.m. UTC | #8
>>>> A driver may wish to iterate every tagged request, not just ones that
>>>> satisfy blk_mq_request_started(). The intended use is so a driver may
>>>> terminate entered requests on quiesced queues.
>>>
>>> How about we just move the started check into the handler passed in for
>>> those that care about it? Much saner to make the interface iterate
>>> everything, and leave whatever state check to the callback.
>>
>> So we used to do that, and I changed it back in May to test for
>> MQ_RQ_IN_FLIGHT, and then Ming changed it to check
>> blk_mq_request_started.  So this is clearly a minefield of sorts..
>>
>> Note that at least mtip32xx, nbd, skd and the various nvme transports
>> want to use the function to terminate all requests in the error
>> path, and it would be great to have one single understood, documented
>> and debugged helper for that in the core, so this is a vote for moving
>> more of the logic in your second helper into the core code.  skd
>> will need actually use ->complete to release resources for that, though
>> and mtip plays some odd abort bits.  If it weren't for the interesting
>> abort behavior in nvme-fc that means we could even unexport the
>> low-level interface.
>>
> Yes, I'm very much in favour of this, too.
> We always have this IMO slightly weird notion of stopping the queue, set 
> some error flags in the driver, then _restarting_ the queue, just so 
> that the driver then sees the error flag and terminates the requests.
> Which I always found quite counter-intuitive.

What about requests that come in after the iteration runs? how are those
terminated?
Keith Busch Dec. 4, 2018, 3:46 p.m. UTC | #9
On Mon, Dec 03, 2018 at 05:33:06PM -0800, Sagi Grimberg wrote:
> >>
> > Yes, I'm very much in favour of this, too.
> > We always have this IMO slightly weird notion of stopping the queue, set 
> > some error flags in the driver, then _restarting_ the queue, just so 
> > that the driver then sees the error flag and terminates the requests.
> > Which I always found quite counter-intuitive.
> 
> What about requests that come in after the iteration runs? how are those
> terminated?

If we've reached a dead state, I think you'd want to start a queue freeze
before running the terminating iterator.
James Smart Dec. 4, 2018, 4:26 p.m. UTC | #10
On 12/4/2018 7:46 AM, Keith Busch wrote:
> On Mon, Dec 03, 2018 at 05:33:06PM -0800, Sagi Grimberg wrote:
>>> Yes, I'm very much in favour of this, too.
>>> We always have this IMO slightly weird notion of stopping the queue, set
>>> some error flags in the driver, then _restarting_ the queue, just so
>>> that the driver then sees the error flag and terminates the requests.
>>> Which I always found quite counter-intuitive.
>> What about requests that come in after the iteration runs? how are those
>> terminated?
> If we've reached a dead state, I think you'd want to start a queue freeze
> before running the terminating iterator.

For the requests that come in after the iterator, the nvmf_check_ready() 
routine, which validates controller state, will catch and bounce them.

Keith states why we froze it in the past.  Whatever the itterator is, 
I'd prefer we not get abort calls on io that has yet to be successfully 
started.

-- james
Sagi Grimberg Dec. 4, 2018, 5:23 p.m. UTC | #11
>>>> Yes, I'm very much in favour of this, too.
>>>> We always have this IMO slightly weird notion of stopping the queue, 
>>>> set
>>>> some error flags in the driver, then _restarting_ the queue, just so
>>>> that the driver then sees the error flag and terminates the requests.
>>>> Which I always found quite counter-intuitive.
>>> What about requests that come in after the iteration runs? how are those
>>> terminated?
>> If we've reached a dead state, I think you'd want to start a queue freeze
>> before running the terminating iterator.
> 
> For the requests that come in after the iterator, the nvmf_check_ready() 
> routine, which validates controller state, will catch and bounce them.

The point of this patch was to omit the check in queue_rq like Keith
did in patch #2.
Sagi Grimberg Dec. 4, 2018, 5:38 p.m. UTC | #12
>>> Yes, I'm very much in favour of this, too.
>>> We always have this IMO slightly weird notion of stopping the queue, set
>>> some error flags in the driver, then _restarting_ the queue, just so
>>> that the driver then sees the error flag and terminates the requests.
>>> Which I always found quite counter-intuitive.
>>
>> What about requests that come in after the iteration runs? how are those
>> terminated?
> 
> If we've reached a dead state, I think you'd want to start a queue freeze
> before running the terminating iterator.

Its not necessarily dead, in fabrics we need to handle disconnections
that last for a while before we are able to reconnect (for a variety of
reasons) and we need a way to fail I/O for failover (or requeue, or
block its up to the upper layer). Its less of a "last resort" action
like in the pci case.

Does this guarantee that after freeze+iter we won't get queued with any
other request? If not then we still need to unfreeze and fail at
queue_rq.
Keith Busch Dec. 4, 2018, 5:48 p.m. UTC | #13
On Tue, Dec 04, 2018 at 09:38:29AM -0800, Sagi Grimberg wrote:
> 
> > > > Yes, I'm very much in favour of this, too.
> > > > We always have this IMO slightly weird notion of stopping the queue, set
> > > > some error flags in the driver, then _restarting_ the queue, just so
> > > > that the driver then sees the error flag and terminates the requests.
> > > > Which I always found quite counter-intuitive.
> > > 
> > > What about requests that come in after the iteration runs? how are those
> > > terminated?
> > 
> > If we've reached a dead state, I think you'd want to start a queue freeze
> > before running the terminating iterator.
> 
> Its not necessarily dead, in fabrics we need to handle disconnections
> that last for a while before we are able to reconnect (for a variety of
> reasons) and we need a way to fail I/O for failover (or requeue, or
> block its up to the upper layer). Its less of a "last resort" action
> like in the pci case.
> 
> Does this guarantee that after freeze+iter we won't get queued with any
> other request? If not then we still need to unfreeze and fail at
> queue_rq.

It sounds like there are different scenarios to consider.

For the dead controller, we call blk_cleanup_queue() at the end which
ends callers who blocked on entering.

If you're doing a failover, you'd replace the freeze with a current path
update in order to prevent new requests from entering.

In either case, you don't need checks in queue_rq. The queue_rq check
is redundant with the quiesce state that blk-mq already provides.

Once quiesced, the proposed iterator can handle the final termination
of the request, perform failover, or some other lld specific action
depending on your situation.
James Smart Dec. 4, 2018, 7:13 p.m. UTC | #14
On 12/4/2018 9:23 AM, Sagi Grimberg wrote:
>
>>>>> Yes, I'm very much in favour of this, too.
>>>>> We always have this IMO slightly weird notion of stopping the 
>>>>> queue, set
>>>>> some error flags in the driver, then _restarting_ the queue, just so
>>>>> that the driver then sees the error flag and terminates the requests.
>>>>> Which I always found quite counter-intuitive.
>>>> What about requests that come in after the iteration runs? how are 
>>>> those
>>>> terminated?
>>> If we've reached a dead state, I think you'd want to start a queue 
>>> freeze
>>> before running the terminating iterator.
>>
>> For the requests that come in after the iterator, the 
>> nvmf_check_ready() routine, which validates controller state, will 
>> catch and bounce them.
>
> The point of this patch was to omit the check in queue_rq like Keith
> did in patch #2.

well - I'm not sure that's possible. The fabrics will have different 
time constraints vs pci.

-- james
James Smart Dec. 4, 2018, 7:33 p.m. UTC | #15
On 12/4/2018 9:48 AM, Keith Busch wrote:
> On Tue, Dec 04, 2018 at 09:38:29AM -0800, Sagi Grimberg wrote:
>>>>> Yes, I'm very much in favour of this, too.
>>>>> We always have this IMO slightly weird notion of stopping the queue, set
>>>>> some error flags in the driver, then _restarting_ the queue, just so
>>>>> that the driver then sees the error flag and terminates the requests.
>>>>> Which I always found quite counter-intuitive.
>>>> What about requests that come in after the iteration runs? how are those
>>>> terminated?
>>> If we've reached a dead state, I think you'd want to start a queue freeze
>>> before running the terminating iterator.
>> Its not necessarily dead, in fabrics we need to handle disconnections
>> that last for a while before we are able to reconnect (for a variety of
>> reasons) and we need a way to fail I/O for failover (or requeue, or
>> block its up to the upper layer). Its less of a "last resort" action
>> like in the pci case.
>>
>> Does this guarantee that after freeze+iter we won't get queued with any
>> other request? If not then we still need to unfreeze and fail at
>> queue_rq.
> It sounds like there are different scenarios to consider.
>
> For the dead controller, we call blk_cleanup_queue() at the end which
> ends callers who blocked on entering.
>
> If you're doing a failover, you'd replace the freeze with a current path
> update in order to prevent new requests from entering.
and if you're not multipath ?    I assume you want the io queues to be 
frozen so they queue there - which can block threads such as ns 
verification. It's good to have them live, as todays checks bounce the 
io, letting the thread terminate as its in a reset/reconnect state, 
which allows those threads to exit out or finish before a new reconnect 
kicks them off again. We've already been fighting deadlocks with the 
reset/delete/rescan paths and these io paths. suspending the queues 
completely over the reconnect will likely create more issues in this area.


> In either case, you don't need checks in queue_rq. The queue_rq check
> is redundant with the quiesce state that blk-mq already provides.

I disagree.  The cases I've run into are on the admin queue - where we 
are sending io to initialize the controller when another error/reset 
occurs, and the checks are required to identify/reject the "old" 
initialization commands, with another state check allowing them to 
proceed on the "new" initialization commands.  And there are also cases 
for ioctls and other things that occur during the middle of those 
initialization steps that need to be weeded out.   The Admin queue has 
to be kept live to allow the initialization commands on the new controller.

state checks are also needed for those namespace validation cases....

>
> Once quiesced, the proposed iterator can handle the final termination
> of the request, perform failover, or some other lld specific action
> depending on your situation.

I don't believe they can remain frozen, definitely not for the admin queue.
-- james
Keith Busch Dec. 4, 2018, 9:21 p.m. UTC | #16
On Tue, Dec 04, 2018 at 11:33:33AM -0800, James Smart wrote:
> 
> 
> On 12/4/2018 9:48 AM, Keith Busch wrote:
> > On Tue, Dec 04, 2018 at 09:38:29AM -0800, Sagi Grimberg wrote:
> > > > > > Yes, I'm very much in favour of this, too.
> > > > > > We always have this IMO slightly weird notion of stopping the queue, set
> > > > > > some error flags in the driver, then _restarting_ the queue, just so
> > > > > > that the driver then sees the error flag and terminates the requests.
> > > > > > Which I always found quite counter-intuitive.
> > > > > What about requests that come in after the iteration runs? how are those
> > > > > terminated?
> > > > If we've reached a dead state, I think you'd want to start a queue freeze
> > > > before running the terminating iterator.
> > > Its not necessarily dead, in fabrics we need to handle disconnections
> > > that last for a while before we are able to reconnect (for a variety of
> > > reasons) and we need a way to fail I/O for failover (or requeue, or
> > > block its up to the upper layer). Its less of a "last resort" action
> > > like in the pci case.
> > > 
> > > Does this guarantee that after freeze+iter we won't get queued with any
> > > other request? If not then we still need to unfreeze and fail at
> > > queue_rq.
> > It sounds like there are different scenarios to consider.
> > 
> > For the dead controller, we call blk_cleanup_queue() at the end which
> > ends callers who blocked on entering.
> > 
> > If you're doing a failover, you'd replace the freeze with a current path
> > update in order to prevent new requests from entering.
> and if you're not multipath ?    I assume you want the io queues to be
> frozen so they queue there - which can block threads such as ns
> verification. It's good to have them live, as todays checks bounce the io,
> letting the thread terminate as its in a reset/reconnect state, which allows
> those threads to exit out or finish before a new reconnect kicks them off
> again. We've already been fighting deadlocks with the reset/delete/rescan
> paths and these io paths. suspending the queues completely over the
> reconnect will likely create more issues in this area.
> 
> 
> > In either case, you don't need checks in queue_rq. The queue_rq check
> > is redundant with the quiesce state that blk-mq already provides.
> 
> I disagree.  The cases I've run into are on the admin queue - where we are
> sending io to initialize the controller when another error/reset occurs, and
> the checks are required to identify/reject the "old" initialization
> commands, with another state check allowing them to proceed on the "new"
> initialization commands.  And there are also cases for ioctls and other
> things that occur during the middle of those initialization steps that need
> to be weeded out.   The Admin queue has to be kept live to allow the
> initialization commands on the new controller.
> 
> state checks are also needed for those namespace validation cases....
> 
> > 
> > Once quiesced, the proposed iterator can handle the final termination
> > of the request, perform failover, or some other lld specific action
> > depending on your situation.
> 
> I don't believe they can remain frozen, definitely not for the admin queue.
> -- james

Quiesced and frozen carry different semantics.

My understanding of the nvme-fc implementation is that it returns
BLK_STS_RESOURCE in the scenario you've described where the admin
command can't be executed at the moment. That just has the block layer
requeue it for later resubmission 3 milliseconds later, which will
continue to return the same status code until you're really ready for
it.

What I'm proposing is that instead of using that return code, you may
have nvme-fc control when to dispatch those queued requests by utilizing
the blk-mq quiesce on/off states. Is there a reason that wouldn't work?
Keith Busch Dec. 4, 2018, 9:43 p.m. UTC | #17
On Tue, Dec 04, 2018 at 02:21:17PM -0700, Keith Busch wrote:
> On Tue, Dec 04, 2018 at 11:33:33AM -0800, James Smart wrote:
> > On 12/4/2018 9:48 AM, Keith Busch wrote:
> > > Once quiesced, the proposed iterator can handle the final termination
> > > of the request, perform failover, or some other lld specific action
> > > depending on your situation.
> > 
> > I don't believe they can remain frozen, definitely not for the admin queue.
> > -- james
> 
> Quiesced and frozen carry different semantics.
> 
> My understanding of the nvme-fc implementation is that it returns
> BLK_STS_RESOURCE in the scenario you've described where the admin
> command can't be executed at the moment. That just has the block layer
> requeue it for later resubmission 3 milliseconds later, which will
> continue to return the same status code until you're really ready for
> it.
> 
> What I'm proposing is that instead of using that return code, you may
> have nvme-fc control when to dispatch those queued requests by utilizing
> the blk-mq quiesce on/off states. Is there a reason that wouldn't work?

BTW, this is digressing from this patch's use case. The proposed iteration
doesn't come into play for the above scenario, which can be handled with
existing interfaces.
James Smart Dec. 4, 2018, 10:09 p.m. UTC | #18
On 12/4/2018 1:21 PM, Keith Busch wrote:
> On Tue, Dec 04, 2018 at 11:33:33AM -0800, James Smart wrote:
>> I disagree.  The cases I've run into are on the admin queue - where we are
>> sending io to initialize the controller when another error/reset occurs, and
>> the checks are required to identify/reject the "old" initialization
>> commands, with another state check allowing them to proceed on the "new"
>> initialization commands.  And there are also cases for ioctls and other
>> things that occur during the middle of those initialization steps that need
>> to be weeded out.   The Admin queue has to be kept live to allow the
>> initialization commands on the new controller.
>>
>> state checks are also needed for those namespace validation cases....
>>
>>> Once quiesced, the proposed iterator can handle the final termination
>>> of the request, perform failover, or some other lld specific action
>>> depending on your situation.
>> I don't believe they can remain frozen, definitely not for the admin queue.
>> -- james
> Quiesced and frozen carry different semantics.
>
> My understanding of the nvme-fc implementation is that it returns
> BLK_STS_RESOURCE in the scenario you've described where the admin
> command can't be executed at the moment. That just has the block layer
> requeue it for later resubmission 3 milliseconds later, which will
> continue to return the same status code until you're really ready for
> it.

BLK_STS_RESOURCE is correct - but for "normal" io, which comes from the 
filesystem, etc and are mostly on the io queues.

But if the io originated from other sources, like the core layer via 
nvme_alloc_request() - used by lots of service routines for the 
transport to initialize the controller, core routines to talk to the 
controller, and ioctls from user space - then they are failed with a 
PATHING ERROR status.  The status doesn't mean much to these other 
places which mainly care if they succeed or not, and if not, they fail 
and unwind.  It's pretty critical for these paths to get that error 
status as many of those threads do synchronous io.  And this is not just 
for nvme-fc. Any transport initializing the controller and getting half 
way through it when an error occurs that kills the association will 
depend on this behavior.  PCI is a large exception as interaction with a 
pci function is very different from sending packets over a network and 
detecting network errors.

Io requests, on the io queues, that are flagged as multipath, also are 
failed this way rather than requeued.  We would need some iterator here 
to classify the type of io (one valid to go down another path) and move 
it to another path (but the transport doesn't want to know about other 
paths).


>
> What I'm proposing is that instead of using that return code, you may
> have nvme-fc control when to dispatch those queued requests by utilizing
> the blk-mq quiesce on/off states. Is there a reason that wouldn't work?

and quiesce on/off isn't sufficient to do this.

-- james
diff mbox series

Patch

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 87bc5df72d48..c5fd2e0a4074 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -272,6 +272,7 @@  struct bt_tags_iter_data {
 	busy_tag_iter_fn *fn;
 	void *data;
 	bool reserved;
+	bool all;
 };
 
 static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
@@ -289,7 +290,7 @@  static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * test and set the bit before assining ->rqs[].
 	 */
 	rq = tags->rqs[bitnr];
-	if (rq && blk_mq_request_started(rq))
+	if (rq && (iter_data->all || blk_mq_request_started(rq)))
 		return iter_data->fn(rq, iter_data->data, reserved);
 
 	return true;
@@ -309,13 +310,15 @@  static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
  *		bitmap_tags member of struct blk_mq_tags.
  */
 static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
-			     busy_tag_iter_fn *fn, void *data, bool reserved)
+			     busy_tag_iter_fn *fn, void *data, bool reserved,
+			     bool all)
 {
 	struct bt_tags_iter_data iter_data = {
 		.tags = tags,
 		.fn = fn,
 		.data = data,
 		.reserved = reserved,
+		.all = all,
 	};
 
 	if (tags->rqs)
@@ -333,11 +336,12 @@  static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
  * @priv:	Will be passed as second argument to @fn.
  */
 static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
-		busy_tag_iter_fn *fn, void *priv)
+		busy_tag_iter_fn *fn, void *priv, bool all)
 {
 	if (tags->nr_reserved_tags)
-		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
-	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
+		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true,
+				 all);
+	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false, all);
 }
 
 /**
@@ -357,11 +361,35 @@  void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 
 	for (i = 0; i < tagset->nr_hw_queues; i++) {
 		if (tagset->tags && tagset->tags[i])
-			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
+			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv,
+						 false);
 	}
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 
+/**
+ * blk_mq_tagset_all_iter - iterate over all requests in a tag set
+ * @tagset:	Tag set to iterate over.
+ * @fn:		Pointer to the function that will be called for each started
+ *		request. @fn will be called as follows: @fn(rq, @priv,
+ *		reserved) where rq is a pointer to a request. 'reserved'
+ *		indicates whether or not @rq is a reserved request. Return
+ *		true to continue iterating tags, false to stop.
+ * @priv:	Will be passed as second argument to @fn.
+ */
+void blk_mq_tagset_all_iter(struct blk_mq_tag_set *tagset,
+		busy_tag_iter_fn *fn, void *priv)
+{
+	int i;
+
+	for (i = 0; i < tagset->nr_hw_queues; i++) {
+		if (tagset->tags && tagset->tags[i])
+			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv,
+						 true);
+	}
+}
+EXPORT_SYMBOL(blk_mq_tagset_all_iter);
+
 /**
  * blk_mq_queue_tag_busy_iter - iterate over all requests with a driver tag
  * @q:		Request queue to examine.
@@ -408,6 +436,7 @@  void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 	}
 	blk_queue_exit(q);
 }
+EXPORT_SYMBOL(blk_mq_queue_tag_busy_iter);
 
 static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
 		    bool round_robin, int node)
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..5af7ff94b400 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -33,8 +33,6 @@  extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 					struct blk_mq_tags **tags,
 					unsigned int depth, bool can_grow);
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
-void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
-		void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 						 struct blk_mq_hw_ctx *hctx)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 57eda7b20243..246aa128fd1f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -322,6 +322,10 @@  bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv);
+void blk_mq_tagset_all_iter(struct blk_mq_tag_set *tagset,
+		busy_tag_iter_fn *fn, void *priv);
+void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
+		void *priv);
 void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_unfreeze_queue(struct request_queue *q);
 void blk_freeze_queue_start(struct request_queue *q);