diff mbox series

[1/2] blk-mq: introduce blk_mq_complete_request_sync()

Message ID 20190318032950.17770-2-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series blk-mq/nvme: cancel request synchronously | expand

Commit Message

Ming Lei March 18, 2019, 3:29 a.m. UTC
In NVMe's error handler, follows the typical steps for tearing down
hardware:

1) stop blk_mq hw queues
2) stop the real hw queues
3) cancel in-flight requests via
	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
cancel_request():
	mark the request as abort
	blk_mq_complete_request(req);
4) destroy real hw queues

However, there may be race between #3 and #4, because blk_mq_complete_request()
actually completes the request asynchronously.

This patch introduces blk_mq_complete_request_sync() for fixing the
above race.

Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-nvme@lists.infradead.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 20 ++++++++++++++++----
 include/linux/blk-mq.h |  1 +
 2 files changed, 17 insertions(+), 4 deletions(-)

Comments

Bart Van Assche March 18, 2019, 4:09 a.m. UTC | #1
On 3/17/19 8:29 PM, Ming Lei wrote:
> In NVMe's error handler, follows the typical steps for tearing down
> hardware:
> 
> 1) stop blk_mq hw queues
> 2) stop the real hw queues
> 3) cancel in-flight requests via
> 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> cancel_request():
> 	mark the request as abort
> 	blk_mq_complete_request(req);
> 4) destroy real hw queues
> 
> However, there may be race between #3 and #4, because blk_mq_complete_request()
> actually completes the request asynchronously.
> 
> This patch introduces blk_mq_complete_request_sync() for fixing the
> above race.

Other block drivers wait until outstanding requests have completed by 
calling blk_cleanup_queue() before hardware queues are destroyed. Why 
can't the NVMe driver follow that approach?

Thanks,

Bart.
Ming Lei March 18, 2019, 7:38 a.m. UTC | #2
On Sun, Mar 17, 2019 at 09:09:09PM -0700, Bart Van Assche wrote:
> On 3/17/19 8:29 PM, Ming Lei wrote:
> > In NVMe's error handler, follows the typical steps for tearing down
> > hardware:
> > 
> > 1) stop blk_mq hw queues
> > 2) stop the real hw queues
> > 3) cancel in-flight requests via
> > 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> > cancel_request():
> > 	mark the request as abort
> > 	blk_mq_complete_request(req);
> > 4) destroy real hw queues
> > 
> > However, there may be race between #3 and #4, because blk_mq_complete_request()
> > actually completes the request asynchronously.
> > 
> > This patch introduces blk_mq_complete_request_sync() for fixing the
> > above race.
> 
> Other block drivers wait until outstanding requests have completed by
> calling blk_cleanup_queue() before hardware queues are destroyed. Why can't
> the NVMe driver follow that approach?

The tearing down of controller can be done in error handler, in which
the request queues may not be cleaned up, almost all kinds of NVMe
controller's error handling follows the above steps, such as:

nvme_rdma_error_recovery_work()
	->nvme_rdma_teardown_io_queues()

nvme_timeout()
	->nvme_dev_disable

Thanks,
Ming
Keith Busch March 18, 2019, 2:40 p.m. UTC | #3
On Sun, Mar 17, 2019 at 09:09:09PM -0700, Bart Van Assche wrote:
> On 3/17/19 8:29 PM, Ming Lei wrote:
> > In NVMe's error handler, follows the typical steps for tearing down
> > hardware:
> > 
> > 1) stop blk_mq hw queues
> > 2) stop the real hw queues
> > 3) cancel in-flight requests via
> > 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> > cancel_request():
> > 	mark the request as abort
> > 	blk_mq_complete_request(req);
> > 4) destroy real hw queues
> > 
> > However, there may be race between #3 and #4, because blk_mq_complete_request()
> > actually completes the request asynchronously.
> > 
> > This patch introduces blk_mq_complete_request_sync() for fixing the
> > above race.
> 
> Other block drivers wait until outstanding requests have completed by
> calling blk_cleanup_queue() before hardware queues are destroyed. Why can't
> the NVMe driver follow that approach?

You can't just wait for an outstanding request indefinitely. We have to
safely make forward progress when we've determined it's not going to be
completed.
Bart Van Assche March 18, 2019, 3:04 p.m. UTC | #4
On Mon, 2019-03-18 at 15:38 +0800, Ming Lei wrote:
> On Sun, Mar 17, 2019 at 09:09:09PM -0700, Bart Van Assche wrote:
> > On 3/17/19 8:29 PM, Ming Lei wrote:
> > > In NVMe's error handler, follows the typical steps for tearing down
> > > hardware:
> > > 
> > > 1) stop blk_mq hw queues
> > > 2) stop the real hw queues
> > > 3) cancel in-flight requests via
> > > 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> > > cancel_request():
> > > 	mark the request as abort
> > > 	blk_mq_complete_request(req);
> > > 4) destroy real hw queues
> > > 
> > > However, there may be race between #3 and #4, because blk_mq_complete_request()
> > > actually completes the request asynchronously.
> > > 
> > > This patch introduces blk_mq_complete_request_sync() for fixing the
> > > above race.
> > 
> > Other block drivers wait until outstanding requests have completed by
> > calling blk_cleanup_queue() before hardware queues are destroyed. Why can't
> > the NVMe driver follow that approach?
> 
> The tearing down of controller can be done in error handler, in which
> the request queues may not be cleaned up, almost all kinds of NVMe
> controller's error handling follows the above steps, such as:
> 
> nvme_rdma_error_recovery_work()
> 	->nvme_rdma_teardown_io_queues()
> 
> nvme_timeout()
> 	->nvme_dev_disable

Hi Ming,

This makes me wonder whether the current design of the NVMe core is the best
design we can come up with? The structure of e.g. the SRP initiator and target
drivers is similar to the NVMeOF drivers. However, there is no need in the SRP
initiator driver to terminate requests synchronously. Is this due to
differences in the error handling approaches in the SCSI and NVMe core drivers?

Thanks,

Bart.
Ming Lei March 18, 2019, 3:16 p.m. UTC | #5
On Mon, Mar 18, 2019 at 08:04:55AM -0700, Bart Van Assche wrote:
> On Mon, 2019-03-18 at 15:38 +0800, Ming Lei wrote:
> > On Sun, Mar 17, 2019 at 09:09:09PM -0700, Bart Van Assche wrote:
> > > On 3/17/19 8:29 PM, Ming Lei wrote:
> > > > In NVMe's error handler, follows the typical steps for tearing down
> > > > hardware:
> > > > 
> > > > 1) stop blk_mq hw queues
> > > > 2) stop the real hw queues
> > > > 3) cancel in-flight requests via
> > > > 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> > > > cancel_request():
> > > > 	mark the request as abort
> > > > 	blk_mq_complete_request(req);
> > > > 4) destroy real hw queues
> > > > 
> > > > However, there may be race between #3 and #4, because blk_mq_complete_request()
> > > > actually completes the request asynchronously.
> > > > 
> > > > This patch introduces blk_mq_complete_request_sync() for fixing the
> > > > above race.
> > > 
> > > Other block drivers wait until outstanding requests have completed by
> > > calling blk_cleanup_queue() before hardware queues are destroyed. Why can't
> > > the NVMe driver follow that approach?
> > 
> > The tearing down of controller can be done in error handler, in which
> > the request queues may not be cleaned up, almost all kinds of NVMe
> > controller's error handling follows the above steps, such as:
> > 
> > nvme_rdma_error_recovery_work()
> > 	->nvme_rdma_teardown_io_queues()
> > 
> > nvme_timeout()
> > 	->nvme_dev_disable
> 
> Hi Ming,
> 
> This makes me wonder whether the current design of the NVMe core is the best
> design we can come up with? The structure of e.g. the SRP initiator and target
> drivers is similar to the NVMeOF drivers. However, there is no need in the SRP
> initiator driver to terminate requests synchronously. Is this due to

I am not familiar with SRP, could you explain what SRP initiator driver
will do when the controller is in bad state? Especially about dealing with
in-flight IO requests under this situation.

> differences in the error handling approaches in the SCSI and NVMe core drivers?

As far as I can tell, I don't see obvious design issue in NVMe host drivers,
which tries best to recover controller and retries to complete all in-flight IO.

Thanks,
Ming
Bart Van Assche March 18, 2019, 3:49 p.m. UTC | #6
On Mon, 2019-03-18 at 23:16 +0800, Ming Lei wrote:
> I am not familiar with SRP, could you explain what SRP initiator driver
> will do when the controller is in bad state? Especially about dealing with
> in-flight IO requests under this situation.

Hi Ming,

Just like the NVMeOF initiator driver, the SRP initiator driver uses an
RDMA RC connection for all of its communication over the network. If
communication between initiator and target fails the target driver will
close the connection or one of the work requests that was posted by the
initiator driver will complete with an error status (wc->status !=
IB_WC_SUCCESS). In the latter case the function srp_handle_qp_err() will
try to reestablish the connection between initiator and target after a
certain delay:

	if (delay > 0)
		queue_delayed_work(system_long_wq, &rport->reconnect_work,
				   1UL * delay * HZ);

SCSI timeouts may kick the SCSI error handler. That results in calls of
the srp_reset_device() and/or srp_reset_host() functions. srp_reset_host()
terminates all outstanding requests after having disconnected the RDMA RC
connection. Disconnecting the RC connection first guarantees that there
are no concurrent request completion calls from the regular completion
path and from the error handler.

Bart.
Ming Lei March 18, 2019, 4:06 p.m. UTC | #7
On Mon, Mar 18, 2019 at 08:49:24AM -0700, Bart Van Assche wrote:
> On Mon, 2019-03-18 at 23:16 +0800, Ming Lei wrote:
> > I am not familiar with SRP, could you explain what SRP initiator driver
> > will do when the controller is in bad state? Especially about dealing with
> > in-flight IO requests under this situation.
> 
> Hi Ming,
> 
> Just like the NVMeOF initiator driver, the SRP initiator driver uses an
> RDMA RC connection for all of its communication over the network. If
> communication between initiator and target fails the target driver will
> close the connection or one of the work requests that was posted by the
> initiator driver will complete with an error status (wc->status !=
> IB_WC_SUCCESS). In the latter case the function srp_handle_qp_err() will
> try to reestablish the connection between initiator and target after a
> certain delay:
> 
> 	if (delay > 0)
> 		queue_delayed_work(system_long_wq, &rport->reconnect_work,
> 				   1UL * delay * HZ);
> 
> SCSI timeouts may kick the SCSI error handler. That results in calls of
> the srp_reset_device() and/or srp_reset_host() functions. srp_reset_host()
> terminates all outstanding requests after having disconnected the RDMA RC
> connection.

Looks the approach of NVMe's error handler is basically similar with
above.

And NVMe just uses 'blk_mq_tagset_busy_iter(nvme_cancel_request)' to abort
in-flight requests, and I guess SCSI FC may use driver's approach to do that.

> Disconnecting the RC connection first guarantees that there
> are no concurrent request completion calls from the regular completion
> path and from the error handler.

Looks no concurrent request completion guarantee requires driver's
specific implementation.

However, this patch provides one simple approach for NVMe, then no
driver specific sync mechanism is needed.

Thanks,
Ming
James Smart March 18, 2019, 5:30 p.m. UTC | #8
On 3/17/2019 9:09 PM, Bart Van Assche wrote:
> On 3/17/19 8:29 PM, Ming Lei wrote:
>> In NVMe's error handler, follows the typical steps for tearing down
>> hardware:
>>
>> 1) stop blk_mq hw queues
>> 2) stop the real hw queues
>> 3) cancel in-flight requests via
>>     blk_mq_tagset_busy_iter(tags, cancel_request, ...)
>> cancel_request():
>>     mark the request as abort
>>     blk_mq_complete_request(req);
>> 4) destroy real hw queues
>>
>> However, there may be race between #3 and #4, because 
>> blk_mq_complete_request()
>> actually completes the request asynchronously.
>>
>> This patch introduces blk_mq_complete_request_sync() for fixing the
>> above race.
>
> Other block drivers wait until outstanding requests have completed by 
> calling blk_cleanup_queue() before hardware queues are destroyed. Why 
> can't the NVMe driver follow that approach?
>

speaking for the fabrics, not necessarily pci:

The intent of this looping, which happens immediately following an error 
being detected, is to cause the termination of the outstanding requests. 
Otherwise, the only recourse is to wait for the ios to finish, which 
they may never do, or have their upper-level timeout expire to cause 
their termination - thus a very long delay.   And one of the commands, 
on the admin queue - a different tag set but handled the same, doesn't 
have a timeout (the Async Event Reporting command) so it wouldn't 
necessarily clear without this looping.

-- james
James Smart March 18, 2019, 5:37 p.m. UTC | #9
On 3/17/2019 8:29 PM, Ming Lei wrote:
> In NVMe's error handler, follows the typical steps for tearing down
> hardware:
>
> 1) stop blk_mq hw queues
> 2) stop the real hw queues
> 3) cancel in-flight requests via
> 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> cancel_request():
> 	mark the request as abort
> 	blk_mq_complete_request(req);
> 4) destroy real hw queues
>
> However, there may be race between #3 and #4, because blk_mq_complete_request()
> actually completes the request asynchronously.
>
> This patch introduces blk_mq_complete_request_sync() for fixing the
> above race.
>

This won't help FC at all. Inherently, the "completion" has to be 
asynchronous as line traffic may be required.

e.g. FC doesn't use nvme_complete_request() in the iterator routine.

-- james
Ming Lei March 19, 2019, 1:06 a.m. UTC | #10
On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote:
> 
> 
> On 3/17/2019 8:29 PM, Ming Lei wrote:
> > In NVMe's error handler, follows the typical steps for tearing down
> > hardware:
> > 
> > 1) stop blk_mq hw queues
> > 2) stop the real hw queues
> > 3) cancel in-flight requests via
> > 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> > cancel_request():
> > 	mark the request as abort
> > 	blk_mq_complete_request(req);
> > 4) destroy real hw queues
> > 
> > However, there may be race between #3 and #4, because blk_mq_complete_request()
> > actually completes the request asynchronously.
> > 
> > This patch introduces blk_mq_complete_request_sync() for fixing the
> > above race.
> > 
> 
> This won't help FC at all. Inherently, the "completion" has to be
> asynchronous as line traffic may be required.
> 
> e.g. FC doesn't use nvme_complete_request() in the iterator routine.

Yeah, I saw the FC code, it is supposed to address the asynchronous
completion of blk_mq_complete_request() in error handler.

Also I think it is always the correct thing to abort requests
synchronously in error handler, isn't it?


thanks,
Ming
Ming Lei March 19, 2019, 1:31 a.m. UTC | #11
On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote:
> 
> 
> On 3/17/2019 8:29 PM, Ming Lei wrote:
> > In NVMe's error handler, follows the typical steps for tearing down
> > hardware:
> > 
> > 1) stop blk_mq hw queues
> > 2) stop the real hw queues
> > 3) cancel in-flight requests via
> > 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> > cancel_request():
> > 	mark the request as abort
> > 	blk_mq_complete_request(req);
> > 4) destroy real hw queues
> > 
> > However, there may be race between #3 and #4, because blk_mq_complete_request()
> > actually completes the request asynchronously.
> > 
> > This patch introduces blk_mq_complete_request_sync() for fixing the
> > above race.
> > 
> 
> This won't help FC at all. Inherently, the "completion" has to be
> asynchronous as line traffic may be required.
> 
> e.g. FC doesn't use nvme_complete_request() in the iterator routine.
> 

Looks FC has done the sync already, see nvme_fc_delete_association():

		...
        /* wait for all io that had to be aborted */
        spin_lock_irq(&ctrl->lock);
        wait_event_lock_irq(ctrl->ioabort_wait, ctrl->iocnt == 0, ctrl->lock);
        ctrl->flags &= ~FCCTRL_TERMIO;
        spin_unlock_irq(&ctrl->lock);

Thanks,
Ming
James Smart March 19, 2019, 3:37 a.m. UTC | #12
On 3/18/2019 6:06 PM, Ming Lei wrote:
> On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote:
>>
>> On 3/17/2019 8:29 PM, Ming Lei wrote:
>>> In NVMe's error handler, follows the typical steps for tearing down
>>> hardware:
>>>
>>> 1) stop blk_mq hw queues
>>> 2) stop the real hw queues
>>> 3) cancel in-flight requests via
>>> 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
>>> cancel_request():
>>> 	mark the request as abort
>>> 	blk_mq_complete_request(req);
>>> 4) destroy real hw queues
>>>
>>> However, there may be race between #3 and #4, because blk_mq_complete_request()
>>> actually completes the request asynchronously.
>>>
>>> This patch introduces blk_mq_complete_request_sync() for fixing the
>>> above race.
>>>
>> This won't help FC at all. Inherently, the "completion" has to be
>> asynchronous as line traffic may be required.
>>
>> e.g. FC doesn't use nvme_complete_request() in the iterator routine.
> Yeah, I saw the FC code, it is supposed to address the asynchronous
> completion of blk_mq_complete_request() in error handler.
>
> Also I think it is always the correct thing to abort requests
> synchronously in error handler, isn't it?
>

not sure I fully follow you, but if you're asking shouldn't it always be 
synchronous - why would that be the case ?  I really don't want a 
blocking thread that could block for several seconds on a single io to 
complete.  The controller has changed state and the queues frozen which 
should have been sufficient - but bottom-end io can still complete at 
any time.

-- james
Ming Lei March 19, 2019, 3:50 a.m. UTC | #13
On Mon, Mar 18, 2019 at 08:37:35PM -0700, James Smart wrote:
> 
> 
> On 3/18/2019 6:06 PM, Ming Lei wrote:
> > On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote:
> > > 
> > > On 3/17/2019 8:29 PM, Ming Lei wrote:
> > > > In NVMe's error handler, follows the typical steps for tearing down
> > > > hardware:
> > > > 
> > > > 1) stop blk_mq hw queues
> > > > 2) stop the real hw queues
> > > > 3) cancel in-flight requests via
> > > > 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> > > > cancel_request():
> > > > 	mark the request as abort
> > > > 	blk_mq_complete_request(req);
> > > > 4) destroy real hw queues
> > > > 
> > > > However, there may be race between #3 and #4, because blk_mq_complete_request()
> > > > actually completes the request asynchronously.
> > > > 
> > > > This patch introduces blk_mq_complete_request_sync() for fixing the
> > > > above race.
> > > > 
> > > This won't help FC at all. Inherently, the "completion" has to be
> > > asynchronous as line traffic may be required.
> > > 
> > > e.g. FC doesn't use nvme_complete_request() in the iterator routine.
> > Yeah, I saw the FC code, it is supposed to address the asynchronous
> > completion of blk_mq_complete_request() in error handler.
> > 
> > Also I think it is always the correct thing to abort requests
> > synchronously in error handler, isn't it?
> > 
> 
> not sure I fully follow you, but if you're asking shouldn't it always be
> synchronous - why would that be the case ?  I really don't want a blocking
> thread that could block for several seconds on a single io to complete.  The

We are talking error handler, in which all in-flight requests are simply
aborted via blk_mq_tagset_busy_iter(nvme_cancel_request, ...), and there isn't
any waiting for single io to complete. nvme_cancel_request() basically
re-queues the in-flight request to blk-mq's queues, and the time is
pretty short, and I guess blk_mq_complete_request_sync() should be quicker
than blk_mq_complete_request() under this situation.

> controller has changed state and the queues frozen which should have been
> sufficient - but bottom-end io can still complete at any time.

Queues have been quiesced or stopped for recovering, and queue freezing
requires to wait for completion of all in-flight requests, then a new
IO deadlock is made...

Thanks,
Ming
James Smart March 19, 2019, 4:04 a.m. UTC | #14
On 3/18/2019 6:31 PM, Ming Lei wrote:
> On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote:
>>
>> On 3/17/2019 8:29 PM, Ming Lei wrote:
>>> In NVMe's error handler, follows the typical steps for tearing down
>>> hardware:
>>>
>>> 1) stop blk_mq hw queues
>>> 2) stop the real hw queues
>>> 3) cancel in-flight requests via
>>> 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
>>> cancel_request():
>>> 	mark the request as abort
>>> 	blk_mq_complete_request(req);
>>> 4) destroy real hw queues
>>>
>>> However, there may be race between #3 and #4, because blk_mq_complete_request()
>>> actually completes the request asynchronously.
>>>
>>> This patch introduces blk_mq_complete_request_sync() for fixing the
>>> above race.
>>>
>> This won't help FC at all. Inherently, the "completion" has to be
>> asynchronous as line traffic may be required.
>>
>> e.g. FC doesn't use nvme_complete_request() in the iterator routine.
>>
> Looks FC has done the sync already, see nvme_fc_delete_association():
>
> 		...
>          /* wait for all io that had to be aborted */
>          spin_lock_irq(&ctrl->lock);
>          wait_event_lock_irq(ctrl->ioabort_wait, ctrl->iocnt == 0, ctrl->lock);
>          ctrl->flags &= ~FCCTRL_TERMIO;
>          spin_unlock_irq(&ctrl->lock);

yes - but the iterator started a lot of the back end io terminating in 
parallel. So waiting on many happening in parallel is better than 
waiting 1 at a time.   Even so, I've always disliked this wait and would 
have preferred to exit the thread with something monitoring the 
completions re-queuing a work thread to finish.

-- james
Ming Lei March 19, 2019, 4:28 a.m. UTC | #15
On Mon, Mar 18, 2019 at 09:04:37PM -0700, James Smart wrote:
> 
> 
> On 3/18/2019 6:31 PM, Ming Lei wrote:
> > On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote:
> > > 
> > > On 3/17/2019 8:29 PM, Ming Lei wrote:
> > > > In NVMe's error handler, follows the typical steps for tearing down
> > > > hardware:
> > > > 
> > > > 1) stop blk_mq hw queues
> > > > 2) stop the real hw queues
> > > > 3) cancel in-flight requests via
> > > > 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> > > > cancel_request():
> > > > 	mark the request as abort
> > > > 	blk_mq_complete_request(req);
> > > > 4) destroy real hw queues
> > > > 
> > > > However, there may be race between #3 and #4, because blk_mq_complete_request()
> > > > actually completes the request asynchronously.
> > > > 
> > > > This patch introduces blk_mq_complete_request_sync() for fixing the
> > > > above race.
> > > > 
> > > This won't help FC at all. Inherently, the "completion" has to be
> > > asynchronous as line traffic may be required.
> > > 
> > > e.g. FC doesn't use nvme_complete_request() in the iterator routine.
> > > 
> > Looks FC has done the sync already, see nvme_fc_delete_association():
> > 
> > 		...
> >          /* wait for all io that had to be aborted */
> >          spin_lock_irq(&ctrl->lock);
> >          wait_event_lock_irq(ctrl->ioabort_wait, ctrl->iocnt == 0, ctrl->lock);
> >          ctrl->flags &= ~FCCTRL_TERMIO;
> >          spin_unlock_irq(&ctrl->lock);
> 
> yes - but the iterator started a lot of the back end io terminating in
> parallel. So waiting on many happening in parallel is better than waiting 1
> at a time.

OK, that is FC's sync, not related with this patch.

> Even so, I've always disliked this wait and would have
> preferred to exit the thread with something monitoring the completions
> re-queuing a work thread to finish.

Then I guess you may like this patch given it actually avoids the
potential wait, :-)

What the patch does is to convert the remote completion(#1) into local
completion(#2):

1) previously one request may be completed remotely by blk_mq_complete_request():

         rq->csd.func = __blk_mq_complete_request_remote;
         rq->csd.info = rq;
         rq->csd.flags = 0;
         smp_call_function_single_async(ctx->cpu, &rq->csd);

2) this patch changes the remote completion into local completion via
blk_mq_complete_request_sync(), so all in-flight requests can be aborted
before destroying queue.

		q->mq_ops->complete(rq);

As I mentioned in another email, there isn't any waiting for aborting
request, nvme_cancel_request() simply requeues the request to blk-mq
under this situation.

Thanks,
Ming
Sagi Grimberg March 21, 2019, 12:47 a.m. UTC | #16
> Hi Ming,
> 
> Just like the NVMeOF initiator driver, the SRP initiator driver uses an
> RDMA RC connection for all of its communication over the network. If
> communication between initiator and target fails the target driver will
> close the connection or one of the work requests that was posted by the
> initiator driver will complete with an error status (wc->status !=
> IB_WC_SUCCESS). In the latter case the function srp_handle_qp_err() will
> try to reestablish the connection between initiator and target after a
> certain delay:
> 
> 	if (delay > 0)
> 		queue_delayed_work(system_long_wq, &rport->reconnect_work,
> 				   1UL * delay * HZ);
> 
> SCSI timeouts may kick the SCSI error handler. That results in calls of
> the srp_reset_device() and/or srp_reset_host() functions. srp_reset_host()
> terminates all outstanding requests after having disconnected the RDMA RC
> connection. Disconnecting the RC connection first guarantees that there
> are no concurrent request completion calls from the regular completion
> path and from the error handler.

Hi Bart,

If I understand the race correctly, its not between the requests
completion and the queue pairs removal nor the timeout handler
necessarily, but rather it is between the async requests completion and
the tagset deallocation.

Think of surprise removal (or disconnect) during I/O, drivers
usually stop/quiesce/freeze the queues, terminate/abort inflight
I/Os and then teardown the hw queues and the tagset.

IIRC, the same race holds for srp if this happens during I/O:
1. srp_rport_delete() -> srp_remove_target() -> srp_stop_rport_timers() 
-> __rport_fail_io_fast()

2. complete all I/Os (async remotely via smp)

Then continue..

3. scsi_host_put() -> scsi_host_dev_release() -> scsi_mq_destroy_tags()

What is preventing (3) from happening before (2) if its async? I would
think that scsi drivers need the exact same thing...
Ming Lei March 21, 2019, 1:39 a.m. UTC | #17
On Wed, Mar 20, 2019 at 05:47:01PM -0700, Sagi Grimberg wrote:
> 
> > Hi Ming,
> > 
> > Just like the NVMeOF initiator driver, the SRP initiator driver uses an
> > RDMA RC connection for all of its communication over the network. If
> > communication between initiator and target fails the target driver will
> > close the connection or one of the work requests that was posted by the
> > initiator driver will complete with an error status (wc->status !=
> > IB_WC_SUCCESS). In the latter case the function srp_handle_qp_err() will
> > try to reestablish the connection between initiator and target after a
> > certain delay:
> > 
> > 	if (delay > 0)
> > 		queue_delayed_work(system_long_wq, &rport->reconnect_work,
> > 				   1UL * delay * HZ);
> > 
> > SCSI timeouts may kick the SCSI error handler. That results in calls of
> > the srp_reset_device() and/or srp_reset_host() functions. srp_reset_host()
> > terminates all outstanding requests after having disconnected the RDMA RC
> > connection. Disconnecting the RC connection first guarantees that there
> > are no concurrent request completion calls from the regular completion
> > path and from the error handler.
> 
> Hi Bart,
> 
> If I understand the race correctly, its not between the requests
> completion and the queue pairs removal nor the timeout handler
> necessarily, but rather it is between the async requests completion and
> the tagset deallocation.
> 
> Think of surprise removal (or disconnect) during I/O, drivers
> usually stop/quiesce/freeze the queues, terminate/abort inflight
> I/Os and then teardown the hw queues and the tagset.
> 
> IIRC, the same race holds for srp if this happens during I/O:
> 1. srp_rport_delete() -> srp_remove_target() -> srp_stop_rport_timers() ->
> __rport_fail_io_fast()
> 
> 2. complete all I/Os (async remotely via smp)
> 
> Then continue..
> 
> 3. scsi_host_put() -> scsi_host_dev_release() -> scsi_mq_destroy_tags()
> 
> What is preventing (3) from happening before (2) if its async? I would
> think that scsi drivers need the exact same thing...

blk_cleanup_queue() will do that, but it can't be used in device recovery
obviously.

BTW, blk_mq_complete_request_sync() is a bit misleading, maybe
blk_mq_complete_request_locally() is better.

Thanks,
Ming
Sagi Grimberg March 21, 2019, 2:04 a.m. UTC | #18
>> Hi Bart,
>>
>> If I understand the race correctly, its not between the requests
>> completion and the queue pairs removal nor the timeout handler
>> necessarily, but rather it is between the async requests completion and
>> the tagset deallocation.
>>
>> Think of surprise removal (or disconnect) during I/O, drivers
>> usually stop/quiesce/freeze the queues, terminate/abort inflight
>> I/Os and then teardown the hw queues and the tagset.
>>
>> IIRC, the same race holds for srp if this happens during I/O:
>> 1. srp_rport_delete() -> srp_remove_target() -> srp_stop_rport_timers() ->
>> __rport_fail_io_fast()
>>
>> 2. complete all I/Os (async remotely via smp)
>>
>> Then continue..
>>
>> 3. scsi_host_put() -> scsi_host_dev_release() -> scsi_mq_destroy_tags()
>>
>> What is preventing (3) from happening before (2) if its async? I would
>> think that scsi drivers need the exact same thing...
> 
> blk_cleanup_queue() will do that, but it can't be used in device recovery
> obviously.

But in device recovery we never free the tagset... I might be missing
the race here then...

> BTW, blk_mq_complete_request_sync() is a bit misleading, maybe
> blk_mq_complete_request_locally() is better.

Not really...
Sagi Grimberg March 21, 2019, 2:13 a.m. UTC | #19
>>> In NVMe's error handler, follows the typical steps for tearing down
>>> hardware:
>>>
>>> 1) stop blk_mq hw queues
>>> 2) stop the real hw queues
>>> 3) cancel in-flight requests via
>>> 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
>>> cancel_request():
>>> 	mark the request as abort
>>> 	blk_mq_complete_request(req);
>>> 4) destroy real hw queues
>>>
>>> However, there may be race between #3 and #4, because blk_mq_complete_request()
>>> actually completes the request asynchronously.
>>>
>>> This patch introduces blk_mq_complete_request_sync() for fixing the
>>> above race.
>>
>> Other block drivers wait until outstanding requests have completed by
>> calling blk_cleanup_queue() before hardware queues are destroyed. Why can't
>> the NVMe driver follow that approach?
> 
> The tearing down of controller can be done in error handler, in which
> the request queues may not be cleaned up, almost all kinds of NVMe
> controller's error handling follows the above steps, such as:
> 
> nvme_rdma_error_recovery_work()
> 	->nvme_rdma_teardown_io_queues()

Clarification, this happens in its dedicated context, not the timeout or
error handler.

But I still don't understand the issue here, what is the exact race you
are referring to? that we abort/cancel a request and then we complete
it again when we destroy the HW queue?

If so, that is not the case (at least for rdma/tcp) because
nvme_rdma_teardown_io_queues() first flushes the hw queue and then
aborts inflight I/O.
Bart Van Assche March 21, 2019, 2:15 a.m. UTC | #20
On 3/20/19 5:47 PM, Sagi Grimberg wrote:
> If I understand the race correctly, its not between the requests
> completion and the queue pairs removal nor the timeout handler
> necessarily, but rather it is between the async requests completion and
> the tagset deallocation.
> 
> Think of surprise removal (or disconnect) during I/O, drivers
> usually stop/quiesce/freeze the queues, terminate/abort inflight
> I/Os and then teardown the hw queues and the tagset.
> 
> IIRC, the same race holds for srp if this happens during I/O:
> 1. srp_rport_delete() -> srp_remove_target() -> srp_stop_rport_timers() 
> -> __rport_fail_io_fast()
> 
> 2. complete all I/Os (async remotely via smp)
> 
> Then continue..
> 
> 3. scsi_host_put() -> scsi_host_dev_release() -> scsi_mq_destroy_tags()
> 
> What is preventing (3) from happening before (2) if its async? I would
> think that scsi drivers need the exact same thing...

Hi Sagi,

As Ming already replied, I don't think that (3) can happen before (2) in 
case of the SRP driver. If you have a look at srp_remove_target() you 
will see that it calls scsi_remove_host(). That function only returns 
after blk_cleanup_queue() has been called for all associated request 
queues. As you know that function waits until all outstanding requests 
have completed.

Bart.
Ming Lei March 21, 2019, 2:32 a.m. UTC | #21
On Wed, Mar 20, 2019 at 07:04:09PM -0700, Sagi Grimberg wrote:
> 
> > > Hi Bart,
> > > 
> > > If I understand the race correctly, its not between the requests
> > > completion and the queue pairs removal nor the timeout handler
> > > necessarily, but rather it is between the async requests completion and
> > > the tagset deallocation.
> > > 
> > > Think of surprise removal (or disconnect) during I/O, drivers
> > > usually stop/quiesce/freeze the queues, terminate/abort inflight
> > > I/Os and then teardown the hw queues and the tagset.
> > > 
> > > IIRC, the same race holds for srp if this happens during I/O:
> > > 1. srp_rport_delete() -> srp_remove_target() -> srp_stop_rport_timers() ->
> > > __rport_fail_io_fast()
> > > 
> > > 2. complete all I/Os (async remotely via smp)
> > > 
> > > Then continue..
> > > 
> > > 3. scsi_host_put() -> scsi_host_dev_release() -> scsi_mq_destroy_tags()
> > > 
> > > What is preventing (3) from happening before (2) if its async? I would
> > > think that scsi drivers need the exact same thing...
> > 
> > blk_cleanup_queue() will do that, but it can't be used in device recovery
> > obviously.
> 
> But in device recovery we never free the tagset... I might be missing
> the race here then...

For example,

nvme_rdma_complete_rq
	->nvme_rdma_unmap_data
		->ib_mr_pool_put

But the ib queue pair may has been destroyed by nvme_rdma_destroy_io_queues()
before request's remote completion.

nvme_rdma_teardown_io_queues:
        nvme_stop_queues(&ctrl->ctrl);
        nvme_rdma_stop_io_queues(ctrl);
        blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_cancel_request,
                        &ctrl->ctrl);
        if (remove)
                nvme_start_queues(&ctrl->ctrl);
        nvme_rdma_destroy_io_queues(ctrl, remove);


> 
> > BTW, blk_mq_complete_request_sync() is a bit misleading, maybe
> > blk_mq_complete_request_locally() is better.
> 
> Not really...

Naming is always the hard part...

Thanks,
Ming
Sagi Grimberg March 21, 2019, 9:40 p.m. UTC | #22
> For example,
> 
> nvme_rdma_complete_rq
> 	->nvme_rdma_unmap_data
> 		->ib_mr_pool_put
> 
> But the ib queue pair may has been destroyed by nvme_rdma_destroy_io_queues()
> before request's remote completion.
> 
> nvme_rdma_teardown_io_queues:
>          nvme_stop_queues(&ctrl->ctrl);
>          nvme_rdma_stop_io_queues(ctrl);
>          blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_cancel_request,
>                          &ctrl->ctrl);
>          if (remove)
>                  nvme_start_queues(&ctrl->ctrl);
>          nvme_rdma_destroy_io_queues(ctrl, remove);
> 

Yea, you're right. I'm fine with this patchset.
Christoph Hellwig March 27, 2019, 8:27 a.m. UTC | #23
On Thu, Mar 21, 2019 at 02:40:51PM -0700, Sagi Grimberg wrote:
>> nvme_rdma_teardown_io_queues:
>>          nvme_stop_queues(&ctrl->ctrl);
>>          nvme_rdma_stop_io_queues(ctrl);
>>          blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_cancel_request,
>>                          &ctrl->ctrl);
>>          if (remove)
>>                  nvme_start_queues(&ctrl->ctrl);
>>          nvme_rdma_destroy_io_queues(ctrl, remove);
>>
>
> Yea, you're right. I'm fine with this patchset.

Is this an official Reviewed-by?
Christoph Hellwig March 27, 2019, 8:30 a.m. UTC | #24
Please make this an EXPORT_SYMBOL_GPL as for all new low-level blk-mq
exports.  Otherwise looks fine modolo any minor documentatio nitpicks
that I might have seen in the thread:

Reviewed-by: Christoph Hellwig <hch@lst.de>
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a9c181603cbd..8f925ac0b26d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -569,7 +569,7 @@  static void __blk_mq_complete_request_remote(void *data)
 	q->mq_ops->complete(rq);
 }
 
-static void __blk_mq_complete_request(struct request *rq)
+static void __blk_mq_complete_request(struct request *rq, bool sync)
 {
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct request_queue *q = rq->q;
@@ -586,7 +586,7 @@  static void __blk_mq_complete_request(struct request *rq)
 	 * So complete IO reqeust in softirq context in case of single queue
 	 * for not degrading IO performance by irqsoff latency.
 	 */
-	if (q->nr_hw_queues == 1) {
+	if (q->nr_hw_queues == 1 && !sync) {
 		__blk_complete_request(rq);
 		return;
 	}
@@ -594,8 +594,11 @@  static void __blk_mq_complete_request(struct request *rq)
 	/*
 	 * For a polled request, always complete locallly, it's pointless
 	 * to redirect the completion.
+	 *
+	 * If driver requires to complete the request synchronously,
+	 * complete it locally, and it is usually done in error handler.
 	 */
-	if ((rq->cmd_flags & REQ_HIPRI) ||
+	if ((rq->cmd_flags & REQ_HIPRI) || sync ||
 	    !test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags)) {
 		q->mq_ops->complete(rq);
 		return;
@@ -648,11 +651,20 @@  bool blk_mq_complete_request(struct request *rq)
 {
 	if (unlikely(blk_should_fake_timeout(rq->q)))
 		return false;
-	__blk_mq_complete_request(rq);
+	__blk_mq_complete_request(rq, false);
 	return true;
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
+bool blk_mq_complete_request_sync(struct request *rq)
+{
+	if (unlikely(blk_should_fake_timeout(rq->q)))
+		return false;
+	__blk_mq_complete_request(rq, true);
+	return true;
+}
+EXPORT_SYMBOL(blk_mq_complete_request_sync);
+
 int blk_mq_request_started(struct request *rq)
 {
 	return blk_mq_rq_state(rq) != MQ_RQ_IDLE;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b0c814bcc7e3..6a514e5136f4 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -305,6 +305,7 @@  void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
 void blk_mq_kick_requeue_list(struct request_queue *q);
 void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs);
 bool blk_mq_complete_request(struct request *rq);
+bool blk_mq_complete_request_sync(struct request *rq);
 bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list,
 			   struct bio *bio);
 bool blk_mq_queue_stopped(struct request_queue *q);