diff mbox

[1/6] nvme: Sync request queues on reset

Message ID 20180521140413.GA5528@localhost.localdomain (mailing list archive)
State New, archived
Headers show

Commit Message

Keith Busch May 21, 2018, 2:04 p.m. UTC
On Sat, May 19, 2018 at 08:01:42AM +0800, Ming Lei wrote:
> > You keep saying that, but the controller state is global to the
> > controller. It doesn't matter which namespace request_queue started the
> > reset: every namespaces request queue sees the RESETTING controller state
> 
> When timeouts come, the global state of RESETTING may not be updated
> yet, so all the timeouts may not observe the state.

Even prior to the RESETING state, every single command, no matter
which namespace or request_queue it came on, is reclaimed by the driver.
There *should* be no requests to timeout after nvme_dev_disable is called
because the nvme driver returned control of all requests in the tagset
to blk-mq.

In any case, if blk-mq decides it won't complete those requests, we
can just swap the order in the reset_work: sync first, uncondintionally
disable. Does the following snippet look more okay?

---
--

Comments

Ming Lei May 21, 2018, 3:25 p.m. UTC | #1
On Mon, May 21, 2018 at 08:04:13AM -0600, Keith Busch wrote:
> On Sat, May 19, 2018 at 08:01:42AM +0800, Ming Lei wrote:
> > > You keep saying that, but the controller state is global to the
> > > controller. It doesn't matter which namespace request_queue started the
> > > reset: every namespaces request queue sees the RESETTING controller state
> > 
> > When timeouts come, the global state of RESETTING may not be updated
> > yet, so all the timeouts may not observe the state.
> 
> Even prior to the RESETING state, every single command, no matter
> which namespace or request_queue it came on, is reclaimed by the driver.
> There *should* be no requests to timeout after nvme_dev_disable is called
> because the nvme driver returned control of all requests in the tagset
> to blk-mq.

The timed-out requests won't be canceled by nvme_dev_disable().

If the timed-out requests is handled as RESET_TIMER, there may
be new timeout event triggered again.

> 
> In any case, if blk-mq decides it won't complete those requests, we
> can just swap the order in the reset_work: sync first, uncondintionally
> disable. Does the following snippet look more okay?
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 17a0190bd88f..42af077ee07a 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2307,11 +2307,14 @@ static void nvme_reset_work(struct work_struct *work)
>  		goto out;
>  
>  	/*
> -	 * If we're called to reset a live controller first shut it down before
> -	 * moving on.
> +	 * Ensure there are no timeout work in progress prior to forcefully
> +	 * disabling the queue. There is no harm in disabling the device even
> +	 * when it was already disabled, as this will forcefully reclaim any
> +	 * IOs that are stuck due to blk-mq's timeout handling that prevents
> +	 * timed out requests from completing.
>  	 */
> -	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
> -		nvme_dev_disable(dev, false);
> +	nvme_sync_queues(&dev->ctrl);
> +	nvme_dev_disable(dev, false);

That may not work reliably too.

For example, request A from NS_0 is timed-out and handled as RESET_TIMER,
meantime request B from NS_1 is timed-out and handled as EH_HANDLED.

When the above reset work is run for handling timeout of req B,
new timeout event on request A may come just between the above
nvme_sync_queues() and nvme_dev_disable(), then nvme_dev_disable()
can't cover request A, and finally the timed-out event for req A
will nvme_dev_disable() when the current reset is just in-progress,
then this reset can't move on, and IO hang is caused.


Thanks,
Ming
Keith Busch May 21, 2018, 3:59 p.m. UTC | #2
On Mon, May 21, 2018 at 11:25:43PM +0800, Ming Lei wrote:
> On Mon, May 21, 2018 at 08:04:13AM -0600, Keith Busch wrote:
> > On Sat, May 19, 2018 at 08:01:42AM +0800, Ming Lei wrote:
> > > > You keep saying that, but the controller state is global to the
> > > > controller. It doesn't matter which namespace request_queue started the
> > > > reset: every namespaces request queue sees the RESETTING controller state
> > > 
> > > When timeouts come, the global state of RESETTING may not be updated
> > > yet, so all the timeouts may not observe the state.
> > 
> > Even prior to the RESETING state, every single command, no matter
> > which namespace or request_queue it came on, is reclaimed by the driver.
> > There *should* be no requests to timeout after nvme_dev_disable is called
> > because the nvme driver returned control of all requests in the tagset
> > to blk-mq.
> 
> The timed-out requests won't be canceled by nvme_dev_disable().

??? nvme_dev_disable cancels all started requests. There are no exceptions.

> If the timed-out requests is handled as RESET_TIMER, there may
> be new timeout event triggered again.
> 
> > 
> > In any case, if blk-mq decides it won't complete those requests, we
> > can just swap the order in the reset_work: sync first, uncondintionally
> > disable. Does the following snippet look more okay?
> > 
> > ---
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 17a0190bd88f..42af077ee07a 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2307,11 +2307,14 @@ static void nvme_reset_work(struct work_struct *work)
> >  		goto out;
> >  
> >  	/*
> > -	 * If we're called to reset a live controller first shut it down before
> > -	 * moving on.
> > +	 * Ensure there are no timeout work in progress prior to forcefully
> > +	 * disabling the queue. There is no harm in disabling the device even
> > +	 * when it was already disabled, as this will forcefully reclaim any
> > +	 * IOs that are stuck due to blk-mq's timeout handling that prevents
> > +	 * timed out requests from completing.
> >  	 */
> > -	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
> > -		nvme_dev_disable(dev, false);
> > +	nvme_sync_queues(&dev->ctrl);
> > +	nvme_dev_disable(dev, false);
> 
> That may not work reliably too.
> 
> For example, request A from NS_0 is timed-out and handled as RESET_TIMER,
> meantime request B from NS_1 is timed-out and handled as EH_HANDLED.

Meanwhile, request B's nvme_dev_disable prior to returning EH_HANDLED
cancels all requests, which includes request A.

> When the above reset work is run for handling timeout of req B,
> new timeout event on request A may come just between the above
> nvme_sync_queues() and nvme_dev_disable()

The sync queues either stops the timer from running, or waits for it to
complete. We are in the RESETTING state: if request A's timeout happens
to be running, we're not restarting the timer; we're returning EH_HANDLED.

> then nvme_dev_disable()
> can't cover request A, and finally the timed-out event for req A
> will nvme_dev_disable() when the current reset is just in-progress,
> then this reset can't move on, and IO hang is caused.

At no point in the nvme_reset are we waiting for any IO to complete.
Reset continues to make forward progress.
Ming Lei May 21, 2018, 4:08 p.m. UTC | #3
On Mon, May 21, 2018 at 09:59:09AM -0600, Keith Busch wrote:
> On Mon, May 21, 2018 at 11:25:43PM +0800, Ming Lei wrote:
> > On Mon, May 21, 2018 at 08:04:13AM -0600, Keith Busch wrote:
> > > On Sat, May 19, 2018 at 08:01:42AM +0800, Ming Lei wrote:
> > > > > You keep saying that, but the controller state is global to the
> > > > > controller. It doesn't matter which namespace request_queue started the
> > > > > reset: every namespaces request queue sees the RESETTING controller state
> > > > 
> > > > When timeouts come, the global state of RESETTING may not be updated
> > > > yet, so all the timeouts may not observe the state.
> > > 
> > > Even prior to the RESETING state, every single command, no matter
> > > which namespace or request_queue it came on, is reclaimed by the driver.
> > > There *should* be no requests to timeout after nvme_dev_disable is called
> > > because the nvme driver returned control of all requests in the tagset
> > > to blk-mq.
> > 
> > The timed-out requests won't be canceled by nvme_dev_disable().
> 
> ??? nvme_dev_disable cancels all started requests. There are no exceptions.

Please take a look at blk_mq_complete_request(). Even with Bart's
change, the request still won't be completed by driver. The request can
only be completed by either driver or blk-mq, not both.

> 
> > If the timed-out requests is handled as RESET_TIMER, there may
> > be new timeout event triggered again.
> > 
> > > 
> > > In any case, if blk-mq decides it won't complete those requests, we
> > > can just swap the order in the reset_work: sync first, uncondintionally
> > > disable. Does the following snippet look more okay?
> > > 
> > > ---
> > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > index 17a0190bd88f..42af077ee07a 100644
> > > --- a/drivers/nvme/host/pci.c
> > > +++ b/drivers/nvme/host/pci.c
> > > @@ -2307,11 +2307,14 @@ static void nvme_reset_work(struct work_struct *work)
> > >  		goto out;
> > >  
> > >  	/*
> > > -	 * If we're called to reset a live controller first shut it down before
> > > -	 * moving on.
> > > +	 * Ensure there are no timeout work in progress prior to forcefully
> > > +	 * disabling the queue. There is no harm in disabling the device even
> > > +	 * when it was already disabled, as this will forcefully reclaim any
> > > +	 * IOs that are stuck due to blk-mq's timeout handling that prevents
> > > +	 * timed out requests from completing.
> > >  	 */
> > > -	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
> > > -		nvme_dev_disable(dev, false);
> > > +	nvme_sync_queues(&dev->ctrl);
> > > +	nvme_dev_disable(dev, false);
> > 
> > That may not work reliably too.
> > 
> > For example, request A from NS_0 is timed-out and handled as RESET_TIMER,
> > meantime request B from NS_1 is timed-out and handled as EH_HANDLED.
> 
> Meanwhile, request B's nvme_dev_disable prior to returning EH_HANDLED
> cancels all requests, which includes request A.

No, please see my above comment.


Thanks,
Ming
Keith Busch May 21, 2018, 4:25 p.m. UTC | #4
On Tue, May 22, 2018 at 12:08:37AM +0800, Ming Lei wrote:
> Please take a look at blk_mq_complete_request(). Even with Bart's
> change, the request still won't be completed by driver. The request can
> only be completed by either driver or blk-mq, not both.

So you're saying blk-mq can't complete a request the driver returned to
blk-mq to complete. And that's the nvme driver's problem to fix?
Ming Lei May 22, 2018, 1:56 a.m. UTC | #5
On Mon, May 21, 2018 at 10:25:17AM -0600, Keith Busch wrote:
> On Tue, May 22, 2018 at 12:08:37AM +0800, Ming Lei wrote:
> > Please take a look at blk_mq_complete_request(). Even with Bart's
> > change, the request still won't be completed by driver. The request can
> > only be completed by either driver or blk-mq, not both.
> 
> So you're saying blk-mq can't complete a request the driver returned to
> blk-mq to complete. And that's the nvme driver's problem to fix?

For avoiding use-after-free, one request can only be completed by
one path, either by timeout path or normal completion(irq or cancel)
from driver.

So before handling this req's timeout, this request has to be
marked as completed by blk-mq timeout code already, then nvme_cancel_request()
can't cover this timed-out request.

Thanks,
Ming
diff mbox

Patch

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 17a0190bd88f..42af077ee07a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2307,11 +2307,14 @@  static void nvme_reset_work(struct work_struct *work)
 		goto out;
 
 	/*
-	 * If we're called to reset a live controller first shut it down before
-	 * moving on.
+	 * Ensure there are no timeout work in progress prior to forcefully
+	 * disabling the queue. There is no harm in disabling the device even
+	 * when it was already disabled, as this will forcefully reclaim any
+	 * IOs that are stuck due to blk-mq's timeout handling that prevents
+	 * timed out requests from completing.
 	 */
-	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
-		nvme_dev_disable(dev, false);
+	nvme_sync_queues(&dev->ctrl);
+	nvme_dev_disable(dev, false);
 
 	/*
 	 * Introduce CONNECTING state from nvme-fc/rdma transports to mark the