diff mbox

[V3,7/8] nvme: pci: recover controller reliably

Message ID 20180503031716.31446-8-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ming Lei May 3, 2018, 3:17 a.m. UTC
During draining IO in resetting controller, error still may happen
and timeout can be triggered, the current implementation can't recover
controller any more for this situation, this patch fixes the issue
by the following approach:

- introduces eh_reset_work, and moves draining IO and updating controller
state into this work function.

- resets controller just after nvme_dev_disable(), then schedule
eh_reset_work for draining IO & updating controller state

- introduce reset lock to sync between the above two parts

- move nvme_start_queues() into the part for resetting controller,
so that timeout even won't quiesce queue any more during draining
IO

Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 58 insertions(+), 3 deletions(-)

Comments

jianchao.wang May 3, 2018, 9:14 a.m. UTC | #1
Hi ming

On 05/03/2018 11:17 AM, Ming Lei wrote:
>  static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
> @@ -1199,7 +1204,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  	if (nvme_should_reset(dev, csts)) {
>  		nvme_warn_reset(dev, csts);
>  		nvme_dev_disable(dev, false, true);
> -		nvme_reset_ctrl(&dev->ctrl);
> +		nvme_eh_reset(dev);
>  		return BLK_EH_HANDLED;
>  	}
>  
> @@ -1242,7 +1247,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  			 "I/O %d QID %d timeout, reset controller\n",
>  			 req->tag, nvmeq->qid);
>  		nvme_dev_disable(dev, false, true);
> -		nvme_reset_ctrl(&dev->ctrl);
> +		nvme_eh_reset(dev);

w/o the 8th patch, invoke nvme_eh_reset in nvme_timeout is dangerous.
nvme_pre_reset_dev will send a lot of admin io when initialize the controller.
if this admin ios timeout, the nvme_timeout cannot handle this because the timeout work is sleeping
to wait admin ios.

In addition, even if we take the nvme_wait_freeze out of nvme_eh_reset and put it into another context,
but the ctrl state is still CONNECTING, the nvme_eh_reset cannot move forward.

Actually, I used to report this issue to Keith. I met io hung when the controller die in
nvme_reset_work -> nvme_wait_freeze. As you know, the nvme_reset_work cannot be scheduled because it is waiting.
Here is Keith's commit for this:
http://lists.infradead.org/pipermail/linux-nvme/2018-February/015603.html

Thanks
Jianchao
Ming Lei May 3, 2018, 10:08 a.m. UTC | #2
On Thu, May 03, 2018 at 05:14:30PM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 05/03/2018 11:17 AM, Ming Lei wrote:
> >  static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
> > @@ -1199,7 +1204,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  	if (nvme_should_reset(dev, csts)) {
> >  		nvme_warn_reset(dev, csts);
> >  		nvme_dev_disable(dev, false, true);
> > -		nvme_reset_ctrl(&dev->ctrl);
> > +		nvme_eh_reset(dev);
> >  		return BLK_EH_HANDLED;
> >  	}
> >  
> > @@ -1242,7 +1247,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  			 "I/O %d QID %d timeout, reset controller\n",
> >  			 req->tag, nvmeq->qid);
> >  		nvme_dev_disable(dev, false, true);
> > -		nvme_reset_ctrl(&dev->ctrl);
> > +		nvme_eh_reset(dev);
> 
> w/o the 8th patch, invoke nvme_eh_reset in nvme_timeout is dangerous.
> nvme_pre_reset_dev will send a lot of admin io when initialize the controller.
> if this admin ios timeout, the nvme_timeout cannot handle this because the timeout work is sleeping
> to wait admin ios.

I just tried to not make the 8th patch too big, but looks we have to
merge the two.

Once the EH kthread is introduced in 8th patch, there isn't such issue any more.

> 
> In addition, even if we take the nvme_wait_freeze out of nvme_eh_reset and put it into another context,
> but the ctrl state is still CONNECTING, the nvme_eh_reset cannot move forward.

nvme_eh_reset() can move on, if controller state is either CONNECTING or
RESETTING, nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING) won't
be called in nvme_eh_reset(), and nvme_pre_reset_dev() will be called
directly, then follows scheduling of the eh_reset_work.

> 
> Actually, I used to report this issue to Keith. I met io hung when the controller die in
> nvme_reset_work -> nvme_wait_freeze. As you know, the nvme_reset_work cannot be scheduled because it is waiting.
> Here is Keith's commit for this:
> http://lists.infradead.org/pipermail/linux-nvme/2018-February/015603.html

There are actually two issues here, both are covered in this patchset:

1) when nvme_wait_freeze() is for draining IO, controller error may
happen again, this patch can shutdown & reset controller again to
recover it.

2) there is also another issue: queues may not be put into freeze state
in nvme_dev_disable(), then nvme_wait_freeze() will wait forever. This
issue is addressed by 4th patch.

Both two can be observed in blktests block/011 and verified by this patches.

Thanks for your great review, please let me know if there are other
issues.

Thanks,
Ming
jianchao.wang May 3, 2018, 3:46 p.m. UTC | #3
Hi Ming

Thanks for your kindly response.

On 05/03/2018 06:08 PM, Ming Lei wrote:
> nvme_eh_reset() can move on, if controller state is either CONNECTING or
> RESETTING, nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING) won't
> be called in nvme_eh_reset(), and nvme_pre_reset_dev() will be called
> directly, then follows scheduling of the eh_reset_work.

We have to consider another context, nvme_reset_work.
There are two contexts that will invoke nvme_pre_reset_dev.
nvme_eh_reset could invoke nvme_pre_reset_dev even if the state is RESETTING or CONNECTING.

nvme_error_handler                nvme_reset_ctrl
                                    -> change state to RESETTING
                                    -> queue reset work
                                  nvme_reset_work          
  -> nvme_dev_disable                                     
  -> nvme_eh_reset
    -> nvme_pre_reset_dev          -> nvme_pre_reset_dev
      -> change state to CONNECTING   -> change state fails
                                   -> nvme_remove_dead_ctrl
There seems to be other race scenarios between them.

Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things
to nvme_reset_work as the v2 patch series seems clearer.
The primary target of nvme_error_handler is to drain IOs and disable controller.
reset_work could take over the left things of the recovery work

IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok.
If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them.

> There are actually two issues here, both are covered in this patchset:
> 
> 1) when nvme_wait_freeze() is for draining IO, controller error may
> happen again, this patch can shutdown & reset controller again to
> recover it.
>

We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset.
Otherwise, the code looks really complicated.
In addition, this fix for this issue could be deferred after your great EH kthread solution is submitted.

Thanks
Jianchao

> 2) there is also another issue: queues may not be put into freeze state
> in nvme_dev_disable(), then nvme_wait_freeze() will wait forever. This
> issue is addressed by 4th patch.
> 
> Both two can be observed in blktests block/011 and verified by this patches.
> 
> Thanks for your great review, please let me know if there are other
> issues.
Ming Lei May 4, 2018, 4:24 a.m. UTC | #4
On Thu, May 03, 2018 at 11:46:56PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> Thanks for your kindly response.
> 
> On 05/03/2018 06:08 PM, Ming Lei wrote:
> > nvme_eh_reset() can move on, if controller state is either CONNECTING or
> > RESETTING, nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING) won't
> > be called in nvme_eh_reset(), and nvme_pre_reset_dev() will be called
> > directly, then follows scheduling of the eh_reset_work.
> 
> We have to consider another context, nvme_reset_work.
> There are two contexts that will invoke nvme_pre_reset_dev.
> nvme_eh_reset could invoke nvme_pre_reset_dev even if the state is RESETTING or CONNECTING.
> 
> nvme_error_handler                nvme_reset_ctrl
>                                     -> change state to RESETTING
>                                     -> queue reset work
>                                   nvme_reset_work          
>   -> nvme_dev_disable                                     
>   -> nvme_eh_reset
>     -> nvme_pre_reset_dev          -> nvme_pre_reset_dev
>       -> change state to CONNECTING   -> change state fails
>                                    -> nvme_remove_dead_ctrl
> There seems to be other race scenarios between them.

IMO it is fine to change the change state in nvme_pre_reset_dev()
as the following way:

if (dev->ctrl.state != NVME_CTRL_CONNECTING)
	if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_CONNECTING))
		goto out;

Also dev->reset_lock has been introduced, and we may hold it for
nvme_pre_reset_dev() in the path of nvme_reset_ctrl() too, then
there isn't the above race any more.

> 
> Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things
> to nvme_reset_work as the v2 patch series seems clearer.

That way may not fix the current race: nvme_dev_disable() will
quiesce/freeze queue again when resetting is in-progress, then
nothing can move on.

One big change in V3 is to run nvme_dev_disable() & nvme_pre_reset_dev()
in the EH thread, and make sure queues are started once
nvme_pre_reset_dev() returns. But as you pointed, it may not work well
enough if admin requests are timed-out in nvme_pre_reset_dev().

> The primary target of nvme_error_handler is to drain IOs and disable controller.
> reset_work could take over the left things of the recovery work

Draining IOs isn't necessary, we can remove freezing queues & wait_freeze()
from nvme_dev_disable() & nvme_reset_work() for non-shutdown, that can be
easy to fix all these races since the .eh_reset_work isn't needed any more,
because quiescing is enough to prevent IOs from entering driver/device.

The only issue is that more requests may be failed when reset failed,
so I don't want to try that, and just follows the current behaviour.

IMO the primary goal of nvme EH is to recover controller, that is
what nvme_dev_disable() and resetting should do, if IO isn't drained,
timeout still may be triggered.

The big trouble is about percpu_ref, once the q_usage_counter is killed
to start freezing for preventing IO from entering block layer, we have
to run blk_mq_freeze_queue_wait() before unfreezing the usage couner. If
blk_mq_freeze_queue_wait() isn't needed, things can be much easier for us.

> 
> IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok.
> If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them.
> 

Right, that is one goal of this patchset.

> > There are actually two issues here, both are covered in this patchset:
> > 
> > 1) when nvme_wait_freeze() is for draining IO, controller error may
> > happen again, this patch can shutdown & reset controller again to
> > recover it.
> >
> 
> We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset.
> Otherwise, the code looks really complicated.

Yeah, that is what I plan to do in V4, introduce .eh_pre_reset_work and
.eh_post_reset_work, and run the following things in .eh_pre_reset_work:

	- nvme_pre_reset_dev()
	- schedule .eh_post_reset_work

and run draining IO & updating controller state in .eh_post_reset_work.
.eh_pre_reset_work will be scheduled in nvme_error_handler().

This way may address the issue of admin req timeout in nvme_pre_reset_dev(),
what do you think of this approach?

Thanks,
Ming
jianchao.wang May 4, 2018, 6:10 a.m. UTC | #5
Hi ming

On 05/04/2018 12:24 PM, Ming Lei wrote:
>> Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things
>> to nvme_reset_work as the v2 patch series seems clearer.
> That way may not fix the current race: nvme_dev_disable() will
> quiesce/freeze queue again when resetting is in-progress, then
> nothing can move on.

With timeout quiesce and nvme EH kthread, the nvme_dev_disable could ensure all the in-flight
requests to be handled. When we queue the reset_work in nvme_error_handler, there will be no
timeout anymore.
If there is timeout due to the admin io in reset_work, this is expected, reset_work will be
interrupted and fail.
If the io timeout when reset_work wait freeze, because the reset_work is waiting, things cannot
move on. we have multiple method to handle this:
1. as Keith's solution, change ctrl state to DELETING and fail the reset_work.
2. as current patch set, try to recovery it. but we have to split reset_work into two contexts.

Actually, there are many other places where the nvme_dev_disable will be invoked.   
> 
> One big change in V3 is to run nvme_dev_disable() & nvme_pre_reset_dev()
> in the EH thread, and make sure queues are started once
> nvme_pre_reset_dev() returns. But as you pointed, it may not work well
> enough if admin requests are timed-out in nvme_pre_reset_dev().
> 
>> The primary target of nvme_error_handler is to drain IOs and disable controller.
>> reset_work could take over the left things of the recovery work
> Draining IOs isn't necessary, we can remove freezing queues & wait_freeze()
> from nvme_dev_disable() & nvme_reset_work() for non-shutdown, that can be
> easy to fix all these races since the .eh_reset_work isn't needed any more,
> because quiescing is enough to prevent IOs from entering driver/device.
> 
> The only issue is that more requests may be failed when reset failed,
> so I don't want to try that, and just follows the current behaviour.
> 
> IMO the primary goal of nvme EH is to recover controller, that is
> what nvme_dev_disable() and resetting should do, if IO isn't drained,
> timeout still may be triggered.
> 
> The big trouble is about percpu_ref, once the q_usage_counter is killed
> to start freezing for preventing IO from entering block layer, we have
> to run blk_mq_freeze_queue_wait() before unfreezing the usage couner. If
> blk_mq_freeze_queue_wait() isn't needed, things can be much easier for us.


Sorry for my bad description. I mean nvme_dev_disable will drain all the 
in-flight requests and requeue or fail them, then we will not get any timeout again.

In fact, I used to talk with Keith about remove freezing queues & wait_freeze for non-shutdown.
There is an obstacle, nvme_dev_add -> blk_mq_update_nr_hw_queues.
When the ctrl comes back, the number of hw queues maybe changed.
Even if we move the wait_freeze, blk_mq_update_nr_hw_queues will do that.
On the other hand, if not freeze the queue, the requests that are submitted to the dying hw queue 
during the resetting, will be sacrificed.  


> 
>> IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok.
>> If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them.
>>
> Right, that is one goal of this patchset.
> 
>>> There are actually two issues here, both are covered in this patchset:
>>>
>>> 1) when nvme_wait_freeze() is for draining IO, controller error may
>>> happen again, this patch can shutdown & reset controller again to
>>> recover it.
>>>
>> We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset.
>> Otherwise, the code looks really complicated.
> Yeah, that is what I plan to do in V4, introduce .eh_pre_reset_work and
> .eh_post_reset_work, and run the following things in .eh_pre_reset_work:
> 
> 	- nvme_pre_reset_dev()
> 	- schedule .eh_post_reset_work
> 
> and run draining IO & updating controller state in .eh_post_reset_work.
> .eh_pre_reset_work will be scheduled in nvme_error_handler().
> 
> This way may address the issue of admin req timeout in nvme_pre_reset_dev(),
> what do you think of this approach?

nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
Then it is more convenient to ensure that there will be only one resetting instance running.

Thanks
Jianchao
jianchao.wang May 4, 2018, 6:21 a.m. UTC | #6
Oh sorry.

On 05/04/2018 02:10 PM, jianchao.wang wrote:
> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
> Then it is more convenient to ensure that there will be only one resetting instance running.

ctrl state is still in RESETTING state, nvme_reset_ctrl cannot work.

Thanks
Jianchao
Ming Lei May 4, 2018, 8:02 a.m. UTC | #7
On Fri, May 04, 2018 at 02:10:19PM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 05/04/2018 12:24 PM, Ming Lei wrote:
> >> Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things
> >> to nvme_reset_work as the v2 patch series seems clearer.
> > That way may not fix the current race: nvme_dev_disable() will
> > quiesce/freeze queue again when resetting is in-progress, then
> > nothing can move on.
> 
> With timeout quiesce and nvme EH kthread, the nvme_dev_disable could ensure all the in-flight
> requests to be handled. When we queue the reset_work in nvme_error_handler, there will be no
> timeout anymore.

Right.

> If there is timeout due to the admin io in reset_work, this is expected, reset_work will be
> interrupted and fail.

At least with V3, the admin io submitted in reset_work won't be handled
well. When EH thread hangs in the sync admin io, nvme_eh_schedule()
can't wakeup the EH thread any more, then the admin io may hang forever
in reset_work. Then we have to run nvme_pre_reset_dev() in another
context.

> If the io timeout when reset_work wait freeze, because the reset_work is waiting, things cannot
> move on. we have multiple method to handle this:
> 1. as Keith's solution, change ctrl state to DELETING and fail the reset_work.
> 2. as current patch set, try to recovery it. but we have to split reset_work into two contexts.
>

Right, nvme_dev_disable() can guarantee forward progress, and we have to
make sure that nvme_dev_disable() is called when timeout happens.

> Actually, there are many other places where the nvme_dev_disable will be invoked.   
> > 
> > One big change in V3 is to run nvme_dev_disable() & nvme_pre_reset_dev()
> > in the EH thread, and make sure queues are started once
> > nvme_pre_reset_dev() returns. But as you pointed, it may not work well
> > enough if admin requests are timed-out in nvme_pre_reset_dev().
> > 
> >> The primary target of nvme_error_handler is to drain IOs and disable controller.
> >> reset_work could take over the left things of the recovery work
> > Draining IOs isn't necessary, we can remove freezing queues & wait_freeze()
> > from nvme_dev_disable() & nvme_reset_work() for non-shutdown, that can be
> > easy to fix all these races since the .eh_reset_work isn't needed any more,
> > because quiescing is enough to prevent IOs from entering driver/device.
> > 
> > The only issue is that more requests may be failed when reset failed,
> > so I don't want to try that, and just follows the current behaviour.
> > 
> > IMO the primary goal of nvme EH is to recover controller, that is
> > what nvme_dev_disable() and resetting should do, if IO isn't drained,
> > timeout still may be triggered.
> > 
> > The big trouble is about percpu_ref, once the q_usage_counter is killed
> > to start freezing for preventing IO from entering block layer, we have
> > to run blk_mq_freeze_queue_wait() before unfreezing the usage couner. If
> > blk_mq_freeze_queue_wait() isn't needed, things can be much easier for us.
> 
> 
> Sorry for my bad description. I mean nvme_dev_disable will drain all the 
> in-flight requests and requeue or fail them, then we will not get any timeout again.

OK, got it, sorry for misunderstanding your point.

> 
> In fact, I used to talk with Keith about remove freezing queues & wait_freeze for non-shutdown.
> There is an obstacle, nvme_dev_add -> blk_mq_update_nr_hw_queues.
> When the ctrl comes back, the number of hw queues maybe changed.
> Even if we move the wait_freeze, blk_mq_update_nr_hw_queues will do that.
> On the other hand, if not freeze the queue, the requests that are submitted to the dying hw queue 
> during the resetting, will be sacrificed.  

Right.

> 
> 
> > 
> >> IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok.
> >> If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them.
> >>
> > Right, that is one goal of this patchset.
> > 
> >>> There are actually two issues here, both are covered in this patchset:
> >>>
> >>> 1) when nvme_wait_freeze() is for draining IO, controller error may
> >>> happen again, this patch can shutdown & reset controller again to
> >>> recover it.
> >>>
> >> We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset.
> >> Otherwise, the code looks really complicated.
> > Yeah, that is what I plan to do in V4, introduce .eh_pre_reset_work and
> > .eh_post_reset_work, and run the following things in .eh_pre_reset_work:
> > 
> > 	- nvme_pre_reset_dev()
> > 	- schedule .eh_post_reset_work
> > 
> > and run draining IO & updating controller state in .eh_post_reset_work.
> > .eh_pre_reset_work will be scheduled in nvme_error_handler().
> > 
> > This way may address the issue of admin req timeout in nvme_pre_reset_dev(),
> > what do you think of this approach?
> 
> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
> Then it is more convenient to ensure that there will be only one resetting instance running.
> 

But as you mentioned above, reset_work has to be splitted into two
contexts for handling IO timeout during wait_freeze in reset_work,
so single instance of nvme_reset_ctrl() may not work well.


Thanks,
Ming
jianchao.wang May 4, 2018, 8:28 a.m. UTC | #8
Hi ming

On 05/04/2018 04:02 PM, Ming Lei wrote:
>> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
>> Then it is more convenient to ensure that there will be only one resetting instance running.
>>
> But as you mentioned above, reset_work has to be splitted into two
> contexts for handling IO timeout during wait_freeze in reset_work,
> so single instance of nvme_reset_ctrl() may not work well.

I mean the EH kthread and the reset_work which both could reset the ctrl instead of
the pre and post rest context.

Honestly, I suspect a bit that whether it is worthy to try to recover from [1].
The Eh kthread solution could make things easier, but the codes for recovery from [1] has
made code really complicated. It is more difficult to unify the nvme-pci, rdma and fc. 

How about just fail the resetting as the Keith's solution ?

[1] io timeout when nvme_reset_work or the new nvme_post_reset_dev invoke nvme_wait_freeze.

Thanks
Jianchao
Ming Lei May 4, 2018, 9:16 a.m. UTC | #9
On Fri, May 04, 2018 at 04:28:23PM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 05/04/2018 04:02 PM, Ming Lei wrote:
> >> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
> >> Then it is more convenient to ensure that there will be only one resetting instance running.
> >>
> > But as you mentioned above, reset_work has to be splitted into two
> > contexts for handling IO timeout during wait_freeze in reset_work,
> > so single instance of nvme_reset_ctrl() may not work well.
> 
> I mean the EH kthread and the reset_work which both could reset the ctrl instead of
> the pre and post rest context.

That may run nvme_pre_reset_dev() two times from both EH thread and
reset_work, looks not good.

> 
> Honestly, I suspect a bit that whether it is worthy to try to recover from [1].
> The Eh kthread solution could make things easier, but the codes for recovery from [1] has
> made code really complicated. It is more difficult to unify the nvme-pci, rdma and fc. 
IMO the model is not complicated(one EH thread with two-stage resetting), and
it may be cloned to rdma, fc, .. without much difficulty.

Follows the model:

1) single EH thread, in which controller should be guaranteed to
shutdown, and EH thread is waken up when controller needs to be
recovered.

2) 1st stage resetting: recover controller, which has to run in one
work context, since admin commands for recover may timeout.

3) 2nd stage resetting: draining IO & updating controller state to live,
which has to run in another context, since IOs during this stage may
timeout too.

The reset lock is used to sync between 1st stage resetting and 2nd
stage resetting. The '1st stage resetting' is scheduled from EH thread,
and the '2nd state resetting' is scheduled after the '1st stage
resetting' is done.

The implementation might not be so complicated too, since V3 has partitioned
reset_work into two parts, then what we just need to do in V4 is to
run each in one work from EH thread.

> How about just fail the resetting as the Keith's solution ?
> 
> [1] io timeout when nvme_reset_work or the new nvme_post_reset_dev invoke nvme_wait_freeze.
> 

I'd suggest to not change controller state as DELETING in this case,
otherwise it may be reported as another bug, in which controller becomes
completely unusable. This issue can be easily triggered by blktests
block/011, in this test case, controller is always recoverable.

I will post out V4, and see if all current issues can be covered, and
how complicated the implementation is.

Thanks
Ming
Ming Lei May 5, 2018, 12:16 a.m. UTC | #10
On Fri, May 4, 2018 at 4:28 PM, jianchao.wang
<jianchao.w.wang@oracle.com> wrote:
> Hi ming
>
> On 05/04/2018 04:02 PM, Ming Lei wrote:
>>> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
>>> Then it is more convenient to ensure that there will be only one resetting instance running.
>>>
>> But as you mentioned above, reset_work has to be splitted into two
>> contexts for handling IO timeout during wait_freeze in reset_work,
>> so single instance of nvme_reset_ctrl() may not work well.
>
> I mean the EH kthread and the reset_work which both could reset the ctrl instead of
> the pre and post rest context.
>
> Honestly, I suspect a bit that whether it is worthy to try to recover from [1].
> The Eh kthread solution could make things easier, but the codes for recovery from [1] has
> made code really complicated. It is more difficult to unify the nvme-pci, rdma and fc.

Another choice may be nested EH, which should be easier to implement:

- run the whole recovery procedures(shutdown & reset) in one single context
- and start a new context to handle new timeout during last recovery in the
same way

The two approaches is just like sync IO vs AIO.

Thanks,
Ming Lei
diff mbox

Patch

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 16d7507bfd79..64bbccdb5091 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -71,6 +71,7 @@  struct nvme_queue;
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
 		freeze_queue);
+static int nvme_eh_reset(struct nvme_dev *dev);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -113,6 +114,10 @@  struct nvme_dev {
 	dma_addr_t host_mem_descs_dma;
 	struct nvme_host_mem_buf_desc *host_mem_descs;
 	void **host_mem_desc_bufs;
+
+	/* reset for timeout */
+	struct mutex reset_lock;
+	struct work_struct eh_reset_work;
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -1199,7 +1204,7 @@  static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
 		nvme_dev_disable(dev, false, true);
-		nvme_reset_ctrl(&dev->ctrl);
+		nvme_eh_reset(dev);
 		return BLK_EH_HANDLED;
 	}
 
@@ -1242,7 +1247,7 @@  static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
 		nvme_dev_disable(dev, false, true);
-		nvme_reset_ctrl(&dev->ctrl);
+		nvme_eh_reset(dev);
 
 		/*
 		 * Mark the request as handled, since the inline shutdown
@@ -2424,6 +2429,10 @@  static int nvme_pre_reset_dev(struct nvme_dev *dev)
 	}
 
 	result = nvme_setup_io_queues(dev);
+	if (result)
+		goto out;
+
+	nvme_start_queues(&dev->ctrl);
  out:
 	return result;
 }
@@ -2432,6 +2441,7 @@  static void nvme_post_reset_dev(struct nvme_dev *dev)
 {
 	enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
 
+	mutex_lock(&dev->reset_lock);
 	/*
 	 * Keep the controller around but remove all namespaces if we don't have
 	 * any working I/O queue.
@@ -2442,8 +2452,12 @@  static void nvme_post_reset_dev(struct nvme_dev *dev)
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
 	} else {
-		nvme_start_queues(&dev->ctrl);
+		mutex_unlock(&dev->reset_lock);
+
+		/* error may happen during draining IO again */
 		nvme_wait_freeze(&dev->ctrl);
+
+		mutex_lock(&dev->reset_lock);
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
@@ -2461,10 +2475,12 @@  static void nvme_post_reset_dev(struct nvme_dev *dev)
 	}
 
 	nvme_start_ctrl(&dev->ctrl);
+	mutex_unlock(&dev->reset_lock);
 	return;
 
  out:
 	nvme_remove_dead_ctrl(dev, 0);
+	mutex_unlock(&dev->reset_lock);
 }
 
 static void nvme_reset_work(struct work_struct *work)
@@ -2485,6 +2501,43 @@  static void nvme_reset_work(struct work_struct *work)
 	nvme_post_reset_dev(dev);
 }
 
+static void nvme_eh_reset_work(struct work_struct *work)
+{
+	struct nvme_dev *dev =
+		container_of(work, struct nvme_dev, eh_reset_work);
+
+	nvme_post_reset_dev(dev);
+}
+
+static int nvme_eh_reset(struct nvme_dev *dev)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&dev->reset_lock);
+	if (dev->ctrl.state != NVME_CTRL_RESETTING &&
+	    dev->ctrl.state != NVME_CTRL_CONNECTING) {
+	    if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING))
+		goto exit;
+	}
+
+	ret = nvme_pre_reset_dev(dev);
+	if (ret) {
+		dev_warn(dev->ctrl.device,
+			"failed to pre-reset controller: %d\n", ret);
+		nvme_remove_dead_ctrl(dev, ret);
+		goto exit;
+	}
+
+	mutex_unlock(&dev->reset_lock);
+
+	queue_work(nvme_reset_wq, &dev->eh_reset_work);
+
+	return 0;
+ exit:
+	mutex_unlock(&dev->reset_lock);
+	return ret;
+}
+
 static void nvme_remove_dead_ctrl_work(struct work_struct *work)
 {
 	struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work);
@@ -2606,6 +2659,8 @@  static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (result)
 		goto put_pci;
 
+	mutex_init(&dev->reset_lock);
+	INIT_WORK(&dev->eh_reset_work, nvme_eh_reset_work);
 	INIT_WORK(&dev->ctrl.reset_work, nvme_reset_work);
 	INIT_WORK(&dev->remove_work, nvme_remove_dead_ctrl_work);
 	mutex_init(&dev->shutdown_lock);