Message ID | 20180503031716.31446-8-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi ming On 05/03/2018 11:17 AM, Ming Lei wrote: > static int io_queue_depth_set(const char *val, const struct kernel_param *kp) > @@ -1199,7 +1204,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) > if (nvme_should_reset(dev, csts)) { > nvme_warn_reset(dev, csts); > nvme_dev_disable(dev, false, true); > - nvme_reset_ctrl(&dev->ctrl); > + nvme_eh_reset(dev); > return BLK_EH_HANDLED; > } > > @@ -1242,7 +1247,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) > "I/O %d QID %d timeout, reset controller\n", > req->tag, nvmeq->qid); > nvme_dev_disable(dev, false, true); > - nvme_reset_ctrl(&dev->ctrl); > + nvme_eh_reset(dev); w/o the 8th patch, invoke nvme_eh_reset in nvme_timeout is dangerous. nvme_pre_reset_dev will send a lot of admin io when initialize the controller. if this admin ios timeout, the nvme_timeout cannot handle this because the timeout work is sleeping to wait admin ios. In addition, even if we take the nvme_wait_freeze out of nvme_eh_reset and put it into another context, but the ctrl state is still CONNECTING, the nvme_eh_reset cannot move forward. Actually, I used to report this issue to Keith. I met io hung when the controller die in nvme_reset_work -> nvme_wait_freeze. As you know, the nvme_reset_work cannot be scheduled because it is waiting. Here is Keith's commit for this: http://lists.infradead.org/pipermail/linux-nvme/2018-February/015603.html Thanks Jianchao
On Thu, May 03, 2018 at 05:14:30PM +0800, jianchao.wang wrote: > Hi ming > > On 05/03/2018 11:17 AM, Ming Lei wrote: > > static int io_queue_depth_set(const char *val, const struct kernel_param *kp) > > @@ -1199,7 +1204,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) > > if (nvme_should_reset(dev, csts)) { > > nvme_warn_reset(dev, csts); > > nvme_dev_disable(dev, false, true); > > - nvme_reset_ctrl(&dev->ctrl); > > + nvme_eh_reset(dev); > > return BLK_EH_HANDLED; > > } > > > > @@ -1242,7 +1247,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) > > "I/O %d QID %d timeout, reset controller\n", > > req->tag, nvmeq->qid); > > nvme_dev_disable(dev, false, true); > > - nvme_reset_ctrl(&dev->ctrl); > > + nvme_eh_reset(dev); > > w/o the 8th patch, invoke nvme_eh_reset in nvme_timeout is dangerous. > nvme_pre_reset_dev will send a lot of admin io when initialize the controller. > if this admin ios timeout, the nvme_timeout cannot handle this because the timeout work is sleeping > to wait admin ios. I just tried to not make the 8th patch too big, but looks we have to merge the two. Once the EH kthread is introduced in 8th patch, there isn't such issue any more. > > In addition, even if we take the nvme_wait_freeze out of nvme_eh_reset and put it into another context, > but the ctrl state is still CONNECTING, the nvme_eh_reset cannot move forward. nvme_eh_reset() can move on, if controller state is either CONNECTING or RESETTING, nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING) won't be called in nvme_eh_reset(), and nvme_pre_reset_dev() will be called directly, then follows scheduling of the eh_reset_work. > > Actually, I used to report this issue to Keith. I met io hung when the controller die in > nvme_reset_work -> nvme_wait_freeze. As you know, the nvme_reset_work cannot be scheduled because it is waiting. > Here is Keith's commit for this: > http://lists.infradead.org/pipermail/linux-nvme/2018-February/015603.html There are actually two issues here, both are covered in this patchset: 1) when nvme_wait_freeze() is for draining IO, controller error may happen again, this patch can shutdown & reset controller again to recover it. 2) there is also another issue: queues may not be put into freeze state in nvme_dev_disable(), then nvme_wait_freeze() will wait forever. This issue is addressed by 4th patch. Both two can be observed in blktests block/011 and verified by this patches. Thanks for your great review, please let me know if there are other issues. Thanks, Ming
Hi Ming Thanks for your kindly response. On 05/03/2018 06:08 PM, Ming Lei wrote: > nvme_eh_reset() can move on, if controller state is either CONNECTING or > RESETTING, nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING) won't > be called in nvme_eh_reset(), and nvme_pre_reset_dev() will be called > directly, then follows scheduling of the eh_reset_work. We have to consider another context, nvme_reset_work. There are two contexts that will invoke nvme_pre_reset_dev. nvme_eh_reset could invoke nvme_pre_reset_dev even if the state is RESETTING or CONNECTING. nvme_error_handler nvme_reset_ctrl -> change state to RESETTING -> queue reset work nvme_reset_work -> nvme_dev_disable -> nvme_eh_reset -> nvme_pre_reset_dev -> nvme_pre_reset_dev -> change state to CONNECTING -> change state fails -> nvme_remove_dead_ctrl There seems to be other race scenarios between them. Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things to nvme_reset_work as the v2 patch series seems clearer. The primary target of nvme_error_handler is to drain IOs and disable controller. reset_work could take over the left things of the recovery work IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok. If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them. > There are actually two issues here, both are covered in this patchset: > > 1) when nvme_wait_freeze() is for draining IO, controller error may > happen again, this patch can shutdown & reset controller again to > recover it. > We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset. Otherwise, the code looks really complicated. In addition, this fix for this issue could be deferred after your great EH kthread solution is submitted. Thanks Jianchao > 2) there is also another issue: queues may not be put into freeze state > in nvme_dev_disable(), then nvme_wait_freeze() will wait forever. This > issue is addressed by 4th patch. > > Both two can be observed in blktests block/011 and verified by this patches. > > Thanks for your great review, please let me know if there are other > issues.
On Thu, May 03, 2018 at 11:46:56PM +0800, jianchao.wang wrote: > Hi Ming > > Thanks for your kindly response. > > On 05/03/2018 06:08 PM, Ming Lei wrote: > > nvme_eh_reset() can move on, if controller state is either CONNECTING or > > RESETTING, nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING) won't > > be called in nvme_eh_reset(), and nvme_pre_reset_dev() will be called > > directly, then follows scheduling of the eh_reset_work. > > We have to consider another context, nvme_reset_work. > There are two contexts that will invoke nvme_pre_reset_dev. > nvme_eh_reset could invoke nvme_pre_reset_dev even if the state is RESETTING or CONNECTING. > > nvme_error_handler nvme_reset_ctrl > -> change state to RESETTING > -> queue reset work > nvme_reset_work > -> nvme_dev_disable > -> nvme_eh_reset > -> nvme_pre_reset_dev -> nvme_pre_reset_dev > -> change state to CONNECTING -> change state fails > -> nvme_remove_dead_ctrl > There seems to be other race scenarios between them. IMO it is fine to change the change state in nvme_pre_reset_dev() as the following way: if (dev->ctrl.state != NVME_CTRL_CONNECTING) if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_CONNECTING)) goto out; Also dev->reset_lock has been introduced, and we may hold it for nvme_pre_reset_dev() in the path of nvme_reset_ctrl() too, then there isn't the above race any more. > > Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things > to nvme_reset_work as the v2 patch series seems clearer. That way may not fix the current race: nvme_dev_disable() will quiesce/freeze queue again when resetting is in-progress, then nothing can move on. One big change in V3 is to run nvme_dev_disable() & nvme_pre_reset_dev() in the EH thread, and make sure queues are started once nvme_pre_reset_dev() returns. But as you pointed, it may not work well enough if admin requests are timed-out in nvme_pre_reset_dev(). > The primary target of nvme_error_handler is to drain IOs and disable controller. > reset_work could take over the left things of the recovery work Draining IOs isn't necessary, we can remove freezing queues & wait_freeze() from nvme_dev_disable() & nvme_reset_work() for non-shutdown, that can be easy to fix all these races since the .eh_reset_work isn't needed any more, because quiescing is enough to prevent IOs from entering driver/device. The only issue is that more requests may be failed when reset failed, so I don't want to try that, and just follows the current behaviour. IMO the primary goal of nvme EH is to recover controller, that is what nvme_dev_disable() and resetting should do, if IO isn't drained, timeout still may be triggered. The big trouble is about percpu_ref, once the q_usage_counter is killed to start freezing for preventing IO from entering block layer, we have to run blk_mq_freeze_queue_wait() before unfreezing the usage couner. If blk_mq_freeze_queue_wait() isn't needed, things can be much easier for us. > > IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok. > If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them. > Right, that is one goal of this patchset. > > There are actually two issues here, both are covered in this patchset: > > > > 1) when nvme_wait_freeze() is for draining IO, controller error may > > happen again, this patch can shutdown & reset controller again to > > recover it. > > > > We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset. > Otherwise, the code looks really complicated. Yeah, that is what I plan to do in V4, introduce .eh_pre_reset_work and .eh_post_reset_work, and run the following things in .eh_pre_reset_work: - nvme_pre_reset_dev() - schedule .eh_post_reset_work and run draining IO & updating controller state in .eh_post_reset_work. .eh_pre_reset_work will be scheduled in nvme_error_handler(). This way may address the issue of admin req timeout in nvme_pre_reset_dev(), what do you think of this approach? Thanks, Ming
Hi ming On 05/04/2018 12:24 PM, Ming Lei wrote: >> Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things >> to nvme_reset_work as the v2 patch series seems clearer. > That way may not fix the current race: nvme_dev_disable() will > quiesce/freeze queue again when resetting is in-progress, then > nothing can move on. With timeout quiesce and nvme EH kthread, the nvme_dev_disable could ensure all the in-flight requests to be handled. When we queue the reset_work in nvme_error_handler, there will be no timeout anymore. If there is timeout due to the admin io in reset_work, this is expected, reset_work will be interrupted and fail. If the io timeout when reset_work wait freeze, because the reset_work is waiting, things cannot move on. we have multiple method to handle this: 1. as Keith's solution, change ctrl state to DELETING and fail the reset_work. 2. as current patch set, try to recovery it. but we have to split reset_work into two contexts. Actually, there are many other places where the nvme_dev_disable will be invoked. > > One big change in V3 is to run nvme_dev_disable() & nvme_pre_reset_dev() > in the EH thread, and make sure queues are started once > nvme_pre_reset_dev() returns. But as you pointed, it may not work well > enough if admin requests are timed-out in nvme_pre_reset_dev(). > >> The primary target of nvme_error_handler is to drain IOs and disable controller. >> reset_work could take over the left things of the recovery work > Draining IOs isn't necessary, we can remove freezing queues & wait_freeze() > from nvme_dev_disable() & nvme_reset_work() for non-shutdown, that can be > easy to fix all these races since the .eh_reset_work isn't needed any more, > because quiescing is enough to prevent IOs from entering driver/device. > > The only issue is that more requests may be failed when reset failed, > so I don't want to try that, and just follows the current behaviour. > > IMO the primary goal of nvme EH is to recover controller, that is > what nvme_dev_disable() and resetting should do, if IO isn't drained, > timeout still may be triggered. > > The big trouble is about percpu_ref, once the q_usage_counter is killed > to start freezing for preventing IO from entering block layer, we have > to run blk_mq_freeze_queue_wait() before unfreezing the usage couner. If > blk_mq_freeze_queue_wait() isn't needed, things can be much easier for us. Sorry for my bad description. I mean nvme_dev_disable will drain all the in-flight requests and requeue or fail them, then we will not get any timeout again. In fact, I used to talk with Keith about remove freezing queues & wait_freeze for non-shutdown. There is an obstacle, nvme_dev_add -> blk_mq_update_nr_hw_queues. When the ctrl comes back, the number of hw queues maybe changed. Even if we move the wait_freeze, blk_mq_update_nr_hw_queues will do that. On the other hand, if not freeze the queue, the requests that are submitted to the dying hw queue during the resetting, will be sacrificed. > >> IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok. >> If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them. >> > Right, that is one goal of this patchset. > >>> There are actually two issues here, both are covered in this patchset: >>> >>> 1) when nvme_wait_freeze() is for draining IO, controller error may >>> happen again, this patch can shutdown & reset controller again to >>> recover it. >>> >> We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset. >> Otherwise, the code looks really complicated. > Yeah, that is what I plan to do in V4, introduce .eh_pre_reset_work and > .eh_post_reset_work, and run the following things in .eh_pre_reset_work: > > - nvme_pre_reset_dev() > - schedule .eh_post_reset_work > > and run draining IO & updating controller state in .eh_post_reset_work. > .eh_pre_reset_work will be scheduled in nvme_error_handler(). > > This way may address the issue of admin req timeout in nvme_pre_reset_dev(), > what do you think of this approach? nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface. Then it is more convenient to ensure that there will be only one resetting instance running. Thanks Jianchao
Oh sorry. On 05/04/2018 02:10 PM, jianchao.wang wrote: > nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface. > Then it is more convenient to ensure that there will be only one resetting instance running. ctrl state is still in RESETTING state, nvme_reset_ctrl cannot work. Thanks Jianchao
On Fri, May 04, 2018 at 02:10:19PM +0800, jianchao.wang wrote: > Hi ming > > On 05/04/2018 12:24 PM, Ming Lei wrote: > >> Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things > >> to nvme_reset_work as the v2 patch series seems clearer. > > That way may not fix the current race: nvme_dev_disable() will > > quiesce/freeze queue again when resetting is in-progress, then > > nothing can move on. > > With timeout quiesce and nvme EH kthread, the nvme_dev_disable could ensure all the in-flight > requests to be handled. When we queue the reset_work in nvme_error_handler, there will be no > timeout anymore. Right. > If there is timeout due to the admin io in reset_work, this is expected, reset_work will be > interrupted and fail. At least with V3, the admin io submitted in reset_work won't be handled well. When EH thread hangs in the sync admin io, nvme_eh_schedule() can't wakeup the EH thread any more, then the admin io may hang forever in reset_work. Then we have to run nvme_pre_reset_dev() in another context. > If the io timeout when reset_work wait freeze, because the reset_work is waiting, things cannot > move on. we have multiple method to handle this: > 1. as Keith's solution, change ctrl state to DELETING and fail the reset_work. > 2. as current patch set, try to recovery it. but we have to split reset_work into two contexts. > Right, nvme_dev_disable() can guarantee forward progress, and we have to make sure that nvme_dev_disable() is called when timeout happens. > Actually, there are many other places where the nvme_dev_disable will be invoked. > > > > One big change in V3 is to run nvme_dev_disable() & nvme_pre_reset_dev() > > in the EH thread, and make sure queues are started once > > nvme_pre_reset_dev() returns. But as you pointed, it may not work well > > enough if admin requests are timed-out in nvme_pre_reset_dev(). > > > >> The primary target of nvme_error_handler is to drain IOs and disable controller. > >> reset_work could take over the left things of the recovery work > > Draining IOs isn't necessary, we can remove freezing queues & wait_freeze() > > from nvme_dev_disable() & nvme_reset_work() for non-shutdown, that can be > > easy to fix all these races since the .eh_reset_work isn't needed any more, > > because quiescing is enough to prevent IOs from entering driver/device. > > > > The only issue is that more requests may be failed when reset failed, > > so I don't want to try that, and just follows the current behaviour. > > > > IMO the primary goal of nvme EH is to recover controller, that is > > what nvme_dev_disable() and resetting should do, if IO isn't drained, > > timeout still may be triggered. > > > > The big trouble is about percpu_ref, once the q_usage_counter is killed > > to start freezing for preventing IO from entering block layer, we have > > to run blk_mq_freeze_queue_wait() before unfreezing the usage couner. If > > blk_mq_freeze_queue_wait() isn't needed, things can be much easier for us. > > > Sorry for my bad description. I mean nvme_dev_disable will drain all the > in-flight requests and requeue or fail them, then we will not get any timeout again. OK, got it, sorry for misunderstanding your point. > > In fact, I used to talk with Keith about remove freezing queues & wait_freeze for non-shutdown. > There is an obstacle, nvme_dev_add -> blk_mq_update_nr_hw_queues. > When the ctrl comes back, the number of hw queues maybe changed. > Even if we move the wait_freeze, blk_mq_update_nr_hw_queues will do that. > On the other hand, if not freeze the queue, the requests that are submitted to the dying hw queue > during the resetting, will be sacrificed. Right. > > > > > >> IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok. > >> If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them. > >> > > Right, that is one goal of this patchset. > > > >>> There are actually two issues here, both are covered in this patchset: > >>> > >>> 1) when nvme_wait_freeze() is for draining IO, controller error may > >>> happen again, this patch can shutdown & reset controller again to > >>> recover it. > >>> > >> We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset. > >> Otherwise, the code looks really complicated. > > Yeah, that is what I plan to do in V4, introduce .eh_pre_reset_work and > > .eh_post_reset_work, and run the following things in .eh_pre_reset_work: > > > > - nvme_pre_reset_dev() > > - schedule .eh_post_reset_work > > > > and run draining IO & updating controller state in .eh_post_reset_work. > > .eh_pre_reset_work will be scheduled in nvme_error_handler(). > > > > This way may address the issue of admin req timeout in nvme_pre_reset_dev(), > > what do you think of this approach? > > nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface. > Then it is more convenient to ensure that there will be only one resetting instance running. > But as you mentioned above, reset_work has to be splitted into two contexts for handling IO timeout during wait_freeze in reset_work, so single instance of nvme_reset_ctrl() may not work well. Thanks, Ming
Hi ming On 05/04/2018 04:02 PM, Ming Lei wrote: >> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface. >> Then it is more convenient to ensure that there will be only one resetting instance running. >> > But as you mentioned above, reset_work has to be splitted into two > contexts for handling IO timeout during wait_freeze in reset_work, > so single instance of nvme_reset_ctrl() may not work well. I mean the EH kthread and the reset_work which both could reset the ctrl instead of the pre and post rest context. Honestly, I suspect a bit that whether it is worthy to try to recover from [1]. The Eh kthread solution could make things easier, but the codes for recovery from [1] has made code really complicated. It is more difficult to unify the nvme-pci, rdma and fc. How about just fail the resetting as the Keith's solution ? [1] io timeout when nvme_reset_work or the new nvme_post_reset_dev invoke nvme_wait_freeze. Thanks Jianchao
On Fri, May 04, 2018 at 04:28:23PM +0800, jianchao.wang wrote: > Hi ming > > On 05/04/2018 04:02 PM, Ming Lei wrote: > >> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface. > >> Then it is more convenient to ensure that there will be only one resetting instance running. > >> > > But as you mentioned above, reset_work has to be splitted into two > > contexts for handling IO timeout during wait_freeze in reset_work, > > so single instance of nvme_reset_ctrl() may not work well. > > I mean the EH kthread and the reset_work which both could reset the ctrl instead of > the pre and post rest context. That may run nvme_pre_reset_dev() two times from both EH thread and reset_work, looks not good. > > Honestly, I suspect a bit that whether it is worthy to try to recover from [1]. > The Eh kthread solution could make things easier, but the codes for recovery from [1] has > made code really complicated. It is more difficult to unify the nvme-pci, rdma and fc. IMO the model is not complicated(one EH thread with two-stage resetting), and it may be cloned to rdma, fc, .. without much difficulty. Follows the model: 1) single EH thread, in which controller should be guaranteed to shutdown, and EH thread is waken up when controller needs to be recovered. 2) 1st stage resetting: recover controller, which has to run in one work context, since admin commands for recover may timeout. 3) 2nd stage resetting: draining IO & updating controller state to live, which has to run in another context, since IOs during this stage may timeout too. The reset lock is used to sync between 1st stage resetting and 2nd stage resetting. The '1st stage resetting' is scheduled from EH thread, and the '2nd state resetting' is scheduled after the '1st stage resetting' is done. The implementation might not be so complicated too, since V3 has partitioned reset_work into two parts, then what we just need to do in V4 is to run each in one work from EH thread. > How about just fail the resetting as the Keith's solution ? > > [1] io timeout when nvme_reset_work or the new nvme_post_reset_dev invoke nvme_wait_freeze. > I'd suggest to not change controller state as DELETING in this case, otherwise it may be reported as another bug, in which controller becomes completely unusable. This issue can be easily triggered by blktests block/011, in this test case, controller is always recoverable. I will post out V4, and see if all current issues can be covered, and how complicated the implementation is. Thanks Ming
On Fri, May 4, 2018 at 4:28 PM, jianchao.wang <jianchao.w.wang@oracle.com> wrote: > Hi ming > > On 05/04/2018 04:02 PM, Ming Lei wrote: >>> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface. >>> Then it is more convenient to ensure that there will be only one resetting instance running. >>> >> But as you mentioned above, reset_work has to be splitted into two >> contexts for handling IO timeout during wait_freeze in reset_work, >> so single instance of nvme_reset_ctrl() may not work well. > > I mean the EH kthread and the reset_work which both could reset the ctrl instead of > the pre and post rest context. > > Honestly, I suspect a bit that whether it is worthy to try to recover from [1]. > The Eh kthread solution could make things easier, but the codes for recovery from [1] has > made code really complicated. It is more difficult to unify the nvme-pci, rdma and fc. Another choice may be nested EH, which should be easier to implement: - run the whole recovery procedures(shutdown & reset) in one single context - and start a new context to handle new timeout during last recovery in the same way The two approaches is just like sync IO vs AIO. Thanks, Ming Lei
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 16d7507bfd79..64bbccdb5091 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -71,6 +71,7 @@ struct nvme_queue; static void nvme_process_cq(struct nvme_queue *nvmeq); static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool freeze_queue); +static int nvme_eh_reset(struct nvme_dev *dev); /* * Represents an NVM Express device. Each nvme_dev is a PCI function. @@ -113,6 +114,10 @@ struct nvme_dev { dma_addr_t host_mem_descs_dma; struct nvme_host_mem_buf_desc *host_mem_descs; void **host_mem_desc_bufs; + + /* reset for timeout */ + struct mutex reset_lock; + struct work_struct eh_reset_work; }; static int io_queue_depth_set(const char *val, const struct kernel_param *kp) @@ -1199,7 +1204,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) if (nvme_should_reset(dev, csts)) { nvme_warn_reset(dev, csts); nvme_dev_disable(dev, false, true); - nvme_reset_ctrl(&dev->ctrl); + nvme_eh_reset(dev); return BLK_EH_HANDLED; } @@ -1242,7 +1247,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) "I/O %d QID %d timeout, reset controller\n", req->tag, nvmeq->qid); nvme_dev_disable(dev, false, true); - nvme_reset_ctrl(&dev->ctrl); + nvme_eh_reset(dev); /* * Mark the request as handled, since the inline shutdown @@ -2424,6 +2429,10 @@ static int nvme_pre_reset_dev(struct nvme_dev *dev) } result = nvme_setup_io_queues(dev); + if (result) + goto out; + + nvme_start_queues(&dev->ctrl); out: return result; } @@ -2432,6 +2441,7 @@ static void nvme_post_reset_dev(struct nvme_dev *dev) { enum nvme_ctrl_state new_state = NVME_CTRL_LIVE; + mutex_lock(&dev->reset_lock); /* * Keep the controller around but remove all namespaces if we don't have * any working I/O queue. @@ -2442,8 +2452,12 @@ static void nvme_post_reset_dev(struct nvme_dev *dev) nvme_remove_namespaces(&dev->ctrl); new_state = NVME_CTRL_ADMIN_ONLY; } else { - nvme_start_queues(&dev->ctrl); + mutex_unlock(&dev->reset_lock); + + /* error may happen during draining IO again */ nvme_wait_freeze(&dev->ctrl); + + mutex_lock(&dev->reset_lock); /* hit this only when allocate tagset fails */ if (nvme_dev_add(dev)) new_state = NVME_CTRL_ADMIN_ONLY; @@ -2461,10 +2475,12 @@ static void nvme_post_reset_dev(struct nvme_dev *dev) } nvme_start_ctrl(&dev->ctrl); + mutex_unlock(&dev->reset_lock); return; out: nvme_remove_dead_ctrl(dev, 0); + mutex_unlock(&dev->reset_lock); } static void nvme_reset_work(struct work_struct *work) @@ -2485,6 +2501,43 @@ static void nvme_reset_work(struct work_struct *work) nvme_post_reset_dev(dev); } +static void nvme_eh_reset_work(struct work_struct *work) +{ + struct nvme_dev *dev = + container_of(work, struct nvme_dev, eh_reset_work); + + nvme_post_reset_dev(dev); +} + +static int nvme_eh_reset(struct nvme_dev *dev) +{ + int ret = -EBUSY; + + mutex_lock(&dev->reset_lock); + if (dev->ctrl.state != NVME_CTRL_RESETTING && + dev->ctrl.state != NVME_CTRL_CONNECTING) { + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) + goto exit; + } + + ret = nvme_pre_reset_dev(dev); + if (ret) { + dev_warn(dev->ctrl.device, + "failed to pre-reset controller: %d\n", ret); + nvme_remove_dead_ctrl(dev, ret); + goto exit; + } + + mutex_unlock(&dev->reset_lock); + + queue_work(nvme_reset_wq, &dev->eh_reset_work); + + return 0; + exit: + mutex_unlock(&dev->reset_lock); + return ret; +} + static void nvme_remove_dead_ctrl_work(struct work_struct *work) { struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work); @@ -2606,6 +2659,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) if (result) goto put_pci; + mutex_init(&dev->reset_lock); + INIT_WORK(&dev->eh_reset_work, nvme_eh_reset_work); INIT_WORK(&dev->ctrl.reset_work, nvme_reset_work); INIT_WORK(&dev->remove_work, nvme_remove_dead_ctrl_work); mutex_init(&dev->shutdown_lock);
During draining IO in resetting controller, error still may happen and timeout can be triggered, the current implementation can't recover controller any more for this situation, this patch fixes the issue by the following approach: - introduces eh_reset_work, and moves draining IO and updating controller state into this work function. - resets controller just after nvme_dev_disable(), then schedule eh_reset_work for draining IO & updating controller state - introduce reset lock to sync between the above two parts - move nvme_start_queues() into the part for resetting controller, so that timeout even won't quiesce queue any more during draining IO Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: Laurence Oberman <loberman@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- drivers/nvme/host/pci.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 58 insertions(+), 3 deletions(-)