Message ID | 20240209050342.406184-1-nilay@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset | expand |
On Fri, 2024-02-09 at 10:32 +0530, Nilay Shroff wrote: > If the nvme subsyetm reset causes the loss of communication to the > nvme > adapter then EEH could potnetially recover the adapter. The detection > of > comminication loss to the adapter only happens when the nvme driver > attempts to read an MMIO register. > > The nvme subsystem reset command writes 0x4E564D65 to NSSR register > and > schedule adapter reset.In the case nvme subsystem reset caused the > loss > of communication to the nvme adapter then either IO timeout event or > adapter reset handler could detect it. If IO timeout even could > detect > loss of communication then EEH handler is able to recover the > communication to the adapter. This change was implemented in > 651438bb0af5 > (nvme-pci: Fix EEH failure on ppc). However if the adapter > communication > loss is detected in nvme reset work handler then EEH is unable to > successfully finish the adapter recovery. > > This patch ensures that, > - nvme driver reset handler would observer pci channel was offline > after > a failed MMIO read and avoids marking the controller state to DEAD > and > thus gives a fair chance to EEH handler to recover the nvme > adapter. > > - if nvme controller is already in RESETTNG state and pci channel > frozen > error is detected then nvme driver pci-error-handler code sends > the > correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH > handler > so that EEH handler could proceed with the pci slot reset. > > Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Note that while in this case the issue was discovered via the Power EEH handler, this is not a Power specific issue. The problem and fix is applicable to all architectures. > > [ 131.415601] EEH: Recovering PHB#40-PE#10000 > [ 131.415619] EEH: PE location: N/A, PHB location: N/A > [ 131.415623] EEH: Frozen PHB#40-PE#10000 detected > [ 131.415627] EEH: Call Trace: > [ 131.415629] EEH: [c000000000051078] > __eeh_send_failure_event+0x7c/0x15c > [ 131.415782] EEH: [c000000000049bdc] > eeh_dev_check_failure.part.0+0x27c/0x6b0 > [ 131.415789] EEH: [c000000000cb665c] nvme_pci_reg_read32+0x78/0x9c > [ 131.415802] EEH: [c000000000ca07f8] nvme_wait_ready+0xa8/0x18c > [ 131.415814] EEH: [c000000000cb7070] nvme_dev_disable+0x368/0x40c > [ 131.415823] EEH: [c000000000cb9970] nvme_reset_work+0x198/0x348 > [ 131.415830] EEH: [c00000000017b76c] process_one_work+0x1f0/0x4f4 > [ 131.415841] EEH: [c00000000017be2c] worker_thread+0x3bc/0x590 > [ 131.415846] EEH: [c00000000018a46c] kthread+0x138/0x140 > [ 131.415854] EEH: [c00000000000dd58] start_kernel_thread+0x14/0x18 > [ 131.415864] EEH: This PCI device has failed 1 times in the last > hour and will be permanently disabled after 5 failures. > [ 131.415874] EEH: Notify device drivers to shutdown > [ 131.415882] EEH: Beginning: 'error_detected(IO frozen)' > [ 131.415888] PCI 0040:01:00.0#10000: EEH: Invoking nvme- > >error_detected(IO frozen) > [ 131.415891] nvme nvme1: frozen state error detected, reset > controller > [ 131.515358] nvme 0040:01:00.0: enabling device (0000 -> 0002) > [ 131.515778] nvme nvme1: Disabling device after reset failure: -19 > [ 131.555336] PCI 0040:01:00.0#10000: EEH: nvme driver reports: > 'disconnect' > [ 131.555343] EEH: Finished:'error_detected(IO frozen)' with > aggregate recovery state:'disconnect' > [ 131.555371] EEH: Unable to recover from failure from PHB#40- > PE#10000. > [ 131.555371] Please try reseating or replacing it > [ 131.556296] EEH: of node=0040:01:00.0 > [ 131.556351] EEH: PCI device/vendor: 00251e0f > [ 131.556421] EEH: PCI cmd/status register: 00100142 > [ 131.556428] EEH: PCI-E capabilities and status follow: > [ 131.556678] EEH: PCI-E 00: 0002b010 10008fe3 00002910 00436044 > [ 131.556859] EEH: PCI-E 10: 10440000 00000000 00000000 00000000 > [ 131.556869] EEH: PCI-E 20: 00000000 > [ 131.556875] EEH: PCI-E AER capability register set follows: > [ 131.557115] EEH: PCI-E AER 00: 14820001 00000000 00400000 00462030 > [ 131.557294] EEH: PCI-E AER 10: 00000000 0000e000 000002a0 00000000 > [ 131.557469] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 > [ 131.557523] EEH: PCI-E AER 30: 00000000 00000000 > [ 131.558807] EEH: Beginning: 'error_detected(permanent failure)' > [ 131.558815] PCI 0040:01:00.0#10000: EEH: Invoking nvme- > >error_detected(permanent failure) > [ 131.558818] nvme nvme1: failure state error detected, request > disconnect > [ 131.558839] PCI 0040:01:00.0#10000: EEH: nvme driver reports: > 'disconnect' > --- > drivers/nvme/host/pci.c | 16 +++++++++++++--- > 1 file changed, 13 insertions(+), 3 deletions(-) > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index c1d6357ec98a..a6ba46e727ba 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2776,6 +2776,14 @@ static void nvme_reset_work(struct work_struct > *work) > out_unlock: > mutex_unlock(&dev->shutdown_lock); > out: > + /* > + * If PCI recovery is ongoing then let it finish first > + */ > + if (pci_channel_offline(to_pci_dev(dev->dev))) { > + dev_warn(dev->ctrl.device, "PCI recovery is ongoing so > let it finish\n"); > + return; > + } > + > /* > * Set state to deleting now to avoid blocking > nvme_wait_reset(), which > * may be holding this pci_dev's device lock. > @@ -3295,9 +3303,11 @@ static pci_ers_result_t > nvme_error_detected(struct pci_dev *pdev, > case pci_channel_io_frozen: > dev_warn(dev->ctrl.device, > "frozen state error detected, reset > controller\n"); > - if (!nvme_change_ctrl_state(&dev->ctrl, > NVME_CTRL_RESETTING)) { > - nvme_dev_disable(dev, true); > - return PCI_ERS_RESULT_DISCONNECT; > + if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) > { > + if (!nvme_change_ctrl_state(&dev->ctrl, > NVME_CTRL_RESETTING)) { > + nvme_dev_disable(dev, true); > + return PCI_ERS_RESULT_DISCONNECT; > + } > } > nvme_dev_disable(dev, false); > return PCI_ERS_RESULT_NEED_RESET;
Hi, A gentle ping... Thanks, --Nilay On 2/16/24 18:07, Srimannarayana Murthy Maram wrote: > Hi all, > > Tested patch with upstream kernel "6.8.0-rc4" > > Issue verified on IBM power systems with manual test case has following steps > 1. Note nvme controllers for a nvme subsystem using > "nvme list-subsys" > 2. Perform nvme subsystem reset on each listed controller > under nvme subsystem one after the other, only after successful recovery. > > Verified it on power system with NVME device on normal and multipath (2 paths) configuration. > > Provided patch successfully recovered single controller(normal) and both controller(multipath) listed under nvme subsystem. > > Tested-by: Maram Srimannarayana Murthy<msmurthy@linux.vnet.ibm.com> > > Thank you, > Maram Srimannarayana Murthy > Sr. Test Engineer | IBM > > On 2/9/24 10:32, Nilay Shroff wrote: >> If the nvme subsyetm reset causes the loss of communication to the nvme >> adapter then EEH could potnetially recover the adapter. The detection of >> comminication loss to the adapter only happens when the nvme driver >> attempts to read an MMIO register. >> >> The nvme subsystem reset command writes 0x4E564D65 to NSSR register and >> schedule adapter reset.In the case nvme subsystem reset caused the loss >> of communication to the nvme adapter then either IO timeout event or >> adapter reset handler could detect it. If IO timeout even could detect >> loss of communication then EEH handler is able to recover the >> communication to the adapter. This change was implemented in 651438bb0af5 >> (nvme-pci: Fix EEH failure on ppc). However if the adapter communication >> loss is detected in nvme reset work handler then EEH is unable to >> successfully finish the adapter recovery. >> >> This patch ensures that, >> - nvme driver reset handler would observer pci channel was offline after >> a failed MMIO read and avoids marking the controller state to DEAD and >> thus gives a fair chance to EEH handler to recover the nvme adapter. >> >> - if nvme controller is already in RESETTNG state and pci channel frozen >> error is detected then nvme driver pci-error-handler code sends the >> correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH handler >> so that EEH handler could proceed with the pci slot reset. >> >> Signed-off-by: Nilay Shroff<nilay@linux.ibm.com> >> >> [ 131.415601] EEH: Recovering PHB#40-PE#10000 >> [ 131.415619] EEH: PE location: N/A, PHB location: N/A >> [ 131.415623] EEH: Frozen PHB#40-PE#10000 detected >> [ 131.415627] EEH: Call Trace: >> [ 131.415629] EEH: [c000000000051078] __eeh_send_failure_event+0x7c/0x15c >> [ 131.415782] EEH: [c000000000049bdc] eeh_dev_check_failure.part.0+0x27c/0x6b0 >> [ 131.415789] EEH: [c000000000cb665c] nvme_pci_reg_read32+0x78/0x9c >> [ 131.415802] EEH: [c000000000ca07f8] nvme_wait_ready+0xa8/0x18c >> [ 131.415814] EEH: [c000000000cb7070] nvme_dev_disable+0x368/0x40c >> [ 131.415823] EEH: [c000000000cb9970] nvme_reset_work+0x198/0x348 >> [ 131.415830] EEH: [c00000000017b76c] process_one_work+0x1f0/0x4f4 >> [ 131.415841] EEH: [c00000000017be2c] worker_thread+0x3bc/0x590 >> [ 131.415846] EEH: [c00000000018a46c] kthread+0x138/0x140 >> [ 131.415854] EEH: [c00000000000dd58] start_kernel_thread+0x14/0x18 >> [ 131.415864] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures. >> [ 131.415874] EEH: Notify device drivers to shutdown >> [ 131.415882] EEH: Beginning: 'error_detected(IO frozen)' >> [ 131.415888] PCI 0040:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen) >> [ 131.415891] nvme nvme1: frozen state error detected, reset controller >> [ 131.515358] nvme 0040:01:00.0: enabling device (0000 -> 0002) >> [ 131.515778] nvme nvme1: Disabling device after reset failure: -19 >> [ 131.555336] PCI 0040:01:00.0#10000: EEH: nvme driver reports: 'disconnect' >> [ 131.555343] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'disconnect' >> [ 131.555371] EEH: Unable to recover from failure from PHB#40-PE#10000. >> [ 131.555371] Please try reseating or replacing it >> [ 131.556296] EEH: of node=0040:01:00.0 >> [ 131.556351] EEH: PCI device/vendor: 00251e0f >> [ 131.556421] EEH: PCI cmd/status register: 00100142 >> [ 131.556428] EEH: PCI-E capabilities and status follow: >> [ 131.556678] EEH: PCI-E 00: 0002b010 10008fe3 00002910 00436044 >> [ 131.556859] EEH: PCI-E 10: 10440000 00000000 00000000 00000000 >> [ 131.556869] EEH: PCI-E 20: 00000000 >> [ 131.556875] EEH: PCI-E AER capability register set follows: >> [ 131.557115] EEH: PCI-E AER 00: 14820001 00000000 00400000 00462030 >> [ 131.557294] EEH: PCI-E AER 10: 00000000 0000e000 000002a0 00000000 >> [ 131.557469] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 >> [ 131.557523] EEH: PCI-E AER 30: 00000000 00000000 >> [ 131.558807] EEH: Beginning: 'error_detected(permanent failure)' >> [ 131.558815] PCI 0040:01:00.0#10000: EEH: Invoking nvme->error_detected(permanent failure) >> [ 131.558818] nvme nvme1: failure state error detected, request disconnect >> [ 131.558839] PCI 0040:01:00.0#10000: EEH: nvme driver reports: 'disconnect' >> --- >> drivers/nvme/host/pci.c | 16 +++++++++++++--- >> 1 file changed, 13 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c >> index c1d6357ec98a..a6ba46e727ba 100644 >> --- a/drivers/nvme/host/pci.c >> +++ b/drivers/nvme/host/pci.c >> @@ -2776,6 +2776,14 @@ static void nvme_reset_work(struct work_struct *work) >> out_unlock: >> mutex_unlock(&dev->shutdown_lock); >> out: >> + /* >> + * If PCI recovery is ongoing then let it finish first >> + */ >> + if (pci_channel_offline(to_pci_dev(dev->dev))) { >> + dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n"); >> + return; >> + } >> + >> /* >> * Set state to deleting now to avoid blocking nvme_wait_reset(), which >> * may be holding this pci_dev's device lock. >> @@ -3295,9 +3303,11 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev, >> case pci_channel_io_frozen: >> dev_warn(dev->ctrl.device, >> "frozen state error detected, reset controller\n"); >> - if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { >> - nvme_dev_disable(dev, true); >> - return PCI_ERS_RESULT_DISCONNECT; >> + if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) { >> + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { >> + nvme_dev_disable(dev, true); >> + return PCI_ERS_RESULT_DISCONNECT; >> + } >> } >> nvme_dev_disable(dev, false); >> return PCI_ERS_RESULT_NEED_RESET;
On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote: > If the nvme subsyetm reset causes the loss of communication to the nvme > adapter then EEH could potnetially recover the adapter. The detection of > comminication loss to the adapter only happens when the nvme driver > attempts to read an MMIO register. > > The nvme subsystem reset command writes 0x4E564D65 to NSSR register and > schedule adapter reset.In the case nvme subsystem reset caused the loss > of communication to the nvme adapter then either IO timeout event or > adapter reset handler could detect it. If IO timeout even could detect > loss of communication then EEH handler is able to recover the > communication to the adapter. This change was implemented in 651438bb0af5 > (nvme-pci: Fix EEH failure on ppc). However if the adapter communication > loss is detected in nvme reset work handler then EEH is unable to > successfully finish the adapter recovery. > > This patch ensures that, > - nvme driver reset handler would observer pci channel was offline after > a failed MMIO read and avoids marking the controller state to DEAD and > thus gives a fair chance to EEH handler to recover the nvme adapter. > > - if nvme controller is already in RESETTNG state and pci channel frozen > error is detected then nvme driver pci-error-handler code sends the > correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH handler > so that EEH handler could proceed with the pci slot reset. A subsystem reset takes the link down. I'm pretty sure the proper way to recover from it requires pcie hotplug support.
On 2/27/24 23:59, Keith Busch wrote: > On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote: >> If the nvme subsyetm reset causes the loss of communication to the nvme >> adapter then EEH could potnetially recover the adapter. The detection of >> comminication loss to the adapter only happens when the nvme driver >> attempts to read an MMIO register. >> >> The nvme subsystem reset command writes 0x4E564D65 to NSSR register and >> schedule adapter reset.In the case nvme subsystem reset caused the loss >> of communication to the nvme adapter then either IO timeout event or >> adapter reset handler could detect it. If IO timeout even could detect >> loss of communication then EEH handler is able to recover the >> communication to the adapter. This change was implemented in 651438bb0af5 >> (nvme-pci: Fix EEH failure on ppc). However if the adapter communication >> loss is detected in nvme reset work handler then EEH is unable to >> successfully finish the adapter recovery. >> >> This patch ensures that, >> - nvme driver reset handler would observer pci channel was offline after >> a failed MMIO read and avoids marking the controller state to DEAD and >> thus gives a fair chance to EEH handler to recover the nvme adapter. >> >> - if nvme controller is already in RESETTNG state and pci channel frozen >> error is detected then nvme driver pci-error-handler code sends the >> correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH handler >> so that EEH handler could proceed with the pci slot reset. > > A subsystem reset takes the link down. I'm pretty sure the proper way to > recover from it requires pcie hotplug support. > Yes you're correct. We require pcie hotplugging to recover. However powerpc EEH handler could able to recover the pcie adapter without physically removing and re-inserting the adapter or in another words, it could reset adapter without hotplug activity. In fact, powerpc EEH could isolate pcie slot and resets it (i.e. resetting the PCI device holding the PCI #RST line high for two seconds), followed by setting up the device config space (the base address registers (BAR's), latency timer, cache line size, interrupt line, and so on). You may find more information about EEH recovery here: https://www.kernel.org/doc/Documentation/powerpc/eeh-pci-error-recovery.txt Typically when pcie error is detected and the EEH is able to recover the device, the EEH handler code goes through below sequence (assuming driver is EEH aware): eeh_handle_normal_event() eeh_set_channel_state()-> set state to pci_channel_io_frozen eeh_report_error() nvme_error_detected() -> channel state "pci_channel_io_frozen"; returns PCI_ERS_RESULT_NEED_RESET eeh_slot_reset() -> recovery successful nvme_slot_reset() -> returns PCI_ERS_RESULT_RECOVERED eeh_set_channel_state()-> set state to pci_channel_io_normal nvme_error_resume() In case pcie erorr is detected and the EEH is unable to recover the device, the EEH handler code goes through the below sequence: eeh_handle_normal_event() eeh_set_channel_state()-> set state to pci_channel_io_frozen eeh_report_error() nvme_error_detected() -> channel state pci_channel_io_frozen; returns PCI_ERS_RESULT_NEED_RESET eeh_slot_reset() -> recovery failed eeh_report_failure() nvme_error_detected()-> channel state pci_channel_io_perm_failure; returns PCI_ERS_RESULT_DISCONNECT eeh_set_channel_state()-> set state to pci_channel_io_perm_failure nvme_remove() If we execute the command "nvme subsystem-reset ..." and adapter communication is lost then in the current code (under nvme_reset_work()) we simply disable the device and mark the controller DEAD. However we may have a chance to recover the controller if driver is EEH aware and EEH recovery is underway. We already handle one such case in nvme_timeout(). So this patch ensures that if we fall through nvme_reset_work() post subsystem-reset and the EEH recovery is in progress then we give a chance to the EEH mechanism to recover the adapter. If in case the EEH recovery is unsuccessful then we'd anyway fall through code path I mentioned above where we invoke nvme_remove() at the end and delete the erring controller. With the proposed patch, we find that EEH recovery is successful post subsystem-reset. Please find below the relevant output: # lspci 0524:28:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller CM7 2.5" (rev 01) # nvme list-subsys nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3 hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988 iopolicy=numa \ +- nvme0 pcie 0524:28:00.0 live # nvme subsystem-reset /dev/nvme0 # nvme list-subsys nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3 hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988 iopolicy=numa \ +- nvme0 pcie 0524:28:00.0 resetting [10556.034082] EEH: Recovering PHB#524-PE#280000 [10556.034108] EEH: PE location: N/A, PHB location: N/A [10556.034112] EEH: Frozen PHB#524-PE#280000 detected [10556.034115] EEH: Call Trace: [10556.034117] EEH: [c000000000051068] __eeh_send_failure_event+0x7c/0x15c [10556.034304] EEH: [c000000000049bcc] eeh_dev_check_failure.part.0+0x27c/0x6b0 [10556.034310] EEH: [c008000004753d3c] nvme_pci_reg_read32+0x80/0xac [nvme] [10556.034319] EEH: [c0080000045f365c] nvme_wait_ready+0xa4/0x18c [nvme_core] [10556.034333] EEH: [c008000004754750] nvme_dev_disable+0x370/0x41c [nvme] [10556.034338] EEH: [c008000004757184] nvme_reset_work+0x1f4/0x3cc [nvme] [10556.034344] EEH: [c00000000017bb8c] process_one_work+0x1f0/0x4f4 [10556.034350] EEH: [c00000000017c24c] worker_thread+0x3bc/0x590 [10556.034355] EEH: [c00000000018a87c] kthread+0x138/0x140 [10556.034358] EEH: [c00000000000dd58] start_kernel_thread+0x14/0x18 [10556.034363] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures. [10556.034368] EEH: Notify device drivers to shutdown [10556.034371] EEH: Beginning: 'error_detected(IO frozen)' [10556.034376] PCI 0524:28:00.0#280000: EEH: Invoking nvme->error_detected(IO frozen) [10556.034379] nvme nvme0: frozen state error detected, reset controller [10556.102654] nvme 0524:28:00.0: enabling device (0000 -> 0002) [10556.103171] nvme nvme0: PCI recovery is ongoing so let it finish [10556.142532] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'need reset' [10556.142535] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset' [...] [...] [10556.148172] EEH: Reset without hotplug activity [10558.298672] EEH: Beginning: 'slot_reset' [10558.298692] PCI 0524:28:00.0#280000: EEH: Invoking nvme->slot_reset() [10558.298696] nvme nvme0: restart after slot reset [10558.301925] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'recovered' [10558.301928] EEH: Finished:'slot_reset' with aggregate recovery state:'recovered' [10558.301939] EEH: Notify device driver to resume [10558.301944] EEH: Beginning: 'resume' [10558.301947] PCI 0524:28:00.0#280000: EEH: Invoking nvme->resume() [10558.331051] nvme nvme0: Shutdown timeout set to 10 seconds [10558.356679] nvme nvme0: 16/0/0 default/read/poll queues [10558.357026] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'none' [10558.357028] EEH: Finished:'resume' [10558.357035] EEH: Recovery successful. # nvme list-subsys nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3 hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988 iopolicy=numa \ +- nvme0 pcie 0524:28:00.0 live Thanks, --Nilay
Hi Keith, On 2/28/24 16:49, Nilay Shroff wrote: > > > On 2/27/24 23:59, Keith Busch wrote: >> On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote: >>> If the nvme subsyetm reset causes the loss of communication to the nvme >>> adapter then EEH could potnetially recover the adapter. The detection of >>> comminication loss to the adapter only happens when the nvme driver >>> attempts to read an MMIO register. >>> >>> The nvme subsystem reset command writes 0x4E564D65 to NSSR register and >>> schedule adapter reset.In the case nvme subsystem reset caused the loss >>> of communication to the nvme adapter then either IO timeout event or >>> adapter reset handler could detect it. If IO timeout even could detect >>> loss of communication then EEH handler is able to recover the >>> communication to the adapter. This change was implemented in 651438bb0af5 >>> (nvme-pci: Fix EEH failure on ppc). However if the adapter communication >>> loss is detected in nvme reset work handler then EEH is unable to >>> successfully finish the adapter recovery. >>> >>> This patch ensures that, >>> - nvme driver reset handler would observer pci channel was offline after >>> a failed MMIO read and avoids marking the controller state to DEAD and >>> thus gives a fair chance to EEH handler to recover the nvme adapter. >>> >>> - if nvme controller is already in RESETTNG state and pci channel frozen >>> error is detected then nvme driver pci-error-handler code sends the >>> correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH handler >>> so that EEH handler could proceed with the pci slot reset. >> >> A subsystem reset takes the link down. I'm pretty sure the proper way to >> recover from it requires pcie hotplug support. >> > Yes you're correct. We require pcie hotplugging to recover. However powerpc EEH > handler could able to recover the pcie adapter without physically removing and > re-inserting the adapter or in another words, it could reset adapter without > hotplug activity. In fact, powerpc EEH could isolate pcie slot and resets it > (i.e. resetting the PCI device holding the PCI #RST line high for two seconds), > followed by setting up the device config space (the base address registers > (BAR's), latency timer, cache line size, interrupt line, and so on). > > You may find more information about EEH recovery here: > https://www.kernel.org/doc/Documentation/powerpc/eeh-pci-error-recovery.txt > > Typically when pcie error is detected and the EEH is able to recover the device, > the EEH handler code goes through below sequence (assuming driver is EEH aware): > > eeh_handle_normal_event() > eeh_set_channel_state()-> set state to pci_channel_io_frozen > eeh_report_error() > nvme_error_detected() -> channel state "pci_channel_io_frozen"; returns PCI_ERS_RESULT_NEED_RESET > eeh_slot_reset() -> recovery successful > nvme_slot_reset() -> returns PCI_ERS_RESULT_RECOVERED > eeh_set_channel_state()-> set state to pci_channel_io_normal > nvme_error_resume() > > In case pcie erorr is detected and the EEH is unable to recover the device, > the EEH handler code goes through the below sequence: > > eeh_handle_normal_event() > eeh_set_channel_state()-> set state to pci_channel_io_frozen > eeh_report_error() > nvme_error_detected() -> channel state pci_channel_io_frozen; returns PCI_ERS_RESULT_NEED_RESET > eeh_slot_reset() -> recovery failed > eeh_report_failure() > nvme_error_detected()-> channel state pci_channel_io_perm_failure; returns PCI_ERS_RESULT_DISCONNECT > eeh_set_channel_state()-> set state to pci_channel_io_perm_failure > nvme_remove() > > > If we execute the command "nvme subsystem-reset ..." and adapter communication is > lost then in the current code (under nvme_reset_work()) we simply disable the device > and mark the controller DEAD. However we may have a chance to recover the controller > if driver is EEH aware and EEH recovery is underway. We already handle one such case > in nvme_timeout(). So this patch ensures that if we fall through nvme_reset_work() > post subsystem-reset and the EEH recovery is in progress then we give a chance to the > EEH mechanism to recover the adapter. If in case the EEH recovery is unsuccessful then > we'd anyway fall through code path I mentioned above where we invoke nvme_remove() at > the end and delete the erring controller. > BTW, the similar issue was earlier fixed in 651438bb0af5(nvme-pci: Fix EEH failure on ppc). And that fix was needed due to the controller health check polling was removed in b2a0eb1a0ac72869(nvme-pci: Remove watchdog timer). In fact, today we may be able to recover the NVMe adapter if subsystem-reset or any other PCI error occurs and at the same time some I/O request is in flight. The recovery is possible due to the in flight I/O request would eventually timeout and the nvme_timeout() has special code (added in 651438bb0af5) that gives EEH a chance to recover the adapter. However later, in 1e866afd4bcdd(nvme: ensure subsystem reset is single threaded), the nvme subsystem reset code was reworked so now when user executes command subsystem-reset, kernel first writes 0x4E564D65 to nvme NSSR register and then schedules the adapter reset. It's quite possible that when subsytem-reset is executed there were no I/O in flight and hence we may never hit the nvme_timeout(). Later when the adapter reset code (under nvme_reset_work()) start execution, it accesses MMIO registers. Hence, IMO, potentially nvme_reset_work() would also need similar changes as implemented under nvme_timeout() so that EEH recovery could be possible. > With the proposed patch, we find that EEH recovery is successful post subsystem-reset. > Please find below the relevant output: > # lspci > 0524:28:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller CM7 2.5" (rev 01) > > # nvme list-subsys > nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3 > hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988 > iopolicy=numa > \ > +- nvme0 pcie 0524:28:00.0 live > > # nvme subsystem-reset /dev/nvme0 > > # nvme list-subsys > nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3 > hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988 > iopolicy=numa > \ > +- nvme0 pcie 0524:28:00.0 resetting > > [10556.034082] EEH: Recovering PHB#524-PE#280000 > [10556.034108] EEH: PE location: N/A, PHB location: N/A > [10556.034112] EEH: Frozen PHB#524-PE#280000 detected > [10556.034115] EEH: Call Trace: > [10556.034117] EEH: [c000000000051068] __eeh_send_failure_event+0x7c/0x15c > [10556.034304] EEH: [c000000000049bcc] eeh_dev_check_failure.part.0+0x27c/0x6b0 > [10556.034310] EEH: [c008000004753d3c] nvme_pci_reg_read32+0x80/0xac [nvme] > [10556.034319] EEH: [c0080000045f365c] nvme_wait_ready+0xa4/0x18c [nvme_core] > [10556.034333] EEH: [c008000004754750] nvme_dev_disable+0x370/0x41c [nvme] > [10556.034338] EEH: [c008000004757184] nvme_reset_work+0x1f4/0x3cc [nvme] > [10556.034344] EEH: [c00000000017bb8c] process_one_work+0x1f0/0x4f4 > [10556.034350] EEH: [c00000000017c24c] worker_thread+0x3bc/0x590 > [10556.034355] EEH: [c00000000018a87c] kthread+0x138/0x140 > [10556.034358] EEH: [c00000000000dd58] start_kernel_thread+0x14/0x18 > [10556.034363] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures. > [10556.034368] EEH: Notify device drivers to shutdown > [10556.034371] EEH: Beginning: 'error_detected(IO frozen)' > [10556.034376] PCI 0524:28:00.0#280000: EEH: Invoking nvme->error_detected(IO frozen) > [10556.034379] nvme nvme0: frozen state error detected, reset controller > [10556.102654] nvme 0524:28:00.0: enabling device (0000 -> 0002) > [10556.103171] nvme nvme0: PCI recovery is ongoing so let it finish > [10556.142532] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'need reset' > [10556.142535] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset' > [...] > [...] > [10556.148172] EEH: Reset without hotplug activity > [10558.298672] EEH: Beginning: 'slot_reset' > [10558.298692] PCI 0524:28:00.0#280000: EEH: Invoking nvme->slot_reset() > [10558.298696] nvme nvme0: restart after slot reset > [10558.301925] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'recovered' > [10558.301928] EEH: Finished:'slot_reset' with aggregate recovery state:'recovered' > [10558.301939] EEH: Notify device driver to resume > [10558.301944] EEH: Beginning: 'resume' > [10558.301947] PCI 0524:28:00.0#280000: EEH: Invoking nvme->resume() > [10558.331051] nvme nvme0: Shutdown timeout set to 10 seconds > [10558.356679] nvme nvme0: 16/0/0 default/read/poll queues > [10558.357026] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'none' > [10558.357028] EEH: Finished:'resume' > [10558.357035] EEH: Recovery successful. > > # nvme list-subsys > nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3 > hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988 > iopolicy=numa > \ > +- nvme0 pcie 0524:28:00.0 live > > Thanks, --Nilay
Hi Keith and Christoph, On 2/27/24 23:59, Keith Busch wrote: > On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote: >> If the nvme subsyetm reset causes the loss of communication to the nvme >> adapter then EEH could potnetially recover the adapter. The detection of >> comminication loss to the adapter only happens when the nvme driver >> attempts to read an MMIO register. >> >> The nvme subsystem reset command writes 0x4E564D65 to NSSR register and >> schedule adapter reset.In the case nvme subsystem reset caused the loss >> of communication to the nvme adapter then either IO timeout event or >> adapter reset handler could detect it. If IO timeout even could detect >> loss of communication then EEH handler is able to recover the >> communication to the adapter. This change was implemented in 651438bb0af5 >> (nvme-pci: Fix EEH failure on ppc). However if the adapter communication >> loss is detected in nvme reset work handler then EEH is unable to >> successfully finish the adapter recovery. >> >> This patch ensures that, >> - nvme driver reset handler would observer pci channel was offline after >> a failed MMIO read and avoids marking the controller state to DEAD and >> thus gives a fair chance to EEH handler to recover the nvme adapter. >> >> - if nvme controller is already in RESETTNG state and pci channel frozen >> error is detected then nvme driver pci-error-handler code sends the >> correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH handler >> so that EEH handler could proceed with the pci slot reset. > > A subsystem reset takes the link down. I'm pretty sure the proper way to > recover from it requires pcie hotplug support. This was working earlier in kernel version 6.0.0. We were able to recover the NVMe pcie adapater on powerpc after nvme subsystem reset assuming some IO were in flight when subsysetem reset happens. However starting kernel version 6.1.0 this is broken. I 've found the offending commit 1e866afd4bcd(nvme: ensure subsystem reset is single threaded) causing this issue on kernel version 6.1.0 and above. So this seems to be a regression and the proposed patch help fix this bug. Please find below logs captured for both working and non-working cases: Working case (kernel version 6.0.0): ----------------------------------- # uname -r 6.0.0 # nvme list-subsys nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358 hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2 iopolicy=numa \ +- nvme0 pcie 0018:01:00.0 live # nvme subsystem-reset /dev/nvme0 # dmesg <snip> <snip> [ 3215.658378] EEH: Recovering PHB#18-PE#10000 [ 3215.658401] EEH: PE location: N/A, PHB location: N/A [ 3215.658406] EEH: Frozen PHB#18-PE#10000 detected [ 3215.658409] EEH: Call Trace: [ 3215.658411] EEH: [c00000000005130c] __eeh_send_failure_event+0x7c/0x160 [ 3215.658577] EEH: [c00000000004a104] eeh_dev_check_failure.part.0+0x254/0x670 [ 3215.658583] EEH: [c0080000044e61bc] nvme_timeout+0x254/0x4f0 [nvme] [ 3215.658591] EEH: [c00000000078d840] blk_mq_check_expired+0xa0/0x130 [ 3215.658602] EEH: [c00000000079a118] bt_iter+0xf8/0x140 [ 3215.658609] EEH: [c00000000079b29c] blk_mq_queue_tag_busy_iter+0x3cc/0x720 [ 3215.658620] EEH: [c00000000078fe74] blk_mq_timeout_work+0x84/0x1c0 [ 3215.658633] EEH: [c000000000173b08] process_one_work+0x2a8/0x570 [ 3215.658644] EEH: [c000000000173e68] worker_thread+0x98/0x5e0 [ 3215.658655] EEH: [c000000000181504] kthread+0x124/0x130 [ 3215.658666] EEH: [c00000000000cbd4] ret_from_kernel_thread+0x5c/0x64 [ 3215.658672] EEH: This PCI device has failed 5 times in the last hour and will be permanently disabled after 5 failures. [ 3215.658677] EEH: Notify device drivers to shutdown [ 3215.658681] EEH: Beginning: 'error_detected(IO frozen)' [ 3215.658688] PCI 0018:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen) [ 3215.658692] nvme nvme0: frozen state error detected, reset controller [ 3215.788089] PCI 0018:01:00.0#10000: EEH: nvme driver reports: 'need reset' [ 3215.788092] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset' <snip> <snip> [ 3215.790666] EEH: Reset without hotplug activity [ 3218.078715] EEH: Beginning: 'slot_reset' [ 3218.078729] PCI 0018:01:00.0#10000: EEH: Invoking nvme->slot_reset() [ 3218.078734] nvme nvme0: restart after slot reset [ 3218.081088] PCI 0018:01:00.0#10000: EEH: nvme driver reports: 'recovered' [ 3218.081090] EEH: Finished:'slot_reset' with aggregate recovery state:'recovered' [ 3218.081099] EEH: Notify device driver to resume [ 3218.081101] EEH: Beginning: 'resume' <snip> [ 3218.161027] EEH: Finished:'resume' [ 3218.161038] EEH: Recovery successful. # nvme list-subsys nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358 hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2 iopolicy=numa \ +- nvme0 pcie 0018:01:00.0 live Non-working case (kernel verion 6.1): ------------------------------------ # uname -r 6.1.0 # nvme list-subsys nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358 hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2 iopolicy=numa \ +- nvme0 pcie 0018:01:00.0 live # nvme subsystem-reset /dev/nvme0 #dmesg [ 177.578828] EEH: Recovering PHB#18-PE#10000 [ 177.578852] EEH: PE location: N/A, PHB location: N/A [ 177.578858] EEH: Frozen PHB#18-PE#10000 detected [ 177.578869] EEH: Call Trace: [ 177.578872] EEH: [c0000000000510bc] __eeh_send_failure_event+0x7c/0x160 [ 177.579206] EEH: [c000000000049eb4] eeh_dev_check_failure.part.0+0x254/0x670 [ 177.579212] EEH: [c008000004c261cc] nvme_timeout+0x254/0x4e0 [nvme] [ 177.579221] EEH: [c00000000079cb00] blk_mq_check_expired+0xa0/0x130 [ 177.579226] EEH: [c0000000007a9628] bt_iter+0xf8/0x140 [ 177.579231] EEH: [c0000000007aa79c] blk_mq_queue_tag_busy_iter+0x3bc/0x6e0 [ 177.579237] EEH: [c00000000079f324] blk_mq_timeout_work+0x84/0x1c0 [ 177.579241] EEH: [c000000000174a28] process_one_work+0x2a8/0x570 [ 177.579247] EEH: [c000000000174d88] worker_thread+0x98/0x5e0 [ 177.579253] EEH: [c000000000182454] kthread+0x124/0x130 [ 177.579257] EEH: [c00000000000cddc] ret_from_kernel_thread+0x5c/0x64 [ 177.579263] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures. [ 177.579269] EEH: Notify device drivers to shutdown [ 177.579272] EEH: Beginning: 'error_detected(IO frozen)' [ 177.579276] PCI 0018:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen) [ 177.579279] nvme nvme0: frozen state error detected, reset controller [ 177.658746] nvme 0018:01:00.0: enabling device (0000 -> 0002) [ 177.658967] nvme 0018:01:00.0: iommu: 64-bit OK but direct DMA is limited by 800000800000000 [ 177.658982] nvme 0018:01:00.0: iommu: 64-bit OK but direct DMA is limited by 800000800000000 [ 177.659059] nvme nvme0: Removing after probe failure status: -19 [ 177.698719] PCI 0018:01:00.0#10000: EEH: nvme driver reports: 'need reset' [ 177.698723] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset' <snip> <snip> [ 179.999828] EEH: Beginning: 'slot_reset' [ 179.999840] PCI 0018:01:00.0#10000: EEH: no driver [ 179.999842] EEH: Finished:'slot_reset' with aggregate recovery state:'none' [ 179.999848] EEH: Notify device driver to resume [ 179.999850] EEH: Beginning: 'resume' [ 179.999853] PCI 0018:01:00.0#10000: EEH: no driver <snip> # nvme list-subsys <empty> Thanks, --Nilay
On Wed, Mar 06, 2024 at 04:50:10PM +0530, Nilay Shroff wrote:
> Hi Keith and Christoph,
Sorry for the delay, been very busy recently. I'll revisit this this
week.
On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote: > @@ -2776,6 +2776,14 @@ static void nvme_reset_work(struct work_struct *work) > out_unlock: > mutex_unlock(&dev->shutdown_lock); > out: > + /* > + * If PCI recovery is ongoing then let it finish first > + */ > + if (pci_channel_offline(to_pci_dev(dev->dev))) { > + dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n"); > + return; > + } > + > /* > * Set state to deleting now to avoid blocking nvme_wait_reset(), which > * may be holding this pci_dev's device lock. > @@ -3295,9 +3303,11 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev, > case pci_channel_io_frozen: > dev_warn(dev->ctrl.device, > "frozen state error detected, reset controller\n"); > - if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { > - nvme_dev_disable(dev, true); > - return PCI_ERS_RESULT_DISCONNECT; > + if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) { > + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { > + nvme_dev_disable(dev, true); > + return PCI_ERS_RESULT_DISCONNECT; > + } > } > nvme_dev_disable(dev, false); > return PCI_ERS_RESULT_NEED_RESET; I get what you're trying to do, but it looks racy. The reset_work may finish before pci sets channel offline, or the error handling work happens to see RESETTING state, but then transitions to CONNECTING state after and deadlocks on the '.resume()' side. You are counting on a very specific sequence tied to the PCIe error handling module, and maybe you are able to count on that sequence for your platform in this unique scenario, but these link errors could happen anytime. And nvme subsystem reset is just odd, it's not clear how it was intended to be handled. It takes the links down so seems like it requires re-enumeration from a pcie hotplug driver, and that's kind of how it was expected to work here, but your platform has a special way to contain the link event and bring things back up the way they were before. And the fact you *require* IO to be in flight just so the timeout handler can dispatch a non-posted transaction 30 seconds later to trigger EEH is also odd. Why can't EEH just detect the link down event directly? This driver unfortunately doesn't handle errors during a reset well. Trying to handle that has been problematic, so the driver just bails if anything goes wrong at this critical initialization point. Maybe we need to address the reset/initialization failure handling more generically and delegate the teardown or retry decision to something else. Messing with that is pretty fragile right now, though. Or you could just re-enumerate the slot. I don't know, sorry my message is not really helping much to get this fixed.
On 3/9/24 21:14, Keith Busch wrote: > On Sat, Mar 09, 2024 at 07:59:11PM +0530, Nilay Shroff wrote: >> On 3/8/24 21:11, Keith Busch wrote: >>> On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote: >>>> @@ -2776,6 +2776,14 @@ static void nvme_reset_work(struct work_struct *work) >>>> out_unlock: >>>> mutex_unlock(&dev->shutdown_lock); >>>> out: >>>> + /* >>>> + * If PCI recovery is ongoing then let it finish first >>>> + */ >>>> + if (pci_channel_offline(to_pci_dev(dev->dev))) { >>>> + dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n"); >>>> + return; >>>> + } >>>> + >>>> /* >>>> * Set state to deleting now to avoid blocking nvme_wait_reset(), which >>>> * may be holding this pci_dev's device lock. >>>> @@ -3295,9 +3303,11 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev, >>>> case pci_channel_io_frozen: >>>> dev_warn(dev->ctrl.device, >>>> "frozen state error detected, reset controller\n"); >>>> - if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { >>>> - nvme_dev_disable(dev, true); >>>> - return PCI_ERS_RESULT_DISCONNECT; >>>> + if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) { >>>> + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { >>>> + nvme_dev_disable(dev, true); >>>> + return PCI_ERS_RESULT_DISCONNECT; >>>> + } >>>> } >>>> nvme_dev_disable(dev, false); >>>> return PCI_ERS_RESULT_NEED_RESET; >>> >>> I get what you're trying to do, but it looks racy. The reset_work may >>> finish before pci sets channel offline, or the error handling work >>> happens to see RESETTING state, but then transitions to CONNECTING state >>> after and deadlocks on the '.resume()' side. You are counting on a very >>> specific sequence tied to the PCIe error handling module, and maybe you >>> are able to count on that sequence for your platform in this unique >>> scenario, but these link errors could happen anytime. >>> >> I am not sure about the deadlock in '.resume()' side you mentioned above. >> Did you mean that deadlock occur due to someone holding this pci_dev's device lock? >> Or deadlock occur due to the flush_work() from nvme_error_resume() would never >> return? > > Your patch may observe a ctrl in "RESETTING" state from > error_detected(), then disable the controller, which quiesces the admin > queue. Meanwhile, reset_work may proceed to CONNECTING state and try > nvme_submit_sync_cmd(), which blocks forever because no one is going to > unquiesce that admin queue. > OK I think I got your point. However, it seems that even without my patch the above mentioned deadlock could still be possible. Without my patch, if error_detcted() observe a ctrl in "RESETTING" state then it still invokes nvme_dev_disable(). The only difference with my patch is that error_detected() returns the PCI_ERS_RESULT_NEED_RESET instead of PCI_ERS_RESULT_DISCONNECT. Regarding the deadlock, it appears to me that reset_work races with nvme_dev_disable() and we may want to extend the shutdown_lock in reset_work so that nvme_dev_disable() can't interfere with admin queue while reset_work accesses the admin queue. I think we can fix this case. I would send PATCH v2 with this fix for review, however, please let me know if you have any other concern before I spin a new patch. Thanks, --Nilay
On Sun, Mar 10, 2024 at 12:35:06AM +0530, Nilay Shroff wrote: > On 3/9/24 21:14, Keith Busch wrote: > > Your patch may observe a ctrl in "RESETTING" state from > > error_detected(), then disable the controller, which quiesces the admin > > queue. Meanwhile, reset_work may proceed to CONNECTING state and try > > nvme_submit_sync_cmd(), which blocks forever because no one is going to > > unquiesce that admin queue. > > > OK I think I got your point. However, it seems that even without my patch > the above mentioned deadlock could still be possible. I sure hope not. The current design should guarnatee forward progress on initialization failed devices. > Without my patch, if error_detcted() observe a ctrl in "RESETTING" state then > it still invokes nvme_dev_disable(). The only difference with my patch is that > error_detected() returns the PCI_ERS_RESULT_NEED_RESET instead of PCI_ERS_RESULT_DISCONNECT. There's one more subtle difference: that condition disables with the 'shutdown' parameter set to 'true' which accomplishes a couple things: all entered requests are flushed to their demise via the final unquiesce, and all request_queue's are killed which forces error returns for all new request allocations. No thread will be left waiting for something that won't happen.
On 3/11/24 10:11, Keith Busch wrote: > On Sun, Mar 10, 2024 at 12:35:06AM +0530, Nilay Shroff wrote: >> On 3/9/24 21:14, Keith Busch wrote: >>> Your patch may observe a ctrl in "RESETTING" state from >>> error_detected(), then disable the controller, which quiesces the admin >>> queue. Meanwhile, reset_work may proceed to CONNECTING state and try >>> nvme_submit_sync_cmd(), which blocks forever because no one is going to >>> unquiesce that admin queue. >>> >> OK I think I got your point. However, it seems that even without my patch >> the above mentioned deadlock could still be possible. > > I sure hope not. The current design should guarnatee forward progress on > initialization failed devices. > >> Without my patch, if error_detcted() observe a ctrl in "RESETTING" state then >> it still invokes nvme_dev_disable(). The only difference with my patch is that >> error_detected() returns the PCI_ERS_RESULT_NEED_RESET instead of PCI_ERS_RESULT_DISCONNECT. > > There's one more subtle difference: that condition disables with the > 'shutdown' parameter set to 'true' which accomplishes a couple things: > all entered requests are flushed to their demise via the final > unquiesce, and all request_queue's are killed which forces error returns > for all new request allocations. No thread will be left waiting for > something that won't happen. > Aaargh, I didn't notice that subtle difference. I got your all points.. After thinking for a while (as you suggested) it seems that we potentially require to contain the race between reset_work and error_detected code path as both could run in parallel on different cpu when pci recovery initiates. So I'm thinking that if we can hold on the error_detected() code path to proceed until reset_work is finished ? Particularly, in error_detcted() function if we fall through the pci_channel_io_frozen case and the ctrl state is already RESETTING (so that means that reset_work shall be running) then hold on invoking nvme_dev_diable(dev, false) until reset_work is finished. The changes should be something as below: @@ -3295,10 +3304,13 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev, case pci_channel_io_frozen: dev_warn(dev->ctrl.device, "frozen state error detected, reset controller\n"); - if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { - nvme_dev_disable(dev, true); - return PCI_ERS_RESULT_DISCONNECT; + if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) { + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { + nvme_dev_disable(dev, true); + return PCI_ERS_RESULT_DISCONNECT; + } } + flush_work(&dev->ctrl.reset_work); nvme_dev_disable(dev, false); return PCI_ERS_RESULT_NEED_RESET; case pci_channel_io_perm_failure: The flush_work() would ensure that we don't disable the ctrl if reset_work is running. If the rest_work is *not* running currently then flush_work() should return immediately. Moreover, if reset_work is scheduled or start running after flush_work() returns then reset_work should not be able to get upto the CONNECTING state because pci recovery is in progress and so it should fail early. On the reset_work side other than detecting pci error recovery, I think we also need one another change where in case the ctrl state is set to CONNECTING and we detect the pci error recovery in progress then before returning from the reset_work we set the ctrl state to RESETTING so that error_detected() could forward progress. The changes should be something as below: @@ -2776,6 +2776,16 @@ static void nvme_reset_work(struct work_struct *work) out_unlock: mutex_unlock(&dev->shutdown_lock); out: + /* + * If PCI recovery is ongoing then let it finish first + */ + if (pci_channel_offline(to_pci_dev(dev->dev))) { + dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n"); + if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) + WRITE_ONCE(dev->ctrl.state, NVME_CTRL_RESETTING); + return; + } + /* * Set state to deleting now to avoid blocking nvme_wait_reset(), which * may be holding this pci_dev's device lock. Please let me know if you find anything not good with the above changes. Thanks, --Nilay
On Mon, Mar 11, 2024 at 06:28:21PM +0530, Nilay Shroff wrote: > @@ -3295,10 +3304,13 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev, > case pci_channel_io_frozen: > dev_warn(dev->ctrl.device, > "frozen state error detected, reset controller\n"); > - if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { > - nvme_dev_disable(dev, true); > - return PCI_ERS_RESULT_DISCONNECT; > + if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) { > + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { > + nvme_dev_disable(dev, true); > + return PCI_ERS_RESULT_DISCONNECT; > + } > } > + flush_work(&dev->ctrl.reset_work); I was messing with a similar idea. I wasn't sure if EEH calls the error handler inline with the error, in which case this would try to flush the work within the same work, which obviously doesn't work. As long as its called from a different thread, then this should be fine. > nvme_dev_disable(dev, false); > return PCI_ERS_RESULT_NEED_RESET; > case pci_channel_io_perm_failure: > > The flush_work() would ensure that we don't disable the ctrl if reset_work > is running. If the rest_work is *not* running currently then flush_work() should > return immediately. Moreover, if reset_work is scheduled or start running after > flush_work() returns then reset_work should not be able to get upto the CONNECTING > state because pci recovery is in progress and so it should fail early. > > On the reset_work side other than detecting pci error recovery, I think we also > need one another change where in case the ctrl state is set to CONNECTING and we > detect the pci error recovery in progress then before returning from the reset_work > we set the ctrl state to RESETTING so that error_detected() could forward progress. > The changes should be something as below: > > @@ -2776,6 +2776,16 @@ static void nvme_reset_work(struct work_struct *work) > out_unlock: > mutex_unlock(&dev->shutdown_lock); > out: > + /* > + * If PCI recovery is ongoing then let it finish first > + */ > + if (pci_channel_offline(to_pci_dev(dev->dev))) { > + dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n"); > + if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) > + WRITE_ONCE(dev->ctrl.state, NVME_CTRL_RESETTING); This may break the state machine, like if the device was hot removed during all this error handling. This will force the state back to RESETTING when it should be DEAD. I think what you need is just allow a controller to reset from a connecting state. Have to be careful that wouldn't break any other expectations, though.
On 3/12/24 20:00, Keith Busch wrote: > On Mon, Mar 11, 2024 at 06:28:21PM +0530, Nilay Shroff wrote: >> @@ -3295,10 +3304,13 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev, >> case pci_channel_io_frozen: >> dev_warn(dev->ctrl.device, >> "frozen state error detected, reset controller\n"); >> - if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { >> - nvme_dev_disable(dev, true); >> - return PCI_ERS_RESULT_DISCONNECT; >> + if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) { >> + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { >> + nvme_dev_disable(dev, true); >> + return PCI_ERS_RESULT_DISCONNECT; >> + } >> } >> + flush_work(&dev->ctrl.reset_work); > > I was messing with a similar idea. I wasn't sure if EEH calls the error > handler inline with the error, in which case this would try to flush the > work within the same work, which obviously doesn't work. As long as its > called from a different thread, then this should be fine. The EEH recovery happens from different thread and so flush work should work here as expected. >> @@ -2776,6 +2776,16 @@ static void nvme_reset_work(struct work_struct *work) >> out_unlock: >> mutex_unlock(&dev->shutdown_lock); >> out: >> + /* >> + * If PCI recovery is ongoing then let it finish first >> + */ >> + if (pci_channel_offline(to_pci_dev(dev->dev))) { >> + dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n"); >> + if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) >> + WRITE_ONCE(dev->ctrl.state, NVME_CTRL_RESETTING); > > This may break the state machine, like if the device was hot removed > during all this error handling. This will force the state back to > RESETTING when it should be DEAD. Agreed, we shouldn't force reset state to RESETTING here. > > I think what you need is just allow a controller to reset from a > connecting state. Have to be careful that wouldn't break any other > expectations, though. Yeah ok got your point, so I have reworked the ctrl state machine as you suggested and tested the changes. The updated state machine code is shown below: @@ -546,10 +546,11 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, break; case NVME_CTRL_RESETTING: switch (old_state) { case NVME_CTRL_NEW: case NVME_CTRL_LIVE: + case NVME_CTRL_CONNECTING: changed = true; fallthrough; default: break; } And accordingly updated reset_work function is show below. Here we ensure that even though the pci error recovery is in progress, if we couldn't move ctrl state to RESETTING then we would let rest_work forward progress and set the ctrl state to DEAD. @@ -2774,10 +2774,19 @@ static void nvme_reset_work(struct work_struct *work) return; out_unlock: mutex_unlock(&dev->shutdown_lock); out: + /* + * If PCI recovery is ongoing then let it finish first + */ + if (pci_channel_offline(to_pci_dev(dev->dev))) { + if (nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { + dev_warn(dev->ctrl.device, "PCI recovery is ongoing, let it finish\n"); + return; + } + } /* * Set state to deleting now to avoid blocking nvme_wait_reset(), which * may be holding this pci_dev's device lock. */ dev_warn(dev->ctrl.device, "Disabling device after reset failure: %d\n", Now I have also included in my test the hot removal of NVMe adapter while EEH recovery is in progress. And the EEH recovery code handles this case well : When EEH recovery is in progress and if we hot removes the adapter (which is being recovered) then EEH handler would stop the recovery, set the PCI channel state to "pci_channel_io_perm_failure". Collected the logs of this case, (shown below): ----------------------------------------------- # nvme list-subsys nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358 hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2 iopolicy=numa # lspci 0018:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X # dmesg [ 561.639102] EEH: Recovering PHB#18-PE#10000 [ 561.639120] EEH: PE location: N/A, PHB location: N/A [ 561.639128] EEH: Frozen PHB#18-PE#10000 detected <snip> <snip> [ 561.639428] EEH: This PCI device has failed 2 times in the last hour and will be permanently disabled after 5 failures. [ 561.639441] EEH: Notify device drivers to shutdown [ 561.639450] EEH: Beginning: 'error_detected(IO frozen)' [ 561.639458] PCI 0018:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen) [ 561.639462] nvme nvme0: frozen state error detected, reset controller [ 561.719078] nvme 0018:01:00.0: enabling device (0000 -> 0002) [ 561.719318] nvme nvme0: PCI recovery is ongoing so let it finish <!!!! WHILE EEH RECOVERY IS IN PROGRESS WE HOT REMOVE THE NVMe ADAPTER !!!!> [ 563.850328] rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1 <snip> [ 571.879092] PCI 0018:01:00.0#10000: EEH: nvme driver reports: 'need reset' [ 571.879097] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset' <snip> <snip> [ 571.881761] EEH: Reset without hotplug activity [ 574.039807] EEH: PHB#18-PE#10000: Slot inactive after reset: 0x2 (attempt 1) [ 574.309091] EEH: Failure 2 resetting PHB#18-PE#10000 (attempt 2) [ 574.309091] [ 574.579094] EEH: Failure 2 resetting PHB#18-PE#10000 (attempt 3) [ 574.579094] [ 574.579101] eeh_handle_normal_event: Cannot reset, err=-5 [ 574.579104] EEH: Unable to recover from failure from PHB#18-PE#10000. [ 574.579104] Please try reseating or replacing it <snip> <snip> [ 574.581314] EEH: Beginning: 'error_detected(permanent failure)' [ 574.581320] PCI 0018:01:00.0#10000: EEH: no device # lspci <empty> # nvme list-subsys <empty> Another case tested, when the reset_work is ongoing post subsystem-reset command and if user immediately hot removes the NVMe adapter then EEH recovery is *not* triggered and ctrl forward progress to the "DEAD" state. Collected the logs of this case, (shown below): ----------------------------------------------- # nvme list-subsys nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358 hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2 iopolicy=numa # lspci 0018:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X # nvme subsystem-reset /dev/nvme0 <!!!! HOT REMOVE THE NVMe ADAPTER !!!!> # dmesg [ 9967.143886] eeh_handle_normal_event: Cannot find PCI bus for PHB#18-PE#10000 [ 9967.224078] eeh_handle_normal_event: Cannot find PCI bus for PHB#18-PE#10000 <snip> [ 9967.223858] nvme 0018:01:00.0: enabling device (0000 -> 0002) [ 9967.224140] nvme nvme0: Disabling device after reset failure: -19 # lspci <empty> # nvme list-subsys <empty> Please let me know if the above changes look good to you. If it looks good then I would spin a new version of the patch and send for a review. Thanks, --Nilay
Hi Keith, A gentle ping. I don't know whether you got a chance to review the last email on this subject. Please let me know your feedback/thoughts. Thanks, --Nilay On 3/13/24 17:29, Nilay Shroff wrote: > > > On 3/12/24 20:00, Keith Busch wrote: >> On Mon, Mar 11, 2024 at 06:28:21PM +0530, Nilay Shroff wrote: >>> @@ -3295,10 +3304,13 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev, >>> case pci_channel_io_frozen: >>> dev_warn(dev->ctrl.device, >>> "frozen state error detected, reset controller\n"); >>> - if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { >>> - nvme_dev_disable(dev, true); >>> - return PCI_ERS_RESULT_DISCONNECT; >>> + if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) { >>> + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { >>> + nvme_dev_disable(dev, true); >>> + return PCI_ERS_RESULT_DISCONNECT; >>> + } >>> } >>> + flush_work(&dev->ctrl.reset_work); >> >> I was messing with a similar idea. I wasn't sure if EEH calls the error >> handler inline with the error, in which case this would try to flush the >> work within the same work, which obviously doesn't work. As long as its >> called from a different thread, then this should be fine. > The EEH recovery happens from different thread and so flush work should > work here as expected. > >>> @@ -2776,6 +2776,16 @@ static void nvme_reset_work(struct work_struct *work) >>> out_unlock: >>> mutex_unlock(&dev->shutdown_lock); >>> out: >>> + /* >>> + * If PCI recovery is ongoing then let it finish first >>> + */ >>> + if (pci_channel_offline(to_pci_dev(dev->dev))) { >>> + dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n"); >>> + if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) >>> + WRITE_ONCE(dev->ctrl.state, NVME_CTRL_RESETTING); >> >> This may break the state machine, like if the device was hot removed >> during all this error handling. This will force the state back to >> RESETTING when it should be DEAD. > Agreed, we shouldn't force reset state to RESETTING here. >> >> I think what you need is just allow a controller to reset from a >> connecting state. Have to be careful that wouldn't break any other >> expectations, though. > Yeah ok got your point, so I have reworked the ctrl state machine as you > suggested and tested the changes. The updated state machine code is shown > below: > > @@ -546,10 +546,11 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, > break; > case NVME_CTRL_RESETTING: > switch (old_state) { > case NVME_CTRL_NEW: > case NVME_CTRL_LIVE: > + case NVME_CTRL_CONNECTING: > changed = true; > fallthrough; > default: > break; > } > > And accordingly updated reset_work function is show below. Here we ensure that > even though the pci error recovery is in progress, if we couldn't move ctrl state > to RESETTING then we would let rest_work forward progress and set the ctrl state > to DEAD. > > @@ -2774,10 +2774,19 @@ static void nvme_reset_work(struct work_struct *work) > return; > > out_unlock: > mutex_unlock(&dev->shutdown_lock); > out: > + /* > + * If PCI recovery is ongoing then let it finish first > + */ > + if (pci_channel_offline(to_pci_dev(dev->dev))) { > + if (nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { > + dev_warn(dev->ctrl.device, "PCI recovery is ongoing, let it finish\n"); > + return; > + } > + } > /* > * Set state to deleting now to avoid blocking nvme_wait_reset(), which > * may be holding this pci_dev's device lock. > */ > dev_warn(dev->ctrl.device, "Disabling device after reset failure: %d\n", > > > Now I have also included in my test the hot removal of NVMe adapter while EEH recovery > is in progress. And the EEH recovery code handles this case well : When EEH recovery > is in progress and if we hot removes the adapter (which is being recovered) then EEH > handler would stop the recovery, set the PCI channel state to "pci_channel_io_perm_failure". > > > Collected the logs of this case, (shown below): > ----------------------------------------------- > # nvme list-subsys > nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358 > hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2 > iopolicy=numa > > # lspci > 0018:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X > > # dmesg > [ 561.639102] EEH: Recovering PHB#18-PE#10000 > [ 561.639120] EEH: PE location: N/A, PHB location: N/A > [ 561.639128] EEH: Frozen PHB#18-PE#10000 detected > <snip> > <snip> > [ 561.639428] EEH: This PCI device has failed 2 times in the last hour and will be permanently disabled after 5 failures. > [ 561.639441] EEH: Notify device drivers to shutdown > [ 561.639450] EEH: Beginning: 'error_detected(IO frozen)' > [ 561.639458] PCI 0018:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen) > [ 561.639462] nvme nvme0: frozen state error detected, reset controller > [ 561.719078] nvme 0018:01:00.0: enabling device (0000 -> 0002) > [ 561.719318] nvme nvme0: PCI recovery is ongoing so let it finish > > <!!!! WHILE EEH RECOVERY IS IN PROGRESS WE HOT REMOVE THE NVMe ADAPTER !!!!> > > [ 563.850328] rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1 > <snip> > [ 571.879092] PCI 0018:01:00.0#10000: EEH: nvme driver reports: 'need reset' > [ 571.879097] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset' > <snip> > <snip> > [ 571.881761] EEH: Reset without hotplug activity > [ 574.039807] EEH: PHB#18-PE#10000: Slot inactive after reset: 0x2 (attempt 1) > [ 574.309091] EEH: Failure 2 resetting PHB#18-PE#10000 (attempt 2) > [ 574.309091] > [ 574.579094] EEH: Failure 2 resetting PHB#18-PE#10000 (attempt 3) > [ 574.579094] > [ 574.579101] eeh_handle_normal_event: Cannot reset, err=-5 > [ 574.579104] EEH: Unable to recover from failure from PHB#18-PE#10000. > [ 574.579104] Please try reseating or replacing it > <snip> > <snip> > [ 574.581314] EEH: Beginning: 'error_detected(permanent failure)' > [ 574.581320] PCI 0018:01:00.0#10000: EEH: no device > > # lspci > <empty> > > # nvme list-subsys > <empty> > > > Another case tested, when the reset_work is ongoing post subsystem-reset command > and if user immediately hot removes the NVMe adapter then EEH recovery is *not* > triggered and ctrl forward progress to the "DEAD" state. > > Collected the logs of this case, (shown below): > ----------------------------------------------- > # nvme list-subsys > nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358 > hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2 > iopolicy=numa > > # lspci > 0018:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X > > # nvme subsystem-reset /dev/nvme0 > > <!!!! HOT REMOVE THE NVMe ADAPTER !!!!> > > # dmesg > [ 9967.143886] eeh_handle_normal_event: Cannot find PCI bus for PHB#18-PE#10000 > [ 9967.224078] eeh_handle_normal_event: Cannot find PCI bus for PHB#18-PE#10000 > <snip> > [ 9967.223858] nvme 0018:01:00.0: enabling device (0000 -> 0002) > [ 9967.224140] nvme nvme0: Disabling device after reset failure: -19 > > # lspci > <empty> > > # nvme list-subsys > <empty> > > > Please let me know if the above changes look good to you. If it looks good then > I would spin a new version of the patch and send for a review. > > Thanks, > --Nilay > >
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index c1d6357ec98a..a6ba46e727ba 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2776,6 +2776,14 @@ static void nvme_reset_work(struct work_struct *work) out_unlock: mutex_unlock(&dev->shutdown_lock); out: + /* + * If PCI recovery is ongoing then let it finish first + */ + if (pci_channel_offline(to_pci_dev(dev->dev))) { + dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n"); + return; + } + /* * Set state to deleting now to avoid blocking nvme_wait_reset(), which * may be holding this pci_dev's device lock. @@ -3295,9 +3303,11 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev, case pci_channel_io_frozen: dev_warn(dev->ctrl.device, "frozen state error detected, reset controller\n"); - if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { - nvme_dev_disable(dev, true); - return PCI_ERS_RESULT_DISCONNECT; + if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) { + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { + nvme_dev_disable(dev, true); + return PCI_ERS_RESULT_DISCONNECT; + } } nvme_dev_disable(dev, false); return PCI_ERS_RESULT_NEED_RESET;
If the nvme subsyetm reset causes the loss of communication to the nvme adapter then EEH could potnetially recover the adapter. The detection of comminication loss to the adapter only happens when the nvme driver attempts to read an MMIO register. The nvme subsystem reset command writes 0x4E564D65 to NSSR register and schedule adapter reset.In the case nvme subsystem reset caused the loss of communication to the nvme adapter then either IO timeout event or adapter reset handler could detect it. If IO timeout even could detect loss of communication then EEH handler is able to recover the communication to the adapter. This change was implemented in 651438bb0af5 (nvme-pci: Fix EEH failure on ppc). However if the adapter communication loss is detected in nvme reset work handler then EEH is unable to successfully finish the adapter recovery. This patch ensures that, - nvme driver reset handler would observer pci channel was offline after a failed MMIO read and avoids marking the controller state to DEAD and thus gives a fair chance to EEH handler to recover the nvme adapter. - if nvme controller is already in RESETTNG state and pci channel frozen error is detected then nvme driver pci-error-handler code sends the correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH handler so that EEH handler could proceed with the pci slot reset. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> [ 131.415601] EEH: Recovering PHB#40-PE#10000 [ 131.415619] EEH: PE location: N/A, PHB location: N/A [ 131.415623] EEH: Frozen PHB#40-PE#10000 detected [ 131.415627] EEH: Call Trace: [ 131.415629] EEH: [c000000000051078] __eeh_send_failure_event+0x7c/0x15c [ 131.415782] EEH: [c000000000049bdc] eeh_dev_check_failure.part.0+0x27c/0x6b0 [ 131.415789] EEH: [c000000000cb665c] nvme_pci_reg_read32+0x78/0x9c [ 131.415802] EEH: [c000000000ca07f8] nvme_wait_ready+0xa8/0x18c [ 131.415814] EEH: [c000000000cb7070] nvme_dev_disable+0x368/0x40c [ 131.415823] EEH: [c000000000cb9970] nvme_reset_work+0x198/0x348 [ 131.415830] EEH: [c00000000017b76c] process_one_work+0x1f0/0x4f4 [ 131.415841] EEH: [c00000000017be2c] worker_thread+0x3bc/0x590 [ 131.415846] EEH: [c00000000018a46c] kthread+0x138/0x140 [ 131.415854] EEH: [c00000000000dd58] start_kernel_thread+0x14/0x18 [ 131.415864] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures. [ 131.415874] EEH: Notify device drivers to shutdown [ 131.415882] EEH: Beginning: 'error_detected(IO frozen)' [ 131.415888] PCI 0040:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen) [ 131.415891] nvme nvme1: frozen state error detected, reset controller [ 131.515358] nvme 0040:01:00.0: enabling device (0000 -> 0002) [ 131.515778] nvme nvme1: Disabling device after reset failure: -19 [ 131.555336] PCI 0040:01:00.0#10000: EEH: nvme driver reports: 'disconnect' [ 131.555343] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'disconnect' [ 131.555371] EEH: Unable to recover from failure from PHB#40-PE#10000. [ 131.555371] Please try reseating or replacing it [ 131.556296] EEH: of node=0040:01:00.0 [ 131.556351] EEH: PCI device/vendor: 00251e0f [ 131.556421] EEH: PCI cmd/status register: 00100142 [ 131.556428] EEH: PCI-E capabilities and status follow: [ 131.556678] EEH: PCI-E 00: 0002b010 10008fe3 00002910 00436044 [ 131.556859] EEH: PCI-E 10: 10440000 00000000 00000000 00000000 [ 131.556869] EEH: PCI-E 20: 00000000 [ 131.556875] EEH: PCI-E AER capability register set follows: [ 131.557115] EEH: PCI-E AER 00: 14820001 00000000 00400000 00462030 [ 131.557294] EEH: PCI-E AER 10: 00000000 0000e000 000002a0 00000000 [ 131.557469] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 131.557523] EEH: PCI-E AER 30: 00000000 00000000 [ 131.558807] EEH: Beginning: 'error_detected(permanent failure)' [ 131.558815] PCI 0040:01:00.0#10000: EEH: Invoking nvme->error_detected(permanent failure) [ 131.558818] nvme nvme1: failure state error detected, request disconnect [ 131.558839] PCI 0040:01:00.0#10000: EEH: nvme driver reports: 'disconnect' --- drivers/nvme/host/pci.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-)