Message ID | 1599072130-10043-9-git-send-email-andrey.grodzovsky@amd.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Series | Implement PCI Error Recovery on Navi12 | expand |
On 9/2/20 11:42 AM, Andrey Grodzovsky wrote: > This reverts commit 6d2c89441571ea534d6240f7724f518936c44f8d. > > In the code bellow > > pci_walk_bus(bus, report_frozen_detected, &status); > - if (reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED) > + status = reset_link(dev, service); > > status returned from report_frozen_detected is unconditionally masked > by status returned from reset_link which is wrong. > > This breaks error recovery implementation for AMDGPU driver > by masking PCI_ERS_RESULT_NEED_RESET returned from amdgpu_pci_error_detected > and hence skiping slot reset callback which is necessary for proper > ASIC recovery. Effectively no other callback besides resume callback will > be called after link reset the way it is implemented now regardless of what > value error_detected callback returns. > } Instead of reverting this change, can you try following patch ? https://lore.kernel.org/linux-pci/56ad4901-725f-7b88-2117-b124b28b027f@linux.intel.com/T/#me8029c04f63c21f9d1cb3b1ba2aeffbca3a60df5
Yes, works also. Can you provide me a formal patch that i can commit into our local amd staging tree with my patch set ? Alex - is that how we want to do it, without this patch or reverting the original patch the feature is broken. Andrey On 9/2/20 3:00 PM, Kuppuswamy, Sathyanarayanan wrote: > > > On 9/2/20 11:42 AM, Andrey Grodzovsky wrote: >> This reverts commit 6d2c89441571ea534d6240f7724f518936c44f8d. >> >> In the code bellow >> >> pci_walk_bus(bus, report_frozen_detected, &status); >> - if (reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED) >> + status = reset_link(dev, service); >> >> status returned from report_frozen_detected is unconditionally masked >> by status returned from reset_link which is wrong. >> >> This breaks error recovery implementation for AMDGPU driver >> by masking PCI_ERS_RESULT_NEED_RESET returned from amdgpu_pci_error_detected >> and hence skiping slot reset callback which is necessary for proper >> ASIC recovery. Effectively no other callback besides resume callback will >> be called after link reset the way it is implemented now regardless of what >> value error_detected callback returns. >> > } > > Instead of reverting this change, can you try following patch ? > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F56ad4901-725f-7b88-2117-b124b28b027f%40linux.intel.com%2FT%2F%23me8029c04f63c21f9d1cb3b1ba2aeffbca3a60df5&data=02%7C01%7Candrey.grodzovsky%40amd.com%7C77325d6a2abc42d26ae608d84f726c51%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637346700170831846&sdata=JPo8lOXfjxpq%2BnmlVrSi93aZxGjIlbuh0rkZmNKkzQM%3D&reserved=0 > >
On 9/2/20 12:54 PM, Andrey Grodzovsky wrote: > Yes, works also. > > Can you provide me a formal patch that i can commit into our local amd staging tree with my patch set ? https://patchwork.kernel.org/patch/11684175/mbox/ > > Alex - is that how we want to do it, without this patch or reverting the original patch the feature > is broken. > > Andrey > > On 9/2/20 3:00 PM, Kuppuswamy, Sathyanarayanan wrote: >> >> >> On 9/2/20 11:42 AM, Andrey Grodzovsky wrote: >>> This reverts commit 6d2c89441571ea534d6240f7724f518936c44f8d. >>> >>> In the code bellow >>> >>> pci_walk_bus(bus, report_frozen_detected, &status); >>> - if (reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED) >>> + status = reset_link(dev, service); >>> >>> status returned from report_frozen_detected is unconditionally masked >>> by status returned from reset_link which is wrong. >>> >>> This breaks error recovery implementation for AMDGPU driver >>> by masking PCI_ERS_RESULT_NEED_RESET returned from amdgpu_pci_error_detected >>> and hence skiping slot reset callback which is necessary for proper >>> ASIC recovery. Effectively no other callback besides resume callback will >>> be called after link reset the way it is implemented now regardless of what >>> value error_detected callback returns. >>> >> } >> >> Instead of reverting this change, can you try following patch ? >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F56ad4901-725f-7b88-2117-b124b28b027f%40linux.intel.com%2FT%2F%23me8029c04f63c21f9d1cb3b1ba2aeffbca3a60df5&data=02%7C01%7Candrey.grodzovsky%40amd.com%7C77325d6a2abc42d26ae608d84f726c51%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637346700170831846&sdata=JPo8lOXfjxpq%2BnmlVrSi93aZxGjIlbuh0rkZmNKkzQM%3D&reserved=0 >> >>
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index c543f41..81dd719 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -165,8 +165,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, pci_dbg(dev, "broadcast error_detected message\n"); if (state == pci_channel_io_frozen) { pci_walk_bus(bus, report_frozen_detected, &status); - status = reset_link(dev); - if (status != PCI_ERS_RESULT_RECOVERED) { + if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) { pci_warn(dev, "link reset failed\n"); goto failed; }
This reverts commit 6d2c89441571ea534d6240f7724f518936c44f8d. In the code bellow pci_walk_bus(bus, report_frozen_detected, &status); - if (reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED) + status = reset_link(dev, service); status returned from report_frozen_detected is unconditionally masked by status returned from reset_link which is wrong. This breaks error recovery implementation for AMDGPU driver by masking PCI_ERS_RESULT_NEED_RESET returned from amdgpu_pci_error_detected and hence skiping slot reset callback which is necessary for proper ASIC recovery. Effectively no other callback besides resume callback will be called after link reset the way it is implemented now regardless of what value error_detected callback returns. In general step 6.1.4 describing link reset unlike the other steps is not well defined in what are the expected return values and the appropriate next steps as it is for other stpes. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> --- drivers/pci/pcie/err.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)