Message ID | 20240617200411.1426554-3-terry.bowman@amd.com |
---|---|
State | New, archived |
Headers | show |
Series | Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports | expand |
On Mon, 17 Jun 2024 15:04:04 -0500 Terry Bowman <terry.bowman@amd.com> wrote: > The AER service driver clears the AER correctable error (CE) status before > calling the correctable error handler. This results in the error's status > not correctly reflected if read from the CE handler. > > The AER CE status is needed by the portdrv's CE handler. The portdrv's > CE handler is intended to only call the registered notifier callbacks > if the CE error status has correctable internal error (CIE) set. > > This is not a problem for AER uncorrrectbale errors (UCE). The UCE status uncorrectable > is still present in the AER capability and available for reading, if > needed, when the UCE handler is called. I'm seeing the clear in the DPC path for UCE. For other cases is it a side effect of the reset? > > Change the order of clearing the CE status and calling the CE handler. > Make it to call the CE handler first and then clear the CE status > after returning. > > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: linux-pci@vger.kernel.org Seems reasonable, but many gremlins around the ordering in these flows, so I'm to particularly confident. With that in mind. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huwei.com> > --- > drivers/pci/pcie/aer.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index ac6293c24976..4dc03cb9aff0 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -1094,9 +1094,6 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info) > * Correctable error does not need software intervention. > * No need to go through error recovery process. > */ > - if (aer) > - pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS, > - info->status); > if (pcie_aer_is_native(dev)) { > struct pci_driver *pdrv = dev->driver; > > @@ -1105,6 +1102,10 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info) > pdrv->err_handler->cor_error_detected(dev); > pcie_clear_device_status(dev); > } > + if (aer) > + pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS, > + info->status); > + > } else if (info->severity == AER_NONFATAL) > pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset); > else if (info->severity == AER_FATAL)
Terry Bowman wrote: > The AER service driver clears the AER correctable error (CE) status before > calling the correctable error handler. This results in the error's status > not correctly reflected if read from the CE handler. > > The AER CE status is needed by the portdrv's CE handler. The portdrv's > CE handler is intended to only call the registered notifier callbacks > if the CE error status has correctable internal error (CIE) set. Is this a fix or a prep patch? It reads like a "fix", but there are no notifiers to worry about today. > This is not a problem for AER uncorrrectbale errors (UCE). The UCE status > is still present in the AER capability and available for reading, if > needed, when the UCE handler is called. > > Change the order of clearing the CE status and calling the CE handler. > Make it to call the CE handler first and then clear the CE status > after returning. With the changelog clarified to indicate whether this has any impact on current behavior you can add: Acked-by: Dan Williams <dan.j.williams@intel.com>
On 6/20/24 06:31, Jonathan Cameron wrote: > On Mon, 17 Jun 2024 15:04:04 -0500 > Terry Bowman <terry.bowman@amd.com> wrote: > >> The AER service driver clears the AER correctable error (CE) status before >> calling the correctable error handler. This results in the error's status >> not correctly reflected if read from the CE handler. >> >> The AER CE status is needed by the portdrv's CE handler. The portdrv's >> CE handler is intended to only call the registered notifier callbacks >> if the CE error status has correctable internal error (CIE) set. >> >> This is not a problem for AER uncorrrectbale errors (UCE). The UCE status > > uncorrectable > Thank you. >> is still present in the AER capability and available for reading, if >> needed, when the UCE handler is called. > > I'm seeing the clear in the DPC path for UCE. For other cases is > it a side effect of the reset? > Depends on when its being read. I'm assuming this is after recovery in your case. And after recovery it will be zeroed. Regards, Terry >> >> Change the order of clearing the CE status and calling the CE handler. >> Make it to call the CE handler first and then clear the CE status >> after returning. >> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com> >> Cc: Bjorn Helgaas <bhelgaas@google.com> >> Cc: linux-pci@vger.kernel.org > Seems reasonable, but many gremlins around the ordering in these > flows, so I'm to particularly confident. With that in mind. > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huwei.com> > >> --- >> drivers/pci/pcie/aer.c | 7 ++++--- >> 1 file changed, 4 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c >> index ac6293c24976..4dc03cb9aff0 100644 >> --- a/drivers/pci/pcie/aer.c >> +++ b/drivers/pci/pcie/aer.c >> @@ -1094,9 +1094,6 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info) >> * Correctable error does not need software intervention. >> * No need to go through error recovery process. >> */ >> - if (aer) >> - pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS, >> - info->status); >> if (pcie_aer_is_native(dev)) { >> struct pci_driver *pdrv = dev->driver; >> >> @@ -1105,6 +1102,10 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info) >> pdrv->err_handler->cor_error_detected(dev); >> pcie_clear_device_status(dev); >> } >> + if (aer) >> + pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS, >> + info->status); >> + >> } else if (info->severity == AER_NONFATAL) >> pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset); >> else if (info->severity == AER_FATAL) >
Hi Dan, I added a response below. On 6/21/24 14:23, Dan Williams wrote: > Terry Bowman wrote: >> The AER service driver clears the AER correctable error (CE) status before >> calling the correctable error handler. This results in the error's status >> not correctly reflected if read from the CE handler. >> >> The AER CE status is needed by the portdrv's CE handler. The portdrv's >> CE handler is intended to only call the registered notifier callbacks >> if the CE error status has correctable internal error (CIE) set. > > Is this a fix or a prep patch? It reads like a "fix", but there are no > notifiers to worry about today. > I will add mention "in preparation for future patch". >> This is not a problem for AER uncorrrectbale errors (UCE). The UCE status >> is still present in the AER capability and available for reading, if >> needed, when the UCE handler is called. >> >> Change the order of clearing the CE status and calling the CE handler. >> Make it to call the CE handler first and then clear the CE status >> after returning. > > With the changelog clarified to indicate whether this has any impact on > current behavior you can add: > > Acked-by: Dan Williams <dan.j.williams@intel.com> Regards, Terry
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index ac6293c24976..4dc03cb9aff0 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -1094,9 +1094,6 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info) * Correctable error does not need software intervention. * No need to go through error recovery process. */ - if (aer) - pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS, - info->status); if (pcie_aer_is_native(dev)) { struct pci_driver *pdrv = dev->driver; @@ -1105,6 +1102,10 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info) pdrv->err_handler->cor_error_detected(dev); pcie_clear_device_status(dev); } + if (aer) + pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS, + info->status); + } else if (info->severity == AER_NONFATAL) pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset); else if (info->severity == AER_FATAL)
The AER service driver clears the AER correctable error (CE) status before calling the correctable error handler. This results in the error's status not correctly reflected if read from the CE handler. The AER CE status is needed by the portdrv's CE handler. The portdrv's CE handler is intended to only call the registered notifier callbacks if the CE error status has correctable internal error (CIE) set. This is not a problem for AER uncorrrectbale errors (UCE). The UCE status is still present in the AER capability and available for reading, if needed, when the UCE handler is called. Change the order of clearing the CE status and calling the CE handler. Make it to call the CE handler first and then clear the CE status after returning. Signed-off-by: Terry Bowman <terry.bowman@amd.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-pci@vger.kernel.org --- drivers/pci/pcie/aer.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)