Message ID | 20230622205523.85375-26-terry.bowman@amd.com |
---|---|
State | Superseded |
Headers | show |
Series | cxl/pci: Add support for RCH RAS error handling | expand |
On 6/22/23 13:55, Terry Bowman wrote: > From: Robert Richter <rrichter@amd.com> > > In Restricted CXL Device (RCD) mode a CXL device is exposed as an > RCiEP, but CXL downstream and upstream ports are not enumerated and > not visible in the PCIe hierarchy. [1] Protocol and link errors from > these non-enumerated ports are signaled as internal AER errors, either > Uncorrectable Internal Error (UIE) or Corrected Internal Errors (CIE) > via an RCEC. > > Restricted CXL host (RCH) downstream port-detected errors have the > Requster ID of the RCEC set in the RCEC's AER Error Source ID s/Requster/Requester/ > register. A CXL handler must then inspect the error status in various > CXL registers residing in the dport's component register space (CXL > RAS capability) or the dport's RCRB (PCIe AER extended > capability). [2] > > Errors showing up in the RCEC's error handler must be handled and > connected to the CXL subsystem. Implement this by forwarding the error > to all CXL devices below the RCEC. Since the entire CXL device is > controlled only using PCIe Configuration Space of device 0, function > 0, only pass it there [3]. The error handling is limited to currently > supported devices with the Memory Device class code set (CXL Type 3 > Device, PCI_CLASS_MEMORY_CXL, 502h), handle downstream port errors in > the device's cxl_pci driver. Support for other CXL Device Types > (e.g. a CXL.cache Device) can be added later. > > To handle downstream port errors in addition to errors directed to the > CXL endpoint device, a handler must also inspect the CXL RAS and PCIe > AER capabilities of the CXL downstream port the device is connected > to. > > Since CXL downstream port errors are signaled using internal errors, > the handler requires those errors to be unmasked. This is subject of a > follow-on patch. > > The reason for choosing this implementation is that the AER service > driver claims the RCEC device, but does not allow it to register a > custom specific handler to support CXL. Connecting the RCEC hard-wired > with a CXL handler does not work, as the CXL subsystem might not be > present all the time. The alternative to add an implementation to the > portdrv to allow the registration of a custom RCEC error handler isn't > worth doing it as CXL would be its only user. Instead, just check for > an CXL RCEC and pass it down to the connected CXL device's error > handler. With this approach the code can entirely be implemented in > the PCIe AER driver and is independent of the CXL subsystem. The CXL > driver only provides the handler. > > [1] CXL 3.0 spec: 9.11.8 CXL Devices Attached to an RCH > [2] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors > [3] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices > > Co-developed-by: Terry Bowman <terry.bowman@amd.com> > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > Signed-off-by: Robert Richter <rrichter@amd.com> > Cc: "Oliver O'Halloran" <oohall@gmail.com> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-pci@vger.kernel.org > Acked-by: Bjorn Helgaas <bhelgaas@google.com> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> > --- > drivers/pci/pcie/Kconfig | 12 +++++ > drivers/pci/pcie/aer.c | 96 +++++++++++++++++++++++++++++++++++++++- > 2 files changed, 106 insertions(+), 2 deletions(-) > > diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig > index 228652a59f27..4f0e70fafe2d 100644 > --- a/drivers/pci/pcie/Kconfig > +++ b/drivers/pci/pcie/Kconfig > @@ -49,6 +49,18 @@ config PCIEAER_INJECT > gotten from: > https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ > > +config PCIEAER_CXL > + bool "PCI Express CXL RAS support for Restricted Hosts (RCH)" > + default y > + depends on PCIEAER && CXL_PCI > + help > + Enables error handling of downstream ports of a CXL host > + that is operating in RCD mode (Restricted CXL Host, RCH). > + The downstream port reports AER errors to a given RCEC. > + Errors are handled by the CXL memory device driver. > + > + If unsure, say Y. > + > # > # PCI Express ECRC > # > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index d3344fcf1f79..c354ca5e8f2b 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -946,14 +946,100 @@ static bool find_source_device(struct pci_dev *parent, > return true; > } > > +#ifdef CONFIG_PCIEAER_CXL > + > +static bool is_cxl_mem_dev(struct pci_dev *dev) > +{ > + /* > + * The capability, status, and control fields in Device 0, > + * Function 0 DVSEC control the CXL functionality of the > + * entire device (CXL 3.0, 8.1.3). > + */ > + if (dev->devfn != PCI_DEVFN(0, 0)) > + return false; > + > + /* > + * CXL Memory Devices must have the 502h class code set (CXL > + * 3.0, 8.1.12.1). > + */ > + if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL) > + return false; > + > + return true; > +} > + > +static bool cxl_error_is_native(struct pci_dev *dev) > +{ > + struct pci_host_bridge *host = pci_find_host_bridge(dev->bus); > + > + if (pcie_ports_native) > + return true; > + > + return host->native_aer && host->native_cxl_error; > +} > + > +static bool is_internal_error(struct aer_err_info *info) > +{ > + if (info->severity == AER_CORRECTABLE) > + return info->status & PCI_ERR_COR_INTERNAL; > + > + return info->status & PCI_ERR_UNC_INTN; > +} > + > +static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data) > +{ > + struct aer_err_info *info = (struct aer_err_info *)data; > + const struct pci_error_handlers *err_handler; > + > + if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev)) > + return 0; > + > + /* protect dev->driver */ > + device_lock(&dev->dev); > + > + err_handler = dev->driver ? dev->driver->err_handler : NULL; > + if (!err_handler) > + goto out; > + > + if (info->severity == AER_CORRECTABLE) { > + if (err_handler->cor_error_detected) > + err_handler->cor_error_detected(dev); > + } else if (err_handler->error_detected) { > + if (info->severity == AER_NONFATAL) > + err_handler->error_detected(dev, pci_channel_io_normal); > + else if (info->severity == AER_FATAL) > + err_handler->error_detected(dev, pci_channel_io_frozen); > + } > +out: > + device_unlock(&dev->dev); > + return 0; > +} > + > +static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) > +{ > + /* > + * Internal errors of an RCEC indicate an AER error in an > + * RCH's downstream port. Check and handle them in the CXL.mem > + * device driver. > + */ > + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC && > + is_internal_error(info)) > + pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info); > +} > + > +#else > +static inline void cxl_rch_handle_error(struct pci_dev *dev, > + struct aer_err_info *info) { } > +#endif > + > /** > - * handle_error_source - handle logging error into an event log > + * pci_aer_handle_error - handle logging error into an event log > * @dev: pointer to pci_dev data structure of error source device > * @info: comprehensive error information > * > * Invoked when an error being detected by Root Port. > */ > -static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) > +static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info) > { > int aer = dev->aer_cap; > > @@ -977,6 +1063,12 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) > pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset); > else if (info->severity == AER_FATAL) > pcie_do_recovery(dev, pci_channel_io_frozen, aer_root_reset); > +} > + > +static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) > +{ > + cxl_rch_handle_error(dev, info); > + pci_aer_handle_error(dev, info); > pci_dev_put(dev); > } >
Dave Jiang wrote: > > > On 6/22/23 13:55, Terry Bowman wrote: > > From: Robert Richter <rrichter@amd.com> > > > > In Restricted CXL Device (RCD) mode a CXL device is exposed as an > > RCiEP, but CXL downstream and upstream ports are not enumerated and > > not visible in the PCIe hierarchy. [1] Protocol and link errors from > > these non-enumerated ports are signaled as internal AER errors, either > > Uncorrectable Internal Error (UIE) or Corrected Internal Errors (CIE) > > via an RCEC. > > > > Restricted CXL host (RCH) downstream port-detected errors have the > > Requster ID of the RCEC set in the RCEC's AER Error Source ID > > s/Requster/Requester/ Thanks, fixed.
diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig index 228652a59f27..4f0e70fafe2d 100644 --- a/drivers/pci/pcie/Kconfig +++ b/drivers/pci/pcie/Kconfig @@ -49,6 +49,18 @@ config PCIEAER_INJECT gotten from: https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ +config PCIEAER_CXL + bool "PCI Express CXL RAS support for Restricted Hosts (RCH)" + default y + depends on PCIEAER && CXL_PCI + help + Enables error handling of downstream ports of a CXL host + that is operating in RCD mode (Restricted CXL Host, RCH). + The downstream port reports AER errors to a given RCEC. + Errors are handled by the CXL memory device driver. + + If unsure, say Y. + # # PCI Express ECRC # diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index d3344fcf1f79..c354ca5e8f2b 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -946,14 +946,100 @@ static bool find_source_device(struct pci_dev *parent, return true; } +#ifdef CONFIG_PCIEAER_CXL + +static bool is_cxl_mem_dev(struct pci_dev *dev) +{ + /* + * The capability, status, and control fields in Device 0, + * Function 0 DVSEC control the CXL functionality of the + * entire device (CXL 3.0, 8.1.3). + */ + if (dev->devfn != PCI_DEVFN(0, 0)) + return false; + + /* + * CXL Memory Devices must have the 502h class code set (CXL + * 3.0, 8.1.12.1). + */ + if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL) + return false; + + return true; +} + +static bool cxl_error_is_native(struct pci_dev *dev) +{ + struct pci_host_bridge *host = pci_find_host_bridge(dev->bus); + + if (pcie_ports_native) + return true; + + return host->native_aer && host->native_cxl_error; +} + +static bool is_internal_error(struct aer_err_info *info) +{ + if (info->severity == AER_CORRECTABLE) + return info->status & PCI_ERR_COR_INTERNAL; + + return info->status & PCI_ERR_UNC_INTN; +} + +static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data) +{ + struct aer_err_info *info = (struct aer_err_info *)data; + const struct pci_error_handlers *err_handler; + + if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev)) + return 0; + + /* protect dev->driver */ + device_lock(&dev->dev); + + err_handler = dev->driver ? dev->driver->err_handler : NULL; + if (!err_handler) + goto out; + + if (info->severity == AER_CORRECTABLE) { + if (err_handler->cor_error_detected) + err_handler->cor_error_detected(dev); + } else if (err_handler->error_detected) { + if (info->severity == AER_NONFATAL) + err_handler->error_detected(dev, pci_channel_io_normal); + else if (info->severity == AER_FATAL) + err_handler->error_detected(dev, pci_channel_io_frozen); + } +out: + device_unlock(&dev->dev); + return 0; +} + +static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) +{ + /* + * Internal errors of an RCEC indicate an AER error in an + * RCH's downstream port. Check and handle them in the CXL.mem + * device driver. + */ + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC && + is_internal_error(info)) + pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info); +} + +#else +static inline void cxl_rch_handle_error(struct pci_dev *dev, + struct aer_err_info *info) { } +#endif + /** - * handle_error_source - handle logging error into an event log + * pci_aer_handle_error - handle logging error into an event log * @dev: pointer to pci_dev data structure of error source device * @info: comprehensive error information * * Invoked when an error being detected by Root Port. */ -static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) +static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info) { int aer = dev->aer_cap; @@ -977,6 +1063,12 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset); else if (info->severity == AER_FATAL) pcie_do_recovery(dev, pci_channel_io_frozen, aer_root_reset); +} + +static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) +{ + cxl_rch_handle_error(dev, info); + pci_aer_handle_error(dev, info); pci_dev_put(dev); }