Message ID | 20240617200411.1426554-10-terry.bowman@amd.com |
---|---|
State | New, archived |
Headers | show |
Series | Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports | expand |
On Mon, 17 Jun 2024 15:04:11 -0500 Terry Bowman <terry.bowman@amd.com> wrote: > CXL RAS errors are reported through AER interrupts using the AER status: > correctbale internal errors (CIE) and AER uncorrectable internal errors correctable > (UIE).[1] But, the AER CIE/UIE are disabled by default preventing > notification of CXL RAS errors.[2] > > Enable CXL PCIe port RAS notification by unmasking the ports' AER CIE > and UIE errors. > > [1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream > Switch Ports > [2] PCI6.0 - 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h), > 7.8.4.6 Correctable Error Mask Register (Offset 14h) > > Signed-off-by: Terry Bowman <terry.bowman@amd.com> I'm not sure doing this from a driver other than the one handling the errors makes sense. It is doing a couple of RMW without any locking or guarantees that the driver bound to the PCI port might care about this changing. I'd like more info on why we don't just turn this on in general and hence avoid the need to control it from the 'wrong' place. Jonathan > --- > drivers/cxl/core/pci.c | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c > index e630eccb733d..73637d39df0a 100644 > --- a/drivers/cxl/core/pci.c > +++ b/drivers/cxl/core/pci.c > @@ -861,6 +861,12 @@ void cxl_setup_parent_uport(struct device *host, struct cxl_port *port) > struct device *uport_dev = port->uport_dev; > > cxl_port_map_regs(uport_dev, map, regs); > + > + if (dev_is_pci(uport_dev)) { > + struct pci_dev *pdev = to_pci_dev(uport_dev); > + > + pci_aer_unmask_internal_errors(pdev); I'd skip the local variable for conciseness. > + } > } > EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_uport, CXL); > > @@ -878,6 +884,12 @@ void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport) > > if (dport->rch) > cxl_disable_rch_root_ints(dport); > + > + if (dev_is_pci(dport_dev)) { > + struct pci_dev *pdev = to_pci_dev(dport_dev); > + > + pci_aer_unmask_internal_errors(pdev); likewise. > + } > } > EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_dport, CXL); >
Hi Jonathan, I added responses inline below. On 6/20/24 08:15, Jonathan Cameron wrote: > On Mon, 17 Jun 2024 15:04:11 -0500 > Terry Bowman <terry.bowman@amd.com> wrote: > >> CXL RAS errors are reported through AER interrupts using the AER status: >> correctbale internal errors (CIE) and AER uncorrectable internal errors > > correctable > Thanks. >> (UIE).[1] But, the AER CIE/UIE are disabled by default preventing >> notification of CXL RAS errors.[2] >> >> Enable CXL PCIe port RAS notification by unmasking the ports' AER CIE >> and UIE errors. >> >> [1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream >> Switch Ports >> [2] PCI6.0 - 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h), >> 7.8.4.6 Correctable Error Mask Register (Offset 14h) >> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com> > > I'm not sure doing this from a driver other than the one handling the > errors makes sense. It is doing a couple of RMW without any locking > or guarantees that the driver bound to the PCI port might care about > this changing. > I think this could fit into the helper function mentioned in our earlier discussion. When the portdrv's notifier enabler is called it could also enable the UIE/CIE. > I'd like more info on why we don't just turn this on in general > and hence avoid the need to control it from the 'wrong' place. > > Jonathan > I was trying to enable only where needed given the one case is not a pattern, yet. At this point it is only for CXL RCH downstream port and CXL VH ports (portdrv). Would you like for the UIE/CIE unmask added to the AER driver init ? > > >> --- >> drivers/cxl/core/pci.c | 12 ++++++++++++ >> 1 file changed, 12 insertions(+) >> >> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c >> index e630eccb733d..73637d39df0a 100644 >> --- a/drivers/cxl/core/pci.c >> +++ b/drivers/cxl/core/pci.c >> @@ -861,6 +861,12 @@ void cxl_setup_parent_uport(struct device *host, struct cxl_port *port) >> struct device *uport_dev = port->uport_dev; >> >> cxl_port_map_regs(uport_dev, map, regs); >> + >> + if (dev_is_pci(uport_dev)) { >> + struct pci_dev *pdev = to_pci_dev(uport_dev); >> + >> + pci_aer_unmask_internal_errors(pdev); > > I'd skip the local variable for conciseness. > >> + } >> } >> EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_uport, CXL); >> >> @@ -878,6 +884,12 @@ void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport) >> >> if (dport->rch) >> cxl_disable_rch_root_ints(dport); >> + >> + if (dev_is_pci(dport_dev)) { >> + struct pci_dev *pdev = to_pci_dev(dport_dev); >> + >> + pci_aer_unmask_internal_errors(pdev); > > likewise. > Got it. Regards, Terry >> + } >> } >> EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_dport, CXL); >> >
On Mon, 24 Jun 2024 11:46:01 -0500 Terry Bowman <Terry.Bowman@amd.com> wrote: > Hi Jonathan, > > I added responses inline below. > > On 6/20/24 08:15, Jonathan Cameron wrote: > > On Mon, 17 Jun 2024 15:04:11 -0500 > > Terry Bowman <terry.bowman@amd.com> wrote: > > > >> CXL RAS errors are reported through AER interrupts using the AER status: > >> correctbale internal errors (CIE) and AER uncorrectable internal errors > > > > correctable > > > > Thanks. > > >> (UIE).[1] But, the AER CIE/UIE are disabled by default preventing > >> notification of CXL RAS errors.[2] > >> > >> Enable CXL PCIe port RAS notification by unmasking the ports' AER CIE > >> and UIE errors. > >> > >> [1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream > >> Switch Ports > >> [2] PCI6.0 - 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h), > >> 7.8.4.6 Correctable Error Mask Register (Offset 14h) > >> > >> Signed-off-by: Terry Bowman <terry.bowman@amd.com> > > > > I'm not sure doing this from a driver other than the one handling the > > errors makes sense. It is doing a couple of RMW without any locking > > or guarantees that the driver bound to the PCI port might care about > > this changing. > > > > I think this could fit into the helper function mentioned in our earlier > discussion. When the portdrv's notifier enabler is called it could also > enable the UIE/CIE. > > > I'd like more info on why we don't just turn this on in general > > and hence avoid the need to control it from the 'wrong' place. > > > > Jonathan > > > > I was trying to enable only where needed given the one case is not a > pattern, yet. At this point it is only for CXL RCH downstream port > and CXL VH ports (portdrv). > > Would you like for the UIE/CIE unmask added to the AER driver init ? If we can get away with it, yes!
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c index e630eccb733d..73637d39df0a 100644 --- a/drivers/cxl/core/pci.c +++ b/drivers/cxl/core/pci.c @@ -861,6 +861,12 @@ void cxl_setup_parent_uport(struct device *host, struct cxl_port *port) struct device *uport_dev = port->uport_dev; cxl_port_map_regs(uport_dev, map, regs); + + if (dev_is_pci(uport_dev)) { + struct pci_dev *pdev = to_pci_dev(uport_dev); + + pci_aer_unmask_internal_errors(pdev); + } } EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_uport, CXL); @@ -878,6 +884,12 @@ void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport) if (dport->rch) cxl_disable_rch_root_ints(dport); + + if (dev_is_pci(dport_dev)) { + struct pci_dev *pdev = to_pci_dev(dport_dev); + + pci_aer_unmask_internal_errors(pdev); + } } EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_dport, CXL);
CXL RAS errors are reported through AER interrupts using the AER status: correctbale internal errors (CIE) and AER uncorrectable internal errors (UIE).[1] But, the AER CIE/UIE are disabled by default preventing notification of CXL RAS errors.[2] Enable CXL PCIe port RAS notification by unmasking the ports' AER CIE and UIE errors. [1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream Switch Ports [2] PCI6.0 - 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h), 7.8.4.6 Correctable Error Mask Register (Offset 14h) Signed-off-by: Terry Bowman <terry.bowman@amd.com> --- drivers/cxl/core/pci.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)