Message ID | 20230411180302.2678736-5-terry.bowman@amd.com |
---|---|
State | Superseded |
Headers | show |
Series | cxl/pci: Add support for RCH RAS error handling | expand |
Hi Terry, kernel test robot noticed the following build errors: [auto build test ERROR on ca712e47054678c5ce93a0e0f686353ad5561195] url: https://github.com/intel-lab-lkp/linux/commits/Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230412-020957 base: ca712e47054678c5ce93a0e0f686353ad5561195 patch link: https://lore.kernel.org/r/20230411180302.2678736-5-terry.bowman%40amd.com patch subject: [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging config: loongarch-buildonly-randconfig-r004-20230409 (https://download.01.org/0day-ci/archive/20230412/202304120926.dekDF6um-lkp@intel.com/config) compiler: loongarch64-linux-gcc (GCC) 12.1.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/7f1c5cefb1e75bd709dc35c7f5e3e29dd5df65e1 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230412-020957 git checkout 7f1c5cefb1e75bd709dc35c7f5e3e29dd5df65e1 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=loongarch olddefconfig COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=loongarch SHELL=/bin/bash If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> | Link: https://lore.kernel.org/oe-kbuild-all/202304120926.dekDF6um-lkp@intel.com/ All errors (new ones prefixed by >>): loongarch64-linux-ld: drivers/cxl/core/pci.o: in function `cxl_rch_log_error': drivers/cxl/core/pci.c:768: undefined reference to `cper_print_aer' >> loongarch64-linux-ld: drivers/cxl/core/pci.c:768: undefined reference to `cper_print_aer' >> loongarch64-linux-ld: drivers/cxl/core/pci.c:768: undefined reference to `cper_print_aer'
Hi Terry,
kernel test robot noticed the following build errors:
[auto build test ERROR on ca712e47054678c5ce93a0e0f686353ad5561195]
url: https://github.com/intel-lab-lkp/linux/commits/Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230412-020957
base: ca712e47054678c5ce93a0e0f686353ad5561195
patch link: https://lore.kernel.org/r/20230411180302.2678736-5-terry.bowman%40amd.com
patch subject: [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging
config: riscv-randconfig-r014-20230410 (https://download.01.org/0day-ci/archive/20230412/202304121055.UceD86D7-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/7f1c5cefb1e75bd709dc35c7f5e3e29dd5df65e1
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230412-020957
git checkout 7f1c5cefb1e75bd709dc35c7f5e3e29dd5df65e1
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=riscv olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304121055.UceD86D7-lkp@intel.com/
All errors (new ones prefixed by >>):
riscv64-linux-ld: riscv64-linux-ld: DWARF error: could not find abbrev number 463040
drivers/cxl/core/pci.o: in function `.L0 ':
pci.c:(.text+0x1ae2): undefined reference to `cper_print_aer'
>> riscv64-linux-ld: pci.c:(.text+0x1afa): undefined reference to `cper_print_aer'
On Tue, 11 Apr 2023 13:03:00 -0500 Terry Bowman <terry.bowman@amd.com> wrote: > RCH downstream port error logging is missing in the current CXL driver. The > missing AER and RAS error logging is needed for communicating driver error > details to userspace. Update the driver to include PCIe AER and CXL RAS > error logging. > > Add RCH downstream port error handling into the existing RCiEP handler. > The downstream port error handler is added to the RCiEP error handler > because the downstream port is implemented in a RCRB, is not PCI > enumerable, and as a result is not directly accessible to the PCI AER > root port driver. The AER root port driver calls the RCiEP handler for > handling RCD errors and RCH downstream port protocol errors. > > Update mem.c to include RAS and AER setup. This includes AER and RAS > capability discovery and mapping for later use in the error handler. > > Disable RCH downstream port's root port cmd interrupts.[1] > > Update existing RCiEP correctable and uncorrectable handlers to also call > the RCH handler. The RCH handler will read the RCH AER registers, check for > error severity, and if an error exists will log using an existing kernel > AER trace routine. The RCH handler will also log downstream port RAS errors > if they exist. > > [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors > > Co-developed-by: Robert Richter <rrichter@amd.com> > Signed-off-by: Robert Richter <rrichter@amd.com> > Signed-off-by: Terry Bowman <terry.bowman@amd.com> Some minor stuff inline. Looks fine to me otherwise. I do find it a little confusing how often we go into an RCD or RCH specific function then drop out directly for 2.0+ case, but you do seem to be consistent with existing code so fair enough. Jonathan > --- > drivers/cxl/core/pci.c | 126 ++++++++++++++++++++++++++++++++++++---- > drivers/cxl/core/regs.c | 1 + > drivers/cxl/cxl.h | 13 +++++ > drivers/cxl/mem.c | 73 +++++++++++++++++++++++ > 4 files changed, 201 insertions(+), 12 deletions(-) > > diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c > index 523d5b9fd7fc..d435ed2ff8b6 100644 > --- a/drivers/cxl/core/pci.c > +++ b/drivers/cxl/core/pci.c > +/* > + * Copy the AER capability registers to a buffer. This is necessary > + * because RCRB AER capability is MMIO mapped. Clear the status > + * after copying. > + * > + * @aer_base: base address of AER capability block in RCRB > + * @aer_regs: destination for copying AER capability > + */ > +static bool cxl_rch_get_aer_info(void __iomem *aer_base, > + struct aer_capability_regs *aer_regs) > +{ > + int read_cnt = PCI_AER_CAPABILITY_LENGTH / sizeof(u32); > + u32 *aer_regs_buf = (u32 *)aer_regs; > + int n; > + > + if (!aer_base) > + return false; > + > + for (n = 0; n < read_cnt; n++) > + aer_regs_buf[n] = readl(aer_base + n * sizeof(u32)); Maybe add a comment here on why memcpy_fromio() doesn't work for us. I'm assuming we need these to definitely be 32bit reads. Otherwise someone will 'optimize' it in future. > + > + writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS); > + writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS); > + > + return true; > +} = > diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c > index bde1fffab09e..dfa6fcfc428a 100644 > --- a/drivers/cxl/core/regs.c > +++ b/drivers/cxl/core/regs.c > @@ -198,6 +198,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, > > return ret_val; > } > +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL); > > int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs, > struct cxl_register_map *map, unsigned long map_mask) > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > index df64c402e6e6..dae3f141ffcb 100644 > --- a/drivers/cxl/cxl.h > +++ b/drivers/cxl/cxl.h > @@ -66,6 +66,8 @@ > #define CXL_DECODER_MIN_GRANULARITY 256 > #define CXL_DECODER_MAX_ENCODED_IG 6 > > +#define PCI_AER_CAPABILITY_LENGTH 56 Odd place to find a PCI specific define. Also a spec reference is always good for these. What's the the length of? PCI r6.0 has cap going up to address 0x5c so length 0x60. This seems to be igoring the header log register. > + > static inline int cxl_hdm_decoder_count(u32 cap_hdr) > { > int val = FIELD_GET(CXL_HDM_DECODER_COUNT_MASK, cap_hdr); > @@ -209,6 +211,15 @@ struct cxl_regs { > struct_group_tagged(cxl_device_regs, device_regs, > void __iomem *status, *mbox, *memdev; > ); > + > + /* > + * Pointer to RCH cxl_dport AER. (only for RCH/RCD mode) > + * @dport_aer: CXL 2.0 12.2.11 RCH Downstream Port-detected Errors As with other cases, I'd like full comments, so something for @aer as well. > + */ > + struct_group_tagged(cxl_rch_regs, rch_regs, > + void __iomem *aer; > + void __iomem *dport_ras; > + ); > }; > > struct cxl_reg_map { > @@ -249,6 +260,8 @@ struct cxl_register_map { > }; > }; > > +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, > + resource_size_t length); > void cxl_probe_component_regs(struct device *dev, void __iomem *base, > struct cxl_component_reg_map *map); > void cxl_probe_device_regs(struct device *dev, void __iomem *base, > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c > index 014295ab6bc6..dd5ae0a4560c 100644 > --- a/drivers/cxl/mem.c > +++ b/drivers/cxl/mem.c > @@ -4,6 +4,7 @@ > #include <linux/device.h> > #include <linux/module.h> > #include <linux/pci.h> > +#include <linux/aer.h> > > #include "cxlmem.h" > #include "cxlpci.h" > @@ -45,6 +46,71 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data) > return 0; > } > > +static void rch_disable_root_ints(void __iomem *aer_base) > +{ > + u32 aer_cmd_mask, aer_cmd; > + > + /* > + * Disable RCH root port command interrupts. > + * CXL3.0 12.2.1.1 - RCH Downstream Port-detected Errors > + */ > + aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN | > + PCI_ERR_ROOT_CMD_NONFATAL_EN | > + PCI_ERR_ROOT_CMD_FATAL_EN); > + aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND); > + aer_cmd &= ~aer_cmd_mask; > + writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND); Should we be touching these if firmware hasn't granted control to the OS? Description in the spec refers to 'software'. Is that the kernel? No idea. I guess this is safe even if it has already been done. Perhaps a comment to say it should already be in this state? > +} > + > +static int cxl_rch_map_ras(struct cxl_dev_state *cxlds, > + struct cxl_dport *parent_dport) > +{ > + struct device *dev = parent_dport->dport; > + resource_size_t aer_phys, ras_phys; > + void __iomem *aer, *dport_ras; > + > + if (!parent_dport->rch) > + return 0; > + > + if (!parent_dport->aer_cap || !parent_dport->ras_cap || > + parent_dport->component_reg_phys == CXL_RESOURCE_NONE) > + return -ENODEV; > + > + aer_phys = parent_dport->aer_cap + parent_dport->rcrb; > + aer = devm_cxl_iomap_block(dev, aer_phys, > + PCI_AER_CAPABILITY_LENGTH); > + > + if (!aer) > + return -ENOMEM; > + > + ras_phys = parent_dport->ras_cap + parent_dport->component_reg_phys; > + dport_ras = devm_cxl_iomap_block(dev, ras_phys, > + CXL_RAS_CAPABILITY_LENGTH); > + > + if (!dport_ras) > + return -ENOMEM; > + > + cxlds->regs.aer = aer; > + cxlds->regs.dport_ras = dport_ras; > + > + return 0; > +} > + > +static int cxl_setup_ras(struct cxl_dev_state *cxlds, > + struct cxl_dport *parent_dport) > +{ > + int rc; > + > + rc = cxl_rch_map_ras(cxlds, parent_dport); > + if (rc) > + return rc; > + > + if (cxlds->rcd) > + rch_disable_root_ints(cxlds->regs.aer); > + > + return rc; > +} > + > static void cxl_setup_rcrb(struct cxl_dev_state *cxlds, > struct cxl_dport *parent_dport) > { > @@ -91,6 +157,13 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, > > cxl_setup_rcrb(cxlds, parent_dport); > > + rc = cxl_setup_ras(cxlds, parent_dport); > + /* Continue with RAS setup errors */ > + if (rc) > + dev_warn(&cxlmd->dev, "CXL RAS setup failed: %d\n", rc); > + else > + dev_info(&cxlmd->dev, "CXL error handling enabled\n"); This feels a little noisy as something to add given we didn't shout about it for non RCD cases (I think). Maybe a dev_dbg()? > + > endpoint = devm_cxl_add_port(host, &cxlmd->dev, cxlds->component_reg_phys, > parent_dport); > if (IS_ERR(endpoint))
Hi Jonathan, I added responses inline below. On 4/13/23 11:50, Jonathan Cameron wrote: > On Tue, 11 Apr 2023 13:03:00 -0500 > Terry Bowman <terry.bowman@amd.com> wrote: > >> RCH downstream port error logging is missing in the current CXL driver. The >> missing AER and RAS error logging is needed for communicating driver error >> details to userspace. Update the driver to include PCIe AER and CXL RAS >> error logging. >> >> Add RCH downstream port error handling into the existing RCiEP handler. >> The downstream port error handler is added to the RCiEP error handler >> because the downstream port is implemented in a RCRB, is not PCI >> enumerable, and as a result is not directly accessible to the PCI AER >> root port driver. The AER root port driver calls the RCiEP handler for >> handling RCD errors and RCH downstream port protocol errors. >> >> Update mem.c to include RAS and AER setup. This includes AER and RAS >> capability discovery and mapping for later use in the error handler. >> >> Disable RCH downstream port's root port cmd interrupts.[1] >> >> Update existing RCiEP correctable and uncorrectable handlers to also call >> the RCH handler. The RCH handler will read the RCH AER registers, check for >> error severity, and if an error exists will log using an existing kernel >> AER trace routine. The RCH handler will also log downstream port RAS errors >> if they exist. >> >> [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors >> >> Co-developed-by: Robert Richter <rrichter@amd.com> >> Signed-off-by: Robert Richter <rrichter@amd.com> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com> > Some minor stuff inline. Looks fine to me otherwise. > > I do find it a little confusing how often we go into an RCD or RCH specific > function then drop out directly for 2.0+ case, but you do seem to be consistent > with existing code so fair enough. > > Jonathan > This was to simplify the code from the caller(s) perspective while also trying to generalize the logic. >> --- >> drivers/cxl/core/pci.c | 126 ++++++++++++++++++++++++++++++++++++---- >> drivers/cxl/core/regs.c | 1 + >> drivers/cxl/cxl.h | 13 +++++ >> drivers/cxl/mem.c | 73 +++++++++++++++++++++++ >> 4 files changed, 201 insertions(+), 12 deletions(-) >> >> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c >> index 523d5b9fd7fc..d435ed2ff8b6 100644 >> --- a/drivers/cxl/core/pci.c >> +++ b/drivers/cxl/core/pci.c > > >> +/* >> + * Copy the AER capability registers to a buffer. This is necessary >> + * because RCRB AER capability is MMIO mapped. Clear the status >> + * after copying. >> + * >> + * @aer_base: base address of AER capability block in RCRB >> + * @aer_regs: destination for copying AER capability >> + */ >> +static bool cxl_rch_get_aer_info(void __iomem *aer_base, >> + struct aer_capability_regs *aer_regs) >> +{ >> + int read_cnt = PCI_AER_CAPABILITY_LENGTH / sizeof(u32); >> + u32 *aer_regs_buf = (u32 *)aer_regs; >> + int n; >> + >> + if (!aer_base) >> + return false; >> + >> + for (n = 0; n < read_cnt; n++) >> + aer_regs_buf[n] = readl(aer_base + n * sizeof(u32)); > > Maybe add a comment here on why memcpy_fromio() doesn't work for us. > I'm assuming we need these to definitely be 32bit reads. > Otherwise someone will 'optimize' it in future. > Correct, this was to enforce 32-bit accesses. I will add a comment. >> + >> + writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS); >> + writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS); >> + >> + return true; >> +} > = >> diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c >> index bde1fffab09e..dfa6fcfc428a 100644 >> --- a/drivers/cxl/core/regs.c >> +++ b/drivers/cxl/core/regs.c >> @@ -198,6 +198,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, >> >> return ret_val; >> } >> +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL); >> >> int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs, >> struct cxl_register_map *map, unsigned long map_mask) >> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h >> index df64c402e6e6..dae3f141ffcb 100644 >> --- a/drivers/cxl/cxl.h >> +++ b/drivers/cxl/cxl.h >> @@ -66,6 +66,8 @@ >> #define CXL_DECODER_MIN_GRANULARITY 256 >> #define CXL_DECODER_MAX_ENCODED_IG 6 >> >> +#define PCI_AER_CAPABILITY_LENGTH 56 > > Odd place to find a PCI specific define. Also a spec reference is > always good for these. What's the the length of? PCI r6.0 has > cap going up to address 0x5c so length 0x60. This seems to be igoring > the header log register. > This was to avoid including the TLP log at 0x38+. I can use sizeof(struct aer_capability_regs) or sizeof(*aer_regs) instead. It's the same 38h(56) and will allow me to remove this #define in the patchset revision. >> + >> static inline int cxl_hdm_decoder_count(u32 cap_hdr) >> { >> int val = FIELD_GET(CXL_HDM_DECODER_COUNT_MASK, cap_hdr); >> @@ -209,6 +211,15 @@ struct cxl_regs { >> struct_group_tagged(cxl_device_regs, device_regs, >> void __iomem *status, *mbox, *memdev; >> ); >> + >> + /* >> + * Pointer to RCH cxl_dport AER. (only for RCH/RCD mode) >> + * @dport_aer: CXL 2.0 12.2.11 RCH Downstream Port-detected Errors > > As with other cases, I'd like full comments, so something for @aer as well. > >> + */ >> + struct_group_tagged(cxl_rch_regs, rch_regs, >> + void __iomem *aer; >> + void __iomem *dport_ras; >> + ); >> }; >> >> struct cxl_reg_map { >> @@ -249,6 +260,8 @@ struct cxl_register_map { >> }; >> }; >> >> +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, >> + resource_size_t length); >> void cxl_probe_component_regs(struct device *dev, void __iomem *base, >> struct cxl_component_reg_map *map); >> void cxl_probe_device_regs(struct device *dev, void __iomem *base, >> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c >> index 014295ab6bc6..dd5ae0a4560c 100644 >> --- a/drivers/cxl/mem.c >> +++ b/drivers/cxl/mem.c >> @@ -4,6 +4,7 @@ >> #include <linux/device.h> >> #include <linux/module.h> >> #include <linux/pci.h> >> +#include <linux/aer.h> >> >> #include "cxlmem.h" >> #include "cxlpci.h" >> @@ -45,6 +46,71 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data) >> return 0; >> } >> >> +static void rch_disable_root_ints(void __iomem *aer_base) >> +{ >> + u32 aer_cmd_mask, aer_cmd; >> + >> + /* >> + * Disable RCH root port command interrupts. >> + * CXL3.0 12.2.1.1 - RCH Downstream Port-detected Errors >> + */ >> + aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN | >> + PCI_ERR_ROOT_CMD_NONFATAL_EN | >> + PCI_ERR_ROOT_CMD_FATAL_EN); >> + aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND); >> + aer_cmd &= ~aer_cmd_mask; >> + writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND); > > Should we be touching these if firmware hasn't granted control to > the OS? Description in the spec refers to 'software'. Is that > the kernel? No idea. I guess this is safe even if it has already > been done. Perhaps a comment to say it should already be in this state? > > These need to be disabled because the RCH shouldn't behave as a root port/RCEC generating interrupts as a result of correctable, fatal, or non-fatal AER errors. I added this per the CXL3.0 spec but, as you mentioned, isn't likely necessary because they are disabled by default per PCI6.0.[1][2] This would be the case for OS/native and HW/FW error reporting. I'll add a comment stating it is already in this state. [1] CXL3.0 - 12.2.1.1 RCH Downstream Port-detected Errors [2] PCI 6.0 - 7.8.4.9 Root Error Command Register (Offset 2Ch) >> +} >> + >> +static int cxl_rch_map_ras(struct cxl_dev_state *cxlds, >> + struct cxl_dport *parent_dport) >> +{ >> + struct device *dev = parent_dport->dport; >> + resource_size_t aer_phys, ras_phys; >> + void __iomem *aer, *dport_ras; >> + >> + if (!parent_dport->rch) >> + return 0; >> + >> + if (!parent_dport->aer_cap || !parent_dport->ras_cap || >> + parent_dport->component_reg_phys == CXL_RESOURCE_NONE) >> + return -ENODEV; >> + >> + aer_phys = parent_dport->aer_cap + parent_dport->rcrb; >> + aer = devm_cxl_iomap_block(dev, aer_phys, >> + PCI_AER_CAPABILITY_LENGTH); >> + >> + if (!aer) >> + return -ENOMEM; >> + >> + ras_phys = parent_dport->ras_cap + parent_dport->component_reg_phys; >> + dport_ras = devm_cxl_iomap_block(dev, ras_phys, >> + CXL_RAS_CAPABILITY_LENGTH); >> + >> + if (!dport_ras) >> + return -ENOMEM; >> + >> + cxlds->regs.aer = aer; >> + cxlds->regs.dport_ras = dport_ras; >> + >> + return 0; >> +} >> + >> +static int cxl_setup_ras(struct cxl_dev_state *cxlds, >> + struct cxl_dport *parent_dport) >> +{ >> + int rc; >> + >> + rc = cxl_rch_map_ras(cxlds, parent_dport); >> + if (rc) >> + return rc; >> + >> + if (cxlds->rcd) >> + rch_disable_root_ints(cxlds->regs.aer); >> + >> + return rc; >> +} >> + >> static void cxl_setup_rcrb(struct cxl_dev_state *cxlds, >> struct cxl_dport *parent_dport) >> { >> @@ -91,6 +157,13 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, >> >> cxl_setup_rcrb(cxlds, parent_dport); >> >> + rc = cxl_setup_ras(cxlds, parent_dport); >> + /* Continue with RAS setup errors */ >> + if (rc) >> + dev_warn(&cxlmd->dev, "CXL RAS setup failed: %d\n", rc); >> + else >> + dev_info(&cxlmd->dev, "CXL error handling enabled\n"); > > This feels a little noisy as something to add given we didn't shout about it for > non RCD cases (I think). Maybe a dev_dbg()? > Ok. Regards, Terry >> + >> endpoint = devm_cxl_add_port(host, &cxlmd->dev, cxlds->component_reg_phys, >> parent_dport); >> if (IS_ERR(endpoint)) >
> > >> + > >> + writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS); > >> + writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS); > >> + > >> + return true; > >> +} > > = > >> diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c > >> index bde1fffab09e..dfa6fcfc428a 100644 > >> --- a/drivers/cxl/core/regs.c > >> +++ b/drivers/cxl/core/regs.c > >> @@ -198,6 +198,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, > >> > >> return ret_val; > >> } > >> +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL); > >> > >> int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs, > >> struct cxl_register_map *map, unsigned long map_mask) > >> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > >> index df64c402e6e6..dae3f141ffcb 100644 > >> --- a/drivers/cxl/cxl.h > >> +++ b/drivers/cxl/cxl.h > >> @@ -66,6 +66,8 @@ > >> #define CXL_DECODER_MIN_GRANULARITY 256 > >> #define CXL_DECODER_MAX_ENCODED_IG 6 > >> > >> +#define PCI_AER_CAPABILITY_LENGTH 56 > > > > Odd place to find a PCI specific define. Also a spec reference is > > always good for these. What's the the length of? PCI r6.0 has > > cap going up to address 0x5c so length 0x60. This seems to be igoring > > the header log register. > > > > This was to avoid including the TLP log at 0x38+. > > I can use sizeof(struct aer_capability_regs) or sizeof(*aer_regs) instead. > It's the same 38h(56) and will allow me to remove this #define in the > patchset revision. That works better than a define that people might think is more generic. Otherwise you get PCI_AER_CAP_WITHOUT_TLP_LOG_LENGTH or something equally horrible. (or define the TLP_LOG length as another define and subtract that?) >
Terry Bowman wrote: > RCH downstream port error logging is missing in the current CXL driver. The > missing AER and RAS error logging is needed for communicating driver error > details to userspace. Update the driver to include PCIe AER and CXL RAS > error logging. > > Add RCH downstream port error handling into the existing RCiEP handler. > The downstream port error handler is added to the RCiEP error handler > because the downstream port is implemented in a RCRB, is not PCI > enumerable, and as a result is not directly accessible to the PCI AER > root port driver. The AER root port driver calls the RCiEP handler for > handling RCD errors and RCH downstream port protocol errors. > > Update mem.c to include RAS and AER setup. This includes AER and RAS > capability discovery and mapping for later use in the error handler. > > Disable RCH downstream port's root port cmd interrupts.[1] > > Update existing RCiEP correctable and uncorrectable handlers to also call > the RCH handler. The RCH handler will read the RCH AER registers, check for > error severity, and if an error exists will log using an existing kernel > AER trace routine. The RCH handler will also log downstream port RAS errors > if they exist. I think this patch wants a lead in refactoring to move the existing probe of the CXL RAS capability into the cxl_port driver so that the RCH path and the VH path can be unified for register mapping and error handling invocation. I do not see a compelling rationale to have 2 separate ways to map the RAS capability. The timing of when cxl_setup_ras() is called looks problematic relative to when the first error handler callback might happen. For example what happens when an error fires after cxl_pci has registered its error handlers, but before the component registers have been mapped out of the RCRB? This implies the need for a callback for cxl_pci to notify the cxl_port driver of CXL errors to handle relative to a PCI AER event.
Hi Dan, I added comments inline below. On 4/17/23 19:06, Dan Williams wrote: > Terry Bowman wrote: >> RCH downstream port error logging is missing in the current CXL driver. The >> missing AER and RAS error logging is needed for communicating driver error >> details to userspace. Update the driver to include PCIe AER and CXL RAS >> error logging. >> >> Add RCH downstream port error handling into the existing RCiEP handler. >> The downstream port error handler is added to the RCiEP error handler >> because the downstream port is implemented in a RCRB, is not PCI >> enumerable, and as a result is not directly accessible to the PCI AER >> root port driver. The AER root port driver calls the RCiEP handler for >> handling RCD errors and RCH downstream port protocol errors. >> >> Update mem.c to include RAS and AER setup. This includes AER and RAS >> capability discovery and mapping for later use in the error handler. >> >> Disable RCH downstream port's root port cmd interrupts.[1] >> >> Update existing RCiEP correctable and uncorrectable handlers to also call >> the RCH handler. The RCH handler will read the RCH AER registers, check for >> error severity, and if an error exists will log using an existing kernel >> AER trace routine. The RCH handler will also log downstream port RAS errors >> if they exist. > > I think this patch wants a lead in refactoring to move the existing > probe of the CXL RAS capability into the cxl_port driver so that the RCH > path and the VH path can be unified for register mapping and error > handling invocation. I do not see a compelling rationale to have 2 > separate ways to map the RAS capability. The timing of when > cxl_setup_ras() is called looks problematic relative to when the first > error handler callback might happen. > With respect to timing, I see this works for probing AER and RAS. Will it work for caching the mapped AER and RAS addresses? I ask because the mapped AER and RAS addresses are stored in cxlds and cxlds is created in cxl_pci and isn't necessarily available during RCH dport discovery. RCH dport is discovered within cxl_acpi context (beginning from cxl_acpi_probe()). Also, port.c code shows cxlds is not typically used. If you like I can change RCH RAS mapping to use cxl_map_component_regs()? This was in cxl_rch_map_ras() to handle the RCH odd case for AER and RAS mapping. The RAS can be moved out but RCH AER would still need to be mapped presumably still in cxl_rch_map_ras(). > For example what happens when an error fires after cxl_pci has > registered its error handlers, but before the component registers have > been mapped out of the RCRB? > The RCiEP ISR would execute but the RCH AER and RAS would not be logged because neither are mapped and are instead NULL. The AER and RAS register status would stay resident and be logged in the next ISR entry. > This implies the need for a callback for cxl_pci to notify the cxl_port > driver of CXL errors to handle relative to a PCI AER event. Along similar lines, could the RCH AER and RAS status be checked immediately after mapping and logged if status is present? Regards, Terry
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c index 523d5b9fd7fc..d435ed2ff8b6 100644 --- a/drivers/cxl/core/pci.c +++ b/drivers/cxl/core/pci.c @@ -5,6 +5,7 @@ #include <linux/delay.h> #include <linux/pci.h> #include <linux/pci-doe.h> +#include <linux/aer.h> #include <cxlpci.h> #include <cxlmem.h> #include <cxl.h> @@ -613,32 +614,88 @@ void read_cdat_data(struct cxl_port *port) } EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL); -void cxl_cor_error_detected(struct pci_dev *pdev) +/* Get AER severity. Return false if there is no error. */ +static bool cxl_rch_get_aer_severity(struct aer_capability_regs *aer_regs, + int *severity) +{ + if (aer_regs->uncor_status & ~aer_regs->uncor_mask) { + if (aer_regs->uncor_status & PCI_ERR_ROOT_FATAL_RCV) + *severity = AER_FATAL; + else + *severity = AER_NONFATAL; + return true; + } + + if (aer_regs->cor_status & ~aer_regs->cor_mask) { + *severity = AER_CORRECTABLE; + return true; + } + + return false; +} + +/* + * Copy the AER capability registers to a buffer. This is necessary + * because RCRB AER capability is MMIO mapped. Clear the status + * after copying. + * + * @aer_base: base address of AER capability block in RCRB + * @aer_regs: destination for copying AER capability + */ +static bool cxl_rch_get_aer_info(void __iomem *aer_base, + struct aer_capability_regs *aer_regs) +{ + int read_cnt = PCI_AER_CAPABILITY_LENGTH / sizeof(u32); + u32 *aer_regs_buf = (u32 *)aer_regs; + int n; + + if (!aer_base) + return false; + + for (n = 0; n < read_cnt; n++) + aer_regs_buf[n] = readl(aer_base + n * sizeof(u32)); + + writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS); + writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS); + + return true; +} + +static void __cxl_log_correctable_ras(struct cxl_dev_state *cxlds, + void __iomem *ras_base) { - struct cxl_dev_state *cxlds = pci_get_drvdata(pdev); void __iomem *addr; u32 status; - if (!cxlds->regs.ras) + if (!ras_base) return; - addr = cxlds->regs.ras + CXL_RAS_CORRECTABLE_STATUS_OFFSET; + addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET; status = readl(addr); if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) { writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr); trace_cxl_aer_correctable_error(cxlds->cxlmd, status); } } -EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, CXL); + +static void cxl_log_correctable_ras_endpoint(struct cxl_dev_state *cxlds) +{ + return __cxl_log_correctable_ras(cxlds, cxlds->regs.ras); +} + +static void cxl_log_correctable_ras_dport(struct cxl_dev_state *cxlds) +{ + return __cxl_log_correctable_ras(cxlds, cxlds->regs.dport_ras); +} /* CXL spec rev3.0 8.2.4.16.1 */ -static void header_log_copy(struct cxl_dev_state *cxlds, u32 *log) +static void header_log_copy(void __iomem *ras_base, u32 *log) { void __iomem *addr; u32 *log_addr; int i, log_u32_size = CXL_HEADERLOG_SIZE / sizeof(u32); - addr = cxlds->regs.ras + CXL_RAS_HEADER_LOG_OFFSET; + addr = ras_base + CXL_RAS_HEADER_LOG_OFFSET; log_addr = log; for (i = 0; i < log_u32_size; i++) { @@ -652,17 +709,18 @@ static void header_log_copy(struct cxl_dev_state *cxlds, u32 *log) * Log the state of the RAS status registers and prepare them to log the * next error status. Return 1 if reset needed. */ -static bool cxl_report_and_clear(struct cxl_dev_state *cxlds) +static bool __cxl_report_and_clear(struct cxl_dev_state *cxlds, + void __iomem *ras_base) { u32 hl[CXL_HEADERLOG_SIZE_U32]; void __iomem *addr; u32 status; u32 fe; - if (!cxlds->regs.ras) + if (!ras_base) return false; - addr = cxlds->regs.ras + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET; + addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET; status = readl(addr); if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK)) return false; @@ -670,7 +728,7 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds) /* If multiple errors, log header points to first error from ctrl reg */ if (hweight32(status) > 1) { void __iomem *rcc_addr = - cxlds->regs.ras + CXL_RAS_CAP_CONTROL_OFFSET; + ras_base + CXL_RAS_CAP_CONTROL_OFFSET; fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK, readl(rcc_addr))); @@ -678,13 +736,54 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds) fe = status; } - header_log_copy(cxlds, hl); + header_log_copy(ras_base, hl); trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl); writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr); return true; } +static bool cxl_report_and_clear(struct cxl_dev_state *cxlds) +{ + return __cxl_report_and_clear(cxlds, cxlds->regs.ras); +} + +static bool cxl_report_and_clear_dport(struct cxl_dev_state *cxlds) +{ + return __cxl_report_and_clear(cxlds, cxlds->regs.dport_ras); +} + +static void cxl_rch_log_error(struct cxl_dev_state *cxlds) +{ + struct pci_dev *pdev = to_pci_dev(cxlds->dev); + struct aer_capability_regs aer_regs; + int severity; + + if (!cxl_rch_get_aer_info(cxlds->regs.aer, &aer_regs)) + return; + + if (!cxl_rch_get_aer_severity(&aer_regs, &severity)) + return; + + cper_print_aer(pdev, severity, &aer_regs); + + if (severity == AER_CORRECTABLE) + cxl_log_correctable_ras_dport(cxlds); + else + cxl_report_and_clear_dport(cxlds); +} + +void cxl_cor_error_detected(struct pci_dev *pdev) +{ + struct cxl_dev_state *cxlds = pci_get_drvdata(pdev); + + if (cxlds->rcd) + cxl_rch_log_error(cxlds); + + cxl_log_correctable_ras_endpoint(cxlds); +} +EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, CXL); + pci_ers_result_t cxl_error_detected(struct pci_dev *pdev, pci_channel_state_t state) { @@ -693,6 +792,9 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev, struct device *dev = &cxlmd->dev; bool ue; + if (cxlds->rcd) + cxl_rch_log_error(cxlds); + /* * A frozen channel indicates an impending reset which is fatal to * CXL.mem operation, and will likely crash the system. On the off diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c index bde1fffab09e..dfa6fcfc428a 100644 --- a/drivers/cxl/core/regs.c +++ b/drivers/cxl/core/regs.c @@ -198,6 +198,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, return ret_val; } +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL); int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs, struct cxl_register_map *map, unsigned long map_mask) diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index df64c402e6e6..dae3f141ffcb 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -66,6 +66,8 @@ #define CXL_DECODER_MIN_GRANULARITY 256 #define CXL_DECODER_MAX_ENCODED_IG 6 +#define PCI_AER_CAPABILITY_LENGTH 56 + static inline int cxl_hdm_decoder_count(u32 cap_hdr) { int val = FIELD_GET(CXL_HDM_DECODER_COUNT_MASK, cap_hdr); @@ -209,6 +211,15 @@ struct cxl_regs { struct_group_tagged(cxl_device_regs, device_regs, void __iomem *status, *mbox, *memdev; ); + + /* + * Pointer to RCH cxl_dport AER. (only for RCH/RCD mode) + * @dport_aer: CXL 2.0 12.2.11 RCH Downstream Port-detected Errors + */ + struct_group_tagged(cxl_rch_regs, rch_regs, + void __iomem *aer; + void __iomem *dport_ras; + ); }; struct cxl_reg_map { @@ -249,6 +260,8 @@ struct cxl_register_map { }; }; +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, + resource_size_t length); void cxl_probe_component_regs(struct device *dev, void __iomem *base, struct cxl_component_reg_map *map); void cxl_probe_device_regs(struct device *dev, void __iomem *base, diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c index 014295ab6bc6..dd5ae0a4560c 100644 --- a/drivers/cxl/mem.c +++ b/drivers/cxl/mem.c @@ -4,6 +4,7 @@ #include <linux/device.h> #include <linux/module.h> #include <linux/pci.h> +#include <linux/aer.h> #include "cxlmem.h" #include "cxlpci.h" @@ -45,6 +46,71 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data) return 0; } +static void rch_disable_root_ints(void __iomem *aer_base) +{ + u32 aer_cmd_mask, aer_cmd; + + /* + * Disable RCH root port command interrupts. + * CXL3.0 12.2.1.1 - RCH Downstream Port-detected Errors + */ + aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN | + PCI_ERR_ROOT_CMD_NONFATAL_EN | + PCI_ERR_ROOT_CMD_FATAL_EN); + aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND); + aer_cmd &= ~aer_cmd_mask; + writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND); +} + +static int cxl_rch_map_ras(struct cxl_dev_state *cxlds, + struct cxl_dport *parent_dport) +{ + struct device *dev = parent_dport->dport; + resource_size_t aer_phys, ras_phys; + void __iomem *aer, *dport_ras; + + if (!parent_dport->rch) + return 0; + + if (!parent_dport->aer_cap || !parent_dport->ras_cap || + parent_dport->component_reg_phys == CXL_RESOURCE_NONE) + return -ENODEV; + + aer_phys = parent_dport->aer_cap + parent_dport->rcrb; + aer = devm_cxl_iomap_block(dev, aer_phys, + PCI_AER_CAPABILITY_LENGTH); + + if (!aer) + return -ENOMEM; + + ras_phys = parent_dport->ras_cap + parent_dport->component_reg_phys; + dport_ras = devm_cxl_iomap_block(dev, ras_phys, + CXL_RAS_CAPABILITY_LENGTH); + + if (!dport_ras) + return -ENOMEM; + + cxlds->regs.aer = aer; + cxlds->regs.dport_ras = dport_ras; + + return 0; +} + +static int cxl_setup_ras(struct cxl_dev_state *cxlds, + struct cxl_dport *parent_dport) +{ + int rc; + + rc = cxl_rch_map_ras(cxlds, parent_dport); + if (rc) + return rc; + + if (cxlds->rcd) + rch_disable_root_ints(cxlds->regs.aer); + + return rc; +} + static void cxl_setup_rcrb(struct cxl_dev_state *cxlds, struct cxl_dport *parent_dport) { @@ -91,6 +157,13 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, cxl_setup_rcrb(cxlds, parent_dport); + rc = cxl_setup_ras(cxlds, parent_dport); + /* Continue with RAS setup errors */ + if (rc) + dev_warn(&cxlmd->dev, "CXL RAS setup failed: %d\n", rc); + else + dev_info(&cxlmd->dev, "CXL error handling enabled\n"); + endpoint = devm_cxl_add_port(host, &cxlmd->dev, cxlds->component_reg_phys, parent_dport); if (IS_ERR(endpoint))