[v3,4/6] cxl/pci: Add RCH downstream port error logging

Message ID	20230411180302.2678736-5-terry.bowman@amd.com
State	Superseded
Headers	show Return-Path: <linux-cxl-owner@vger.kernel.org> Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C From: Terry Bowman <terry.bowman@amd.com> To: <alison.schofield@intel.com>, <vishal.l.verma@intel.com>, <ira.weiny@intel.com>, <bwidawsk@kernel.org>, <dan.j.williams@intel.com>, <dave.jiang@intel.com>, <Jonathan.Cameron@huawei.com>, <linux-cxl@vger.kernel.org> CC: <terry.bowman@amd.com>, <rrichter@amd.com>, <linux-kernel@vger.kernel.org>, <bhelgaas@google.com> Subject: [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging Date: Tue, 11 Apr 2023 13:03:00 -0500 Message-ID: <20230411180302.2678736-5-terry.bowman@amd.com> In-Reply-To: <20230411180302.2678736-1-terry.bowman@amd.com> References: <20230411180302.2678736-1-terry.bowman@amd.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain Precedence: bulk
Series	cxl/pci: Add support for RCH RAS error handling \| expand [v3,0/6] cxl/pci: Add support for RCH RAS error handling [v3,1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery [v3,2/6] efi/cper: Export cper_mem_err_unpack() for use by modules [v3,3/6] PCI/AER: Export cper_print_aer() for use by modules [v3,4/6] cxl/pci: Add RCH downstream port error logging [v3,5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler [v3,6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling

Bowman, Terry April 11, 2023, 6:03 p.m. UTC

RCH downstream port error logging is missing in the current CXL driver. The
missing AER and RAS error logging is needed for communicating driver error
details to userspace. Update the driver to include PCIe AER and CXL RAS
error logging.

Add RCH downstream port error handling into the existing RCiEP handler.
The downstream port error handler is added to the RCiEP error handler
because the downstream port is implemented in a RCRB, is not PCI
enumerable, and as a result is not directly accessible to the PCI AER
root port driver. The AER root port driver calls the RCiEP handler for
handling RCD errors and RCH downstream port protocol errors.

Update mem.c to include RAS and AER setup. This includes AER and RAS
capability discovery and mapping for later use in the error handler.

Disable RCH downstream port's root port cmd interrupts.[1]

Update existing RCiEP correctable and uncorrectable handlers to also call
the RCH handler. The RCH handler will read the RCH AER registers, check for
error severity, and if an error exists will log using an existing kernel
AER trace routine. The RCH handler will also log downstream port RAS errors
if they exist.

[1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors

Co-developed-by: Robert Richter <rrichter@amd.com>
Signed-off-by: Robert Richter <rrichter@amd.com>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c  | 126 ++++++++++++++++++++++++++++++++++++----
 drivers/cxl/core/regs.c |   1 +
 drivers/cxl/cxl.h       |  13 +++++
 drivers/cxl/mem.c       |  73 +++++++++++++++++++++++
 4 files changed, 201 insertions(+), 12 deletions(-)

kernel test robot April 12, 2023, 1:32 a.m. UTC | #1

Hi Terry,

kernel test robot noticed the following build errors:

[auto build test ERROR on ca712e47054678c5ce93a0e0f686353ad5561195]

url:    https://github.com/intel-lab-lkp/linux/commits/Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230412-020957
base:   ca712e47054678c5ce93a0e0f686353ad5561195
patch link:    https://lore.kernel.org/r/20230411180302.2678736-5-terry.bowman%40amd.com
patch subject: [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging
config: loongarch-buildonly-randconfig-r004-20230409 (https://download.01.org/0day-ci/archive/20230412/202304120926.dekDF6um-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/7f1c5cefb1e75bd709dc35c7f5e3e29dd5df65e1
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230412-020957
        git checkout 7f1c5cefb1e75bd709dc35c7f5e3e29dd5df65e1
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=loongarch olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=loongarch SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304120926.dekDF6um-lkp@intel.com/

All errors (new ones prefixed by >>):

   loongarch64-linux-ld: drivers/cxl/core/pci.o: in function `cxl_rch_log_error':
   drivers/cxl/core/pci.c:768: undefined reference to `cper_print_aer'
>> loongarch64-linux-ld: drivers/cxl/core/pci.c:768: undefined reference to `cper_print_aer'
>> loongarch64-linux-ld: drivers/cxl/core/pci.c:768: undefined reference to `cper_print_aer'

kernel test robot April 12, 2023, 3:04 a.m. UTC | #2

Hi Terry,

kernel test robot noticed the following build errors:

[auto build test ERROR on ca712e47054678c5ce93a0e0f686353ad5561195]

url:    https://github.com/intel-lab-lkp/linux/commits/Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230412-020957
base:   ca712e47054678c5ce93a0e0f686353ad5561195
patch link:    https://lore.kernel.org/r/20230411180302.2678736-5-terry.bowman%40amd.com
patch subject: [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging
config: riscv-randconfig-r014-20230410 (https://download.01.org/0day-ci/archive/20230412/202304121055.UceD86D7-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/7f1c5cefb1e75bd709dc35c7f5e3e29dd5df65e1
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230412-020957
        git checkout 7f1c5cefb1e75bd709dc35c7f5e3e29dd5df65e1
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=riscv olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304121055.UceD86D7-lkp@intel.com/

All errors (new ones prefixed by >>):

   riscv64-linux-ld: riscv64-linux-ld: DWARF error: could not find abbrev number 463040
   drivers/cxl/core/pci.o: in function `.L0 ':
   pci.c:(.text+0x1ae2): undefined reference to `cper_print_aer'
>> riscv64-linux-ld: pci.c:(.text+0x1afa): undefined reference to `cper_print_aer'

Jonathan Cameron April 13, 2023, 4:50 p.m. UTC | #3

On Tue, 11 Apr 2023 13:03:00 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> RCH downstream port error logging is missing in the current CXL driver. The
> missing AER and RAS error logging is needed for communicating driver error
> details to userspace. Update the driver to include PCIe AER and CXL RAS
> error logging.
> 
> Add RCH downstream port error handling into the existing RCiEP handler.
> The downstream port error handler is added to the RCiEP error handler
> because the downstream port is implemented in a RCRB, is not PCI
> enumerable, and as a result is not directly accessible to the PCI AER
> root port driver. The AER root port driver calls the RCiEP handler for
> handling RCD errors and RCH downstream port protocol errors.
> 
> Update mem.c to include RAS and AER setup. This includes AER and RAS
> capability discovery and mapping for later use in the error handler.
> 
> Disable RCH downstream port's root port cmd interrupts.[1]
> 
> Update existing RCiEP correctable and uncorrectable handlers to also call
> the RCH handler. The RCH handler will read the RCH AER registers, check for
> error severity, and if an error exists will log using an existing kernel
> AER trace routine. The RCH handler will also log downstream port RAS errors
> if they exist.
> 
> [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors
> 
> Co-developed-by: Robert Richter <rrichter@amd.com>
> Signed-off-by: Robert Richter <rrichter@amd.com>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Some minor stuff inline. Looks fine to me otherwise.

I do find it a little confusing how often we go into an RCD or RCH specific
function then drop out directly for 2.0+ case, but you do seem to be consistent
with existing code so fair enough.

Jonathan

> ---
>  drivers/cxl/core/pci.c  | 126 ++++++++++++++++++++++++++++++++++++----
>  drivers/cxl/core/regs.c |   1 +
>  drivers/cxl/cxl.h       |  13 +++++
>  drivers/cxl/mem.c       |  73 +++++++++++++++++++++++
>  4 files changed, 201 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 523d5b9fd7fc..d435ed2ff8b6 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c


> +/*
> + * Copy the AER capability registers to a buffer. This is necessary
> + * because RCRB AER capability is MMIO mapped. Clear the status
> + * after copying.
> + *
> + * @aer_base: base address of AER capability block in RCRB
> + * @aer_regs: destination for copying AER capability
> + */
> +static bool cxl_rch_get_aer_info(void __iomem *aer_base,
> +				 struct aer_capability_regs *aer_regs)
> +{
> +	int read_cnt = PCI_AER_CAPABILITY_LENGTH / sizeof(u32);
> +	u32 *aer_regs_buf = (u32 *)aer_regs;
> +	int n;
> +
> +	if (!aer_base)
> +		return false;
> +
> +	for (n = 0; n < read_cnt; n++)
> +		aer_regs_buf[n] = readl(aer_base + n * sizeof(u32));

Maybe add a comment here on why memcpy_fromio() doesn't work for us.
I'm assuming we need these to definitely be 32bit reads.
Otherwise someone will 'optimize' it in future.

> +
> +	writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS);
> +	writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS);
> +
> +	return true;
> +}
=
> diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
> index bde1fffab09e..dfa6fcfc428a 100644
> --- a/drivers/cxl/core/regs.c
> +++ b/drivers/cxl/core/regs.c
> @@ -198,6 +198,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
>  
>  	return ret_val;
>  }
> +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL);
>  
>  int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs,
>  			   struct cxl_register_map *map, unsigned long map_mask)
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index df64c402e6e6..dae3f141ffcb 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -66,6 +66,8 @@
>  #define CXL_DECODER_MIN_GRANULARITY 256
>  #define CXL_DECODER_MAX_ENCODED_IG 6
>  
> +#define PCI_AER_CAPABILITY_LENGTH 56

Odd place to find a PCI specific define. Also a spec reference is
always good for these.  What's the the length of? PCI r6.0 has
cap going up to address 0x5c  so length 0x60.  This seems to be igoring
the header log register.

> +
>  static inline int cxl_hdm_decoder_count(u32 cap_hdr)
>  {
>  	int val = FIELD_GET(CXL_HDM_DECODER_COUNT_MASK, cap_hdr);
> @@ -209,6 +211,15 @@ struct cxl_regs {
>  	struct_group_tagged(cxl_device_regs, device_regs,
>  		void __iomem *status, *mbox, *memdev;
>  	);
> +
> +	/*
> +	 * Pointer to RCH cxl_dport AER. (only for RCH/RCD mode)
> +	 * @dport_aer: CXL 2.0 12.2.11 RCH Downstream Port-detected Errors

As with other cases, I'd like full comments, so something for @aer as well.

> +	 */
> +	struct_group_tagged(cxl_rch_regs, rch_regs,
> +		void __iomem *aer;
> +		void __iomem *dport_ras;
> +	);
>  };
>  
>  struct cxl_reg_map {
> @@ -249,6 +260,8 @@ struct cxl_register_map {
>  	};
>  };
>  
> +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
> +				   resource_size_t length);
>  void cxl_probe_component_regs(struct device *dev, void __iomem *base,
>  			      struct cxl_component_reg_map *map);
>  void cxl_probe_device_regs(struct device *dev, void __iomem *base,
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 014295ab6bc6..dd5ae0a4560c 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -4,6 +4,7 @@
>  #include <linux/device.h>
>  #include <linux/module.h>
>  #include <linux/pci.h>
> +#include <linux/aer.h>
>  
>  #include "cxlmem.h"
>  #include "cxlpci.h"
> @@ -45,6 +46,71 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
>  	return 0;
>  }
>  
> +static void rch_disable_root_ints(void __iomem *aer_base)
> +{
> +	u32 aer_cmd_mask, aer_cmd;
> +
> +	/*
> +	 * Disable RCH root port command interrupts.
> +	 * CXL3.0 12.2.1.1 - RCH Downstream Port-detected Errors
> +	 */
> +	aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN |
> +			PCI_ERR_ROOT_CMD_NONFATAL_EN |
> +			PCI_ERR_ROOT_CMD_FATAL_EN);
> +	aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND);
> +	aer_cmd &= ~aer_cmd_mask;
> +	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);

Should we be touching these if firmware hasn't granted control to
the OS?  Description in the spec refers to 'software'. Is that
the kernel? No idea.  I guess this is safe even if it has already
been done. Perhaps a comment to say it should already be in this state?


> +}
> +
> +static int cxl_rch_map_ras(struct cxl_dev_state *cxlds,
> +			   struct cxl_dport *parent_dport)
> +{
> +	struct device *dev = parent_dport->dport;
> +	resource_size_t aer_phys, ras_phys;
> +	void __iomem *aer, *dport_ras;
> +
> +	if (!parent_dport->rch)
> +		return 0;
> +
> +	if (!parent_dport->aer_cap || !parent_dport->ras_cap ||
> +	    parent_dport->component_reg_phys == CXL_RESOURCE_NONE)
> +		return -ENODEV;
> +
> +	aer_phys = parent_dport->aer_cap + parent_dport->rcrb;
> +	aer = devm_cxl_iomap_block(dev, aer_phys,
> +				   PCI_AER_CAPABILITY_LENGTH);
> +
> +	if (!aer)
> +		return -ENOMEM;
> +
> +	ras_phys = parent_dport->ras_cap + parent_dport->component_reg_phys;
> +	dport_ras = devm_cxl_iomap_block(dev, ras_phys,
> +					 CXL_RAS_CAPABILITY_LENGTH);
> +
> +	if (!dport_ras)
> +		return -ENOMEM;
> +
> +	cxlds->regs.aer = aer;
> +	cxlds->regs.dport_ras = dport_ras;
> +
> +	return 0;
> +}
> +
> +static int cxl_setup_ras(struct cxl_dev_state *cxlds,
> +			 struct cxl_dport *parent_dport)
> +{
> +	int rc;
> +
> +	rc = cxl_rch_map_ras(cxlds, parent_dport);
> +	if (rc)
> +		return rc;
> +
> +	if (cxlds->rcd)
> +		rch_disable_root_ints(cxlds->regs.aer);
> +
> +	return rc;
> +}
> +
>  static void cxl_setup_rcrb(struct cxl_dev_state *cxlds,
>  			   struct cxl_dport *parent_dport)
>  {
> @@ -91,6 +157,13 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>  
>  	cxl_setup_rcrb(cxlds, parent_dport);
>  
> +	rc = cxl_setup_ras(cxlds, parent_dport);
> +	/* Continue with RAS setup errors */
> +	if (rc)
> +		dev_warn(&cxlmd->dev, "CXL RAS setup failed: %d\n", rc);
> +	else
> +		dev_info(&cxlmd->dev, "CXL error handling enabled\n");

This feels a little noisy as something to add given we didn't shout about it for
non RCD cases (I think).  Maybe a dev_dbg()?

> +
>  	endpoint = devm_cxl_add_port(host, &cxlmd->dev, cxlds->component_reg_phys,
>  				     parent_dport);
>  	if (IS_ERR(endpoint))

Bowman, Terry April 14, 2023, 4:36 p.m. UTC | #4

Hi Jonathan,

I added responses inline below.

On 4/13/23 11:50, Jonathan Cameron wrote:
> On Tue, 11 Apr 2023 13:03:00 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
> 
>> RCH downstream port error logging is missing in the current CXL driver. The
>> missing AER and RAS error logging is needed for communicating driver error
>> details to userspace. Update the driver to include PCIe AER and CXL RAS
>> error logging.
>>
>> Add RCH downstream port error handling into the existing RCiEP handler.
>> The downstream port error handler is added to the RCiEP error handler
>> because the downstream port is implemented in a RCRB, is not PCI
>> enumerable, and as a result is not directly accessible to the PCI AER
>> root port driver. The AER root port driver calls the RCiEP handler for
>> handling RCD errors and RCH downstream port protocol errors.
>>
>> Update mem.c to include RAS and AER setup. This includes AER and RAS
>> capability discovery and mapping for later use in the error handler.
>>
>> Disable RCH downstream port's root port cmd interrupts.[1]
>>
>> Update existing RCiEP correctable and uncorrectable handlers to also call
>> the RCH handler. The RCH handler will read the RCH AER registers, check for
>> error severity, and if an error exists will log using an existing kernel
>> AER trace routine. The RCH handler will also log downstream port RAS errors
>> if they exist.
>>
>> [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors
>>
>> Co-developed-by: Robert Richter <rrichter@amd.com>
>> Signed-off-by: Robert Richter <rrichter@amd.com>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Some minor stuff inline. Looks fine to me otherwise.
> 
> I do find it a little confusing how often we go into an RCD or RCH specific
> function then drop out directly for 2.0+ case, but you do seem to be consistent
> with existing code so fair enough.
> 
> Jonathan
> 

This was to simplify the code from the caller(s) perspective while also trying to 
generalize the logic. 

>> ---
>>  drivers/cxl/core/pci.c  | 126 ++++++++++++++++++++++++++++++++++++----
>>  drivers/cxl/core/regs.c |   1 +
>>  drivers/cxl/cxl.h       |  13 +++++
>>  drivers/cxl/mem.c       |  73 +++++++++++++++++++++++
>>  4 files changed, 201 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 523d5b9fd7fc..d435ed2ff8b6 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
> 
> 
>> +/*
>> + * Copy the AER capability registers to a buffer. This is necessary
>> + * because RCRB AER capability is MMIO mapped. Clear the status
>> + * after copying.
>> + *
>> + * @aer_base: base address of AER capability block in RCRB
>> + * @aer_regs: destination for copying AER capability
>> + */
>> +static bool cxl_rch_get_aer_info(void __iomem *aer_base,
>> +				 struct aer_capability_regs *aer_regs)
>> +{
>> +	int read_cnt = PCI_AER_CAPABILITY_LENGTH / sizeof(u32);
>> +	u32 *aer_regs_buf = (u32 *)aer_regs;
>> +	int n;
>> +
>> +	if (!aer_base)
>> +		return false;
>> +
>> +	for (n = 0; n < read_cnt; n++)
>> +		aer_regs_buf[n] = readl(aer_base + n * sizeof(u32));
> 
> Maybe add a comment here on why memcpy_fromio() doesn't work for us.
> I'm assuming we need these to definitely be 32bit reads.
> Otherwise someone will 'optimize' it in future.
> 

Correct, this was to enforce 32-bit accesses. I will add a comment.

>> +
>> +	writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS);
>> +	writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS);
>> +
>> +	return true;
>> +}
> =
>> diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
>> index bde1fffab09e..dfa6fcfc428a 100644
>> --- a/drivers/cxl/core/regs.c
>> +++ b/drivers/cxl/core/regs.c
>> @@ -198,6 +198,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
>>  
>>  	return ret_val;
>>  }
>> +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL);
>>  
>>  int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs,
>>  			   struct cxl_register_map *map, unsigned long map_mask)
>> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
>> index df64c402e6e6..dae3f141ffcb 100644
>> --- a/drivers/cxl/cxl.h
>> +++ b/drivers/cxl/cxl.h
>> @@ -66,6 +66,8 @@
>>  #define CXL_DECODER_MIN_GRANULARITY 256
>>  #define CXL_DECODER_MAX_ENCODED_IG 6
>>  
>> +#define PCI_AER_CAPABILITY_LENGTH 56
> 
> Odd place to find a PCI specific define. Also a spec reference is
> always good for these.  What's the the length of? PCI r6.0 has
> cap going up to address 0x5c  so length 0x60.  This seems to be igoring
> the header log register.
>

This was to avoid including the TLP log at 0x38+.

I can use sizeof(struct aer_capability_regs) or sizeof(*aer_regs) instead. 
It's the same 38h(56) and will allow me to remove this #define in the 
patchset revision.
 
>> +
>>  static inline int cxl_hdm_decoder_count(u32 cap_hdr)
>>  {
>>  	int val = FIELD_GET(CXL_HDM_DECODER_COUNT_MASK, cap_hdr);
>> @@ -209,6 +211,15 @@ struct cxl_regs {
>>  	struct_group_tagged(cxl_device_regs, device_regs,
>>  		void __iomem *status, *mbox, *memdev;
>>  	);
>> +
>> +	/*
>> +	 * Pointer to RCH cxl_dport AER. (only for RCH/RCD mode)
>> +	 * @dport_aer: CXL 2.0 12.2.11 RCH Downstream Port-detected Errors
> 
> As with other cases, I'd like full comments, so something for @aer as well.
> 
>> +	 */
>> +	struct_group_tagged(cxl_rch_regs, rch_regs,
>> +		void __iomem *aer;
>> +		void __iomem *dport_ras;
>> +	);
>>  };
>>  
>>  struct cxl_reg_map {
>> @@ -249,6 +260,8 @@ struct cxl_register_map {
>>  	};
>>  };
>>  
>> +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
>> +				   resource_size_t length);
>>  void cxl_probe_component_regs(struct device *dev, void __iomem *base,
>>  			      struct cxl_component_reg_map *map);
>>  void cxl_probe_device_regs(struct device *dev, void __iomem *base,
>> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
>> index 014295ab6bc6..dd5ae0a4560c 100644
>> --- a/drivers/cxl/mem.c
>> +++ b/drivers/cxl/mem.c
>> @@ -4,6 +4,7 @@
>>  #include <linux/device.h>
>>  #include <linux/module.h>
>>  #include <linux/pci.h>
>> +#include <linux/aer.h>
>>  
>>  #include "cxlmem.h"
>>  #include "cxlpci.h"
>> @@ -45,6 +46,71 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
>>  	return 0;
>>  }
>>  
>> +static void rch_disable_root_ints(void __iomem *aer_base)
>> +{
>> +	u32 aer_cmd_mask, aer_cmd;
>> +
>> +	/*
>> +	 * Disable RCH root port command interrupts.
>> +	 * CXL3.0 12.2.1.1 - RCH Downstream Port-detected Errors
>> +	 */
>> +	aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN |
>> +			PCI_ERR_ROOT_CMD_NONFATAL_EN |
>> +			PCI_ERR_ROOT_CMD_FATAL_EN);
>> +	aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND);
>> +	aer_cmd &= ~aer_cmd_mask;
>> +	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
> 
> Should we be touching these if firmware hasn't granted control to
> the OS?  Description in the spec refers to 'software'. Is that
> the kernel? No idea.  I guess this is safe even if it has already
> been done. Perhaps a comment to say it should already be in this state?
> 
> 

These need to be disabled because the RCH shouldn't behave as a root 
port/RCEC generating interrupts as a result of correctable, fatal, or non-fatal
AER errors. I added this per the CXL3.0 spec but, as you mentioned, isn't 
likely necessary because they are disabled by default per PCI6.0.[1][2]

This would be the case for OS/native and HW/FW error reporting.

I'll add a comment stating it is already in this state. 

[1] CXL3.0 - 12.2.1.1 RCH Downstream Port-detected Errors
[2] PCI 6.0 - 7.8.4.9 Root Error Command Register (Offset 2Ch)

>> +}
>> +
>> +static int cxl_rch_map_ras(struct cxl_dev_state *cxlds,
>> +			   struct cxl_dport *parent_dport)
>> +{
>> +	struct device *dev = parent_dport->dport;
>> +	resource_size_t aer_phys, ras_phys;
>> +	void __iomem *aer, *dport_ras;
>> +
>> +	if (!parent_dport->rch)
>> +		return 0;
>> +
>> +	if (!parent_dport->aer_cap || !parent_dport->ras_cap ||
>> +	    parent_dport->component_reg_phys == CXL_RESOURCE_NONE)
>> +		return -ENODEV;
>> +
>> +	aer_phys = parent_dport->aer_cap + parent_dport->rcrb;
>> +	aer = devm_cxl_iomap_block(dev, aer_phys,
>> +				   PCI_AER_CAPABILITY_LENGTH);
>> +
>> +	if (!aer)
>> +		return -ENOMEM;
>> +
>> +	ras_phys = parent_dport->ras_cap + parent_dport->component_reg_phys;
>> +	dport_ras = devm_cxl_iomap_block(dev, ras_phys,
>> +					 CXL_RAS_CAPABILITY_LENGTH);
>> +
>> +	if (!dport_ras)
>> +		return -ENOMEM;
>> +
>> +	cxlds->regs.aer = aer;
>> +	cxlds->regs.dport_ras = dport_ras;
>> +
>> +	return 0;
>> +}
>> +
>> +static int cxl_setup_ras(struct cxl_dev_state *cxlds,
>> +			 struct cxl_dport *parent_dport)
>> +{
>> +	int rc;
>> +
>> +	rc = cxl_rch_map_ras(cxlds, parent_dport);
>> +	if (rc)
>> +		return rc;
>> +
>> +	if (cxlds->rcd)
>> +		rch_disable_root_ints(cxlds->regs.aer);
>> +
>> +	return rc;
>> +}
>> +
>>  static void cxl_setup_rcrb(struct cxl_dev_state *cxlds,
>>  			   struct cxl_dport *parent_dport)
>>  {
>> @@ -91,6 +157,13 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>>  
>>  	cxl_setup_rcrb(cxlds, parent_dport);
>>  
>> +	rc = cxl_setup_ras(cxlds, parent_dport);
>> +	/* Continue with RAS setup errors */
>> +	if (rc)
>> +		dev_warn(&cxlmd->dev, "CXL RAS setup failed: %d\n", rc);
>> +	else
>> +		dev_info(&cxlmd->dev, "CXL error handling enabled\n");
> 
> This feels a little noisy as something to add given we didn't shout about it for
> non RCD cases (I think).  Maybe a dev_dbg()?
> 

Ok.

Regards,
Terry

>> +
>>  	endpoint = devm_cxl_add_port(host, &cxlmd->dev, cxlds->component_reg_phys,
>>  				     parent_dport);
>>  	if (IS_ERR(endpoint))
>

Jonathan Cameron April 17, 2023, 4:56 p.m. UTC | #5

> 
> >> +
> >> +	writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS);
> >> +	writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS);
> >> +
> >> +	return true;
> >> +}  
> > =  
> >> diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
> >> index bde1fffab09e..dfa6fcfc428a 100644
> >> --- a/drivers/cxl/core/regs.c
> >> +++ b/drivers/cxl/core/regs.c
> >> @@ -198,6 +198,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
> >>  
> >>  	return ret_val;
> >>  }
> >> +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL);
> >>  
> >>  int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs,
> >>  			   struct cxl_register_map *map, unsigned long map_mask)
> >> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> >> index df64c402e6e6..dae3f141ffcb 100644
> >> --- a/drivers/cxl/cxl.h
> >> +++ b/drivers/cxl/cxl.h
> >> @@ -66,6 +66,8 @@
> >>  #define CXL_DECODER_MIN_GRANULARITY 256
> >>  #define CXL_DECODER_MAX_ENCODED_IG 6
> >>  
> >> +#define PCI_AER_CAPABILITY_LENGTH 56  
> > 
> > Odd place to find a PCI specific define. Also a spec reference is
> > always good for these.  What's the the length of? PCI r6.0 has
> > cap going up to address 0x5c  so length 0x60.  This seems to be igoring
> > the header log register.
> >  
> 
> This was to avoid including the TLP log at 0x38+.
> 
> I can use sizeof(struct aer_capability_regs) or sizeof(*aer_regs) instead. 
> It's the same 38h(56) and will allow me to remove this #define in the 
> patchset revision.

That works better than a define that people might think is more generic.
Otherwise you get PCI_AER_CAP_WITHOUT_TLP_LOG_LENGTH or
something equally horrible. (or define the TLP_LOG length as another
define and subtract that?)

>

Dan Williams April 18, 2023, 12:06 a.m. UTC | #6

Terry Bowman wrote:
> RCH downstream port error logging is missing in the current CXL driver. The
> missing AER and RAS error logging is needed for communicating driver error
> details to userspace. Update the driver to include PCIe AER and CXL RAS
> error logging.
> 
> Add RCH downstream port error handling into the existing RCiEP handler.
> The downstream port error handler is added to the RCiEP error handler
> because the downstream port is implemented in a RCRB, is not PCI
> enumerable, and as a result is not directly accessible to the PCI AER
> root port driver. The AER root port driver calls the RCiEP handler for
> handling RCD errors and RCH downstream port protocol errors.
> 
> Update mem.c to include RAS and AER setup. This includes AER and RAS
> capability discovery and mapping for later use in the error handler.
> 
> Disable RCH downstream port's root port cmd interrupts.[1]
> 
> Update existing RCiEP correctable and uncorrectable handlers to also call
> the RCH handler. The RCH handler will read the RCH AER registers, check for
> error severity, and if an error exists will log using an existing kernel
> AER trace routine. The RCH handler will also log downstream port RAS errors
> if they exist.

I think this patch wants a lead in refactoring to move the existing
probe of the CXL RAS capability into the cxl_port driver so that the RCH
path and the VH path can be unified for register mapping and error
handling invocation. I do not see a compelling rationale to have 2
separate ways to map the RAS capability. The timing of when
cxl_setup_ras() is called looks problematic relative to when the first
error handler callback might happen.

For example what happens when an error fires after cxl_pci has
registered its error handlers, but before the component registers have
been mapped out of the RCRB?

This implies the need for a callback for cxl_pci to notify the cxl_port
driver of CXL errors to handle relative to a PCI AER event.

Bowman, Terry April 24, 2023, 6:39 p.m. UTC | #7

Hi Dan,

I added comments inline below.

On 4/17/23 19:06, Dan Williams wrote:
> Terry Bowman wrote:
>> RCH downstream port error logging is missing in the current CXL driver. The
>> missing AER and RAS error logging is needed for communicating driver error
>> details to userspace. Update the driver to include PCIe AER and CXL RAS
>> error logging.
>>
>> Add RCH downstream port error handling into the existing RCiEP handler.
>> The downstream port error handler is added to the RCiEP error handler
>> because the downstream port is implemented in a RCRB, is not PCI
>> enumerable, and as a result is not directly accessible to the PCI AER
>> root port driver. The AER root port driver calls the RCiEP handler for
>> handling RCD errors and RCH downstream port protocol errors.
>>
>> Update mem.c to include RAS and AER setup. This includes AER and RAS
>> capability discovery and mapping for later use in the error handler.
>>
>> Disable RCH downstream port's root port cmd interrupts.[1]
>>
>> Update existing RCiEP correctable and uncorrectable handlers to also call
>> the RCH handler. The RCH handler will read the RCH AER registers, check for
>> error severity, and if an error exists will log using an existing kernel
>> AER trace routine. The RCH handler will also log downstream port RAS errors
>> if they exist.
> 
> I think this patch wants a lead in refactoring to move the existing
> probe of the CXL RAS capability into the cxl_port driver so that the RCH
> path and the VH path can be unified for register mapping and error
> handling invocation. I do not see a compelling rationale to have 2
> separate ways to map the RAS capability. The timing of when
> cxl_setup_ras() is called looks problematic relative to when the first
> error handler callback might happen.
> 

With respect to timing, I see this works for probing AER and RAS. Will it 
work for caching the mapped AER and RAS addresses? I ask because the mapped 
AER and RAS addresses are stored in cxlds and cxlds is created in cxl_pci 
and isn't necessarily available during RCH dport discovery. RCH dport is 
discovered within cxl_acpi context (beginning from cxl_acpi_probe()). Also, 
port.c code shows cxlds is not typically used. 

If you like I can change RCH RAS mapping to use cxl_map_component_regs()? This 
was in cxl_rch_map_ras() to handle the RCH odd case for AER and RAS mapping. 
The RAS can be moved out but RCH AER would still need to be mapped presumably still
in cxl_rch_map_ras().

> For example what happens when an error fires after cxl_pci has
> registered its error handlers, but before the component registers have
> been mapped out of the RCRB?
>

The RCiEP ISR would execute but the RCH AER and RAS would not be logged because 
neither are mapped and are instead NULL. The AER and RAS register status would stay 
resident and be logged in the next ISR entry. 

> This implies the need for a callback for cxl_pci to notify the cxl_port
> driver of CXL errors to handle relative to a PCI AER event.

Along similar lines, could the RCH AER and RAS status be checked immediately after 
mapping and logged if status is present?

Regards,
Terry

[v3,4/6] cxl/pci: Add RCH downstream port error logging

Commit Message

Comments

Patch