[v5,24/26] cxl/pci: Add RCH downstream port error logging

Message ID	20230607221651.2454764-25-terry.bowman@amd.com
State	Superseded
Headers	show Return-Path: <linux-cxl-owner@vger.kernel.org> Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C From: Terry Bowman <terry.bowman@amd.com> To: <alison.schofield@intel.com>, <vishal.l.verma@intel.com>, <ira.weiny@intel.com>, <bwidawsk@kernel.org>, <dan.j.williams@intel.com>, <dave.jiang@intel.com>, <Jonathan.Cameron@huawei.com>, <linux-cxl@vger.kernel.org> CC: <terry.bowman@amd.com>, <rrichter@amd.com>, <linux-kernel@vger.kernel.org>, <bhelgaas@google.com> Subject: [PATCH v5 24/26] cxl/pci: Add RCH downstream port error logging Date: Wed, 7 Jun 2023 17:16:49 -0500 Message-ID: <20230607221651.2454764-25-terry.bowman@amd.com> In-Reply-To: <20230607221651.2454764-1-terry.bowman@amd.com> References: <20230607221651.2454764-1-terry.bowman@amd.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain Precedence: bulk
Series	cxl/pci: Add support for RCH RAS error handling \| expand [v5,00/26] cxl/pci: Add support for RCH RAS error handling [v5,01/26] cxl/acpi: Probe RCRB later during RCH downstream port creation [v5,02/26] cxl/rch: Prepare for caching the MMIO mapped PCIe AER capability [v5,03/26] cxl: Rename member @dport of struct cxl_dport to @dev [v5,04/26] cxl/core/regs: Rename phys_addr in cxl_map_component_regs() [v5,05/26] cxl/core/regs: Add @dev to cxl_register_map [v5,06/26] cxl/pci: Refactor component register discovery for reuse [v5,07/26] cxl/acpi: Moving add_host_bridge_uport() around [v5,08/26] cxl/acpi: Directly bind the CEDT detected CHBCR to the Host Bridge's port [v5,09/26] cxl/regs: Remove early capability checks in Component Register setup [v5,10/26] cxl/mem: Prepare for early RCH dport component register setup [v5,11/26] cxl/pci: Early setup RCH dport component registers from RCRB [v5,12/26] cxl/port: Store the port's Component Register mappings in struct cxl_port [v5,13/26] cxl/port: Store the downstream port's Component Register mappings in struct cxl_dport [v5,14/26] cxl/pci: Store the endpoint's Component Register mappings in struct cxl_dev_state [v5,15/26] cxl/hdm: Use stored Component Register mappings to map HDM decoder capability [v5,16/26] cxl/port: Remove Component Register base address from struct cxl_port [v5,17/26] cxl/port: Remove Component Register base address from struct cxl_dport [v5,18/26] cxl/pci: Remove Component Register base address from struct cxl_dev_state [v5,19/26] cxl/pci: Add RCH downstream port AER register discovery [v5,20/26] PCI/AER: Refactor cper_print_aer() for use by CXL driver module [v5,21/26] cxl/pci: Update CXL error logging to use RAS register address [v5,22/26] cxl/pci: Map RCH downstream AER registers for logging protocol errors [v5,23/26] cxl/pci: Disable root port interrupts in RCH mode [v5,24/26] cxl/pci: Add RCH downstream port error logging [v5,25/26] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler [v5,26/26] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling

Message ID

20230607221651.2454764-25-terry.bowman@amd.com

State

Superseded

Headers

Received-SPF: Pass (protection.outlook.com: domain of amd.com designates
 165.204.84.17 as permitted sender) receiver=protection.outlook.com;
 client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C
From: Terry Bowman <terry.bowman@amd.com>
To: <alison.schofield@intel.com>, <vishal.l.verma@intel.com>,
        <ira.weiny@intel.com>, <bwidawsk@kernel.org>,
        <dan.j.williams@intel.com>, <dave.jiang@intel.com>,
        <Jonathan.Cameron@huawei.com>, <linux-cxl@vger.kernel.org>
CC: <terry.bowman@amd.com>, <rrichter@amd.com>,
        <linux-kernel@vger.kernel.org>, <bhelgaas@google.com>
Subject: [PATCH v5 24/26] cxl/pci: Add RCH downstream port error logging
Date: Wed, 7 Jun 2023 17:16:49 -0500
Message-ID: <20230607221651.2454764-25-terry.bowman@amd.com>
In-Reply-To: <20230607221651.2454764-1-terry.bowman@amd.com>
References: <20230607221651.2454764-1-terry.bowman@amd.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 07 Jun 2023 22:21:30.0668
 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 876ec96e-e5ef-4eb3-f448-08db67a58af6
X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: 
 TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com]
X-MS-Exchange-CrossTenant-AuthSource: 
 CY4PEPF0000EE31.namprd05.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: CH0PR12MB5169
Precedence: bulk
List-ID: <linux-cxl.vger.kernel.org>
X-Mailing-List: linux-cxl@vger.kernel.org

Series

cxl/pci: Add support for RCH RAS error handling | expand

Commit Message

Bowman, Terry June 7, 2023, 10:16 p.m. UTC

RCH downstream port error logging is missing in the current CXL driver. The
missing AER and RAS error logging is needed for communicating driver error
details to userspace. Update the driver to include PCIe AER and CXL RAS
error logging.

Add RCH downstream port error handling into the existing RCiEP handler.
The downstream port error handler is added to the RCiEP error handler
because the downstream port is implemented in a RCRB, is not PCI
enumerable, and as a result is not directly accessible to the PCI AER
root port driver. The AER root port driver calls the RCiEP handler for
handling RCD errors and RCH downstream port protocol errors.

Update existing RCiEP correctable and uncorrectable handlers to also call
the RCH handler. The RCH handler will read the RCH AER registers, check for
error severity, and if an error exists will log using an existing kernel
AER trace routine. The RCH handler will also log downstream port RAS errors
if they exist.

Co-developed-by: Robert Richter <rrichter@amd.com>
Signed-off-by: Robert Richter <rrichter@amd.com>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/cxl/core/pci.c | 98 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 98 insertions(+)

Comments

Dan Williams June 12, 2023, 9:38 p.m. UTC | #1

Terry Bowman wrote:
> RCH downstream port error logging is missing in the current CXL driver. The
> missing AER and RAS error logging is needed for communicating driver error
> details to userspace. Update the driver to include PCIe AER and CXL RAS
> error logging.
> 
> Add RCH downstream port error handling into the existing RCiEP handler.
> The downstream port error handler is added to the RCiEP error handler
> because the downstream port is implemented in a RCRB, is not PCI
> enumerable, and as a result is not directly accessible to the PCI AER
> root port driver. The AER root port driver calls the RCiEP handler for
> handling RCD errors and RCH downstream port protocol errors.
> 
> Update existing RCiEP correctable and uncorrectable handlers to also call
> the RCH handler. The RCH handler will read the RCH AER registers, check for
> error severity, and if an error exists will log using an existing kernel
> AER trace routine. The RCH handler will also log downstream port RAS errors
> if they exist.
> 
> Co-developed-by: Robert Richter <rrichter@amd.com>
> Signed-off-by: Robert Richter <rrichter@amd.com>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>  drivers/cxl/core/pci.c | 98 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 98 insertions(+)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index def6ee5ab4f5..97886aacc64a 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -5,6 +5,7 @@
>  #include <linux/delay.h>
>  #include <linux/pci.h>
>  #include <linux/pci-doe.h>
> +#include <linux/aer.h>
>  #include <cxlpci.h>
>  #include <cxlmem.h>
>  #include <cxl.h>
> @@ -747,10 +748,105 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
>  	return __cxl_report_and_clear(cxlds, cxlds->regs.ras);
>  }
>  
> +#ifdef CONFIG_PCIEAER_CXL

A general reaction to the "ifdef in a .c file" style recommendation.
Maybe this section could move to a drivers/cxl/core/aer.c file, and be
optionally compiled by config in the Makefile? I.e. similar to:

cxl_core-$(CONFIG_TRACING) += trace.o
cxl_core-$(CONFIG_CXL_REGION) += region.o

...it is borderline just big enough, but I'll leave it up to you.

> +
> +static void cxl_log_correctable_ras_dport(struct cxl_dev_state *cxlds,
> +					  struct cxl_dport *dport)
> +{
> +	return __cxl_log_correctable_ras(cxlds, dport->regs.ras);
> +}
> +
> +static bool cxl_report_and_clear_dport(struct cxl_dev_state *cxlds,
> +				       struct cxl_dport *dport)
> +{
> +	return __cxl_report_and_clear(cxlds, dport->regs.ras);
> +}
> +
> +/*
> + * Copy the AER capability registers using 32 bit read accesses.
> + * This is necessary because RCRB AER capability is MMIO mapped. Clear the
> + * status after copying.
> + *
> + * @aer_base: base address of AER capability block in RCRB
> + * @aer_regs: destination for copying AER capability
> + */
> +static bool cxl_rch_get_aer_info(void __iomem *aer_base,
> +				 struct aer_capability_regs *aer_regs)
> +{
> +	int read_cnt = sizeof(struct aer_capability_regs) / sizeof(u32);
> +	u32 *aer_regs_buf = (u32 *)aer_regs;
> +	int n;
> +
> +	if (!aer_base)
> +		return false;
> +
> +	/* Use readl() to guarantee 32-bit accesses */
> +	for (n = 0; n < read_cnt; n++)
> +		aer_regs_buf[n] = readl(aer_base + n * sizeof(u32));
> +
> +	writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS);
> +	writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS);
> +
> +	return true;
> +}
> +
> +/* Get AER severity. Return false if there is no error. */
> +static bool cxl_rch_get_aer_severity(struct aer_capability_regs *aer_regs,
> +				     int *severity)
> +{
> +	if (aer_regs->uncor_status & ~aer_regs->uncor_mask) {
> +		if (aer_regs->uncor_status & PCI_ERR_ROOT_FATAL_RCV)
> +			*severity = AER_FATAL;
> +		else
> +			*severity = AER_NONFATAL;
> +		return true;
> +	}
> +
> +	if (aer_regs->cor_status & ~aer_regs->cor_mask) {
> +		*severity = AER_CORRECTABLE;
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
> +static void cxl_handle_rch_dport_errors(struct cxl_dev_state *cxlds)
> +{
> +	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
> +	struct aer_capability_regs aer_regs;
> +	struct cxl_dport *dport;
> +	int severity;
> +
> +	if (!cxlds->rcd)
> +		return;

Small quibble, but I think this check belongs in the caller.

> +
> +	if (!cxl_pci_find_port(pdev, &dport) || !dport->rch)
> +		return;

The reference for the @port return from cxl_pci_find_port() is leaked
here.

How can dport->rch be false while cxlds->rcd is true? Is that check
required?

> +
> +	if (!cxl_rch_get_aer_info(dport->regs.dport_aer, &aer_regs))
> +		return;
> +
> +	if (!cxl_rch_get_aer_severity(&aer_regs, &severity))
> +		return;
> +
> +	pci_print_aer(pdev, severity, &aer_regs);
> +
> +	if (severity == AER_CORRECTABLE)
> +		cxl_log_correctable_ras_dport(cxlds, dport);
> +	else
> +		cxl_report_and_clear_dport(cxlds, dport);

This is the code that made me go back and have heartburn about the
naming choices. I.e. would a casual reader assume that correctable
errors are not cleared, and that reporting is different than logging?

> +}
> +
> +#else
> +static void cxl_handle_rch_dport_errors(struct cxl_dev_state *cxlds) { }
> +#endif
> +
>  void cxl_cor_error_detected(struct pci_dev *pdev)
>  {
>  	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>  
> +	cxl_handle_rch_dport_errors(cxlds);
> +
>  	cxl_log_correctable_ras_endpoint(cxlds);
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, CXL);
> @@ -763,6 +859,8 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>  	struct device *dev = &cxlmd->dev;
>  	bool ue;
>  
> +	cxl_handle_rch_dport_errors(cxlds);

Per above comment on "cxlds->rcd" conditional, it is mildly surprising
that even the VH path calls this helper.

Bowman, Terry June 16, 2023, 4:17 p.m. UTC | #2

Hi Dan,

I added responses below.

On 6/12/23 16:38, Dan Williams wrote:
> Terry Bowman wrote:
>> RCH downstream port error logging is missing in the current CXL driver. The
>> missing AER and RAS error logging is needed for communicating driver error
>> details to userspace. Update the driver to include PCIe AER and CXL RAS
>> error logging.
>>
>> Add RCH downstream port error handling into the existing RCiEP handler.
>> The downstream port error handler is added to the RCiEP error handler
>> because the downstream port is implemented in a RCRB, is not PCI
>> enumerable, and as a result is not directly accessible to the PCI AER
>> root port driver. The AER root port driver calls the RCiEP handler for
>> handling RCD errors and RCH downstream port protocol errors.
>>
>> Update existing RCiEP correctable and uncorrectable handlers to also call
>> the RCH handler. The RCH handler will read the RCH AER registers, check for
>> error severity, and if an error exists will log using an existing kernel
>> AER trace routine. The RCH handler will also log downstream port RAS errors
>> if they exist.
>>
>> Co-developed-by: Robert Richter <rrichter@amd.com>
>> Signed-off-by: Robert Richter <rrichter@amd.com>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> ---
>>  drivers/cxl/core/pci.c | 98 ++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 98 insertions(+)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index def6ee5ab4f5..97886aacc64a 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -5,6 +5,7 @@
>>  #include <linux/delay.h>
>>  #include <linux/pci.h>
>>  #include <linux/pci-doe.h>
>> +#include <linux/aer.h>
>>  #include <cxlpci.h>
>>  #include <cxlmem.h>
>>  #include <cxl.h>
>> @@ -747,10 +748,105 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
>>  	return __cxl_report_and_clear(cxlds, cxlds->regs.ras);
>>  }
>>  
>> +#ifdef CONFIG_PCIEAER_CXL
> 
> A general reaction to the "ifdef in a .c file" style recommendation.
> Maybe this section could move to a drivers/cxl/core/aer.c file, and be
> optionally compiled by config in the Makefile? I.e. similar to:
> 
> cxl_core-$(CONFIG_TRACING) += trace.o
> cxl_core-$(CONFIG_CXL_REGION) += region.o
> 
> ...it is borderline just big enough, but I'll leave it up to you.
> 


I'll take a look at this. We have most of the patchset requests implplemented
and will give me time to look at this.

>> +
>> +static void cxl_log_correctable_ras_dport(struct cxl_dev_state *cxlds,
>> +					  struct cxl_dport *dport)
>> +{
>> +	return __cxl_log_correctable_ras(cxlds, dport->regs.ras);
>> +}
>> +
>> +static bool cxl_report_and_clear_dport(struct cxl_dev_state *cxlds,
>> +				       struct cxl_dport *dport)
>> +{
>> +	return __cxl_report_and_clear(cxlds, dport->regs.ras);
>> +}
>> +
>> +/*
>> + * Copy the AER capability registers using 32 bit read accesses.
>> + * This is necessary because RCRB AER capability is MMIO mapped. Clear the
>> + * status after copying.
>> + *
>> + * @aer_base: base address of AER capability block in RCRB
>> + * @aer_regs: destination for copying AER capability
>> + */
>> +static bool cxl_rch_get_aer_info(void __iomem *aer_base,
>> +				 struct aer_capability_regs *aer_regs)
>> +{
>> +	int read_cnt = sizeof(struct aer_capability_regs) / sizeof(u32);
>> +	u32 *aer_regs_buf = (u32 *)aer_regs;
>> +	int n;
>> +
>> +	if (!aer_base)
>> +		return false;
>> +
>> +	/* Use readl() to guarantee 32-bit accesses */
>> +	for (n = 0; n < read_cnt; n++)
>> +		aer_regs_buf[n] = readl(aer_base + n * sizeof(u32));
>> +
>> +	writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS);
>> +	writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS);
>> +
>> +	return true;
>> +}
>> +
>> +/* Get AER severity. Return false if there is no error. */
>> +static bool cxl_rch_get_aer_severity(struct aer_capability_regs *aer_regs,
>> +				     int *severity)
>> +{
>> +	if (aer_regs->uncor_status & ~aer_regs->uncor_mask) {
>> +		if (aer_regs->uncor_status & PCI_ERR_ROOT_FATAL_RCV)
>> +			*severity = AER_FATAL;
>> +		else
>> +			*severity = AER_NONFATAL;
>> +		return true;
>> +	}
>> +
>> +	if (aer_regs->cor_status & ~aer_regs->cor_mask) {
>> +		*severity = AER_CORRECTABLE;
>> +		return true;
>> +	}
>> +
>> +	return false;
>> +}
>> +
>> +static void cxl_handle_rch_dport_errors(struct cxl_dev_state *cxlds)
>> +{
>> +	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
>> +	struct aer_capability_regs aer_regs;
>> +	struct cxl_dport *dport;
>> +	int severity;
>> +
>> +	if (!cxlds->rcd)
>> +		return;
> 
> Small quibble, but I think this check belongs in the caller.
> 

Ok.

>> +
>> +	if (!cxl_pci_find_port(pdev, &dport) || !dport->rch)
>> +		return;
> 
> The reference for the @port return from cxl_pci_find_port() is leaked
> here.
> 
> How can dport->rch be false while cxlds->rcd is true? Is that check
> required?
> 

I will remove the rch check.

>> +
>> +	if (!cxl_rch_get_aer_info(dport->regs.dport_aer, &aer_regs))
>> +		return;
>> +
>> +	if (!cxl_rch_get_aer_severity(&aer_regs, &severity))
>> +		return;
>> +
>> +	pci_print_aer(pdev, severity, &aer_regs);
>> +
>> +	if (severity == AER_CORRECTABLE)
>> +		cxl_log_correctable_ras_dport(cxlds, dport);
>> +	else
>> +		cxl_report_and_clear_dport(cxlds, dport);
> 
> This is the code that made me go back and have heartburn about the
> naming choices. I.e. would a casual reader assume that correctable
> errors are not cleared, and that reporting is different than logging?
>

Yes, the names are ready for reworking. I have updated the functions to use 
consistent naming in the v6 patchset.
 
>> +}
>> +
>> +#else
>> +static void cxl_handle_rch_dport_errors(struct cxl_dev_state *cxlds) { }
>> +#endif
>> +
>>  void cxl_cor_error_detected(struct pci_dev *pdev)
>>  {
>>  	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>>  
>> +	cxl_handle_rch_dport_errors(cxlds);
>> +
>>  	cxl_log_correctable_ras_endpoint(cxlds);
>>  }
>>  EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, CXL);
>> @@ -763,6 +859,8 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>>  	struct device *dev = &cxlmd->dev;
>>  	bool ue;
>>  
>> +	cxl_handle_rch_dport_errors(cxlds);
> 
> Per above comment on "cxlds->rcd" conditional, it is mildly surprising
> that even the VH path calls this helper.

The 'if (cxlds->rcd)' will be moved here per your above request. Strictly speaking, 
this is still in the VH path but an improvement. This is really an endpoint path 
for RCH(RCD) and VH endpoints.

An alternative solution we considered was using a separate RCH dport error handler but 
that requires further AER port driver plumbing rework (for only CXL) or changing 
the assigned error handlers depending on RCH-VH mode at runtime. I spent time 
implementing and testing these options and we found it added significant complexity 
for a limited use case.

Regards,
Terry

Bowman, Terry June 16, 2023, 4:28 p.m. UTC | #3

Hi Dan,

On 6/16/23 11:17, Terry Bowman wrote:
> Hi Dan,
> 
> I added responses below.
> 
> On 6/12/23 16:38, Dan Williams wrote:
>> Terry Bowman wrote:
>>> RCH downstream port error logging is missing in the current CXL driver. The
>>> missing AER and RAS error logging is needed for communicating driver error
>>> details to userspace. Update the driver to include PCIe AER and CXL RAS
>>> error logging.
>>>
>>> Add RCH downstream port error handling into the existing RCiEP handler.
>>> The downstream port error handler is added to the RCiEP error handler
>>> because the downstream port is implemented in a RCRB, is not PCI
>>> enumerable, and as a result is not directly accessible to the PCI AER
>>> root port driver. The AER root port driver calls the RCiEP handler for
>>> handling RCD errors and RCH downstream port protocol errors.
>>>
>>> Update existing RCiEP correctable and uncorrectable handlers to also call
>>> the RCH handler. The RCH handler will read the RCH AER registers, check for
>>> error severity, and if an error exists will log using an existing kernel
>>> AER trace routine. The RCH handler will also log downstream port RAS errors
>>> if they exist.
>>>
>>> Co-developed-by: Robert Richter <rrichter@amd.com>
>>> Signed-off-by: Robert Richter <rrichter@amd.com>
>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>> ---
>>>  drivers/cxl/core/pci.c | 98 ++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 98 insertions(+)
>>>
>>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>>> index def6ee5ab4f5..97886aacc64a 100644
>>> --- a/drivers/cxl/core/pci.c
>>> +++ b/drivers/cxl/core/pci.c
>>> @@ -5,6 +5,7 @@
>>>  #include <linux/delay.h>
>>>  #include <linux/pci.h>
>>>  #include <linux/pci-doe.h>
>>> +#include <linux/aer.h>
>>>  #include <cxlpci.h>
>>>  #include <cxlmem.h>
>>>  #include <cxl.h>
>>> @@ -747,10 +748,105 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
>>>  	return __cxl_report_and_clear(cxlds, cxlds->regs.ras);
>>>  }
>>>  
>>> +#ifdef CONFIG_PCIEAER_CXL
>>
>> A general reaction to the "ifdef in a .c file" style recommendation.
>> Maybe this section could move to a drivers/cxl/core/aer.c file, and be
>> optionally compiled by config in the Makefile? I.e. similar to:
>>
>> cxl_core-$(CONFIG_TRACING) += trace.o
>> cxl_core-$(CONFIG_CXL_REGION) += region.o
>>
>> ...it is borderline just big enough, but I'll leave it up to you.
>>
> 
> 
> I'll take a look at this. We have most of the patchset requests implplemented
> and will give me time to look at this.
> 
>>> +
>>> +static void cxl_log_correctable_ras_dport(struct cxl_dev_state *cxlds,
>>> +					  struct cxl_dport *dport)
>>> +{
>>> +	return __cxl_log_correctable_ras(cxlds, dport->regs.ras);
>>> +}
>>> +
>>> +static bool cxl_report_and_clear_dport(struct cxl_dev_state *cxlds,
>>> +				       struct cxl_dport *dport)
>>> +{
>>> +	return __cxl_report_and_clear(cxlds, dport->regs.ras);
>>> +}
>>> +
>>> +/*
>>> + * Copy the AER capability registers using 32 bit read accesses.
>>> + * This is necessary because RCRB AER capability is MMIO mapped. Clear the
>>> + * status after copying.
>>> + *
>>> + * @aer_base: base address of AER capability block in RCRB
>>> + * @aer_regs: destination for copying AER capability
>>> + */
>>> +static bool cxl_rch_get_aer_info(void __iomem *aer_base,
>>> +				 struct aer_capability_regs *aer_regs)
>>> +{
>>> +	int read_cnt = sizeof(struct aer_capability_regs) / sizeof(u32);
>>> +	u32 *aer_regs_buf = (u32 *)aer_regs;
>>> +	int n;
>>> +
>>> +	if (!aer_base)
>>> +		return false;
>>> +
>>> +	/* Use readl() to guarantee 32-bit accesses */
>>> +	for (n = 0; n < read_cnt; n++)
>>> +		aer_regs_buf[n] = readl(aer_base + n * sizeof(u32));
>>> +
>>> +	writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS);
>>> +	writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS);
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +/* Get AER severity. Return false if there is no error. */
>>> +static bool cxl_rch_get_aer_severity(struct aer_capability_regs *aer_regs,
>>> +				     int *severity)
>>> +{
>>> +	if (aer_regs->uncor_status & ~aer_regs->uncor_mask) {
>>> +		if (aer_regs->uncor_status & PCI_ERR_ROOT_FATAL_RCV)
>>> +			*severity = AER_FATAL;
>>> +		else
>>> +			*severity = AER_NONFATAL;
>>> +		return true;
>>> +	}
>>> +
>>> +	if (aer_regs->cor_status & ~aer_regs->cor_mask) {
>>> +		*severity = AER_CORRECTABLE;
>>> +		return true;
>>> +	}
>>> +
>>> +	return false;
>>> +}
>>> +
>>> +static void cxl_handle_rch_dport_errors(struct cxl_dev_state *cxlds)
>>> +{
>>> +	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
>>> +	struct aer_capability_regs aer_regs;
>>> +	struct cxl_dport *dport;
>>> +	int severity;
>>> +
>>> +	if (!cxlds->rcd)
>>> +		return;
>>
>> Small quibble, but I think this check belongs in the caller.
>>
> 
> Ok.
> 
>>> +
>>> +	if (!cxl_pci_find_port(pdev, &dport) || !dport->rch)
>>> +		return;
>>
>> The reference for the @port return from cxl_pci_find_port() is leaked
>> here.
>>


I will address this as well. Thanks for pointing this out.

Regards,
Terry

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index def6ee5ab4f5..97886aacc64a 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -5,6 +5,7 @@ 
 #include <linux/delay.h>
 #include <linux/pci.h>
 #include <linux/pci-doe.h>
+#include <linux/aer.h>
 #include <cxlpci.h>
 #include <cxlmem.h>
 #include <cxl.h>
@@ -747,10 +748,105 @@  static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
 	return __cxl_report_and_clear(cxlds, cxlds->regs.ras);
 }
 
+#ifdef CONFIG_PCIEAER_CXL
+
+static void cxl_log_correctable_ras_dport(struct cxl_dev_state *cxlds,
+					  struct cxl_dport *dport)
+{
+	return __cxl_log_correctable_ras(cxlds, dport->regs.ras);
+}
+
+static bool cxl_report_and_clear_dport(struct cxl_dev_state *cxlds,
+				       struct cxl_dport *dport)
+{
+	return __cxl_report_and_clear(cxlds, dport->regs.ras);
+}
+
+/*
+ * Copy the AER capability registers using 32 bit read accesses.
+ * This is necessary because RCRB AER capability is MMIO mapped. Clear the
+ * status after copying.
+ *
+ * @aer_base: base address of AER capability block in RCRB
+ * @aer_regs: destination for copying AER capability
+ */
+static bool cxl_rch_get_aer_info(void __iomem *aer_base,
+				 struct aer_capability_regs *aer_regs)
+{
+	int read_cnt = sizeof(struct aer_capability_regs) / sizeof(u32);
+	u32 *aer_regs_buf = (u32 *)aer_regs;
+	int n;
+
+	if (!aer_base)
+		return false;
+
+	/* Use readl() to guarantee 32-bit accesses */
+	for (n = 0; n < read_cnt; n++)
+		aer_regs_buf[n] = readl(aer_base + n * sizeof(u32));
+
+	writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS);
+	writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS);
+
+	return true;
+}
+
+/* Get AER severity. Return false if there is no error. */
+static bool cxl_rch_get_aer_severity(struct aer_capability_regs *aer_regs,
+				     int *severity)
+{
+	if (aer_regs->uncor_status & ~aer_regs->uncor_mask) {
+		if (aer_regs->uncor_status & PCI_ERR_ROOT_FATAL_RCV)
+			*severity = AER_FATAL;
+		else
+			*severity = AER_NONFATAL;
+		return true;
+	}
+
+	if (aer_regs->cor_status & ~aer_regs->cor_mask) {
+		*severity = AER_CORRECTABLE;
+		return true;
+	}
+
+	return false;
+}
+
+static void cxl_handle_rch_dport_errors(struct cxl_dev_state *cxlds)
+{
+	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
+	struct aer_capability_regs aer_regs;
+	struct cxl_dport *dport;
+	int severity;
+
+	if (!cxlds->rcd)
+		return;
+
+	if (!cxl_pci_find_port(pdev, &dport) || !dport->rch)
+		return;
+
+	if (!cxl_rch_get_aer_info(dport->regs.dport_aer, &aer_regs))
+		return;
+
+	if (!cxl_rch_get_aer_severity(&aer_regs, &severity))
+		return;
+
+	pci_print_aer(pdev, severity, &aer_regs);
+
+	if (severity == AER_CORRECTABLE)
+		cxl_log_correctable_ras_dport(cxlds, dport);
+	else
+		cxl_report_and_clear_dport(cxlds, dport);
+}
+
+#else
+static void cxl_handle_rch_dport_errors(struct cxl_dev_state *cxlds) { }
+#endif
+
 void cxl_cor_error_detected(struct pci_dev *pdev)
 {
 	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
 
+	cxl_handle_rch_dport_errors(cxlds);
+
 	cxl_log_correctable_ras_endpoint(cxlds);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, CXL);
@@ -763,6 +859,8 @@  pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 	struct device *dev = &cxlmd->dev;
 	bool ue;
 
+	cxl_handle_rch_dport_errors(cxlds);
+
 	/*
 	 * A frozen channel indicates an impending reset which is fatal to
 	 * CXL.mem operation, and will likely crash the system. On the off

[v5,24/26] cxl/pci: Add RCH downstream port error logging

Commit Message

Comments

Patch