diff mbox series

[v5,06/16] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices

Message ID 20250107143852.3692571-7-terry.bowman@amd.com
State New
Headers show
Series Enable CXL PCIe port protocol error handling and logging | expand

Commit Message

Bowman, Terry Jan. 7, 2025, 2:38 p.m. UTC
The AER service driver's aer_get_device_error_info() function doesn't read
uncorrectable (UCE) fatal error status from PCIe Upstream Port devices,
including CXL Upstream Switch Ports. As a result, fatal errors are not
logged or handled as needed for CXL PCIe Upstream Switch Port devices.

Update the aer_get_device_error_info() function to read the UCE fatal
status for all CXL PCIe devices. Make the change such that non-CXL devices
are not affected.

The fatal error status will be used in future patches implementing
CXL PCIe Port uncorrectable error handling and logging.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/pci/pcie/aer.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Jonathan Cameron Jan. 14, 2025, 11:32 a.m. UTC | #1
On Tue, 7 Jan 2025 08:38:42 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER service driver's aer_get_device_error_info() function doesn't read
> uncorrectable (UCE) fatal error status from PCIe Upstream Port devices,
> including CXL Upstream Switch Ports. As a result, fatal errors are not
> logged or handled as needed for CXL PCIe Upstream Switch Port devices.
> 
> Update the aer_get_device_error_info() function to read the UCE fatal
> status for all CXL PCIe devices. Make the change such that non-CXL devices
> are not affected.
> 
> The fatal error status will be used in future patches implementing
> CXL PCIe Port uncorrectable error handling and logging.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

This clashes with Shuai's series adding link healthy checks.
Maybe we can reuse that logic to incorporate the condition we
care about here?


> ---
>  drivers/pci/pcie/aer.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 62be599e3bee..79c828bdcb6d 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1253,7 +1253,8 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
>  	} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
>  		   type == PCI_EXP_TYPE_RC_EC ||
>  		   type == PCI_EXP_TYPE_DOWNSTREAM ||
> -		   info->severity == AER_NONFATAL) {
> +		   info->severity == AER_NONFATAL ||
> +		   (pcie_is_cxl(dev) && type == PCI_EXP_TYPE_UPSTREAM)) {
>  
>  		/* Link is still healthy for IO reads */
>  		pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,
Ira Weiny Jan. 14, 2025, 4:57 p.m. UTC | #2
Terry Bowman wrote:
> The AER service driver's aer_get_device_error_info() function doesn't read
> uncorrectable (UCE) fatal error status from PCIe Upstream Port devices,
> including CXL Upstream Switch Ports. As a result, fatal errors are not
> logged or handled as needed for CXL PCIe Upstream Switch Port devices.
> 
> Update the aer_get_device_error_info() function to read the UCE fatal
> status for all CXL PCIe devices. Make the change such that non-CXL devices
> are not affected.
> 
> The fatal error status will be used in future patches implementing
> CXL PCIe Port uncorrectable error handling and logging.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]
Bowman, Terry Jan. 14, 2025, 8:44 p.m. UTC | #3
On 1/14/2025 5:32 AM, Jonathan Cameron wrote:
> On Tue, 7 Jan 2025 08:38:42 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The AER service driver's aer_get_device_error_info() function doesn't read
>> uncorrectable (UCE) fatal error status from PCIe Upstream Port devices,
>> including CXL Upstream Switch Ports. As a result, fatal errors are not
>> logged or handled as needed for CXL PCIe Upstream Switch Port devices.
>>
>> Update the aer_get_device_error_info() function to read the UCE fatal
>> status for all CXL PCIe devices. Make the change such that non-CXL devices
>> are not affected.
>>
>> The fatal error status will be used in future patches implementing
>> CXL PCIe Port uncorrectable error handling and logging.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> This clashes with Shuai's series adding link healthy checks.
> Maybe we can reuse that logic to incorporate the condition we
> care about here?
>

I'll add changes to query Upstream Port link status. I'll borrow from Shuai's patch.

Regards,
Terry

>> ---
>>  drivers/pci/pcie/aer.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 62be599e3bee..79c828bdcb6d 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1253,7 +1253,8 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
>>  	} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>  		   type == PCI_EXP_TYPE_RC_EC ||
>>  		   type == PCI_EXP_TYPE_DOWNSTREAM ||
>> -		   info->severity == AER_NONFATAL) {
>> +		   info->severity == AER_NONFATAL ||
>> +		   (pcie_is_cxl(dev) && type == PCI_EXP_TYPE_UPSTREAM)) {
>>  
>>  		/* Link is still healthy for IO reads */
>>  		pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,
diff mbox series

Patch

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 62be599e3bee..79c828bdcb6d 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1253,7 +1253,8 @@  int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
 	} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
 		   type == PCI_EXP_TYPE_RC_EC ||
 		   type == PCI_EXP_TYPE_DOWNSTREAM ||
-		   info->severity == AER_NONFATAL) {
+		   info->severity == AER_NONFATAL ||
+		   (pcie_is_cxl(dev) && type == PCI_EXP_TYPE_UPSTREAM)) {
 
 		/* Link is still healthy for IO reads */
 		pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,