[RFC,0/6] Add support for root port RAS error handling

Message ID	20240313083602.239201-1-ming4.li@intel.com
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 17E2B1B7E8; Wed, 13 Mar 2024 09:04:35 +0000 (UTC) From: Li Ming <ming4.li@intel.com> To: dan.j.williams@intel.com, rrichter@amd.com, terry.bowman@amd.com Cc: linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org, Li Ming <ming4.li@intel.com> Subject: [RFC PATCH 0/6] Add support for root port RAS error handling Date: Wed, 13 Mar 2024 08:35:56 +0000 Message-Id: <20240313083602.239201-1-ming4.li@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Add support for root port RAS error handling \| expand [RFC,0/6] Add support for root port RAS error handling [RFC,1/6] PCI/RCEC: Introduce pcie_walk_rcec_all() [RFC,2/6] PCI/CXL: A new attribute to indicate CXL-capable host bridge [RFC,3/6] PCI/AER: Enable RCEC to report internal error for CXL root port [RFC,4/6] PCI/AER: Extend RCH RAS error handling to support VH topology case [RFC,5/6] cxl: Use __free() for cxl_pci/mem_find_port() to drop put_device() [RFC,6/6] cxl/pci: Support to handle root port RAS errors captured by RCEC

Message ID

20240313083602.239201-1-ming4.li@intel.com

Headers

From: Li Ming <ming4.li@intel.com>
To: dan.j.williams@intel.com,
	rrichter@amd.com,
	terry.bowman@amd.com
Cc: linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Li Ming <ming4.li@intel.com>
Subject: [RFC PATCH 0/6] Add support for root port RAS error handling
Date: Wed, 13 Mar 2024 08:35:56 +0000
Message-Id: <20240313083602.239201-1-ming4.li@intel.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

Add support for root port RAS error handling | expand

Message

Li, Ming4 March 13, 2024, 8:35 a.m. UTC

Protocol errors signaled to a CXL root port may be captured by a Root
Complex Event Collector(RCEC). If those errors are not cleared and
reported the system owner loses forensic information for system failure
analysis.

Per CXL r3.1 section 9.18.1.5, the recommendation for this case from CXL
specification is the 'Else' statement in 'IMPLEMENTATION NODE' under
'Table 9-24 RDPAS Structure':

	"Probe all CXL Downstream Ports and determine whether they have logged an
	error in the CXL.io or CXL.cachemem status registers."

The CXL subsystem already supports RCH RAS Error handling that has a
dependency on the RCEC. Reuse and extend that RCH topoogy support to
handle reported errors in the VH topology case. The implementation is
composed of:
* Provide a new interface from RCEC side to support walk all devices
  under RCEC and RCEC associated bus range. PCIe AER core uses this
  interface to walk all CXL endpoints and all CXL root ports under the
  bus ranges.

* Update the PCIe AER core to enable Uncorrectable Internal Errors and
  Correctable Internal Errors report for root ports.

* Invoke the cxl_pci error handler for RCEC reported errors.

* Handle root-port errors in the cxl_pci handler when the device is
  direct attached.

The implementation is only for above case without CXL switch, still
remain two opens to be discussed.
1. Is it compatible for CXL switch port error handling?
CXL switch port error handling proposal has not yet been finalized.
Should confirm that this implementation will be compatible with that.

2. How to handle the case which CXL root port reported CXL.CM protocol
erros by itself?
Not support for this case in the patchset at present, my opinion is that
invoking the cxl_pci handle to deal with such case as well.

base-commit: 73bf93edeeea866b0b6efbc8d2595bdaaba7f1a5 branch: next

Li Ming (6):
  PCI/RCEC: Introduce pcie_walk_rcec_all()
  PCI/CXL: A new attribute to indicate if a host bridge is CXL capable
  PCI/AER: Enable RCEC to report internal error for CXL root port
  PCI/AER: Support to handle errors detected by CXL root port
  cxl: Use __free() for cxl_pci/mem_find_port() to drop put_device()
  cxl/pci: Add support for the RAS handling of RCEC captured errors on
    RP

 drivers/acpi/pci_root.c |  1 +
 drivers/cxl/core/pci.c  | 89 +++++++++++++++++++++++++++--------------
 drivers/cxl/core/port.c |  9 +++++
 drivers/cxl/cxl.h       |  2 +
 drivers/cxl/mem.c       |  5 +--
 drivers/cxl/pci.c       | 12 +++---
 drivers/pci/pci.h       |  6 +++
 drivers/pci/pcie/aer.c  | 44 +++++++++++++-------
 drivers/pci/pcie/rcec.c | 44 +++++++++++++++++++-
 include/linux/pci.h     |  1 +
 10 files changed, 155 insertions(+), 58 deletions(-)

Comments

Dan Williams March 15, 2024, 1:45 a.m. UTC | #1

Li Ming wrote:
> Protocol errors signaled to a CXL root port may be captured by a Root
> Complex Event Collector(RCEC). If those errors are not cleared and
> reported the system owner loses forensic information for system failure
> analysis.
> 
> Per CXL r3.1 section 9.18.1.5, the recommendation for this case from CXL
> specification is the 'Else' statement in 'IMPLEMENTATION NODE' under
> 'Table 9-24 RDPAS Structure':
> 
> 	"Probe all CXL Downstream Ports and determine whether they have logged an
> 	error in the CXL.io or CXL.cachemem status registers."
> 
> The CXL subsystem already supports RCH RAS Error handling that has a
> dependency on the RCEC. Reuse and extend that RCH topoogy support to
> handle reported errors in the VH topology case. The implementation is
> composed of:
> * Provide a new interface from RCEC side to support walk all devices
>   under RCEC and RCEC associated bus range. PCIe AER core uses this
>   interface to walk all CXL endpoints and all CXL root ports under the
>   bus ranges.
> * Update the PCIe AER core to enable Uncorrectable Internal Errors and
>   Correctable Internal Errors report for root ports.

Thanks for the above background.

> * Invoke the cxl_pci error handler for RCEC reported errors.

So what do you expect happens when a switch is involved? In the RCH case
it knows that the only thing that can fire RCEC is a root complex
integrated endpoint implementation driven by cxl_pci. In the VH case it
could be a switch.

> * Handle root-port errors in the cxl_pci handler when the device is
>   direct attached.

I do expect direct-attach to be a predominant use case, but I want to
make sure that the implementation at least does not make the switch port
error handling case more difficult to implement.

Li, Ming4 March 15, 2024, 8:40 a.m. UTC | #2

On 3/15/2024 9:45 AM, Dan Williams wrote:
> Li Ming wrote:
>> Protocol errors signaled to a CXL root port may be captured by a Root
>> Complex Event Collector(RCEC). If those errors are not cleared and
>> reported the system owner loses forensic information for system failure
>> analysis.
>>
>> Per CXL r3.1 section 9.18.1.5, the recommendation for this case from CXL
>> specification is the 'Else' statement in 'IMPLEMENTATION NODE' under
>> 'Table 9-24 RDPAS Structure':
>>
>> 	"Probe all CXL Downstream Ports and determine whether they have logged an
>> 	error in the CXL.io or CXL.cachemem status registers."
>>
>> The CXL subsystem already supports RCH RAS Error handling that has a
>> dependency on the RCEC. Reuse and extend that RCH topoogy support to
>> handle reported errors in the VH topology case. The implementation is
>> composed of:
>> * Provide a new interface from RCEC side to support walk all devices
>>   under RCEC and RCEC associated bus range. PCIe AER core uses this
>>   interface to walk all CXL endpoints and all CXL root ports under the
>>   bus ranges.
>> * Update the PCIe AER core to enable Uncorrectable Internal Errors and
>>   Correctable Internal Errors report for root ports.
> 
> Thanks for the above background.
> 
>> * Invoke the cxl_pci error handler for RCEC reported errors.
> 
> So what do you expect happens when a switch is involved? In the RCH case
> it knows that the only thing that can fire RCEC is a root complex
> integrated endpoint implementation driven by cxl_pci. In the VH case it
> could be a switch.
> 
>> * Handle root-port errors in the cxl_pci handler when the device is
>>   direct attached.
> 
> I do expect direct-attach to be a predominant use case, but I want to
> make sure that the implementation at least does not make the switch port
> error handling case more difficult to implement.

Hi Dan,

Currently, A rough idea I have is that:
If a CXL switch connected to the CXL RP, there should be two cases,
1. no CXL memory device connected to the switch, in this case, I'm not sure whether CXL.cachemem protocol errors is still possibly happened between RP and switch without CXL memory device. If not, maybe we don't need to consider such case?

2. a CXL memory device connected to the switch. I think cxl_pci error handler could also help to handle CXL.cachemem protocol errors happened in switch USP/DSP.

Dan Williams March 15, 2024, 6:21 p.m. UTC | #3

Li, Ming wrote:
[..]
> > I do expect direct-attach to be a predominant use case, but I want to
> > make sure that the implementation at least does not make the switch port
> > error handling case more difficult to implement.
> 
> Hi Dan,
> 
> Currently, A rough idea I have is that:
> If a CXL switch connected to the CXL RP, there should be two cases,
> 1. no CXL memory device connected to the switch, in this case, I'm not
> sure whether CXL.cachemem protocol errors is still possibly happened
> between RP and switch without CXL memory device. If not, maybe we
> don't need to consider such case?

Protocol errors can happen between any 2 ports, just like PCI protocol
errors.

> 2. a CXL memory device connected to the switch. I think cxl_pci error
> handler could also help to handle CXL.cachemem protocol errors
> happened in switch USP/DSP.

No, for 2 reasons:

* The cxl_pci driver is only for general CXL type-3 memory
  expanders. Even though no CXL.cache devices have upstream drivers they
  do exist and they would experience protocol errors that the PCI core
  needs to consider.

* When a switch is present it is possible to have a protocol error
  between the switch upstream port and the root port, and not between
  the switch downstream port and the endpoint.

The more I think about it I do not think it is appropriate for cxl_pci
to be involved in clearing root port errors in the VH case, it only
works the RCH case because of the way the device and the root-port get
combined.

Li, Ming4 March 20, 2024, 12:48 p.m. UTC | #4

On 3/16/2024 2:21 AM, Dan Williams wrote:
> Li, Ming wrote:
> [..]
>>> I do expect direct-attach to be a predominant use case, but I want to
>>> make sure that the implementation at least does not make the switch port
>>> error handling case more difficult to implement.
>>
>> Hi Dan,
>>
>> Currently, A rough idea I have is that:
>> If a CXL switch connected to the CXL RP, there should be two cases,
>> 1. no CXL memory device connected to the switch, in this case, I'm not
>> sure whether CXL.cachemem protocol errors is still possibly happened
>> between RP and switch without CXL memory device. If not, maybe we
>> don't need to consider such case?
> 
> Protocol errors can happen between any 2 ports, just like PCI protocol
> errors.

Seems like if we want to handle CXL protocol error for such case on CXL driver side, we need some changes for CXL switch port/dport enumeration, I remembered that switch USP and DSP are enumerated only when a CXL memory device attached.

> 
>> 2. a CXL memory device connected to the switch. I think cxl_pci error
>> handler could also help to handle CXL.cachemem protocol errors
>> happened in switch USP/DSP.
> 
> No, for 2 reasons:
> 
> * The cxl_pci driver is only for general CXL type-3 memory
>   expanders. Even though no CXL.cache devices have upstream drivers they
>   do exist and they would experience protocol errors that the PCI core
>   needs to consider.
> 
> * When a switch is present it is possible to have a protocol error
>   between the switch upstream port and the root port, and not between
>   the switch downstream port and the endpoint.
> 
> The more I think about it I do not think it is appropriate for cxl_pci
> to be involved in clearing root port errors in the VH case, it only
> works the RCH case because of the way the device and the root-port get
> combined.

Thank you for your explanation. 

I think I need some time to consider a appropriate proposal. maybe we can register a CXL error handler into struct pci_driver during CXL port/dport enumeration for root port and switch cases, PCIe AER driver will invoke this registered CXL error handler to handle errors on CXL subsystem side. But I think this solution also meets above issue(current CXL subsystem won't enumerate CXL switch port/dport without CXL mem device).