[RFC,1/6] PCI/RCEC: Introduce pcie_walk_rcec_all()

Message ID	20240313083602.239201-2-ming4.li@intel.com
State	New, archived
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9671B1B971; Wed, 13 Mar 2024 09:04:37 +0000 (UTC) From: Li Ming <ming4.li@intel.com> To: dan.j.williams@intel.com, rrichter@amd.com, terry.bowman@amd.com Cc: linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org, Li Ming <ming4.li@intel.com> Subject: [RFC PATCH 1/6] PCI/RCEC: Introduce pcie_walk_rcec_all() Date: Wed, 13 Mar 2024 08:35:57 +0000 Message-Id: <20240313083602.239201-2-ming4.li@intel.com> In-Reply-To: <20240313083602.239201-1-ming4.li@intel.com> References: <20240313083602.239201-1-ming4.li@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Add support for root port RAS error handling \| expand [RFC,0/6] Add support for root port RAS error handling [RFC,1/6] PCI/RCEC: Introduce pcie_walk_rcec_all() [RFC,2/6] PCI/CXL: A new attribute to indicate CXL-capable host bridge [RFC,3/6] PCI/AER: Enable RCEC to report internal error for CXL root port [RFC,4/6] PCI/AER: Extend RCH RAS error handling to support VH topology case [RFC,5/6] cxl: Use __free() for cxl_pci/mem_find_port() to drop put_device() [RFC,6/6] cxl/pci: Support to handle root port RAS errors captured by RCEC

Li, Ming4 March 13, 2024, 8:35 a.m. UTC

PCIe RCEC core only provides pcie_walk_rcec() to walk all RCiEP devices
associating with RCEC, but CXL subsystem needs a helper function which
can walk all devices in RCEC associated bus range other than RCiEPs for
below RAS error case.

CXL r3.1 section 12.2.2 mentions that the CXL.cachemem protocol errors
detected by a CXL root port could be logged in RCEC AER Extended
Capability. The recommendation solution from CXL r3.1 section 9.18.1.5
is:

	"Probe all CXL Downstream Ports and determine whether they have
	logged an error in the CXL.io or CXL.cachemem status registers."

The new helper function called pcie_walk_rcec_all(), CXL RAS error
handler can use it to locate all CXL root ports or CXL devices in RCEC
associated bus range.

Signed-off-by: Li Ming <ming4.li@intel.com>
---
 drivers/pci/pci.h       |  6 ++++++
 drivers/pci/pcie/rcec.c | 44 +++++++++++++++++++++++++++++++++++++++--
 2 files changed, 48 insertions(+), 2 deletions(-)

Terry Bowman March 25, 2024, 8:15 p.m. UTC | #1

Hi Li,

I added comments below.

On 3/13/24 03:35, Li Ming wrote:
> PCIe RCEC core only provides pcie_walk_rcec() to walk all RCiEP devices
> associating with RCEC, but CXL subsystem needs a helper function which
> can walk all devices in RCEC associated bus range other than RCiEPs for
> below RAS error case.
> 
> CXL r3.1 section 12.2.2 mentions that the CXL.cachemem protocol errors
> detected by a CXL root port could be logged in RCEC AER Extended
> Capability. The recommendation solution from CXL r3.1 section 9.18.1.5
> is:
> 
> 	"Probe all CXL Downstream Ports and determine whether they have
> 	logged an error in the CXL.io or CXL.cachemem status registers."
> 
> The new helper function called pcie_walk_rcec_all(), CXL RAS error
> handler can use it to locate all CXL root ports or CXL devices in RCEC
> associated bus range.

The RCEC-root port relation you mention is new to me. Typically, not in 
all cases, RCH-RCD has a RCEC. And a VH mode system has a root port 
instead. The RCH RCEC and VH root port are both bound to the PCIeport 
bus driver that supports handling and logging AER. This allows the PCIe 
port bus driver to handle AER in a RCEC and root port AER using the same 
procedure and accesses to the AER capability registers. 

This is oversimplified but are you looking to handle root port AER error 
in the RCEC from the below diagram? 

RCEC <--> CXL root port (bridge) <--> Endpoint

> 
> Signed-off-by: Li Ming <ming4.li@intel.com>
> ---
>  drivers/pci/pci.h       |  6 ++++++
>  drivers/pci/pcie/rcec.c | 44 +++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 48 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 5ecbcf041179..a068f2d7dd28 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -444,6 +444,9 @@ void pcie_link_rcec(struct pci_dev *rcec);
>  void pcie_walk_rcec(struct pci_dev *rcec,
>  		    int (*cb)(struct pci_dev *, void *),
>  		    void *userdata);
> +void pcie_walk_rcec_all(struct pci_dev *rcec,
> +			int (*cb)(struct pci_dev *, void *),
> +			void *userdata);
>  #else
>  static inline void pci_rcec_init(struct pci_dev *dev) { }
>  static inline void pci_rcec_exit(struct pci_dev *dev) { }
> @@ -451,6 +454,9 @@ static inline void pcie_link_rcec(struct pci_dev *rcec) { }
>  static inline void pcie_walk_rcec(struct pci_dev *rcec,
>  				  int (*cb)(struct pci_dev *, void *),
>  				  void *userdata) { }
> +static inline void pcie_walk_rcec_all(struct pci_dev *rcec,
> +				      int (*cb)(struct pci_dev *, void *),
> +				      void *userdata) { }
>  #endif
>  
>  #ifdef CONFIG_PCI_ATS
> diff --git a/drivers/pci/pcie/rcec.c b/drivers/pci/pcie/rcec.c
> index d0bcd141ac9c..189de280660c 100644
> --- a/drivers/pci/pcie/rcec.c
> +++ b/drivers/pci/pcie/rcec.c
> @@ -65,6 +65,15 @@ static int walk_rcec_helper(struct pci_dev *dev, void *data)
>  	return 0;
>  }
>  
> +static int walk_rcec_all_helper(struct pci_dev *dev, void *data)
> +{
> +	struct walk_rcec_data *rcec_data = data;
> +
> +	rcec_data->user_callback(dev, rcec_data->user_data);
> +
> +	return 0;
> +}
> +
>  static void walk_rcec(int (*cb)(struct pci_dev *dev, void *data),
>  		      void *userdata)
>  {
> @@ -83,7 +92,7 @@ static void walk_rcec(int (*cb)(struct pci_dev *dev, void *data),
>  	nextbusn = rcec->rcec_ea->nextbusn;
>  	lastbusn = rcec->rcec_ea->lastbusn;
>  
> -	/* All RCiEP devices are on the same bus as the RCEC */
> +	/* All devices are on the same bus as the RCEC */

RCiEPs are not guaranteed to be on same bus as RCEC. Details for associated 
next and last busses:

"This register does not indicate association between an Event Collector and 
any Function on the same Bus Number as the Event Collector itself, however 
it is permitted for the Association Bus Range to include the Bus Number of 
the Root Complex Event Collector."[1]

[1] PCI Spec 6.0 - RCEC Associated Bus Numbers Register 9Ofset 08h)


>  	if (nextbusn == 0xff && lastbusn == 0x00)
>  		return;
>  
> @@ -96,7 +105,7 @@ static void walk_rcec(int (*cb)(struct pci_dev *dev, void *data),
>  		if (!bus)
>  			continue;
>  
> -		/* Find RCiEP devices on the given bus ranges */
> +		/* Find devices on the given bus ranges */
>  		pci_walk_bus(bus, cb, rcec_data);
>  	}
>  }
> @@ -146,6 +155,37 @@ void pcie_walk_rcec(struct pci_dev *rcec, int (*cb)(struct pci_dev *, void *),
>  	walk_rcec(walk_rcec_helper, &rcec_data);
>  }
>  
> +/**
> + * pcie_walk_rcec_all - Walk all devices in RCEC's bus range.
> + * @rcec:	RCEC whose devices should be walked
> + * @cb:		Callback to be called for each device found
> + * @userdata:	Arbitrary pointer to be passed to callback
> + *
> + * It is implemented only for CXL cases.
> + * Per CXL r3.1 section 12.2.2, CXL protocol errors detected by
> + * CXL root port could be logged in an RCEC's AER Extended Capability.
> + * And per CXL r3.1 section 9.18.1.5, the recommendation is that probing
> + * all CXL root ports to determine whether they have logged an error.
> + * So provide this function for CXL to walk the given RCEC, CXL driver
> + * will figure out which CXL root ports detected errors.
> + *
> + * If @cb returns anything other than 0, break out.
> + */
> +void pcie_walk_rcec_all(struct pci_dev *rcec, int (*cb)(struct pci_dev *, void *),
> +			void *userdata)
> +{

Is this identical to pcie_walk_rcec()? Is this addition necessary?

Regards,
Terry

> +	struct walk_rcec_data rcec_data;
> +
> +	if (!rcec->rcec_ea)
> +		return;
> +
> +	rcec_data.rcec = rcec;
> +	rcec_data.user_callback = cb;
> +	rcec_data.user_data = userdata;
> +
> +	walk_rcec(walk_rcec_all_helper, &rcec_data);
> +}
> +
>  void pci_rcec_init(struct pci_dev *dev)
>  {
>  	struct rcec_ea *rcec_ea;

Dan Williams April 16, 2024, 4:39 a.m. UTC | #2

Terry Bowman wrote:
> Hi Li,
> 
> I added comments below.
> 
> On 3/13/24 03:35, Li Ming wrote:
> > PCIe RCEC core only provides pcie_walk_rcec() to walk all RCiEP devices
> > associating with RCEC, but CXL subsystem needs a helper function which
> > can walk all devices in RCEC associated bus range other than RCiEPs for
> > below RAS error case.
> > 
> > CXL r3.1 section 12.2.2 mentions that the CXL.cachemem protocol errors
> > detected by a CXL root port could be logged in RCEC AER Extended
> > Capability. The recommendation solution from CXL r3.1 section 9.18.1.5
> > is:
> > 
> > 	"Probe all CXL Downstream Ports and determine whether they have
> > 	logged an error in the CXL.io or CXL.cachemem status registers."
> > 
> > The new helper function called pcie_walk_rcec_all(), CXL RAS error
> > handler can use it to locate all CXL root ports or CXL devices in RCEC
> > associated bus range.
> 
> The RCEC-root port relation you mention is new to me. Typically, not in 
> all cases, RCH-RCD has a RCEC. And a VH mode system has a root port 
> instead. The RCH RCEC and VH root port are both bound to the PCIeport 
> bus driver that supports handling and logging AER. This allows the PCIe 
> port bus driver to handle AER in a RCEC and root port AER using the same 
> procedure and accesses to the AER capability registers. 
> 
> This is oversimplified but are you looking to handle root port AER error 
> in the RCEC from the below diagram? 
> 
> RCEC <--> CXL root port (bridge) <--> Endpoint
> 
> > 
> > Signed-off-by: Li Ming <ming4.li@intel.com>
> > ---
> >  drivers/pci/pci.h       |  6 ++++++
> >  drivers/pci/pcie/rcec.c | 44 +++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 48 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 5ecbcf041179..a068f2d7dd28 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -444,6 +444,9 @@ void pcie_link_rcec(struct pci_dev *rcec);
> >  void pcie_walk_rcec(struct pci_dev *rcec,
> >  		    int (*cb)(struct pci_dev *, void *),
> >  		    void *userdata);
> > +void pcie_walk_rcec_all(struct pci_dev *rcec,
> > +			int (*cb)(struct pci_dev *, void *),
> > +			void *userdata);
> >  #else
> >  static inline void pci_rcec_init(struct pci_dev *dev) { }
> >  static inline void pci_rcec_exit(struct pci_dev *dev) { }
> > @@ -451,6 +454,9 @@ static inline void pcie_link_rcec(struct pci_dev *rcec) { }
> >  static inline void pcie_walk_rcec(struct pci_dev *rcec,
> >  				  int (*cb)(struct pci_dev *, void *),
> >  				  void *userdata) { }
> > +static inline void pcie_walk_rcec_all(struct pci_dev *rcec,
> > +				      int (*cb)(struct pci_dev *, void *),
> > +				      void *userdata) { }
> >  #endif
> >  
> >  #ifdef CONFIG_PCI_ATS
> > diff --git a/drivers/pci/pcie/rcec.c b/drivers/pci/pcie/rcec.c
> > index d0bcd141ac9c..189de280660c 100644
> > --- a/drivers/pci/pcie/rcec.c
> > +++ b/drivers/pci/pcie/rcec.c
> > @@ -65,6 +65,15 @@ static int walk_rcec_helper(struct pci_dev *dev, void *data)
> >  	return 0;
> >  }
> >  
> > +static int walk_rcec_all_helper(struct pci_dev *dev, void *data)
> > +{
> > +	struct walk_rcec_data *rcec_data = data;
> > +
> > +	rcec_data->user_callback(dev, rcec_data->user_data);
> > +
> > +	return 0;
> > +}
> > +
> >  static void walk_rcec(int (*cb)(struct pci_dev *dev, void *data),
> >  		      void *userdata)
> >  {
> > @@ -83,7 +92,7 @@ static void walk_rcec(int (*cb)(struct pci_dev *dev, void *data),
> >  	nextbusn = rcec->rcec_ea->nextbusn;
> >  	lastbusn = rcec->rcec_ea->lastbusn;
> >  
> > -	/* All RCiEP devices are on the same bus as the RCEC */
> > +	/* All devices are on the same bus as the RCEC */
> 
> RCiEPs are not guaranteed to be on same bus as RCEC. Details for associated 
> next and last busses:
> 
> "This register does not indicate association between an Event Collector and 
> any Function on the same Bus Number as the Event Collector itself, however 
> it is permitted for the Association Bus Range to include the Bus Number of 
> the Root Complex Event Collector."[1]
> 
> [1] PCI Spec 6.0 - RCEC Associated Bus Numbers Register 9Ofset 08h)

Hi Terry,

This patchset is responding to the implications of the implementation
note in 9.18.1.5 RCEC Downstream Port Association Structure (RDPAS).
That says that CXL.io and CXL.cachemem errors in Root Ports may indeed
be signaled to an RCEC. Do you expect that implementation note to cause
any issues on platforms that do not follow that CXL spec behavior?

My expectation is that it may just cause extra polling for errors, but
not cause any harm.

Li, Ming4 April 16, 2024, 7:23 a.m. UTC | #3

On 3/26/2024 4:15 AM, Terry Bowman wrote:
> Hi Li,
> 
> I added comments below.
> 
> On 3/13/24 03:35, Li Ming wrote:
>> PCIe RCEC core only provides pcie_walk_rcec() to walk all RCiEP devices
>> associating with RCEC, but CXL subsystem needs a helper function which
>> can walk all devices in RCEC associated bus range other than RCiEPs for
>> below RAS error case.
>>
>> CXL r3.1 section 12.2.2 mentions that the CXL.cachemem protocol errors
>> detected by a CXL root port could be logged in RCEC AER Extended
>> Capability. The recommendation solution from CXL r3.1 section 9.18.1.5
>> is:
>>
>> 	"Probe all CXL Downstream Ports and determine whether they have
>> 	logged an error in the CXL.io or CXL.cachemem status registers."
>>
>> The new helper function called pcie_walk_rcec_all(), CXL RAS error
>> handler can use it to locate all CXL root ports or CXL devices in RCEC
>> associated bus range.
> 
> The RCEC-root port relation you mention is new to me. Typically, not in 
> all cases, RCH-RCD has a RCEC. And a VH mode system has a root port 
> instead. The RCH RCEC and VH root port are both bound to the PCIeport 
> bus driver that supports handling and logging AER. This allows the PCIe 
> port bus driver to handle AER in a RCEC and root port AER using the same 
> procedure and accesses to the AER capability registers. 
> 
> This is oversimplified but are you looking to handle root port AER error 
> in the RCEC from the below diagram? 
> 
> RCEC <--> CXL root port (bridge) <--> Endpoint

Hi Terry,

Thank you for your comments.
Yes, if CXL root port detected a CXL.cachemem protocol error, the error is possible to be logged in RCEC AER Extended Capability as Uncorrectable Internal Error or Correctable Internal Error.
But the cxl.cachemem protocol error is logged in the CXL root port RAS capability at the same time. So the implementation is not only handle Uncorrectable Internal Error/Correctable Internal Error on RCEC AER capability, but also find out the CXL root port reported this error and handle the RAS error in the CXL root port.



> 
>>
>> Signed-off-by: Li Ming <ming4.li@intel.com>
>> ---
>>  drivers/pci/pci.h       |  6 ++++++
>>  drivers/pci/pcie/rcec.c | 44 +++++++++++++++++++++++++++++++++++++++--
>>  2 files changed, 48 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index 5ecbcf041179..a068f2d7dd28 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -444,6 +444,9 @@ void pcie_link_rcec(struct pci_dev *rcec);
>>  void pcie_walk_rcec(struct pci_dev *rcec,
>>  		    int (*cb)(struct pci_dev *, void *),
>>  		    void *userdata);
>> +void pcie_walk_rcec_all(struct pci_dev *rcec,
>> +			int (*cb)(struct pci_dev *, void *),
>> +			void *userdata);
>>  #else
>>  static inline void pci_rcec_init(struct pci_dev *dev) { }
>>  static inline void pci_rcec_exit(struct pci_dev *dev) { }
>> @@ -451,6 +454,9 @@ static inline void pcie_link_rcec(struct pci_dev *rcec) { }
>>  static inline void pcie_walk_rcec(struct pci_dev *rcec,
>>  				  int (*cb)(struct pci_dev *, void *),
>>  				  void *userdata) { }
>> +static inline void pcie_walk_rcec_all(struct pci_dev *rcec,
>> +				      int (*cb)(struct pci_dev *, void *),
>> +				      void *userdata) { }
>>  #endif
>>  
>>  #ifdef CONFIG_PCI_ATS
>> diff --git a/drivers/pci/pcie/rcec.c b/drivers/pci/pcie/rcec.c
>> index d0bcd141ac9c..189de280660c 100644
>> --- a/drivers/pci/pcie/rcec.c
>> +++ b/drivers/pci/pcie/rcec.c
>> @@ -65,6 +65,15 @@ static int walk_rcec_helper(struct pci_dev *dev, void *data)
>>  	return 0;
>>  }
>>  
>> +static int walk_rcec_all_helper(struct pci_dev *dev, void *data)
>> +{
>> +	struct walk_rcec_data *rcec_data = data;
>> +
>> +	rcec_data->user_callback(dev, rcec_data->user_data);
>> +
>> +	return 0;
>> +}
>> +
>>  static void walk_rcec(int (*cb)(struct pci_dev *dev, void *data),
>>  		      void *userdata)
>>  {
>> @@ -83,7 +92,7 @@ static void walk_rcec(int (*cb)(struct pci_dev *dev, void *data),
>>  	nextbusn = rcec->rcec_ea->nextbusn;
>>  	lastbusn = rcec->rcec_ea->lastbusn;
>>  
>> -	/* All RCiEP devices are on the same bus as the RCEC */
>> +	/* All devices are on the same bus as the RCEC */
> 
> RCiEPs are not guaranteed to be on same bus as RCEC. Details for associated 
> next and last busses:
> 
> "This register does not indicate association between an Event Collector and 
> any Function on the same Bus Number as the Event Collector itself, however 
> it is permitted for the Association Bus Range to include the Bus Number of 
> the Root Complex Event Collector."[1]

I agree with you, the comment in code makes confusion, actually, walk_rcec() already walks the bus same as RCEC's and the bus range defined in RCEC Associated Bus Number Register.

> 
> [1] PCI Spec 6.0 - RCEC Associated Bus Numbers Register 9Ofset 08h)
> 
> 
>>  	if (nextbusn == 0xff && lastbusn == 0x00)
>>  		return;
>>  
>> @@ -96,7 +105,7 @@ static void walk_rcec(int (*cb)(struct pci_dev *dev, void *data),
>>  		if (!bus)
>>  			continue;
>>  
>> -		/* Find RCiEP devices on the given bus ranges */
>> +		/* Find devices on the given bus ranges */
>>  		pci_walk_bus(bus, cb, rcec_data);
>>  	}
>>  }
>> @@ -146,6 +155,37 @@ void pcie_walk_rcec(struct pci_dev *rcec, int (*cb)(struct pci_dev *, void *),
>>  	walk_rcec(walk_rcec_helper, &rcec_data);
>>  }
>>  
>> +/**
>> + * pcie_walk_rcec_all - Walk all devices in RCEC's bus range.
>> + * @rcec:	RCEC whose devices should be walked
>> + * @cb:		Callback to be called for each device found
>> + * @userdata:	Arbitrary pointer to be passed to callback
>> + *
>> + * It is implemented only for CXL cases.
>> + * Per CXL r3.1 section 12.2.2, CXL protocol errors detected by
>> + * CXL root port could be logged in an RCEC's AER Extended Capability.
>> + * And per CXL r3.1 section 9.18.1.5, the recommendation is that probing
>> + * all CXL root ports to determine whether they have logged an error.
>> + * So provide this function for CXL to walk the given RCEC, CXL driver
>> + * will figure out which CXL root ports detected errors.
>> + *
>> + * If @cb returns anything other than 0, break out.
>> + */
>> +void pcie_walk_rcec_all(struct pci_dev *rcec, int (*cb)(struct pci_dev *, void *),
>> +			void *userdata)
>> +{
> 
> Is this identical to pcie_walk_rcec()? Is this addition necessary?
> 

pcie_walk_rcec() only walks all RCiEPs in the RCEC associated bus range, 
My implementation needs to walk all cxl root ports in the RCEC associated bus range to handle CXL root port error logged in RCEC AER Extended Capability case. 

Thanks
Ming

> Regards,
> Terry
> 
>> +	struct walk_rcec_data rcec_data;
>> +
>> +	if (!rcec->rcec_ea)
>> +		return;
>> +
>> +	rcec_data.rcec = rcec;
>> +	rcec_data.user_callback = cb;
>> +	rcec_data.user_data = userdata;
>> +
>> +	walk_rcec(walk_rcec_all_helper, &rcec_data);
>> +}
>> +
>>  void pci_rcec_init(struct pci_dev *dev)
>>  {
>>  	struct rcec_ea *rcec_ea;

Terry Bowman April 22, 2024, 2:34 p.m. UTC | #4

Hi Dan,

On 4/15/24 23:39, Dan Williams wrote:
> Terry Bowman wrote:
>> Hi Li,
>>
>> I added comments below.
>>
>> On 3/13/24 03:35, Li Ming wrote:
>>> PCIe RCEC core only provides pcie_walk_rcec() to walk all RCiEP devices
>>> associating with RCEC, but CXL subsystem needs a helper function which
>>> can walk all devices in RCEC associated bus range other than RCiEPs for
>>> below RAS error case.
>>>
>>> CXL r3.1 section 12.2.2 mentions that the CXL.cachemem protocol errors
>>> detected by a CXL root port could be logged in RCEC AER Extended
>>> Capability. The recommendation solution from CXL r3.1 section 9.18.1.5
>>> is:
>>>
>>> 	"Probe all CXL Downstream Ports and determine whether they have
>>> 	logged an error in the CXL.io or CXL.cachemem status registers."
>>>
>>> The new helper function called pcie_walk_rcec_all(), CXL RAS error
>>> handler can use it to locate all CXL root ports or CXL devices in RCEC
>>> associated bus range.
>>
>> The RCEC-root port relation you mention is new to me. Typically, not in 
>> all cases, RCH-RCD has a RCEC. And a VH mode system has a root port 
>> instead. The RCH RCEC and VH root port are both bound to the PCIeport 
>> bus driver that supports handling and logging AER. This allows the PCIe 
>> port bus driver to handle AER in a RCEC and root port AER using the same 
>> procedure and accesses to the AER capability registers. 
>>
>> This is oversimplified but are you looking to handle root port AER error 
>> in the RCEC from the below diagram? 
>>
>> RCEC <--> CXL root port (bridge) <--> Endpoint
>>
>>>
>>> Signed-off-by: Li Ming <ming4.li@intel.com>
>>> ---
>>>  drivers/pci/pci.h       |  6 ++++++
>>>  drivers/pci/pcie/rcec.c | 44 +++++++++++++++++++++++++++++++++++++++--
>>>  2 files changed, 48 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>>> index 5ecbcf041179..a068f2d7dd28 100644
>>> --- a/drivers/pci/pci.h
>>> +++ b/drivers/pci/pci.h
>>> @@ -444,6 +444,9 @@ void pcie_link_rcec(struct pci_dev *rcec);
>>>  void pcie_walk_rcec(struct pci_dev *rcec,
>>>  		    int (*cb)(struct pci_dev *, void *),
>>>  		    void *userdata);
>>> +void pcie_walk_rcec_all(struct pci_dev *rcec,
>>> +			int (*cb)(struct pci_dev *, void *),
>>> +			void *userdata);
>>>  #else
>>>  static inline void pci_rcec_init(struct pci_dev *dev) { }
>>>  static inline void pci_rcec_exit(struct pci_dev *dev) { }
>>> @@ -451,6 +454,9 @@ static inline void pcie_link_rcec(struct pci_dev *rcec) { }
>>>  static inline void pcie_walk_rcec(struct pci_dev *rcec,
>>>  				  int (*cb)(struct pci_dev *, void *),
>>>  				  void *userdata) { }
>>> +static inline void pcie_walk_rcec_all(struct pci_dev *rcec,
>>> +				      int (*cb)(struct pci_dev *, void *),
>>> +				      void *userdata) { }
>>>  #endif
>>>  
>>>  #ifdef CONFIG_PCI_ATS
>>> diff --git a/drivers/pci/pcie/rcec.c b/drivers/pci/pcie/rcec.c
>>> index d0bcd141ac9c..189de280660c 100644
>>> --- a/drivers/pci/pcie/rcec.c
>>> +++ b/drivers/pci/pcie/rcec.c
>>> @@ -65,6 +65,15 @@ static int walk_rcec_helper(struct pci_dev *dev, void *data)
>>>  	return 0;
>>>  }
>>>  
>>> +static int walk_rcec_all_helper(struct pci_dev *dev, void *data)
>>> +{
>>> +	struct walk_rcec_data *rcec_data = data;
>>> +
>>> +	rcec_data->user_callback(dev, rcec_data->user_data);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>>  static void walk_rcec(int (*cb)(struct pci_dev *dev, void *data),
>>>  		      void *userdata)
>>>  {
>>> @@ -83,7 +92,7 @@ static void walk_rcec(int (*cb)(struct pci_dev *dev, void *data),
>>>  	nextbusn = rcec->rcec_ea->nextbusn;
>>>  	lastbusn = rcec->rcec_ea->lastbusn;
>>>  
>>> -	/* All RCiEP devices are on the same bus as the RCEC */
>>> +	/* All devices are on the same bus as the RCEC */
>>
>> RCiEPs are not guaranteed to be on same bus as RCEC. Details for associated 
>> next and last busses:
>>
>> "This register does not indicate association between an Event Collector and 
>> any Function on the same Bus Number as the Event Collector itself, however 
>> it is permitted for the Association Bus Range to include the Bus Number of 
>> the Root Complex Event Collector."[1]
>>
>> [1] PCI Spec 6.0 - RCEC Associated Bus Numbers Register 9Ofset 08h)
> 
> Hi Terry,
> 
> This patchset is responding to the implications of the implementation
> note in 9.18.1.5 RCEC Downstream Port Association Structure (RDPAS).
> That says that CXL.io and CXL.cachemem errors in Root Ports may indeed
> be signaled to an RCEC. Do you expect that implementation note to cause
> any issues on platforms that do not follow that CXL spec behavior?
> 
> My expectation is that it may just cause extra polling for errors, but
> not cause any harm.

AMD platforms in RCH/RCD mode consume protocol errors in the RCEC's AER driver. AMD 
platforms in VH mode consume protocol errors (including root port errors) in the 
root port's AER driver. The exception is the VH mode host with CXL1.1 endpoint and 
RCH downstream errors. CXL1.1 endpoint and RCH downstream errors in a VH host are 
consumed in the RCEC.

I don't believe these patchset changes would affect this behavior. But, I will need 
to test to confirm.

Regards,
Terry

Dan Williams April 22, 2024, 11:03 p.m. UTC | #5

Terry Bowman wrote:
[..]
> > Hi Terry,
> > 
> > This patchset is responding to the implications of the implementation
> > note in 9.18.1.5 RCEC Downstream Port Association Structure (RDPAS).
> > That says that CXL.io and CXL.cachemem errors in Root Ports may indeed
> > be signaled to an RCEC. Do you expect that implementation note to cause
> > any issues on platforms that do not follow that CXL spec behavior?
> > 
> > My expectation is that it may just cause extra polling for errors, but
> > not cause any harm.
> 
> AMD platforms in RCH/RCD mode consume protocol errors in the RCEC's AER driver. AMD 
> platforms in VH mode consume protocol errors (including root port errors) in the 
> root port's AER driver. The exception is the VH mode host with CXL1.1 endpoint and 
> RCH downstream errors. CXL1.1 endpoint and RCH downstream errors in a VH host are 
> consumed in the RCEC.

I agree that's the most compatible path for existing software.

> I don't believe these patchset changes would affect this behavior. But, I will need 
> to test to confirm.

As I wrote to Li Ming, I think any potential conflict can further be
limited by the fact that this extra scanning is limited to CXL.cachemem,
not typical PCI AER flows.

Li, Ming4 April 23, 2024, 2:33 a.m. UTC | #6

On 4/23/2024 7:03 AM, Dan Williams wrote:
> Terry Bowman wrote:
> [..]
>>> Hi Terry,
>>>
>>> This patchset is responding to the implications of the implementation
>>> note in 9.18.1.5 RCEC Downstream Port Association Structure (RDPAS).
>>> That says that CXL.io and CXL.cachemem errors in Root Ports may indeed
>>> be signaled to an RCEC. Do you expect that implementation note to cause
>>> any issues on platforms that do not follow that CXL spec behavior?
>>>
>>> My expectation is that it may just cause extra polling for errors, but
>>> not cause any harm.
>>
>> AMD platforms in RCH/RCD mode consume protocol errors in the RCEC's AER driver. AMD 
>> platforms in VH mode consume protocol errors (including root port errors) in the 
>> root port's AER driver. The exception is the VH mode host with CXL1.1 endpoint and 
>> RCH downstream errors. CXL1.1 endpoint and RCH downstream errors in a VH host are 
>> consumed in the RCEC.
> 
> I agree that's the most compatible path for existing software.
> 
>> I don't believe these patchset changes would affect this behavior. But, I will need 
>> to test to confirm.
> 
> As I wrote to Li Ming, I think any potential conflict can further be
> limited by the fact that this extra scanning is limited to CXL.cachemem,
> not typical PCI AER flows.

Agree with Dan, but I think that software does not have a chance to know if the error is a CXL.cachemem error withour RDPAS(only knows it is a uncor_internal_error/cor_internal_error reported by RCEC), maybe we can limit this extra scanning for the RPs working on CXL mode?

[RFC,1/6] PCI/RCEC: Introduce pcie_walk_rcec_all()

Commit Message

Comments

Patch