diff mbox series

[RFC,1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers

Message ID 20240617200411.1426554-2-terry.bowman@amd.com (mailing list archive)
State New
Delegated to: Bjorn Helgaas
Headers show
Series [RFC,1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers | expand

Commit Message

Terry Bowman June 17, 2024, 8:04 p.m. UTC
The AER service driver does not currently call a handler for AER
uncorrectable errors (UCE) detected in root ports or downstream
ports. This is not needed in most cases because common PCIe port
functionality is handled by portdrv service drivers.

CXL root ports include CXL specific RAS registers that need logging
before starting do_recovery() in the UCE case.

Update the AER service driver to call the UCE handler for root ports
and downstream ports. These PCIe port devices are bound to the portdrv
driver that includes a CE and UCE handler to be called.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
---
 drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Comments

Jonathan Cameron June 20, 2024, 11:21 a.m. UTC | #1
On Mon, 17 Jun 2024 15:04:03 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER service driver does not currently call a handler for AER
> uncorrectable errors (UCE) detected in root ports or downstream
> ports. This is not needed in most cases because common PCIe port
> functionality is handled by portdrv service drivers.
> 
> CXL root ports include CXL specific RAS registers that need logging
> before starting do_recovery() in the UCE case.
> 
> Update the AER service driver to call the UCE handler for root ports
> and downstream ports. These PCIe port devices are bound to the portdrv
> driver that includes a CE and UCE handler to be called.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> ---
>  drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 705893b5f7b0..a4db474b2be5 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>  	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>  
> +	/*
> +	 * PCIe ports may include functionality beyond the standard
> +	 * extended port capabilities. This may present a need to log and
> +	 * handle errors not addressed in this driver. Examples are CXL
> +	 * root ports and CXL downstream switch ports using AER UIE to
> +	 * indicate CXL UCE RAS protocol errors.
> +	 */
> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
> +	    type == PCI_EXP_TYPE_DOWNSTREAM) {
> +		struct pci_driver *pdrv = dev->driver;
> +
> +		if (pdrv && pdrv->err_handler &&
> +		    pdrv->err_handler->error_detected) {
> +			const struct pci_error_handlers *err_handler;
> +
> +			err_handler = pdrv->err_handler;
> +			status = err_handler->error_detected(dev, state);

This status is going to get overridden by one of the pci_walk_bridge()
calls.  Should it be kept around and acted on, or dropped silently?
(I'd guess no for silent!).

> +		}
> +	}
> +
>  	/*
>  	 * If the error was detected by a Root Port, Downstream Port, RCEC,
>  	 * or RCiEP, recovery runs on the device itself.  For Ports, that
Dan Williams June 21, 2024, 7:17 p.m. UTC | #2
Terry Bowman wrote:
> The AER service driver does not currently call a handler for AER
> uncorrectable errors (UCE) detected in root ports or downstream
> ports. This is not needed in most cases because common PCIe port
> functionality is handled by portdrv service drivers.
> 
> CXL root ports include CXL specific RAS registers that need logging
> before starting do_recovery() in the UCE case.
> 
> Update the AER service driver to call the UCE handler for root ports
> and downstream ports. These PCIe port devices are bound to the portdrv
> driver that includes a CE and UCE handler to be called.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> ---
>  drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 705893b5f7b0..a4db474b2be5 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>  	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>  
> +	/*
> +	 * PCIe ports may include functionality beyond the standard
> +	 * extended port capabilities. This may present a need to log and
> +	 * handle errors not addressed in this driver. Examples are CXL
> +	 * root ports and CXL downstream switch ports using AER UIE to
> +	 * indicate CXL UCE RAS protocol errors.
> +	 */
> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
> +	    type == PCI_EXP_TYPE_DOWNSTREAM) {
> +		struct pci_driver *pdrv = dev->driver;
> +
> +		if (pdrv && pdrv->err_handler &&
> +		    pdrv->err_handler->error_detected) {
> +			const struct pci_error_handlers *err_handler;
> +
> +			err_handler = pdrv->err_handler;
> +			status = err_handler->error_detected(dev, state);
> +		}
> +	}
> +

Would not a more appropriate place for this be pci_walk_bridge() where
the ->subordinate == NULL and these type-check cases are unified?
Terry Bowman June 24, 2024, 2:58 p.m. UTC | #3
Hi Jonathan,
I added a response below.

On 6/20/24 06:21, Jonathan Cameron wrote:
> On Mon, 17 Jun 2024 15:04:03 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
> 
>> The AER service driver does not currently call a handler for AER
>> uncorrectable errors (UCE) detected in root ports or downstream
>> ports. This is not needed in most cases because common PCIe port
>> functionality is handled by portdrv service drivers.
>>
>> CXL root ports include CXL specific RAS registers that need logging
>> before starting do_recovery() in the UCE case.
>>
>> Update the AER service driver to call the UCE handler for root ports
>> and downstream ports. These PCIe port devices are bound to the portdrv
>> driver that includes a CE and UCE handler to be called.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
>> ---
>>  drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
>>  1 file changed, 20 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index 705893b5f7b0..a4db474b2be5 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>>  	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>>  
>> +	/*
>> +	 * PCIe ports may include functionality beyond the standard
>> +	 * extended port capabilities. This may present a need to log and
>> +	 * handle errors not addressed in this driver. Examples are CXL
>> +	 * root ports and CXL downstream switch ports using AER UIE to
>> +	 * indicate CXL UCE RAS protocol errors.
>> +	 */
>> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
>> +	    type == PCI_EXP_TYPE_DOWNSTREAM) {
>> +		struct pci_driver *pdrv = dev->driver;
>> +
>> +		if (pdrv && pdrv->err_handler &&
>> +		    pdrv->err_handler->error_detected) {
>> +			const struct pci_error_handlers *err_handler;
>> +
>> +			err_handler = pdrv->err_handler;
>> +			status = err_handler->error_detected(dev, state);
> 
> This status is going to get overridden by one of the pci_walk_bridge()
> calls.  Should it be kept around and acted on, or dropped silently?
> (I'd guess no for silent!).
> 

It should be used later.

According to PCI spec "The only method of recovering from an Uncorrectable
Internal Error is reset or hardware replacement."

I need to make certain that carries through below.

Regards,
Terry

>> +		}
>> +	}
>> +
>>  	/*
>>  	 * If the error was detected by a Root Port, Downstream Port, RCEC,
>>  	 * or RCiEP, recovery runs on the device itself.  For Ports, that
>
Terry Bowman June 24, 2024, 5:56 p.m. UTC | #4
Hi Dan,

I added a response below.

On 6/21/24 14:17, Dan Williams wrote:
> Terry Bowman wrote:
>> The AER service driver does not currently call a handler for AER
>> uncorrectable errors (UCE) detected in root ports or downstream
>> ports. This is not needed in most cases because common PCIe port
>> functionality is handled by portdrv service drivers.
>>
>> CXL root ports include CXL specific RAS registers that need logging
>> before starting do_recovery() in the UCE case.
>>
>> Update the AER service driver to call the UCE handler for root ports
>> and downstream ports. These PCIe port devices are bound to the portdrv
>> driver that includes a CE and UCE handler to be called.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
>> ---
>>  drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
>>  1 file changed, 20 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index 705893b5f7b0..a4db474b2be5 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>>  	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>>  
>> +	/*
>> +	 * PCIe ports may include functionality beyond the standard
>> +	 * extended port capabilities. This may present a need to log and
>> +	 * handle errors not addressed in this driver. Examples are CXL
>> +	 * root ports and CXL downstream switch ports using AER UIE to
>> +	 * indicate CXL UCE RAS protocol errors.
>> +	 */
>> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
>> +	    type == PCI_EXP_TYPE_DOWNSTREAM) {
>> +		struct pci_driver *pdrv = dev->driver;
>> +
>> +		if (pdrv && pdrv->err_handler &&
>> +		    pdrv->err_handler->error_detected) {
>> +			const struct pci_error_handlers *err_handler;
>> +
>> +			err_handler = pdrv->err_handler;
>> +			status = err_handler->error_detected(dev, state);
>> +		}
>> +	}
>> +
> 
> Would not a more appropriate place for this be pci_walk_bridge() where
> the ->subordinate == NULL and these type-check cases are unified?

It does. I can take a look at moving that.

Regards,
Terry
Fan Ni July 10, 2024, 8:48 p.m. UTC | #5
On Mon, Jun 24, 2024 at 12:56:29PM -0500, Terry Bowman wrote:
> Hi Dan,
> 
> I added a response below.
> 
> On 6/21/24 14:17, Dan Williams wrote:
> > Terry Bowman wrote:
> >> The AER service driver does not currently call a handler for AER
> >> uncorrectable errors (UCE) detected in root ports or downstream
> >> ports. This is not needed in most cases because common PCIe port
> >> functionality is handled by portdrv service drivers.
> >>
> >> CXL root ports include CXL specific RAS registers that need logging
> >> before starting do_recovery() in the UCE case.
> >>
> >> Update the AER service driver to call the UCE handler for root ports
> >> and downstream ports. These PCIe port devices are bound to the portdrv
> >> driver that includes a CE and UCE handler to be called.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> Cc: Bjorn Helgaas <bhelgaas@google.com>
> >> Cc: linux-pci@vger.kernel.org
> >> ---
> >>  drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
> >>  1 file changed, 20 insertions(+)
> >>
> >> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> >> index 705893b5f7b0..a4db474b2be5 100644
> >> --- a/drivers/pci/pcie/err.c
> >> +++ b/drivers/pci/pcie/err.c
> >> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >>  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> >>  	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> >>  
> >> +	/*
> >> +	 * PCIe ports may include functionality beyond the standard
> >> +	 * extended port capabilities. This may present a need to log and
> >> +	 * handle errors not addressed in this driver. Examples are CXL
> >> +	 * root ports and CXL downstream switch ports using AER UIE to
> >> +	 * indicate CXL UCE RAS protocol errors.
> >> +	 */
> >> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
> >> +	    type == PCI_EXP_TYPE_DOWNSTREAM) {
> >> +		struct pci_driver *pdrv = dev->driver;
> >> +
> >> +		if (pdrv && pdrv->err_handler &&
> >> +		    pdrv->err_handler->error_detected) {
> >> +			const struct pci_error_handlers *err_handler;
> >> +
> >> +			err_handler = pdrv->err_handler;
> >> +			status = err_handler->error_detected(dev, state);
> >> +		}
> >> +	}
> >> +
> > 
> > Would not a more appropriate place for this be pci_walk_bridge() where
> > the ->subordinate == NULL and these type-check cases are unified?
> 
> It does. I can take a look at moving that.

Has that already been handled in pci_walk_bridge?

The function pci_walk_bridge() will call report_error_detected, where
the err handler will be called. 
https://elixir.bootlin.com/linux/v6.10-rc6/source/drivers/pci/pcie/err.c#L80

Fan

> 
> Regards,
> Terry
Terry Bowman July 10, 2024, 9:48 p.m. UTC | #6
Hi Fan,

On 7/10/24 15:48, nifan.cxl@gmail.com wrote:
> On Mon, Jun 24, 2024 at 12:56:29PM -0500, Terry Bowman wrote:
>> Hi Dan,
>>
>> I added a response below.
>>
>> On 6/21/24 14:17, Dan Williams wrote:
>>> Terry Bowman wrote:
>>>> The AER service driver does not currently call a handler for AER
>>>> uncorrectable errors (UCE) detected in root ports or downstream
>>>> ports. This is not needed in most cases because common PCIe port
>>>> functionality is handled by portdrv service drivers.
>>>>
>>>> CXL root ports include CXL specific RAS registers that need logging
>>>> before starting do_recovery() in the UCE case.
>>>>
>>>> Update the AER service driver to call the UCE handler for root ports
>>>> and downstream ports. These PCIe port devices are bound to the portdrv
>>>> driver that includes a CE and UCE handler to be called.
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>>>> Cc: linux-pci@vger.kernel.org
>>>> ---
>>>>  drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
>>>>  1 file changed, 20 insertions(+)
>>>>
>>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>>> index 705893b5f7b0..a4db474b2be5 100644
>>>> --- a/drivers/pci/pcie/err.c
>>>> +++ b/drivers/pci/pcie/err.c
>>>> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>>  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>>>>  	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>>>>  
>>>> +	/*
>>>> +	 * PCIe ports may include functionality beyond the standard
>>>> +	 * extended port capabilities. This may present a need to log and
>>>> +	 * handle errors not addressed in this driver. Examples are CXL
>>>> +	 * root ports and CXL downstream switch ports using AER UIE to
>>>> +	 * indicate CXL UCE RAS protocol errors.
>>>> +	 */
>>>> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>>> +	    type == PCI_EXP_TYPE_DOWNSTREAM) {
>>>> +		struct pci_driver *pdrv = dev->driver;
>>>> +
>>>> +		if (pdrv && pdrv->err_handler &&
>>>> +		    pdrv->err_handler->error_detected) {
>>>> +			const struct pci_error_handlers *err_handler;
>>>> +
>>>> +			err_handler = pdrv->err_handler;
>>>> +			status = err_handler->error_detected(dev, state);
>>>> +		}
>>>> +	}
>>>> +
>>>
>>> Would not a more appropriate place for this be pci_walk_bridge() where
>>> the ->subordinate == NULL and these type-check cases are unified?
>>
>> It does. I can take a look at moving that.
> 
> Has that already been handled in pci_walk_bridge?
> 
> The function pci_walk_bridge() will call report_error_detected, where
> the err handler will be called. 
> https://elixir.bootlin.com/linux/v6.10-rc6/source/drivers/pci/pcie/err.c#L80
> 
> Fan
> 

You would think so but the UCE handler was not called in my testing for the PCIe 
ports (RP,USP,DSP). The pci_walk_bridge() function has 2 cases:
- If there is a subordinate/secondary bus then the callback is called for
those downstream devices but not the port itself.
- If there is no subordinate/secondary bus then the callback is invoked for the 
port itself.

The function header comment may explain it better:
/**                                                                                                                                                                                                                
 * pci_walk_bridge - walk bridges potentially AER affected                                                                                                                                                         
 * @bridge:     bridge which may be a Port, an RCEC, or an RCiEP                                                                                                                                                   
 * @cb:         callback to be called for each device found                                                                                                                                                        
 * @userdata:   arbitrary pointer to be passed to callback                                                                                                                                                         
 *                                                             
 * If the device provided is a bridge, walk the subordinate bus, including                                                                                                                                         
 * any bridged devices on buses under this bus.  Call the provided callback                                                                                                                                        
 * on each device found.                                                                                                                                                                                           
 *                                                                                                                                                                                                                 
 * If the device provided has no subordinate bus, e.g., an RCEC or RCiEP,                                                                                                                                          
 * call the callback on the device itself. 
 */

Regards,
Terry

>>
>> Regards,
>> Terry
Fan Ni July 11, 2024, 1:14 a.m. UTC | #7
On Wed, Jul 10, 2024 at 04:48:09PM -0500, Terry Bowman wrote:
> Hi Fan,
> 
> On 7/10/24 15:48, nifan.cxl@gmail.com wrote:
> > On Mon, Jun 24, 2024 at 12:56:29PM -0500, Terry Bowman wrote:
> >> Hi Dan,
> >>
> >> I added a response below.
> >>
> >> On 6/21/24 14:17, Dan Williams wrote:
> >>> Terry Bowman wrote:
> >>>> The AER service driver does not currently call a handler for AER
> >>>> uncorrectable errors (UCE) detected in root ports or downstream
> >>>> ports. This is not needed in most cases because common PCIe port
> >>>> functionality is handled by portdrv service drivers.
> >>>>
> >>>> CXL root ports include CXL specific RAS registers that need logging
> >>>> before starting do_recovery() in the UCE case.
> >>>>
> >>>> Update the AER service driver to call the UCE handler for root ports
> >>>> and downstream ports. These PCIe port devices are bound to the portdrv
> >>>> driver that includes a CE and UCE handler to be called.
> >>>>
> >>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >>>> Cc: Bjorn Helgaas <bhelgaas@google.com>
> >>>> Cc: linux-pci@vger.kernel.org
> >>>> ---
> >>>>  drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
> >>>>  1 file changed, 20 insertions(+)
> >>>>
> >>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> >>>> index 705893b5f7b0..a4db474b2be5 100644
> >>>> --- a/drivers/pci/pcie/err.c
> >>>> +++ b/drivers/pci/pcie/err.c
> >>>> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >>>>  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> >>>>  	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> >>>>  
> >>>> +	/*
> >>>> +	 * PCIe ports may include functionality beyond the standard
> >>>> +	 * extended port capabilities. This may present a need to log and
> >>>> +	 * handle errors not addressed in this driver. Examples are CXL
> >>>> +	 * root ports and CXL downstream switch ports using AER UIE to
> >>>> +	 * indicate CXL UCE RAS protocol errors.
> >>>> +	 */
> >>>> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
> >>>> +	    type == PCI_EXP_TYPE_DOWNSTREAM) {
> >>>> +		struct pci_driver *pdrv = dev->driver;
> >>>> +
> >>>> +		if (pdrv && pdrv->err_handler &&
> >>>> +		    pdrv->err_handler->error_detected) {
> >>>> +			const struct pci_error_handlers *err_handler;
> >>>> +
> >>>> +			err_handler = pdrv->err_handler;
> >>>> +			status = err_handler->error_detected(dev, state);
> >>>> +		}
> >>>> +	}
> >>>> +
> >>>
> >>> Would not a more appropriate place for this be pci_walk_bridge() where
> >>> the ->subordinate == NULL and these type-check cases are unified?
> >>
> >> It does. I can take a look at moving that.
> > 
> > Has that already been handled in pci_walk_bridge?
> > 
> > The function pci_walk_bridge() will call report_error_detected, where
> > the err handler will be called. 
> > https://elixir.bootlin.com/linux/v6.10-rc6/source/drivers/pci/pcie/err.c#L80
> > 
> > Fan
> > 
> 
> You would think so but the UCE handler was not called in my testing for the PCIe 
> ports (RP,USP,DSP). The pci_walk_bridge() function has 2 cases:
> - If there is a subordinate/secondary bus then the callback is called for
> those downstream devices but not the port itself.
> - If there is no subordinate/secondary bus then the callback is invoked for the 
> port itself.
> 
> The function header comment may explain it better:
> /**                                                                                                                                                                                                                
>  * pci_walk_bridge - walk bridges potentially AER affected                                                                                                                                                         
>  * @bridge:     bridge which may be a Port, an RCEC, or an RCiEP                                                                                                                                                   
>  * @cb:         callback to be called for each device found                                                                                                                                                        
>  * @userdata:   arbitrary pointer to be passed to callback                                                                                                                                                         
>  *                                                             
>  * If the device provided is a bridge, walk the subordinate bus, including                                                                                                                                         
>  * any bridged devices on buses under this bus.  Call the provided callback                                                                                                                                        
>  * on each device found.                                                                                                                                                                                           
>  *                                                                                                                                                                                                                 
>  * If the device provided has no subordinate bus, e.g., an RCEC or RCiEP,                                                                                                                                          
>  * call the callback on the device itself. 
>  */
> 
> Regards,
> Terry

OK, interesting.
Btw, what is the "state" passed to pcie_do_recovery(...state...)?

Fan

> 
> >>
> >> Regards,
> >> Terry
Fan Ni Aug. 19, 2024, 6:35 p.m. UTC | #8
On Mon, Jun 24, 2024 at 12:56:29PM -0500, Terry Bowman wrote:
> Hi Dan,
> 
> I added a response below.
> 
> On 6/21/24 14:17, Dan Williams wrote:
> > Terry Bowman wrote:
> >> The AER service driver does not currently call a handler for AER
> >> uncorrectable errors (UCE) detected in root ports or downstream
> >> ports. This is not needed in most cases because common PCIe port
> >> functionality is handled by portdrv service drivers.
> >>
> >> CXL root ports include CXL specific RAS registers that need logging
> >> before starting do_recovery() in the UCE case.
> >>
> >> Update the AER service driver to call the UCE handler for root ports
> >> and downstream ports. These PCIe port devices are bound to the portdrv
> >> driver that includes a CE and UCE handler to be called.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> Cc: Bjorn Helgaas <bhelgaas@google.com>
> >> Cc: linux-pci@vger.kernel.org
> >> ---
> >>  drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
> >>  1 file changed, 20 insertions(+)
> >>
> >> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> >> index 705893b5f7b0..a4db474b2be5 100644
> >> --- a/drivers/pci/pcie/err.c
> >> +++ b/drivers/pci/pcie/err.c
> >> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >>  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> >>  	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> >>  
> >> +	/*
> >> +	 * PCIe ports may include functionality beyond the standard
> >> +	 * extended port capabilities. This may present a need to log and
> >> +	 * handle errors not addressed in this driver. Examples are CXL
> >> +	 * root ports and CXL downstream switch ports using AER UIE to
> >> +	 * indicate CXL UCE RAS protocol errors.
> >> +	 */
> >> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
> >> +	    type == PCI_EXP_TYPE_DOWNSTREAM) {
> >> +		struct pci_driver *pdrv = dev->driver;
> >> +
> >> +		if (pdrv && pdrv->err_handler &&
> >> +		    pdrv->err_handler->error_detected) {
> >> +			const struct pci_error_handlers *err_handler;
> >> +
> >> +			err_handler = pdrv->err_handler;
> >> +			status = err_handler->error_detected(dev, state);
> >> +		}
> >> +	}
> >> +
> > 
> > Would not a more appropriate place for this be pci_walk_bridge() where
> > the ->subordinate == NULL and these type-check cases are unified?
> 
> It does. I can take a look at moving that.
> 

Based on current code logic, the code added here will be executed as
long as the type matches (downstream port or root port), and I also
noticed the case ->subordinate == NULL never gets touched when I try to
inject an error through the aer_inject module and the user space tool. 
If my way to do error injection is right, it means the behaviour will
get changed after the code move.

Here is some of my experimental setup:

QEMU +  cxl topology (one type3 memdev directly attached to a HB with a
single root port).

1. Load the cxl related drivers before error injection

2. Do aer inject with aer_inject inside the QEMU VM

# aer_inject ~/nonfatal

aer inject input file looks like below
-----------------------------------------------------
fan:~/cxl/linux-fixes$ cat ~/nonfatal 
# Inject an uncorrectable/non-fatal training error into the device
# with header log words 0 1 2 3.
#
# Either specify the PCI id on the command-line option or uncomment and edit
# the PCI_ID line below using the correct PCI ID.
#
# Note that system firmware/BIOS may mask certain errors, change their severity
# and/or not report header log words.
#
AER
PCI_ID 0000:0c:00.0
UNCOR_STATUS COMP_ABORT
HEADER_LOG 0 1 2 3
-----------------------------------------------------

The "lspci" output on the VM looks like below
----------------------------------------------------
Qemu: execute "lspci" on VM
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
0c:00.0 PCI bridge: Intel Corporation Device 7075
0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
--------------------------------------------------

Fan


> Regards,
> Terry
diff mbox series

Patch

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 705893b5f7b0..a4db474b2be5 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -203,6 +203,26 @@  pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
 	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
 
+	/*
+	 * PCIe ports may include functionality beyond the standard
+	 * extended port capabilities. This may present a need to log and
+	 * handle errors not addressed in this driver. Examples are CXL
+	 * root ports and CXL downstream switch ports using AER UIE to
+	 * indicate CXL UCE RAS protocol errors.
+	 */
+	if (type == PCI_EXP_TYPE_ROOT_PORT ||
+	    type == PCI_EXP_TYPE_DOWNSTREAM) {
+		struct pci_driver *pdrv = dev->driver;
+
+		if (pdrv && pdrv->err_handler &&
+		    pdrv->err_handler->error_detected) {
+			const struct pci_error_handlers *err_handler;
+
+			err_handler = pdrv->err_handler;
+			status = err_handler->error_detected(dev, state);
+		}
+	}
+
 	/*
 	 * If the error was detected by a Root Port, Downstream Port, RCEC,
 	 * or RCiEP, recovery runs on the device itself.  For Ports, that