diff mbox series

[RFC,9/9] cxl/pci: Enable interrupts for CXL PCIe ports' AER internal errors

Message ID 20240617200411.1426554-10-terry.bowman@amd.com
State New, archived
Headers show
Series Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports | expand

Commit Message

Bowman, Terry June 17, 2024, 8:04 p.m. UTC
CXL RAS errors are reported through AER interrupts using the AER status:
correctbale internal errors (CIE) and AER uncorrectable internal errors
(UIE).[1] But, the AER CIE/UIE are disabled by default preventing
notification of CXL RAS errors.[2]

Enable CXL PCIe port RAS notification by unmasking the ports' AER CIE
and UIE errors.

[1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream
             Switch Ports
[2] PCI6.0 - 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h),
             7.8.4.6 Correctable Error Mask Register (Offset 14h)

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Jonathan Cameron June 20, 2024, 1:15 p.m. UTC | #1
On Mon, 17 Jun 2024 15:04:11 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL RAS errors are reported through AER interrupts using the AER status:
> correctbale internal errors (CIE) and AER uncorrectable internal errors

correctable

> (UIE).[1] But, the AER CIE/UIE are disabled by default preventing
> notification of CXL RAS errors.[2]
> 
> Enable CXL PCIe port RAS notification by unmasking the ports' AER CIE
> and UIE errors.
> 
> [1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream
>              Switch Ports
> [2] PCI6.0 - 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h),
>              7.8.4.6 Correctable Error Mask Register (Offset 14h)
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

I'm not sure doing this from a driver other than the one handling the
errors makes sense.  It is doing a couple of RMW without any locking
or guarantees that the driver bound to the PCI port might care about
this changing.

I'd like more info on why we don't just turn this on in general
and hence avoid the need to control it from the 'wrong' place.

Jonathan



> ---
>  drivers/cxl/core/pci.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index e630eccb733d..73637d39df0a 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -861,6 +861,12 @@ void cxl_setup_parent_uport(struct device *host, struct cxl_port *port)
>  	struct device *uport_dev = port->uport_dev;
>  
>  	cxl_port_map_regs(uport_dev, map, regs);
> +
> +	if (dev_is_pci(uport_dev)) {
> +		struct pci_dev *pdev = to_pci_dev(uport_dev);
> +
> +		pci_aer_unmask_internal_errors(pdev);

I'd skip the local variable for conciseness.

> +	}
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_uport, CXL);
>  
> @@ -878,6 +884,12 @@ void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport)
>  
>  	if (dport->rch)
>  		cxl_disable_rch_root_ints(dport);
> +
> +	if (dev_is_pci(dport_dev)) {
> +		struct pci_dev *pdev = to_pci_dev(dport_dev);
> +
> +		pci_aer_unmask_internal_errors(pdev);

likewise.

> +	}
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_dport, CXL);
>
Bowman, Terry June 24, 2024, 4:46 p.m. UTC | #2
Hi Jonathan,

I added responses inline below.

On 6/20/24 08:15, Jonathan Cameron wrote:
> On Mon, 17 Jun 2024 15:04:11 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
> 
>> CXL RAS errors are reported through AER interrupts using the AER status:
>> correctbale internal errors (CIE) and AER uncorrectable internal errors
> 
> correctable
> 

Thanks.

>> (UIE).[1] But, the AER CIE/UIE are disabled by default preventing
>> notification of CXL RAS errors.[2]
>>
>> Enable CXL PCIe port RAS notification by unmasking the ports' AER CIE
>> and UIE errors.
>>
>> [1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream
>>              Switch Ports
>> [2] PCI6.0 - 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h),
>>              7.8.4.6 Correctable Error Mask Register (Offset 14h)
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> 
> I'm not sure doing this from a driver other than the one handling the
> errors makes sense.  It is doing a couple of RMW without any locking
> or guarantees that the driver bound to the PCI port might care about
> this changing.
> 

I think this could fit into the helper function mentioned in our earlier 
discussion. When the portdrv's notifier enabler is called it could also
enable the UIE/CIE.

> I'd like more info on why we don't just turn this on in general
> and hence avoid the need to control it from the 'wrong' place.
> 
> Jonathan
> 

I was trying to enable only where needed given the one case is not a 
pattern, yet. At this point it is only for CXL RCH downstream port 
and CXL VH ports (portdrv).

Would you like for the UIE/CIE unmask added to the AER driver init ?

> 
> 
>> ---
>>  drivers/cxl/core/pci.c | 12 ++++++++++++
>>  1 file changed, 12 insertions(+)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index e630eccb733d..73637d39df0a 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -861,6 +861,12 @@ void cxl_setup_parent_uport(struct device *host, struct cxl_port *port)
>>  	struct device *uport_dev = port->uport_dev;
>>  
>>  	cxl_port_map_regs(uport_dev, map, regs);
>> +
>> +	if (dev_is_pci(uport_dev)) {
>> +		struct pci_dev *pdev = to_pci_dev(uport_dev);
>> +
>> +		pci_aer_unmask_internal_errors(pdev);
> 
> I'd skip the local variable for conciseness.
> 
>> +	}
>>  }
>>  EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_uport, CXL);
>>  
>> @@ -878,6 +884,12 @@ void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport)
>>  
>>  	if (dport->rch)
>>  		cxl_disable_rch_root_ints(dport);
>> +
>> +	if (dev_is_pci(dport_dev)) {
>> +		struct pci_dev *pdev = to_pci_dev(dport_dev);
>> +
>> +		pci_aer_unmask_internal_errors(pdev);
> 
> likewise.
> 

Got it.

Regards,
Terry

>> +	}
>>  }
>>  EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_dport, CXL);
>>  
>
Jonathan Cameron July 2, 2024, 4 p.m. UTC | #3
On Mon, 24 Jun 2024 11:46:01 -0500
Terry Bowman <Terry.Bowman@amd.com> wrote:

> Hi Jonathan,
> 
> I added responses inline below.
> 
> On 6/20/24 08:15, Jonathan Cameron wrote:
> > On Mon, 17 Jun 2024 15:04:11 -0500
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >   
> >> CXL RAS errors are reported through AER interrupts using the AER status:
> >> correctbale internal errors (CIE) and AER uncorrectable internal errors  
> > 
> > correctable
> >   
> 
> Thanks.
> 
> >> (UIE).[1] But, the AER CIE/UIE are disabled by default preventing
> >> notification of CXL RAS errors.[2]
> >>
> >> Enable CXL PCIe port RAS notification by unmasking the ports' AER CIE
> >> and UIE errors.
> >>
> >> [1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream
> >>              Switch Ports
> >> [2] PCI6.0 - 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h),
> >>              7.8.4.6 Correctable Error Mask Register (Offset 14h)
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>  
> > 
> > I'm not sure doing this from a driver other than the one handling the
> > errors makes sense.  It is doing a couple of RMW without any locking
> > or guarantees that the driver bound to the PCI port might care about
> > this changing.
> >   
> 
> I think this could fit into the helper function mentioned in our earlier 
> discussion. When the portdrv's notifier enabler is called it could also
> enable the UIE/CIE.
> 
> > I'd like more info on why we don't just turn this on in general
> > and hence avoid the need to control it from the 'wrong' place.
> > 
> > Jonathan
> >   
> 
> I was trying to enable only where needed given the one case is not a 
> pattern, yet. At this point it is only for CXL RCH downstream port 
> and CXL VH ports (portdrv).
> 
> Would you like for the UIE/CIE unmask added to the AER driver init ?

If we can get away with it, yes!
diff mbox series

Patch

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index e630eccb733d..73637d39df0a 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -861,6 +861,12 @@  void cxl_setup_parent_uport(struct device *host, struct cxl_port *port)
 	struct device *uport_dev = port->uport_dev;
 
 	cxl_port_map_regs(uport_dev, map, regs);
+
+	if (dev_is_pci(uport_dev)) {
+		struct pci_dev *pdev = to_pci_dev(uport_dev);
+
+		pci_aer_unmask_internal_errors(pdev);
+	}
 }
 EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_uport, CXL);
 
@@ -878,6 +884,12 @@  void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport)
 
 	if (dport->rch)
 		cxl_disable_rch_root_ints(dport);
+
+	if (dev_is_pci(dport_dev)) {
+		struct pci_dev *pdev = to_pci_dev(dport_dev);
+
+		pci_aer_unmask_internal_errors(pdev);
+	}
 }
 EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_dport, CXL);