diff mbox series

[RFC,2/9] PCI/AER: Call AER CE handler before clearing AER CE status register

Message ID 20240617200411.1426554-3-terry.bowman@amd.com
State New, archived
Headers show
Series Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports | expand

Commit Message

Bowman, Terry June 17, 2024, 8:04 p.m. UTC
The AER service driver clears the AER correctable error (CE) status before
calling the correctable error handler. This results in the error's status
not correctly reflected if read from the CE handler.

The AER CE status is needed by the portdrv's CE handler. The portdrv's
CE handler is intended to only call the registered notifier callbacks
if the CE error status has correctable internal error (CIE) set.

This is not a problem for AER uncorrrectbale errors (UCE). The UCE status
is still present in the AER capability and available for reading, if
needed, when the UCE handler is called.

Change the order of clearing the CE status and calling the CE handler.
Make it to call the CE handler first and then clear the CE status
after returning.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
---
 drivers/pci/pcie/aer.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

Jonathan Cameron June 20, 2024, 11:31 a.m. UTC | #1
On Mon, 17 Jun 2024 15:04:04 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER service driver clears the AER correctable error (CE) status before
> calling the correctable error handler. This results in the error's status
> not correctly reflected if read from the CE handler.
> 
> The AER CE status is needed by the portdrv's CE handler. The portdrv's
> CE handler is intended to only call the registered notifier callbacks
> if the CE error status has correctable internal error (CIE) set.
> 
> This is not a problem for AER uncorrrectbale errors (UCE). The UCE status

uncorrectable

> is still present in the AER capability and available for reading, if
> needed, when the UCE handler is called.

I'm seeing the clear in the DPC path for UCE. For other cases is
it a side effect of the reset?

> 
> Change the order of clearing the CE status and calling the CE handler.
> Make it to call the CE handler first and then clear the CE status
> after returning.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
Seems reasonable, but many gremlins around the ordering in these
flows, so I'm to particularly confident. With that in mind.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huwei.com>

> ---
>  drivers/pci/pcie/aer.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index ac6293c24976..4dc03cb9aff0 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1094,9 +1094,6 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  		 * Correctable error does not need software intervention.
>  		 * No need to go through error recovery process.
>  		 */
> -		if (aer)
> -			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
> -					info->status);
>  		if (pcie_aer_is_native(dev)) {
>  			struct pci_driver *pdrv = dev->driver;
>  
> @@ -1105,6 +1102,10 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  				pdrv->err_handler->cor_error_detected(dev);
>  			pcie_clear_device_status(dev);
>  		}
> +		if (aer)
> +			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
> +					info->status);
> +
>  	} else if (info->severity == AER_NONFATAL)
>  		pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
>  	else if (info->severity == AER_FATAL)
Dan Williams June 21, 2024, 7:23 p.m. UTC | #2
Terry Bowman wrote:
> The AER service driver clears the AER correctable error (CE) status before
> calling the correctable error handler. This results in the error's status
> not correctly reflected if read from the CE handler.
> 
> The AER CE status is needed by the portdrv's CE handler. The portdrv's
> CE handler is intended to only call the registered notifier callbacks
> if the CE error status has correctable internal error (CIE) set.

Is this a fix or a prep patch? It reads like a "fix", but there are no
notifiers to worry about today.

> This is not a problem for AER uncorrrectbale errors (UCE). The UCE status
> is still present in the AER capability and available for reading, if
> needed, when the UCE handler is called.
> 
> Change the order of clearing the CE status and calling the CE handler.
> Make it to call the CE handler first and then clear the CE status
> after returning.

With the changelog clarified to indicate whether this has any impact on
current behavior you can add:

Acked-by: Dan Williams <dan.j.williams@intel.com>
Bowman, Terry June 24, 2024, 3:08 p.m. UTC | #3
On 6/20/24 06:31, Jonathan Cameron wrote:
> On Mon, 17 Jun 2024 15:04:04 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
> 
>> The AER service driver clears the AER correctable error (CE) status before
>> calling the correctable error handler. This results in the error's status
>> not correctly reflected if read from the CE handler.
>>
>> The AER CE status is needed by the portdrv's CE handler. The portdrv's
>> CE handler is intended to only call the registered notifier callbacks
>> if the CE error status has correctable internal error (CIE) set.
>>
>> This is not a problem for AER uncorrrectbale errors (UCE). The UCE status
> 
> uncorrectable
> 

Thank you.

>> is still present in the AER capability and available for reading, if
>> needed, when the UCE handler is called.
> 
> I'm seeing the clear in the DPC path for UCE. For other cases is
> it a side effect of the reset?
> 

Depends on when its being read. I'm assuming this is after recovery in your case. 
And after recovery it will be zeroed.

Regards,
Terry

>>
>> Change the order of clearing the CE status and calling the CE handler.
>> Make it to call the CE handler first and then clear the CE status
>> after returning.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
> Seems reasonable, but many gremlins around the ordering in these
> flows, so I'm to particularly confident. With that in mind.
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huwei.com>
> 
>> ---
>>  drivers/pci/pcie/aer.c | 7 ++++---
>>  1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index ac6293c24976..4dc03cb9aff0 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1094,9 +1094,6 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>  		 * Correctable error does not need software intervention.
>>  		 * No need to go through error recovery process.
>>  		 */
>> -		if (aer)
>> -			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>> -					info->status);
>>  		if (pcie_aer_is_native(dev)) {
>>  			struct pci_driver *pdrv = dev->driver;
>>  
>> @@ -1105,6 +1102,10 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>  				pdrv->err_handler->cor_error_detected(dev);
>>  			pcie_clear_device_status(dev);
>>  		}
>> +		if (aer)
>> +			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>> +					info->status);
>> +
>>  	} else if (info->severity == AER_NONFATAL)
>>  		pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
>>  	else if (info->severity == AER_FATAL)
>
Bowman, Terry June 24, 2024, 6 p.m. UTC | #4
Hi Dan, I added a response below.

On 6/21/24 14:23, Dan Williams wrote:
> Terry Bowman wrote:
>> The AER service driver clears the AER correctable error (CE) status before
>> calling the correctable error handler. This results in the error's status
>> not correctly reflected if read from the CE handler.
>>
>> The AER CE status is needed by the portdrv's CE handler. The portdrv's
>> CE handler is intended to only call the registered notifier callbacks
>> if the CE error status has correctable internal error (CIE) set.
> 
> Is this a fix or a prep patch? It reads like a "fix", but there are no
> notifiers to worry about today.
> 

I will add mention "in preparation for future patch".

>> This is not a problem for AER uncorrrectbale errors (UCE). The UCE status
>> is still present in the AER capability and available for reading, if
>> needed, when the UCE handler is called.
>>
>> Change the order of clearing the CE status and calling the CE handler.
>> Make it to call the CE handler first and then clear the CE status
>> after returning.
> 
> With the changelog clarified to indicate whether this has any impact on
> current behavior you can add:
> 
> Acked-by: Dan Williams <dan.j.williams@intel.com>

Regards,
Terry
diff mbox series

Patch

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index ac6293c24976..4dc03cb9aff0 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1094,9 +1094,6 @@  static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 		 * Correctable error does not need software intervention.
 		 * No need to go through error recovery process.
 		 */
-		if (aer)
-			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
-					info->status);
 		if (pcie_aer_is_native(dev)) {
 			struct pci_driver *pdrv = dev->driver;
 
@@ -1105,6 +1102,10 @@  static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 				pdrv->err_handler->cor_error_detected(dev);
 			pcie_clear_device_status(dev);
 		}
+		if (aer)
+			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
+					info->status);
+
 	} else if (info->severity == AER_NONFATAL)
 		pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
 	else if (info->severity == AER_FATAL)