diff mbox series

[v2] PCI/AER: Do not reset the port device status if doing firmware first handling.

Message ID 20200622113523.891666-1-Jonathan.Cameron@huawei.com (mailing list archive)
State Accepted, archived
Delegated to: Bjorn Helgaas
Headers show
Series [v2] PCI/AER: Do not reset the port device status if doing firmware first handling. | expand

Commit Message

Jonathan Cameron June 22, 2020, 11:35 a.m. UTC
pci_aer_clear_device_status() currently resets the device status
(PCI_EXP_DEVSTA) on the downstream port above a device, or the port itself
if the port is the reported AER error source.  This happens even when error
handling is firmware first.

Our interpretation is that firmware first handling means that the firmware
will deal with clearing all relevant error reporting registers
including this one.

Bjorn Helgaas reports that this has been clarified in sec 4.5.1 of:

  System Firmware Intermediary (SFI) _OSC and DPC Updates ECN, Feb 24,
    2020, affecting PCI Firmware Specification, Rev. 3.2
      https://members.pcisig.com/wg/PCI-SIG/document/14076

The call path that triggers this unwanted clear is:

ghes_do_proc->
ghes_handle_aer->
aer_recover_queue->
aer_recover_work_func->
pcie_do_recovery->
pci_aer_clear_device_status

I believe this extra status clear is probably harmless so probably not
worth backporting.  I'm not aware of any reports of issues caused by
this and only identified it as incorrect during some emulated reset
flow testing.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---

Changes since v1:

* As this is independent of the RCiEP APEI error handling patch
  I have separated them.
* Rebase on mainline including changing to new handling of firmware
  first vs native handling.
* More detail added to patch description including the reference
  Bjorn suggested.

 drivers/pci/pcie/aer.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Bjorn Helgaas July 17, 2020, 8:01 p.m. UTC | #1
[+cc Austin, _OSC expert]

On Mon, Jun 22, 2020 at 07:35:23PM +0800, Jonathan Cameron wrote:
> pci_aer_clear_device_status() currently resets the device status
> (PCI_EXP_DEVSTA) on the downstream port above a device, or the port itself
> if the port is the reported AER error source.  This happens even when error
> handling is firmware first.
> 
> Our interpretation is that firmware first handling means that the firmware
> will deal with clearing all relevant error reporting registers
> including this one.

IMO "firmware-first" is meaningless to the kernel.  I see the bit
defined in the ACPI HEST records (ACPI v6.3, sec 18.3.2.4), but there
is no indication of anything the OS needs to *do* with it.  It does
not influence the result of pcie_aer_is_native().  So I don't want to
mention it in the subject or commit log.

But I think what the _OSC negotiation for AER ownership is relevant,
and that's what your patch tests, so I think this is the right thing
to do.

So I applied this as below to pci/error for v5.8, thanks a lot!

Oh, I also propose a preliminary patch (posted and cc'd to you) to
rename pci_aer_clear_device_status():

  https://lore.kernel.org/r/20200717195619.766662-1-helgaas@kernel.org

commit d6c8d24e3d5d ("PCI/ERR: Clear PCIe Device Status errors only if OS owns AER")
Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Date:   Mon Jun 22 19:35:23 2020 +0800

    PCI/ERR: Clear PCIe Device Status errors only if OS owns AER
    
    pcie_clear_device_status() resets the error bits in the PCIe Device Status
    Register (PCI_EXP_DEVSTA).
    
    Previously we did this unconditionally, but on ACPI systems, the _OSC AER
    bit negotiates control of the AER capability.  Per sec 4.5.1 of the System
    Firmware Intermediary _OSC and DPC Updates ECN [1], this bit also covers
    other error enable/status bits including the following:
    
      Correctable Error Reporting Enable
      Non-Fatal Error Reporting Enable
      Fatal Error Reporting Enable
      Unsupported Request Reporting Enable
    
    These bits are all in the PCIe Device Control register (the ECN omitted
    "Reporting", but I think that's a typo), so by implication the _OSC AER bit
    also applies to the error status bits in the PCIe Device Status register:
    
      Correctable Error Detected
      Non-Fatal Error Detected
      Fatal Error Detected
      Unsupported Request Detected
    
    Clear the PCIe Device Status error bits only when the OS controls the AER
    capability and related error enable/status bits.  If platform firmware
    controls the AER capability, firmware is responsible for clearing these
    bits.
    
    One call path leading here is:
    
      ghes_do_proc
        ghes_handle_aer
          aer_recover_queue
            schedule_work(&aer_recover_work)
      ...
      aer_recover_work_func
        pcie_do_recovery
          pcie_clear_device_status
    
    [1] System Firmware Intermediary (SFI) _OSC and DPC Updates ECN, Feb 24,
        2020, affecting PCI Firmware Specification, Rev. 3.2
        https://members.pcisig.com/wg/PCI-SIG/document/14076
    [bhelgaas: commit log]
    Link: https://lore.kernel.org/r/20200622113523.891666-1-Jonathan.Cameron@huawei.com
    Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d3ea667c8520..34bfea5c52b3 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -245,6 +245,9 @@ void pcie_clear_device_status(struct pci_dev *dev)
 {
 	u16 sta;
 
+	if (!pcie_aer_is_native(dev))
+		return;
+
 	pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &sta);
 	pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta);
 }
Bjorn Helgaas July 17, 2020, 9:39 p.m. UTC | #2
On Fri, Jul 17, 2020 at 03:01:29PM -0500, Bjorn Helgaas wrote:
> On Mon, Jun 22, 2020 at 07:35:23PM +0800, Jonathan Cameron wrote:
> > pci_aer_clear_device_status() currently resets the device status
> > (PCI_EXP_DEVSTA) on the downstream port above a device, or the port itself
> > if the port is the reported AER error source.  This happens even when error
> > handling is firmware first.
> > 
> > Our interpretation is that firmware first handling means that the firmware
> > will deal with clearing all relevant error reporting registers
> > including this one.
> 
> ...
> But I think what the _OSC negotiation for AER ownership is relevant,
> and that's what your patch tests, so I think this is the right thing
> to do.
> 
> So I applied this as below to pci/error for v5.8, thanks a lot!

Oops, sorry, I meant for v5.9.  It doesn't seem to be an urgent bug
fix that we could justify merging for v5.8 after the merge window.
diff mbox series

Patch

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 3acf56683915..c7cdeaff4350 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -245,6 +245,9 @@  void pci_aer_clear_device_status(struct pci_dev *dev)
 {
 	u16 sta;
 
+	if (!pcie_aer_is_native(dev))
+		return;
+
 	pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &sta);
 	pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta);
 }