Message ID | 20200622113523.891666-1-Jonathan.Cameron@huawei.com (mailing list archive) |
---|---|
State | Accepted, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | [v2] PCI/AER: Do not reset the port device status if doing firmware first handling. | expand |
[+cc Austin, _OSC expert] On Mon, Jun 22, 2020 at 07:35:23PM +0800, Jonathan Cameron wrote: > pci_aer_clear_device_status() currently resets the device status > (PCI_EXP_DEVSTA) on the downstream port above a device, or the port itself > if the port is the reported AER error source. This happens even when error > handling is firmware first. > > Our interpretation is that firmware first handling means that the firmware > will deal with clearing all relevant error reporting registers > including this one. IMO "firmware-first" is meaningless to the kernel. I see the bit defined in the ACPI HEST records (ACPI v6.3, sec 18.3.2.4), but there is no indication of anything the OS needs to *do* with it. It does not influence the result of pcie_aer_is_native(). So I don't want to mention it in the subject or commit log. But I think what the _OSC negotiation for AER ownership is relevant, and that's what your patch tests, so I think this is the right thing to do. So I applied this as below to pci/error for v5.8, thanks a lot! Oh, I also propose a preliminary patch (posted and cc'd to you) to rename pci_aer_clear_device_status(): https://lore.kernel.org/r/20200717195619.766662-1-helgaas@kernel.org commit d6c8d24e3d5d ("PCI/ERR: Clear PCIe Device Status errors only if OS owns AER") Author: Jonathan Cameron <Jonathan.Cameron@huawei.com> Date: Mon Jun 22 19:35:23 2020 +0800 PCI/ERR: Clear PCIe Device Status errors only if OS owns AER pcie_clear_device_status() resets the error bits in the PCIe Device Status Register (PCI_EXP_DEVSTA). Previously we did this unconditionally, but on ACPI systems, the _OSC AER bit negotiates control of the AER capability. Per sec 4.5.1 of the System Firmware Intermediary _OSC and DPC Updates ECN [1], this bit also covers other error enable/status bits including the following: Correctable Error Reporting Enable Non-Fatal Error Reporting Enable Fatal Error Reporting Enable Unsupported Request Reporting Enable These bits are all in the PCIe Device Control register (the ECN omitted "Reporting", but I think that's a typo), so by implication the _OSC AER bit also applies to the error status bits in the PCIe Device Status register: Correctable Error Detected Non-Fatal Error Detected Fatal Error Detected Unsupported Request Detected Clear the PCIe Device Status error bits only when the OS controls the AER capability and related error enable/status bits. If platform firmware controls the AER capability, firmware is responsible for clearing these bits. One call path leading here is: ghes_do_proc ghes_handle_aer aer_recover_queue schedule_work(&aer_recover_work) ... aer_recover_work_func pcie_do_recovery pcie_clear_device_status [1] System Firmware Intermediary (SFI) _OSC and DPC Updates ECN, Feb 24, 2020, affecting PCI Firmware Specification, Rev. 3.2 https://members.pcisig.com/wg/PCI-SIG/document/14076 [bhelgaas: commit log] Link: https://lore.kernel.org/r/20200622113523.891666-1-Jonathan.Cameron@huawei.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index d3ea667c8520..34bfea5c52b3 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -245,6 +245,9 @@ void pcie_clear_device_status(struct pci_dev *dev) { u16 sta; + if (!pcie_aer_is_native(dev)) + return; + pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &sta); pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta); }
On Fri, Jul 17, 2020 at 03:01:29PM -0500, Bjorn Helgaas wrote: > On Mon, Jun 22, 2020 at 07:35:23PM +0800, Jonathan Cameron wrote: > > pci_aer_clear_device_status() currently resets the device status > > (PCI_EXP_DEVSTA) on the downstream port above a device, or the port itself > > if the port is the reported AER error source. This happens even when error > > handling is firmware first. > > > > Our interpretation is that firmware first handling means that the firmware > > will deal with clearing all relevant error reporting registers > > including this one. > > ... > But I think what the _OSC negotiation for AER ownership is relevant, > and that's what your patch tests, so I think this is the right thing > to do. > > So I applied this as below to pci/error for v5.8, thanks a lot! Oops, sorry, I meant for v5.9. It doesn't seem to be an urgent bug fix that we could justify merging for v5.8 after the merge window.
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 3acf56683915..c7cdeaff4350 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -245,6 +245,9 @@ void pci_aer_clear_device_status(struct pci_dev *dev) { u16 sta; + if (!pcie_aer_is_native(dev)) + return; + pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &sta); pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta); }
pci_aer_clear_device_status() currently resets the device status (PCI_EXP_DEVSTA) on the downstream port above a device, or the port itself if the port is the reported AER error source. This happens even when error handling is firmware first. Our interpretation is that firmware first handling means that the firmware will deal with clearing all relevant error reporting registers including this one. Bjorn Helgaas reports that this has been clarified in sec 4.5.1 of: System Firmware Intermediary (SFI) _OSC and DPC Updates ECN, Feb 24, 2020, affecting PCI Firmware Specification, Rev. 3.2 https://members.pcisig.com/wg/PCI-SIG/document/14076 The call path that triggers this unwanted clear is: ghes_do_proc-> ghes_handle_aer-> aer_recover_queue-> aer_recover_work_func-> pcie_do_recovery-> pci_aer_clear_device_status I believe this extra status clear is probably harmless so probably not worth backporting. I'm not aware of any reports of issues caused by this and only identified it as incorrect during some emulated reset flow testing. Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> --- Changes since v1: * As this is independent of the RCiEP APEI error handling patch I have separated them. * Rebase on mainline including changing to new handling of firmware first vs native handling. * More detail added to patch description including the reference Bjorn suggested. drivers/pci/pcie/aer.c | 3 +++ 1 file changed, 3 insertions(+)