diff mbox series

[1/2] PCI: pciehp: Add support for OS-First Hotplug and AER/DPC

Message ID 20221101000719.36828-2-Smita.KoralahalliChannabasappa@amd.com (mailing list archive)
State Changes Requested
Delegated to: Bjorn Helgaas
Headers show
Series PCI: pciehp: Add support for OS-First Hotplug | expand

Commit Message

Smita Koralahalli Nov. 1, 2022, 12:07 a.m. UTC
Current systems support Firmware-First model for hot-plug. In this model,
firmware holds the responsibilty for executing the HW sequencing actions on
an async or surprise add and removal events. Additionally, according to
Section 6.7.6 of PCIe Base Specification [1], firmware must also handle
the side-effects (DPC/AER events) reported on an async removal and is
abstract to the OS.

This model however, poses issues while rolling out updates or fixing bugs
as the servers need to be brought down for firmware updates. Hence,
introduce support for OS-First hot-plug and AER/DPC. Here, OS is
responsible for handling async add and remove along with handling of
AER/DPC events which are generated as a side-effect of async remove.

The implementation is as follows: On an async remove a DPC is triggered as
a side-effect along with an MSI to the OS. Determine it's an async remove
by checking for DPC Trigger Status in DPC Status Register and Surprise
Down Error Status in AER Uncorrected Error Status to be non-zero. If true,
treat the DPC event as a side-effect of async remove, clear the error
status registers and continue with hot-plug tear down routines. If not,
follow the existing routine to handle AER/DPC errors.

Dmesg before:

pcieport 0000:00:01.4: DPC: containment event, status:0x1f01 source:0x0000
pcieport 0000:00:01.4: DPC: unmasked uncorrectable error detected
pcieport 0000:00:01.4: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:00:01.4:   device [1022:14ab] error status/mask=00000020/04004000
pcieport 0000:00:01.4:    [ 5] SDES (First)
nvme nvme2: frozen state error detected, reset controller
pcieport 0000:00:01.4: DPC: Data Link Layer Link Active not set in 1000 msec
pcieport 0000:00:01.4: AER: subordinate device reset failed
pcieport 0000:00:01.4: AER: device recovery failed
pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
nvme2n1: detected capacity change from 1953525168 to 0
pci 0000:04:00.0: Removing from iommu group 49

Dmesg after:

pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
nvme1n1: detected capacity change from 1953525168 to 0
pci 0000:04:00.0: Removing from iommu group 37
pcieport 0000:00:01.4: pciehp: Slot(16): Card present
pci 0000:04:00.0: [8086:0a54] type 00 class 0x010802
pci 0000:04:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
pci 0000:04:00.0: Max Payload Size set to 512 (was 128, max 512)
pci 0000:04:00.0: enabling Extended Tags
pci 0000:04:00.0: Adding to iommu group 37
pci 0000:04:00.0: BAR 0: assigned [mem 0xf2400000-0xf2403fff 64bit]
pcieport 0000:00:01.4: PCI bridge to [bus 04]
pcieport 0000:00:01.4:   bridge window [io 0x1000-0x1fff]
pcieport 0000:00:01.4:   bridge window [mem 0xf2400000-0xf24fffff]
pcieport 0000:00:01.4:   bridge window [mem 0x20080800000-0x200809fffff 64bit pref]
nvme nvme1: pci function 0000:04:00.0
nvme 0000:04:00.0: enabling device (0000 -> 0002)
nvme nvme1: 128/0/0 default/read/poll queues

[1] PCI Express Base Specification Revision 6.0, Dec 16 2021.
    https://members.pcisig.com/wg/PCI-SIG/document/16609

Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
 drivers/pci/pcie/dpc.c | 61 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

Comments

Bjorn Helgaas Nov. 2, 2022, 11:21 p.m. UTC | #1
On Tue, Nov 01, 2022 at 12:07:18AM +0000, Smita Koralahalli wrote:
> Current systems support Firmware-First model for hot-plug. In this model,

I'm familiar with "firmware first" in the context of ACPI APEI.

Is there more "firmware first" language in the spec related to
hotplug?  Or is this just the ACPI hotplug implemented by acpiphp?  Or
is there something in the PCIe spec that talks about some firmware
interfaces needed in pciehp?  If so, please cite the specific
sections.  I see you cite PCIe r6.0, sec 6.7.6, below, but I don't see
the firmware mention there.

> firmware holds the responsibilty for executing the HW sequencing actions on
> an async or surprise add and removal events. Additionally, according to
> Section 6.7.6 of PCIe Base Specification [1], firmware must also handle
> the side-effects (DPC/AER events) reported on an async removal and is
> abstract to the OS.
> 
> This model however, poses issues while rolling out updates or fixing bugs
> as the servers need to be brought down for firmware updates. Hence,
> introduce support for OS-First hot-plug and AER/DPC. Here, OS is
> responsible for handling async add and remove along with handling of
> AER/DPC events which are generated as a side-effect of async remove.
> 
> The implementation is as follows: On an async remove a DPC is triggered as
> a side-effect along with an MSI to the OS. Determine it's an async remove
> by checking for DPC Trigger Status in DPC Status Register and Surprise
> Down Error Status in AER Uncorrected Error Status to be non-zero. If true,
> treat the DPC event as a side-effect of async remove, clear the error
> status registers and continue with hot-plug tear down routines. If not,
> follow the existing routine to handle AER/DPC errors.
> 
> Dmesg before:
> 
> pcieport 0000:00:01.4: DPC: containment event, status:0x1f01 source:0x0000
> pcieport 0000:00:01.4: DPC: unmasked uncorrectable error detected
> pcieport 0000:00:01.4: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> pcieport 0000:00:01.4:   device [1022:14ab] error status/mask=00000020/04004000
> pcieport 0000:00:01.4:    [ 5] SDES (First)
> nvme nvme2: frozen state error detected, reset controller
> pcieport 0000:00:01.4: DPC: Data Link Layer Link Active not set in 1000 msec
> pcieport 0000:00:01.4: AER: subordinate device reset failed
> pcieport 0000:00:01.4: AER: device recovery failed
> pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
> nvme2n1: detected capacity change from 1953525168 to 0
> pci 0000:04:00.0: Removing from iommu group 49
> 
> Dmesg after:
> 
> pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
> nvme1n1: detected capacity change from 1953525168 to 0
> pci 0000:04:00.0: Removing from iommu group 37
> pcieport 0000:00:01.4: pciehp: Slot(16): Card present
> pci 0000:04:00.0: [8086:0a54] type 00 class 0x010802
> pci 0000:04:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
> pci 0000:04:00.0: Max Payload Size set to 512 (was 128, max 512)
> pci 0000:04:00.0: enabling Extended Tags
> pci 0000:04:00.0: Adding to iommu group 37
> pci 0000:04:00.0: BAR 0: assigned [mem 0xf2400000-0xf2403fff 64bit]
> pcieport 0000:00:01.4: PCI bridge to [bus 04]
> pcieport 0000:00:01.4:   bridge window [io 0x1000-0x1fff]
> pcieport 0000:00:01.4:   bridge window [mem 0xf2400000-0xf24fffff]
> pcieport 0000:00:01.4:   bridge window [mem 0x20080800000-0x200809fffff 64bit pref]
> nvme nvme1: pci function 0000:04:00.0
> nvme 0000:04:00.0: enabling device (0000 -> 0002)
> nvme nvme1: 128/0/0 default/read/poll queues

Remove any lines that are not specifically relevant, e.g., I'm not
sure whether the BARs, iommu, MPS, extended tags info is essential.

Please indent the quoted material two spaces so it doesn't look like
the narrative text.

Thanks for working on this!

Bjorn
Lukas Wunner Nov. 4, 2022, 10:15 a.m. UTC | #2
On Tue, Nov 01, 2022 at 12:07:18AM +0000, Smita Koralahalli wrote:
> The implementation is as follows: On an async remove a DPC is triggered as
> a side-effect along with an MSI to the OS. Determine it's an async remove
> by checking for DPC Trigger Status in DPC Status Register and Surprise
> Down Error Status in AER Uncorrected Error Status to be non-zero. If true,
> treat the DPC event as a side-effect of async remove, clear the error
> status registers and continue with hot-plug tear down routines. If not,
> follow the existing routine to handle AER/DPC errors.

Instead of having the OS recognize and filter Surprise Down events,
it would also be possible to simply set the Surprise Down bit in the
Uncorrectable Error Mask Register.  This could be constrained to
Downstream Ports capable of surprise removal, i.e. those where the
is_hotplug_bridge in struct pci_dev is set.  And that check and the
register change could be performed in pci_dpc_init().

Have you considered such an alternative approach?  If you have, what
was the reason to prefer the more complex solution you're proposing?


> +static void pci_clear_surpdn_errors(struct pci_dev *pdev)
> +{
> +	u16 reg16;
> +	u32 reg32;
> +
> +	pci_read_config_dword(pdev, pdev->dpc_cap + PCI_EXP_DPC_RP_PIO_STATUS, &reg32);
> +	pci_write_config_dword(pdev, pdev->dpc_cap + PCI_EXP_DPC_RP_PIO_STATUS, reg32);
> +
> +	pci_read_config_word(pdev, PCI_STATUS, &reg16);
> +	pci_write_config_word(pdev, PCI_STATUS, reg16);
> +
> +	pcie_capability_read_word(pdev, PCI_EXP_DEVSTA, &reg16);
> +	pcie_capability_write_word(pdev, PCI_EXP_DEVSTA, reg16);
> +}

I don't understand why PCI_STATUS and PCI_EXP_DEVSTA need to be
touched here?


> +static void pciehp_handle_surprise_removal(struct pci_dev *pdev)

Since this function is located in dpc.c and is strictly called from
other functions in the same file, it should be prefixed dpc_, not
pciehp_.


> +	/*
> +	 * According to Section 6.13 and 6.15 of the PCIe Base Spec 6.0,
> +	 * following a hot-plug event, clear the ARI Forwarding Enable bit
> +	 * and AtomicOp Requester Enable as its not determined whether the
> +	 * next device inserted will support these capabilities. AtomicOp
> +	 * capabilities are not supported on PCI Express to PCI/PCI-X Bridges
> +	 * and any newly added component may not be an ARI device.
> +	 */
> +	pcie_capability_clear_word(pdev, PCI_EXP_DEVCTL2,
> +				   (PCI_EXP_DEVCTL2_ARI | PCI_EXP_DEVCTL2_ATOMIC_REQ));

That looks like a reasonable change, but it belongs in a separate
patch.  And I think it should be performed as part of (de-)enumeration,
not as part of DPC error handling.  What about Downstream Ports which
are not DPC-capable, I guess the bits should be cleared as well, no?

How about clearing the bits in pciehp_unconfigure_device()?

Thanks,

Lukas
Kuppuswamy Sathyanarayanan Nov. 9, 2022, 7:12 p.m. UTC | #3
Hi,

On 10/31/22 5:07 PM, Smita Koralahalli wrote:
> Current systems support Firmware-First model for hot-plug. In this model,
> firmware holds the responsibilty for executing the HW sequencing actions on
> an async or surprise add and removal events. Additionally, according to
> Section 6.7.6 of PCIe Base Specification [1], firmware must also handle
> the side-effects (DPC/AER events) reported on an async removal and is
> abstract to the OS.

I don't see anything about hotplug firmware-first info in the above
specification reference and I don't think firmware first logic exists for
hotplug. Are you referring to AER/DPC firmware handling support?

> 
> This model however, poses issues while rolling out updates or fixing bugs
> as the servers need to be brought down for firmware updates. Hence,
> introduce support for OS-First hot-plug and AER/DPC. Here, OS is
> responsible for handling async add and remove along with handling of
> AER/DPC events which are generated as a side-effect of async remove.

I think for OS handling we use "native" term. So use it instead of "OS-First".

> 
> The implementation is as follows: On an async remove a DPC is triggered as
> a side-effect along with an MSI to the OS. Determine it's an async remove
> by checking for DPC Trigger Status in DPC Status Register and Surprise
> Down Error Status in AER Uncorrected Error Status to be non-zero. If true,
> treat the DPC event as a side-effect of async remove, clear the error
> status registers and continue with hot-plug tear down routines. If not,
> follow the existing routine to handle AER/DPC errors.

I am wondering why your recovery fails below? Even if the device is disconnected,
error report handler will return PCI_ERS_RESULT_NEED_RESET and then attempt to do
reset using report_slot_reset(). This should work for your case, right?

> 
> Dmesg before:
> 
> pcieport 0000:00:01.4: DPC: containment event, status:0x1f01 source:0x0000
> pcieport 0000:00:01.4: DPC: unmasked uncorrectable error detected
> pcieport 0000:00:01.4: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> pcieport 0000:00:01.4:   device [1022:14ab] error status/mask=00000020/04004000
> pcieport 0000:00:01.4:    [ 5] SDES (First)
> nvme nvme2: frozen state error detected, reset controller
> pcieport 0000:00:01.4: DPC: Data Link Layer Link Active not set in 1000 msec
> pcieport 0000:00:01.4: AER: subordinate device reset failed
> pcieport 0000:00:01.4: AER: device recovery failed
> pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
> nvme2n1: detected capacity change from 1953525168 to 0
> pci 0000:04:00.0: Removing from iommu group 49
> 
> Dmesg after:
> 
> pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
> nvme1n1: detected capacity change from 1953525168 to 0
> pci 0000:04:00.0: Removing from iommu group 37
> pcieport 0000:00:01.4: pciehp: Slot(16): Card present
> pci 0000:04:00.0: [8086:0a54] type 00 class 0x010802
> pci 0000:04:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
> pci 0000:04:00.0: Max Payload Size set to 512 (was 128, max 512)
> pci 0000:04:00.0: enabling Extended Tags
> pci 0000:04:00.0: Adding to iommu group 37
> pci 0000:04:00.0: BAR 0: assigned [mem 0xf2400000-0xf2403fff 64bit]
> pcieport 0000:00:01.4: PCI bridge to [bus 04]
> pcieport 0000:00:01.4:   bridge window [io 0x1000-0x1fff]
> pcieport 0000:00:01.4:   bridge window [mem 0xf2400000-0xf24fffff]
> pcieport 0000:00:01.4:   bridge window [mem 0x20080800000-0x200809fffff 64bit pref]
> nvme nvme1: pci function 0000:04:00.0
> nvme 0000:04:00.0: enabling device (0000 -> 0002)
> nvme nvme1: 128/0/0 default/read/poll queues
> 
> [1] PCI Express Base Specification Revision 6.0, Dec 16 2021.
>     https://members.pcisig.com/wg/PCI-SIG/document/16609
> 
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> ---
>  drivers/pci/pcie/dpc.c | 61 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 61 insertions(+)
> 
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index f5ffea17c7f8..e422876f51ad 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -293,10 +293,71 @@ void dpc_process_error(struct pci_dev *pdev)
>  	}
>  }
>  
> +static void pci_clear_surpdn_errors(struct pci_dev *pdev)
> +{
> +	u16 reg16;
> +	u32 reg32;
> +
> +	pci_read_config_dword(pdev, pdev->dpc_cap + PCI_EXP_DPC_RP_PIO_STATUS, &reg32);
> +	pci_write_config_dword(pdev, pdev->dpc_cap + PCI_EXP_DPC_RP_PIO_STATUS, reg32);
> +
> +	pci_read_config_word(pdev, PCI_STATUS, &reg16);
> +	pci_write_config_word(pdev, PCI_STATUS, reg16);
> +
> +	pcie_capability_read_word(pdev, PCI_EXP_DEVSTA, &reg16);
> +	pcie_capability_write_word(pdev, PCI_EXP_DEVSTA, reg16);
> +}
> +
> +static void pciehp_handle_surprise_removal(struct pci_dev *pdev)
> +{
> +	if (pdev->dpc_rp_extensions && dpc_wait_rp_inactive(pdev))
> +		return;
> +
> +	/*
> +	 * According to Section 6.7.6 of the PCIe Base Spec 6.0, since async
> +	 * removal might be unexpected, errors might be reported as a side
> +	 * effect of the event and software should handle them as an expected
> +	 * part of this event.
> +	 */
> +	pci_aer_raw_clear_status(pdev);
> +	pci_clear_surpdn_errors(pdev);
> +
> +	/*
> +	 * According to Section 6.13 and 6.15 of the PCIe Base Spec 6.0,
> +	 * following a hot-plug event, clear the ARI Forwarding Enable bit
> +	 * and AtomicOp Requester Enable as its not determined whether the
> +	 * next device inserted will support these capabilities. AtomicOp
> +	 * capabilities are not supported on PCI Express to PCI/PCI-X Bridges
> +	 * and any newly added component may not be an ARI device.
> +	 */
> +	pcie_capability_clear_word(pdev, PCI_EXP_DEVCTL2,
> +				   (PCI_EXP_DEVCTL2_ARI | PCI_EXP_DEVCTL2_ATOMIC_REQ));
> +
> +	pci_write_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS,
> +			      PCI_EXP_DPC_STATUS_TRIGGER);
> +}
> +
> +static bool pciehp_is_surprise_removal(struct pci_dev *pdev)
> +{
> +	u16 status;
> +
> +	pci_read_config_word(pdev, pdev->aer_cap + PCI_ERR_UNCOR_STATUS, &status);
> +
> +	if (!(status & PCI_ERR_UNC_SURPDN))
> +		return false;
> +
> +	pciehp_handle_surprise_removal(pdev);
> +
> +	return true;
> +}
> +
>  static irqreturn_t dpc_handler(int irq, void *context)
>  {
>  	struct pci_dev *pdev = context;
>  
> +	if (pciehp_is_surprise_removal(pdev))
> +		return IRQ_HANDLED;
> +
>  	dpc_process_error(pdev);
>  
>  	/* We configure DPC so it only triggers on ERR_FATAL */
Smita Koralahalli Feb. 14, 2023, 9:31 a.m. UTC | #4
Apologies for the delay in reply. Please let me know if I should resend 
the patch.
Responses inline.

On 11/2/2022 4:21 PM, Bjorn Helgaas wrote:
> On Tue, Nov 01, 2022 at 12:07:18AM +0000, Smita Koralahalli wrote:
>> Current systems support Firmware-First model for hot-plug. In this model,
> I'm familiar with "firmware first" in the context of ACPI APEI.
>
> Is there more "firmware first" language in the spec related to
> hotplug?  Or is this just the ACPI hotplug implemented by acpiphp?  Or
> is there something in the PCIe spec that talks about some firmware
> interfaces needed in pciehp?  If so, please cite the specific
> sections.  I see you cite PCIe r6.0, sec 6.7.6, below, but I don't see
> the firmware mention there.

Firmware-first refers to AER/DPC firmware handling support. When FW is 
in full
control of AER/DPC.. The term "FW-First Hotplug/OS-First Hotplug" might look
confusing here as the terms don't exist in Spec. Will rephrase them in 
next revisions.

In simple words, this patch follows the sequencing actions of a hot 
remove when
DPC is enabled and HPS is suppressed and fixes the side effects of 
remove when
OS is in full control of AER/DPC.

Other relevant reference is in PCI Firmware Specification, Revision 3.3, 
"4.6.12.
_DSM for Downstream Port Containment and Hot-Plug Surprise Control", The
PCIe spec allows for this flow: “When the operating system controls DPC, 
this
section describes how the operating system can request the firmware to 
suppress
Hot-Plug Surprise for a given DPC capable root port or a switch port.."
.. The operating system must evaluate this _DSM function when enabling or
disabling DPC regardless of whether the operating system or system firmware
owns DPC. If the operating system owns DPC then evaluating this _DSM 
function
lets the system firmware know when the operating system is ready to 
handle DPC
events and gives the system firmware an opportunity to clear the 
Hot-Plug Surprise
bit, if applicable.

>
>> firmware holds the responsibilty for executing the HW sequencing actions on
>> an async or surprise add and removal events. Additionally, according to
>> Section 6.7.6 of PCIe Base Specification [1], firmware must also handle
>> the side-effects (DPC/AER events) reported on an async removal and is
>> abstract to the OS.
>>
>> This model however, poses issues while rolling out updates or fixing bugs
>> as the servers need to be brought down for firmware updates. Hence,
>> introduce support for OS-First hot-plug and AER/DPC. Here, OS is
>> responsible for handling async add and remove along with handling of
>> AER/DPC events which are generated as a side-effect of async remove.
>>
>> The implementation is as follows: On an async remove a DPC is triggered as
>> a side-effect along with an MSI to the OS. Determine it's an async remove
>> by checking for DPC Trigger Status in DPC Status Register and Surprise
>> Down Error Status in AER Uncorrected Error Status to be non-zero. If true,
>> treat the DPC event as a side-effect of async remove, clear the error
>> status registers and continue with hot-plug tear down routines. If not,
>> follow the existing routine to handle AER/DPC errors.
>>
>> Dmesg before:
>>
>> pcieport 0000:00:01.4: DPC: containment event, status:0x1f01 source:0x0000
>> pcieport 0000:00:01.4: DPC: unmasked uncorrectable error detected
>> pcieport 0000:00:01.4: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
>> pcieport 0000:00:01.4:   device [1022:14ab] error status/mask=00000020/04004000
>> pcieport 0000:00:01.4:    [ 5] SDES (First)
>> nvme nvme2: frozen state error detected, reset controller
>> pcieport 0000:00:01.4: DPC: Data Link Layer Link Active not set in 1000 msec
>> pcieport 0000:00:01.4: AER: subordinate device reset failed
>> pcieport 0000:00:01.4: AER: device recovery failed
>> pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
>> nvme2n1: detected capacity change from 1953525168 to 0
>> pci 0000:04:00.0: Removing from iommu group 49
>>
>> Dmesg after:
>>
>> pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
>> nvme1n1: detected capacity change from 1953525168 to 0
>> pci 0000:04:00.0: Removing from iommu group 37
>> pcieport 0000:00:01.4: pciehp: Slot(16): Card present
>> pci 0000:04:00.0: [8086:0a54] type 00 class 0x010802
>> pci 0000:04:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
>> pci 0000:04:00.0: Max Payload Size set to 512 (was 128, max 512)
>> pci 0000:04:00.0: enabling Extended Tags
>> pci 0000:04:00.0: Adding to iommu group 37
>> pci 0000:04:00.0: BAR 0: assigned [mem 0xf2400000-0xf2403fff 64bit]
>> pcieport 0000:00:01.4: PCI bridge to [bus 04]
>> pcieport 0000:00:01.4:   bridge window [io 0x1000-0x1fff]
>> pcieport 0000:00:01.4:   bridge window [mem 0xf2400000-0xf24fffff]
>> pcieport 0000:00:01.4:   bridge window [mem 0x20080800000-0x200809fffff 64bit pref]
>> nvme nvme1: pci function 0000:04:00.0
>> nvme 0000:04:00.0: enabling device (0000 -> 0002)
>> nvme nvme1: 128/0/0 default/read/poll queues
> Remove any lines that are not specifically relevant, e.g., I'm not
> sure whether the BARs, iommu, MPS, extended tags info is essential.
>
> Please indent the quoted material two spaces so it doesn't look like
> the narrative text.
>
> Thanks for working on this!
>
> Bjorn

Will do in v2. Thanks for reviewing this.

Smita
Smita Koralahalli Feb. 14, 2023, 9:33 a.m. UTC | #5
On 11/4/2022 3:15 AM, Lukas Wunner wrote:
> On Tue, Nov 01, 2022 at 12:07:18AM +0000, Smita Koralahalli wrote:
>> The implementation is as follows: On an async remove a DPC is triggered as
>> a side-effect along with an MSI to the OS. Determine it's an async remove
>> by checking for DPC Trigger Status in DPC Status Register and Surprise
>> Down Error Status in AER Uncorrected Error Status to be non-zero. If true,
>> treat the DPC event as a side-effect of async remove, clear the error
>> status registers and continue with hot-plug tear down routines. If not,
>> follow the existing routine to handle AER/DPC errors.
> Instead of having the OS recognize and filter Surprise Down events,
> it would also be possible to simply set the Surprise Down bit in the
> Uncorrectable Error Mask Register.  This could be constrained to
> Downstream Ports capable of surprise removal, i.e. those where the
> is_hotplug_bridge in struct pci_dev is set.  And that check and the
> register change could be performed in pci_dpc_init().
>
> Have you considered such an alternative approach?  If you have, what
> was the reason to prefer the more complex solution you're proposing?

By setting the Surprise down bit in Uncorrectable Error Mask register we 
will
not get a DPC event. What I know so far is, we cannot set this bit at 
run-time
after we determine its a surprise down event or probably I don't know
enough how to do it! (once an pciehp interrupt is triggered..).

And setting this bit at initialization might not trigger true DPC events..

Second thing, is masking Surprise Down bit has no impact on logging 
errors in
AER registers.

So, I think that approach probably will not resolve the issue of 
clearing the logs
in AER registers and complicate things while differentiating true errors 
vs surprise
down events. Please correct me if I'm wrong!!

I did few testing after I read your comments.  What I realized is that, 
these DPC
events (side effects of hot remove) are actually benign on AMD systems. 
On a hot
remove I see a Presence Detect change and a DPC event. This PD state change
will trigger a pciehp isr and calls pciehp_handle_presence_or_link_change()
and disables the slot normally. So essentially, this patch will boil 
down to the point
to clearing the logs in AER registers and also handling those error 
messages ("device
recovery failed"....) in dmesg which might confuse users on a hot remove.
What do you think?

Now, I'm not sure whether there will be a PD state change across other 
systems on a
hot remove when AER/DPC is native (OS First) in which case we should 
call the
pciehp_disable_slot() from dpc handler as well.. Any inputs would be
appreciated here..
>
>
>> +static void pci_clear_surpdn_errors(struct pci_dev *pdev)
>> +{
>> +	u16 reg16;
>> +	u32 reg32;
>> +
>> +	pci_read_config_dword(pdev, pdev->dpc_cap + PCI_EXP_DPC_RP_PIO_STATUS, &reg32);
>> +	pci_write_config_dword(pdev, pdev->dpc_cap + PCI_EXP_DPC_RP_PIO_STATUS, reg32);
>> +
>> +	pci_read_config_word(pdev, PCI_STATUS, &reg16);
>> +	pci_write_config_word(pdev, PCI_STATUS, reg16);
>> +
>> +	pcie_capability_read_word(pdev, PCI_EXP_DEVSTA, &reg16);
>> +	pcie_capability_write_word(pdev, PCI_EXP_DEVSTA, reg16);
>> +}
> I don't understand why PCI_STATUS and PCI_EXP_DEVSTA need to be
> touched here?

This is just to mask any kind of appearance that there was an error since
the errors would have been induced by the hot plug event (just duplicating
our BIOS functionality here..).  But, please let me know if OS is already
handling the things here and if it is not required.
>
>
>> +static void pciehp_handle_surprise_removal(struct pci_dev *pdev)
> Since this function is located in dpc.c and is strictly called from
> other functions in the same file, it should be prefixed dpc_, not
> pciehp_.

Sure, will fix in v2.
>
>
>> +	/*
>> +	 * According to Section 6.13 and 6.15 of the PCIe Base Spec 6.0,
>> +	 * following a hot-plug event, clear the ARI Forwarding Enable bit
>> +	 * and AtomicOp Requester Enable as its not determined whether the
>> +	 * next device inserted will support these capabilities. AtomicOp
>> +	 * capabilities are not supported on PCI Express to PCI/PCI-X Bridges
>> +	 * and any newly added component may not be an ARI device.
>> +	 */
>> +	pcie_capability_clear_word(pdev, PCI_EXP_DEVCTL2,
>> +				   (PCI_EXP_DEVCTL2_ARI | PCI_EXP_DEVCTL2_ATOMIC_REQ));
> That looks like a reasonable change, but it belongs in a separate
> patch.  And I think it should be performed as part of (de-)enumeration,
> not as part of DPC error handling.  What about Downstream Ports which
> are not DPC-capable, I guess the bits should be cleared as well, no?

AFAIK, DPC will work on all AMD root ports. I'm not sure how could we handle
on a per port basis if the bridges/ports downstream to root ports don't 
support
DPC..
>
> How about clearing the bits in pciehp_unconfigure_device()?

Okay.

Thanks,
Smita
>
> Thanks,
>
> Lukas
Smita Koralahalli Feb. 14, 2023, 9:34 a.m. UTC | #6
On 11/9/2022 11:12 AM, Sathyanarayanan Kuppuswamy wrote:
> Hi,
>
> On 10/31/22 5:07 PM, Smita Koralahalli wrote:
>> Current systems support Firmware-First model for hot-plug. In this model,
>> firmware holds the responsibilty for executing the HW sequencing actions on
>> an async or surprise add and removal events. Additionally, according to
>> Section 6.7.6 of PCIe Base Specification [1], firmware must also handle
>> the side-effects (DPC/AER events) reported on an async removal and is
>> abstract to the OS.
> I don't see anything about hotplug firmware-first info in the above
> specification reference and I don't think firmware first logic exists for
> hotplug. Are you referring to AER/DPC firmware handling support?

Yes, sorry for the confusion here. Its AER/DPC FW vs OS handling support 
is what the
patch addresses about.
>
>> This model however, poses issues while rolling out updates or fixing bugs
>> as the servers need to be brought down for firmware updates. Hence,
>> introduce support for OS-First hot-plug and AER/DPC. Here, OS is
>> responsible for handling async add and remove along with handling of
>> AER/DPC events which are generated as a side-effect of async remove.
> I think for OS handling we use "native" term. So use it instead of "OS-First".

Will take care.
>
>> The implementation is as follows: On an async remove a DPC is triggered as
>> a side-effect along with an MSI to the OS. Determine it's an async remove
>> by checking for DPC Trigger Status in DPC Status Register and Surprise
>> Down Error Status in AER Uncorrected Error Status to be non-zero. If true,
>> treat the DPC event as a side-effect of async remove, clear the error
>> status registers and continue with hot-plug tear down routines. If not,
>> follow the existing routine to handle AER/DPC errors.
> I am wondering why your recovery fails below? Even if the device is disconnected,
> error report handler will return PCI_ERS_RESULT_NEED_RESET and then attempt to do
> reset using report_slot_reset(). This should work for your case, right?

Not sure, might be because two interrupts are being fired simultaneously 
here (pciehp and dpc).
The device is already brought down by pciehp handler and dpc handler is 
trying to reset the bridge
after the device is brought down?

Thanks,
Smita
>
>> Dmesg before:
>>
>> pcieport 0000:00:01.4: DPC: containment event, status:0x1f01 source:0x0000
>> pcieport 0000:00:01.4: DPC: unmasked uncorrectable error detected
>> pcieport 0000:00:01.4: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
>> pcieport 0000:00:01.4:   device [1022:14ab] error status/mask=00000020/04004000
>> pcieport 0000:00:01.4:    [ 5] SDES (First)
>> nvme nvme2: frozen state error detected, reset controller
>> pcieport 0000:00:01.4: DPC: Data Link Layer Link Active not set in 1000 msec
>> pcieport 0000:00:01.4: AER: subordinate device reset failed
>> pcieport 0000:00:01.4: AER: device recovery failed
>> pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
>> nvme2n1: detected capacity change from 1953525168 to 0
>> pci 0000:04:00.0: Removing from iommu group 49
>>
>> Dmesg after:
>>
>> pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
>> nvme1n1: detected capacity change from 1953525168 to 0
>> pci 0000:04:00.0: Removing from iommu group 37
>> pcieport 0000:00:01.4: pciehp: Slot(16): Card present
>> pci 0000:04:00.0: [8086:0a54] type 00 class 0x010802
>> pci 0000:04:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
>> pci 0000:04:00.0: Max Payload Size set to 512 (was 128, max 512)
>> pci 0000:04:00.0: enabling Extended Tags
>> pci 0000:04:00.0: Adding to iommu group 37
>> pci 0000:04:00.0: BAR 0: assigned [mem 0xf2400000-0xf2403fff 64bit]
>> pcieport 0000:00:01.4: PCI bridge to [bus 04]
>> pcieport 0000:00:01.4:   bridge window [io 0x1000-0x1fff]
>> pcieport 0000:00:01.4:   bridge window [mem 0xf2400000-0xf24fffff]
>> pcieport 0000:00:01.4:   bridge window [mem 0x20080800000-0x200809fffff 64bit pref]
>> nvme nvme1: pci function 0000:04:00.0
>> nvme 0000:04:00.0: enabling device (0000 -> 0002)
>> nvme nvme1: 128/0/0 default/read/poll queues
>>
>> [1] PCI Express Base Specification Revision 6.0, Dec 16 2021.
>>      https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmembers.pcisig.com%2Fwg%2FPCI-SIG%2Fdocument%2F16609&amp;data=05%7C01%7CSmita.KoralahalliChannabasappa%40amd.com%7Cc99c75e249cb44479d5e08dac2866426%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638036179704826384%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=IdqEMcVn5E70P0BW%2F8y6DmwI8ud6HIxWq2uNuwxT2pE%3D&amp;reserved=0
>>
>> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
>> ---
>>   drivers/pci/pcie/dpc.c | 61 ++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 61 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
>> index f5ffea17c7f8..e422876f51ad 100644
>> --- a/drivers/pci/pcie/dpc.c
>> +++ b/drivers/pci/pcie/dpc.c
>> @@ -293,10 +293,71 @@ void dpc_process_error(struct pci_dev *pdev)
>>   	}
>>   }
>>   
>> +static void pci_clear_surpdn_errors(struct pci_dev *pdev)
>> +{
>> +	u16 reg16;
>> +	u32 reg32;
>> +
>> +	pci_read_config_dword(pdev, pdev->dpc_cap + PCI_EXP_DPC_RP_PIO_STATUS, &reg32);
>> +	pci_write_config_dword(pdev, pdev->dpc_cap + PCI_EXP_DPC_RP_PIO_STATUS, reg32);
>> +
>> +	pci_read_config_word(pdev, PCI_STATUS, &reg16);
>> +	pci_write_config_word(pdev, PCI_STATUS, reg16);
>> +
>> +	pcie_capability_read_word(pdev, PCI_EXP_DEVSTA, &reg16);
>> +	pcie_capability_write_word(pdev, PCI_EXP_DEVSTA, reg16);
>> +}
>> +
>> +static void pciehp_handle_surprise_removal(struct pci_dev *pdev)
>> +{
>> +	if (pdev->dpc_rp_extensions && dpc_wait_rp_inactive(pdev))
>> +		return;
>> +
>> +	/*
>> +	 * According to Section 6.7.6 of the PCIe Base Spec 6.0, since async
>> +	 * removal might be unexpected, errors might be reported as a side
>> +	 * effect of the event and software should handle them as an expected
>> +	 * part of this event.
>> +	 */
>> +	pci_aer_raw_clear_status(pdev);
>> +	pci_clear_surpdn_errors(pdev);
>> +
>> +	/*
>> +	 * According to Section 6.13 and 6.15 of the PCIe Base Spec 6.0,
>> +	 * following a hot-plug event, clear the ARI Forwarding Enable bit
>> +	 * and AtomicOp Requester Enable as its not determined whether the
>> +	 * next device inserted will support these capabilities. AtomicOp
>> +	 * capabilities are not supported on PCI Express to PCI/PCI-X Bridges
>> +	 * and any newly added component may not be an ARI device.
>> +	 */
>> +	pcie_capability_clear_word(pdev, PCI_EXP_DEVCTL2,
>> +				   (PCI_EXP_DEVCTL2_ARI | PCI_EXP_DEVCTL2_ATOMIC_REQ));
>> +
>> +	pci_write_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS,
>> +			      PCI_EXP_DPC_STATUS_TRIGGER);
>> +}
>> +
>> +static bool pciehp_is_surprise_removal(struct pci_dev *pdev)
>> +{
>> +	u16 status;
>> +
>> +	pci_read_config_word(pdev, pdev->aer_cap + PCI_ERR_UNCOR_STATUS, &status);
>> +
>> +	if (!(status & PCI_ERR_UNC_SURPDN))
>> +		return false;
>> +
>> +	pciehp_handle_surprise_removal(pdev);
>> +
>> +	return true;
>> +}
>> +
>>   static irqreturn_t dpc_handler(int irq, void *context)
>>   {
>>   	struct pci_dev *pdev = context;
>>   
>> +	if (pciehp_is_surprise_removal(pdev))
>> +		return IRQ_HANDLED;
>> +
>>   	dpc_process_error(pdev);
>>   
>>   	/* We configure DPC so it only triggers on ERR_FATAL */
Smita Koralahalli March 14, 2023, 7:31 p.m. UTC | #7
Hi,

Please let me know if I should redo this on latest tree and discuss my 
comments there.

Thanks,
Smita

On 2/14/2023 1:33 AM, Smita Koralahalli wrote:
> On 11/4/2022 3:15 AM, Lukas Wunner wrote:
>> On Tue, Nov 01, 2022 at 12:07:18AM +0000, Smita Koralahalli wrote:
>>> The implementation is as follows: On an async remove a DPC is 
>>> triggered as
>>> a side-effect along with an MSI to the OS. Determine it's an async 
>>> remove
>>> by checking for DPC Trigger Status in DPC Status Register and Surprise
>>> Down Error Status in AER Uncorrected Error Status to be non-zero. If 
>>> true,
>>> treat the DPC event as a side-effect of async remove, clear the error
>>> status registers and continue with hot-plug tear down routines. If not,
>>> follow the existing routine to handle AER/DPC errors.
>> Instead of having the OS recognize and filter Surprise Down events,
>> it would also be possible to simply set the Surprise Down bit in the
>> Uncorrectable Error Mask Register.  This could be constrained to
>> Downstream Ports capable of surprise removal, i.e. those where the
>> is_hotplug_bridge in struct pci_dev is set.  And that check and the
>> register change could be performed in pci_dpc_init().
>>
>> Have you considered such an alternative approach?  If you have, what
>> was the reason to prefer the more complex solution you're proposing?
>
> By setting the Surprise down bit in Uncorrectable Error Mask register 
> we will
> not get a DPC event. What I know so far is, we cannot set this bit at 
> run-time
> after we determine its a surprise down event or probably I don't know
> enough how to do it! (once an pciehp interrupt is triggered..).
>
> And setting this bit at initialization might not trigger true DPC 
> events..
>
> Second thing, is masking Surprise Down bit has no impact on logging 
> errors in
> AER registers.
>
> So, I think that approach probably will not resolve the issue of 
> clearing the logs
> in AER registers and complicate things while differentiating true 
> errors vs surprise
> down events. Please correct me if I'm wrong!!
>
> I did few testing after I read your comments.  What I realized is 
> that, these DPC
> events (side effects of hot remove) are actually benign on AMD 
> systems. On a hot
> remove I see a Presence Detect change and a DPC event. This PD state 
> change
> will trigger a pciehp isr and calls 
> pciehp_handle_presence_or_link_change()
> and disables the slot normally. So essentially, this patch will boil 
> down to the point
> to clearing the logs in AER registers and also handling those error 
> messages ("device
> recovery failed"....) in dmesg which might confuse users on a hot remove.
> What do you think?
>
> Now, I'm not sure whether there will be a PD state change across other 
> systems on a
> hot remove when AER/DPC is native (OS First) in which case we should 
> call the
> pciehp_disable_slot() from dpc handler as well.. Any inputs would be
> appreciated here..
>>
>>
>>> +static void pci_clear_surpdn_errors(struct pci_dev *pdev)
>>> +{
>>> +    u16 reg16;
>>> +    u32 reg32;
>>> +
>>> +    pci_read_config_dword(pdev, pdev->dpc_cap + 
>>> PCI_EXP_DPC_RP_PIO_STATUS, &reg32);
>>> +    pci_write_config_dword(pdev, pdev->dpc_cap + 
>>> PCI_EXP_DPC_RP_PIO_STATUS, reg32);
>>> +
>>> +    pci_read_config_word(pdev, PCI_STATUS, &reg16);
>>> +    pci_write_config_word(pdev, PCI_STATUS, reg16);
>>> +
>>> +    pcie_capability_read_word(pdev, PCI_EXP_DEVSTA, &reg16);
>>> +    pcie_capability_write_word(pdev, PCI_EXP_DEVSTA, reg16);
>>> +}
>> I don't understand why PCI_STATUS and PCI_EXP_DEVSTA need to be
>> touched here?
>
> This is just to mask any kind of appearance that there was an error since
> the errors would have been induced by the hot plug event (just 
> duplicating
> our BIOS functionality here..).  But, please let me know if OS is already
> handling the things here and if it is not required.
>>
>>
>>> +static void pciehp_handle_surprise_removal(struct pci_dev *pdev)
>> Since this function is located in dpc.c and is strictly called from
>> other functions in the same file, it should be prefixed dpc_, not
>> pciehp_.
>
> Sure, will fix in v2.
>>
>>
>>> +    /*
>>> +     * According to Section 6.13 and 6.15 of the PCIe Base Spec 6.0,
>>> +     * following a hot-plug event, clear the ARI Forwarding Enable bit
>>> +     * and AtomicOp Requester Enable as its not determined whether the
>>> +     * next device inserted will support these capabilities. AtomicOp
>>> +     * capabilities are not supported on PCI Express to PCI/PCI-X 
>>> Bridges
>>> +     * and any newly added component may not be an ARI device.
>>> +     */
>>> +    pcie_capability_clear_word(pdev, PCI_EXP_DEVCTL2,
>>> +                   (PCI_EXP_DEVCTL2_ARI | 
>>> PCI_EXP_DEVCTL2_ATOMIC_REQ));
>> That looks like a reasonable change, but it belongs in a separate
>> patch.  And I think it should be performed as part of (de-)enumeration,
>> not as part of DPC error handling.  What about Downstream Ports which
>> are not DPC-capable, I guess the bits should be cleared as well, no?
>
> AFAIK, DPC will work on all AMD root ports. I'm not sure how could we 
> handle
> on a per port basis if the bridges/ports downstream to root ports 
> don't support
> DPC..
>>
>> How about clearing the bits in pciehp_unconfigure_device()?
>
> Okay.
>
> Thanks,
> Smita
>>
>> Thanks,
>>
>> Lukas
>
>
Lukas Wunner May 10, 2023, 8:19 p.m. UTC | #8
On Tue, Feb 14, 2023 at 01:33:54AM -0800, Smita Koralahalli wrote:
> On 11/4/2022 3:15 AM, Lukas Wunner wrote:
> > On Tue, Nov 01, 2022 at 12:07:18AM +0000, Smita Koralahalli wrote:
> > > The implementation is as follows: On an async remove a DPC is triggered as
> > > a side-effect along with an MSI to the OS. Determine it's an async remove
> > > by checking for DPC Trigger Status in DPC Status Register and Surprise
> > > Down Error Status in AER Uncorrected Error Status to be non-zero. If true,
> > > treat the DPC event as a side-effect of async remove, clear the error
> > > status registers and continue with hot-plug tear down routines. If not,
> > > follow the existing routine to handle AER/DPC errors.
> > 
> > Instead of having the OS recognize and filter Surprise Down events,
> > it would also be possible to simply set the Surprise Down bit in the
> > Uncorrectable Error Mask Register.  This could be constrained to
> > Downstream Ports capable of surprise removal, i.e. those where the
> > is_hotplug_bridge in struct pci_dev is set.  And that check and the
> > register change could be performed in pci_dpc_init().
> > 
> > Have you considered such an alternative approach?  If you have, what
> > was the reason to prefer the more complex solution you're proposing?
[...]
> Second thing, is masking Surprise Down bit has no impact on logging errors
> in AER registers.

Why do you think so?  PCIe r6.0.1 sec 7.8.4.3 says:

   "A masked error [...] is not recorded or reported in the Header Log,
    TLP Prefix Log, or First Error Pointer, and is not reported to the
    PCI Express Root Complex by this Function."

So if you set the Surprise Down Error Mask bit on hotplug ports
capable of surprise removal, there should be no logging and thus
no logs to clear.


> So, I think that approach probably will not resolve the issue of clearing
> the logs in AER registers and complicate things while differentiating true
> errors vs surprise down events. Please correct me if I'm wrong!!

I disagree, I think it's worth a try.  Below please find a patch which
sets the Surprise Down Error mask bit.  Could you test if this fixes
the issue for you?


> And setting this bit at initialization might not trigger true DPC events..

I think we cannot discern whether a Surprise Down Error is caused by
surprise removal or is a true error.  We must assume the former on
surprise-capable hotplug ports.

Thanks,

Lukas

-- >8 --

From: Lukas Wunner <lukas@wunner.de>
Subject: [PATCH] PCI: pciehp: Disable Surprise Down Error reporting

On hotplug ports capable of surprise removal, Surprise Down Errors are
expected and no reason for AER or DPC to spring into action.  Although
a Surprise Down event might be caused by an error, software cannot
discern that from regular surprise removal.

Any well-behaved BIOS should mask such errors, but Smita reports a case
where hot-removing an Intel NVMe SSD [8086:0a54] from an AMD Root Port
[1022:14ab] results in irritating AER log messages and a delay of more
than 1 second caused by DPC handling:

  pcieport 0000:00:01.4: DPC: containment event, status:0x1f01 source:0x0000
  pcieport 0000:00:01.4: DPC: unmasked uncorrectable error detected
  pcieport 0000:00:01.4: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
  pcieport 0000:00:01.4:   device [1022:14ab] error status/mask=00000020/04004000
  pcieport 0000:00:01.4:    [ 5] SDES (First)
  nvme nvme2: frozen state error detected, reset controller
  pcieport 0000:00:01.4: DPC: Data Link Layer Link Active not set in 1000 msec
  pcieport 0000:00:01.4: AER: subordinate device reset failed
  pcieport 0000:00:01.4: AER: device recovery failed
  pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
  nvme2n1: detected capacity change from 1953525168 to 0
  pci 0000:04:00.0: Removing from iommu group 49

Avoid by masking Surprise Down Errors on hotplug ports capable of
surprise removal.

Mask them even if AER or DPC is handled by firmware because if hotplug
control was granted to the operating system, it owns hotplug and thus
Surprise Down events.  So firmware has no business reporting or reacting
to them.

Reported-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Link: https://lore.kernel.org/all/20221101000719.36828-2-Smita.KoralahalliChannabasappa@amd.com/
Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 drivers/pci/hotplug/pciehp_hpc.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index f8c70115b691..2a206dbd76b6 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -985,6 +985,7 @@ struct controller *pcie_init(struct pcie_device *dev)
 {
 	struct controller *ctrl;
 	u32 slot_cap, slot_cap2, link_cap;
+	u16 aer_cap;
 	u8 poweron;
 	struct pci_dev *pdev = dev->port;
 	struct pci_bus *subordinate = pdev->subordinate;
@@ -1030,6 +1031,15 @@ struct controller *pcie_init(struct pcie_device *dev)
 	if (dmi_first_match(inband_presence_disabled_dmi_table))
 		ctrl->inband_presence_disabled = 1;
 
+	/*
+	 * Surprise Down Errors are par for the course on Hot-Plug Surprise
+	 * capable ports, so disable reporting in case BIOS left it enabled.
+	 */
+	aer_cap = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
+	if (aer_cap && slot_cap & PCI_EXP_SLTCAP_HPS)
+		pcie_capability_set_dword(pdev, aer_cap + PCI_ERR_UNCOR_MASK,
+					  PCI_ERR_UNC_SURPDN);
+
 	/* Check if Data Link Layer Link Active Reporting is implemented */
 	pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &link_cap);
Lukas Wunner May 11, 2023, 3:23 p.m. UTC | #9
On Wed, May 10, 2023 at 10:19:37PM +0200, Lukas Wunner wrote:
> Below please find a patch which
> sets the Surprise Down Error mask bit.  Could you test if this fixes
> the issue for you?

Sorry, I failed to appreciate that pcie_capability_set_dword()
can't be used to RMW the AER capability.  Replacement patch below.

-- >8 --

From: Lukas Wunner <lukas@wunner.de>
Subject: [PATCH] PCI: pciehp: Disable Surprise Down Error reporting

On hotplug ports capable of surprise removal, Surprise Down Errors are
expected and no reason for AER or DPC to spring into action.  Although
a Surprise Down event might be caused by an error, software cannot
discern that from regular surprise removal.

Any well-behaved BIOS should mask such errors, but Smita reports a case
where hot-removing an Intel NVMe SSD [8086:0a54] from an AMD Root Port
[1022:14ab] results in irritating AER log messages and a delay of more
than 1 second caused by DPC handling:

  pcieport 0000:00:01.4: DPC: containment event, status:0x1f01 source:0x0000
  pcieport 0000:00:01.4: DPC: unmasked uncorrectable error detected
  pcieport 0000:00:01.4: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
  pcieport 0000:00:01.4:   device [1022:14ab] error status/mask=00000020/04004000
  pcieport 0000:00:01.4:    [ 5] SDES (First)
  nvme nvme2: frozen state error detected, reset controller
  pcieport 0000:00:01.4: DPC: Data Link Layer Link Active not set in 1000 msec
  pcieport 0000:00:01.4: AER: subordinate device reset failed
  pcieport 0000:00:01.4: AER: device recovery failed
  pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
  nvme2n1: detected capacity change from 1953525168 to 0
  pci 0000:04:00.0: Removing from iommu group 49

Avoid by masking Surprise Down Errors on hotplug ports capable of
surprise removal.

Mask them even if AER or DPC is handled by firmware because if hotplug
control was granted to the operating system, it owns hotplug and thus
Surprise Down events.  So firmware has no business reporting or reacting
to them.

Reported-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Link: https://lore.kernel.org/all/20221101000719.36828-2-Smita.KoralahalliChannabasappa@amd.com/
Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 drivers/pci/hotplug/pciehp_hpc.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index f8c70115b691..40a721f3b713 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -984,8 +984,9 @@ static inline int pcie_hotplug_depth(struct pci_dev *dev)
 struct controller *pcie_init(struct pcie_device *dev)
 {
 	struct controller *ctrl;
-	u32 slot_cap, slot_cap2, link_cap;
+	u32 slot_cap, slot_cap2, link_cap, aer_cap;
 	u8 poweron;
+	u16 aer;
 	struct pci_dev *pdev = dev->port;
 	struct pci_bus *subordinate = pdev->subordinate;
 
@@ -1030,6 +1031,17 @@ struct controller *pcie_init(struct pcie_device *dev)
 	if (dmi_first_match(inband_presence_disabled_dmi_table))
 		ctrl->inband_presence_disabled = 1;
 
+	/*
+	 * Surprise Down Errors are par for the course on Hot-Plug Surprise
+	 * capable ports, so disable reporting in case BIOS left it enabled.
+	 */
+	aer = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
+	if (aer && slot_cap & PCI_EXP_SLTCAP_HPS) {
+		pci_read_config_dword(pdev, aer + PCI_ERR_UNCOR_MASK, &aer_cap);
+		aer_cap |= PCI_ERR_UNC_SURPDN;
+		pci_write_config_dword(pdev, aer + PCI_ERR_UNCOR_MASK, aer_cap);
+	}
+
 	/* Check if Data Link Layer Link Active Reporting is implemented */
 	pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &link_cap);
Smita Koralahalli May 15, 2023, 7:20 p.m. UTC | #10
Hi Lukas,

On 5/11/2023 8:23 AM, Lukas Wunner wrote:
> On Wed, May 10, 2023 at 10:19:37PM +0200, Lukas Wunner wrote:
>> Below please find a patch which
>> sets the Surprise Down Error mask bit.  Could you test if this fixes
>> the issue for you?
> 
> Sorry, I failed to appreciate that pcie_capability_set_dword()
> can't be used to RMW the AER capability.  Replacement patch below.
> 
> -- >8 --
> 
> From: Lukas Wunner <lukas@wunner.de>
> Subject: [PATCH] PCI: pciehp: Disable Surprise Down Error reporting
> 
> On hotplug ports capable of surprise removal, Surprise Down Errors are
> expected and no reason for AER or DPC to spring into action.  Although
> a Surprise Down event might be caused by an error, software cannot
> discern that from regular surprise removal.
> 
> Any well-behaved BIOS should mask such errors, but Smita reports a case
> where hot-removing an Intel NVMe SSD [8086:0a54] from an AMD Root Port
> [1022:14ab] results in irritating AER log messages and a delay of more
> than 1 second caused by DPC handling:
> 
>    pcieport 0000:00:01.4: DPC: containment event, status:0x1f01 source:0x0000
>    pcieport 0000:00:01.4: DPC: unmasked uncorrectable error detected
>    pcieport 0000:00:01.4: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
>    pcieport 0000:00:01.4:   device [1022:14ab] error status/mask=00000020/04004000
>    pcieport 0000:00:01.4:    [ 5] SDES (First)
>    nvme nvme2: frozen state error detected, reset controller
>    pcieport 0000:00:01.4: DPC: Data Link Layer Link Active not set in 1000 msec
>    pcieport 0000:00:01.4: AER: subordinate device reset failed
>    pcieport 0000:00:01.4: AER: device recovery failed
>    pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
>    nvme2n1: detected capacity change from 1953525168 to 0
>    pci 0000:04:00.0: Removing from iommu group 49
> 
> Avoid by masking Surprise Down Errors on hotplug ports capable of
> surprise removal.
> 
> Mask them even if AER or DPC is handled by firmware because if hotplug
> control was granted to the operating system, it owns hotplug and thus
> Surprise Down events.  So firmware has no business reporting or reacting
> to them.
> 
> Reported-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> Link: https://lore.kernel.org/all/20221101000719.36828-2-Smita.KoralahalliChannabasappa@amd.com/
> Signed-off-by: Lukas Wunner <lukas@wunner.de>

Thanks for the patch. I tested it and I notice that the AER status 
registers will still be set. I just don't see a DPC event with these 
settings.

I have logged in the status registers after the device is removed in
pciehp_handle_presence_or_link_change().

[  467.597119] PCI_ERR_COR_STATUS 0x0
[  467.597119] PCI_ERR_UNCOR_STATUS 0x20
[  467.597120] PCI_ERR_ROOT_STATUS 0x0
[  467.597121] PCI_EXP_DPC_RP_PIO_STATUS 0x10000
[  467.597122] PCI_STATUS 0x10
[  467.597123] PCI_EXP_DEVSTA 0x604

Section 6.2.3.2.2 in PCIe Spec v6.0 has also mentioned that:
"If an individual error is masked when it is detected, its error status 
bit is still affected, but no error reporting Message is sent to the 
Root Complex, and the error is not recorded in the Header Log, TLP 
Prefix Log, or First Error Pointer"..

So we think, masking will not help in not logging errors in status 
registers..

Let me know what you think..

Thanks,
Smita

> ---
>   drivers/pci/hotplug/pciehp_hpc.c | 14 +++++++++++++-
>   1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> index f8c70115b691..40a721f3b713 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -984,8 +984,9 @@ static inline int pcie_hotplug_depth(struct pci_dev *dev)
>   struct controller *pcie_init(struct pcie_device *dev)
>   {
>   	struct controller *ctrl;
> -	u32 slot_cap, slot_cap2, link_cap;
> +	u32 slot_cap, slot_cap2, link_cap, aer_cap;
>   	u8 poweron;
> +	u16 aer;
>   	struct pci_dev *pdev = dev->port;
>   	struct pci_bus *subordinate = pdev->subordinate;
>   
> @@ -1030,6 +1031,17 @@ struct controller *pcie_init(struct pcie_device *dev)
>   	if (dmi_first_match(inband_presence_disabled_dmi_table))
>   		ctrl->inband_presence_disabled = 1;
>   
> +	/*
> +	 * Surprise Down Errors are par for the course on Hot-Plug Surprise
> +	 * capable ports, so disable reporting in case BIOS left it enabled.
> +	 */
> +	aer = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
> +	if (aer && slot_cap & PCI_EXP_SLTCAP_HPS) {
> +		pci_read_config_dword(pdev, aer + PCI_ERR_UNCOR_MASK, &aer_cap);
> +		aer_cap |= PCI_ERR_UNC_SURPDN;
> +		pci_write_config_dword(pdev, aer + PCI_ERR_UNCOR_MASK, aer_cap);
> +	}
> +
>   	/* Check if Data Link Layer Link Active Reporting is implemented */
>   	pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &link_cap);
>
Lukas Wunner May 15, 2023, 7:38 p.m. UTC | #11
On Mon, May 15, 2023 at 12:20:42PM -0700, Smita Koralahalli wrote:
> On 5/11/2023 8:23 AM, Lukas Wunner wrote:
> > Subject: [PATCH] PCI: pciehp: Disable Surprise Down Error reporting
> > 
> > On hotplug ports capable of surprise removal, Surprise Down Errors are
> > expected and no reason for AER or DPC to spring into action.  Although
> > a Surprise Down event might be caused by an error, software cannot
> > discern that from regular surprise removal.
> > 
> > Any well-behaved BIOS should mask such errors, but Smita reports a case
> > where hot-removing an Intel NVMe SSD [8086:0a54] from an AMD Root Port
> > [1022:14ab] results in irritating AER log messages and a delay of more
> > than 1 second caused by DPC handling:
[...]
> Thanks for the patch. I tested it and I notice that the AER status registers
> will still be set. I just don't see a DPC event with these settings.
> 
> I have logged in the status registers after the device is removed in
> pciehp_handle_presence_or_link_change().
[...]
> Section 6.2.3.2.2 in PCIe Spec v6.0 has also mentioned that:
> "If an individual error is masked when it is detected, its error status bit
> is still affected, but no error reporting Message is sent to the Root
> Complex, and the error is not recorded in the Header Log, TLP Prefix Log, or
> First Error Pointer"..

Thanks for the thorough testing.  So the error is logged and next time
a reporting message for a different error is sent to the Root Complex,
that earlier Surprise Down Error will be seen and you'd get belated
log messages for it, is that what you're saying?

I guess I could amend the patch to let pciehp unconditionally clear
the Surprise Down Error Status bit upon a DLLSC event.

Does the patch otherwise do what you want, i.e. no irritating messages
and no extra delay incurred by AER/DPC handling?

Thanks!

Lukas
Smita Koralahalli May 15, 2023, 8:56 p.m. UTC | #12
On 5/15/2023 12:38 PM, Lukas Wunner wrote:
> On Mon, May 15, 2023 at 12:20:42PM -0700, Smita Koralahalli wrote:
>> On 5/11/2023 8:23 AM, Lukas Wunner wrote:
>>> Subject: [PATCH] PCI: pciehp: Disable Surprise Down Error reporting
>>>
[...]
>>
>> I have logged in the status registers after the device is removed in
>> pciehp_handle_presence_or_link_change().
> [...]
>> Section 6.2.3.2.2 in PCIe Spec v6.0 has also mentioned that:
>> "If an individual error is masked when it is detected, its error status bit
>> is still affected, but no error reporting Message is sent to the Root
>> Complex, and the error is not recorded in the Header Log, TLP Prefix Log, or
>> First Error Pointer"..
> 
> Thanks for the thorough testing.  So the error is logged and next time
> a reporting message for a different error is sent to the Root Complex,
> that earlier Surprise Down Error will be seen and you'd get belated
> log messages for it, is that what you're saying?

Yes, thereby confusing user on a false error.
> 
> I guess I could amend the patch to let pciehp unconditionally clear
> the Surprise Down Error Status bit upon a DLLSC event.

OK. First, mask the uncorrected AER status register for surprise down 
during initialization and then clear all status registers 
unconditionally including DPC_RP_PIO and others inside 
pciehp_handle_presence_or_link_change()..?

Could I please know, why do you think masking surprise down during 
initialization would be a better approach than reading surprise down 
error status on a DPC event? Because in both approaches we should be 
however clearing status registers right?

> 
> Does the patch otherwise do what you want, i.e. no irritating messages
> and no extra delay incurred by AER/DPC handling?

no irritating messages and no delay as there is no DPC error logged now.

Thanks,
Smita
> 
> Thanks!
> 
> Lukas
Lukas Wunner May 16, 2023, 10:14 a.m. UTC | #13
On Mon, May 15, 2023 at 01:56:25PM -0700, Smita Koralahalli wrote:
> Could I please know, why do you think masking surprise down during
> initialization would be a better approach than reading surprise down error
> status on a DPC event? Because in both approaches we should be however
> clearing status registers right?

Masking seemed much simpler, more elegant, less code.

I wasn't aware that masking the error merely suppresses the message to
the Root Complex plus the resulting interrupt, but still logs the error.
That's kind of a bummer, so I think your approach is fine and I've just
sent you some review feedback on your patch.

Thanks,

Lukas
diff mbox series

Patch

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index f5ffea17c7f8..e422876f51ad 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -293,10 +293,71 @@  void dpc_process_error(struct pci_dev *pdev)
 	}
 }
 
+static void pci_clear_surpdn_errors(struct pci_dev *pdev)
+{
+	u16 reg16;
+	u32 reg32;
+
+	pci_read_config_dword(pdev, pdev->dpc_cap + PCI_EXP_DPC_RP_PIO_STATUS, &reg32);
+	pci_write_config_dword(pdev, pdev->dpc_cap + PCI_EXP_DPC_RP_PIO_STATUS, reg32);
+
+	pci_read_config_word(pdev, PCI_STATUS, &reg16);
+	pci_write_config_word(pdev, PCI_STATUS, reg16);
+
+	pcie_capability_read_word(pdev, PCI_EXP_DEVSTA, &reg16);
+	pcie_capability_write_word(pdev, PCI_EXP_DEVSTA, reg16);
+}
+
+static void pciehp_handle_surprise_removal(struct pci_dev *pdev)
+{
+	if (pdev->dpc_rp_extensions && dpc_wait_rp_inactive(pdev))
+		return;
+
+	/*
+	 * According to Section 6.7.6 of the PCIe Base Spec 6.0, since async
+	 * removal might be unexpected, errors might be reported as a side
+	 * effect of the event and software should handle them as an expected
+	 * part of this event.
+	 */
+	pci_aer_raw_clear_status(pdev);
+	pci_clear_surpdn_errors(pdev);
+
+	/*
+	 * According to Section 6.13 and 6.15 of the PCIe Base Spec 6.0,
+	 * following a hot-plug event, clear the ARI Forwarding Enable bit
+	 * and AtomicOp Requester Enable as its not determined whether the
+	 * next device inserted will support these capabilities. AtomicOp
+	 * capabilities are not supported on PCI Express to PCI/PCI-X Bridges
+	 * and any newly added component may not be an ARI device.
+	 */
+	pcie_capability_clear_word(pdev, PCI_EXP_DEVCTL2,
+				   (PCI_EXP_DEVCTL2_ARI | PCI_EXP_DEVCTL2_ATOMIC_REQ));
+
+	pci_write_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS,
+			      PCI_EXP_DPC_STATUS_TRIGGER);
+}
+
+static bool pciehp_is_surprise_removal(struct pci_dev *pdev)
+{
+	u16 status;
+
+	pci_read_config_word(pdev, pdev->aer_cap + PCI_ERR_UNCOR_STATUS, &status);
+
+	if (!(status & PCI_ERR_UNC_SURPDN))
+		return false;
+
+	pciehp_handle_surprise_removal(pdev);
+
+	return true;
+}
+
 static irqreturn_t dpc_handler(int irq, void *context)
 {
 	struct pci_dev *pdev = context;
 
+	if (pciehp_is_surprise_removal(pdev))
+		return IRQ_HANDLED;
+
 	dpc_process_error(pdev);
 
 	/* We configure DPC so it only triggers on ERR_FATAL */