Message ID | cover.1744298239.git.lukas@wunner.de (mailing list archive) |
---|---|
Headers | show |
Series | Ignore spurious PCIe hotplug events | expand |
On Thu, Apr 10, 2025 at 05:27:10PM +0200, Lukas Wunner wrote: > Trying to kill several birds with one stone here: > > First of all, PCIe hotplug is deliberately ignoring link events occurring > as a side effect of Downstream Port Containment. But it's not yet ignoring > Presence Detect Changed events. These can happen if a hotplug bridge uses > in-band presence detect. Reported by Keith Busch, patch [1/2] seeks to > fix it. > > Second, PCIe hotplug is deliberately ignoring link events and Presence > Detect Changed events occurring as a side effect of a Secondary Bus Reset. > But that's no longer working properly since the introduction of bandwidth > control in v6.13-rc1. Actually it never worked properly, but bandwidth > control is now mercilessly exposing the issue. VFIO is thus broken, > it resets the device on passthrough. Reported by Joel Mathew Thomas. > > Third, link or presence events can not only occur as a side effect of DPC > or SBR, but also because of suspend to D3cold, a firmware update or FPGA > reconfiguration. In particular, Altera engineers report that the link > goes down as a side effect of FPGA reconfiguration and the PCIe hotplug > driver responds by disabling slot power. Obviously not what you'd want > while the FPGA is being reconfigured! > > This leads me to believe that we need a generic mechanism to tell hotplug > drivers that spurious link changes are ongoing which need to be ignored. > Patch [2/2] introduces an API for it and the first user is SBR handling > in PCIe hotplug. This fixes the issue exposed by bandwidth control. > It also aligns DPC and SBR handling in the PCIe hotplug driver such that > they use the same code path. > > The API pci_hp_ignore_link_change() / pci_hp_unignore_link_change() is > initially not exported. It can be once the first modular user shows up. > > Although these are technically fixes, they're slightly intrusive, so it > would be good to let them simmer in linux-next for a while. One option > would be to apply for v6.16 and let Greg & Sasha do the backporting. > Another would be to apply to the for-linus branch for v6.15 but wait > maybe 4 weeks before a pull request is sent. > > Please review and test. Thanks! > > Lukas Wunner (2): > PCI: pciehp: Ignore Presence Detect Changed caused by DPC > PCI: pciehp: Ignore Link Down/Up caused by Secondary Bus Reset > > drivers/pci/hotplug/pci_hotplug_core.c | 69 ++++++++++++++++++++++++++++++ > drivers/pci/hotplug/pciehp.h | 1 + > drivers/pci/hotplug/pciehp_core.c | 29 ------------- > drivers/pci/hotplug/pciehp_hpc.c | 78 ++++++++++++++++++++++------------ > drivers/pci/pci.h | 3 ++ > include/linux/pci.h | 8 ++++ > 6 files changed, 132 insertions(+), 56 deletions(-) Applied to pci/hotplug for now, thanks! We can get it into -next and if all goes well easily move to for-linus for v6.15 if that's the right thing.
On Thu, Apr 10, 2025 at 05:27:10PM +0200, Lukas Wunner wrote: > Trying to kill several birds with one stone here: > > First of all, PCIe hotplug is deliberately ignoring link events occurring > as a side effect of Downstream Port Containment. But it's not yet ignoring > Presence Detect Changed events. These can happen if a hotplug bridge uses > in-band presence detect. Reported by Keith Busch, patch [1/2] seeks to > fix it. > > Second, PCIe hotplug is deliberately ignoring link events and Presence > Detect Changed events occurring as a side effect of a Secondary Bus Reset. > But that's no longer working properly since the introduction of bandwidth > control in v6.13-rc1. Actually it never worked properly, but bandwidth > control is now mercilessly exposing the issue. VFIO is thus broken, > it resets the device on passthrough. Reported by Joel Mathew Thomas. > > Third, link or presence events can not only occur as a side effect of DPC > or SBR, but also because of suspend to D3cold, a firmware update or FPGA > reconfiguration. In particular, Altera engineers report that the link > goes down as a side effect of FPGA reconfiguration and the PCIe hotplug > driver responds by disabling slot power. Obviously not what you'd want > while the FPGA is being reconfigured! > > This leads me to believe that we need a generic mechanism to tell hotplug > drivers that spurious link changes are ongoing which need to be ignored. > Patch [2/2] introduces an API for it and the first user is SBR handling > in PCIe hotplug. This fixes the issue exposed by bandwidth control. > It also aligns DPC and SBR handling in the PCIe hotplug driver such that > they use the same code path. > > The API pci_hp_ignore_link_change() / pci_hp_unignore_link_change() is > initially not exported. It can be once the first modular user shows up. > > Although these are technically fixes, they're slightly intrusive, so it > would be good to let them simmer in linux-next for a while. One option > would be to apply for v6.16 and let Greg & Sasha do the backporting. > Another would be to apply to the for-linus branch for v6.15 but wait > maybe 4 weeks before a pull request is sent. I'm a bit conflicted on this because it does appear to help. But it is ignoring a PDC and there are times where it is legit telling the host the device presence really has changed. There are switches that let you toggle downstream connections to change what's attached and it causes a DPC event, swapping out the downstream device at the same time. So this change has the pci driver resume with the wrong device if you happen to be in such a situation. I don't have such switches anymore, so I'd hate to stand in the way over some theoretical issue when this patch helps a more immediate one.