diff mbox series

[v9,12/14] PCI: Provide stub failed link recovery for device probing and hot plug

Message ID alpine.DEB.2.21.2306111619570.64925@angie.orcam.me.uk (mailing list archive)
State Not Applicable
Headers show
Series pci: Work around ASMedia ASM2824 PCIe link training failures | expand

Commit Message

Maciej W. Rozycki June 11, 2023, 5:20 p.m. UTC
This now fails unconditionally and will be always optimised away, but 
provides for quirks to implement recovery for failed links detected in 
device probing and device hot plug events.

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
---
New change in v9, factored out from 7/7:

- Rename `pcie_downstream_link_retrain' to `pcie_failed_link_retrain'.

- Add stub implementation in "pci.h".
---
 drivers/pci/pci.c   |    2 ++
 drivers/pci/pci.h   |    4 ++++
 drivers/pci/probe.c |    2 ++
 3 files changed, 8 insertions(+)

linux-pcie-failed-link-retrain.diff

Comments

Matthew W Carlis July 22, 2024, 7:34 p.m. UTC | #1
Sorry to resurrect this one, but I was wondering why the
PCI device ID in drivers/pci/quirks.c for the ASMedia ASM2824
isn't checked before forcing the link down to Gen1... We have
had to revert this patch during our kernel migration due to it
interacting poorly with at least one older Gen3 PLX PCIe switch
vendor/generation while using DPC. In another context we have
found similar issues during system bringup without DPC while
using a more legacy hot-plug model (BIOS defaults for us..).
In both contexts our devices are stuck at Gen1 after physical
hot-plug/insert, power-cycle.

Tried reading through the patch history/review but it was still
a little bit unclear to me. Can we add the device ID check as a
precondition to forcing link to Gen1?
Maciej W. Rozycki July 22, 2024, 8:40 p.m. UTC | #2
[+cc Ilpo for his previous involvement here]

On Mon, 22 Jul 2024, Matthew W Carlis wrote:

> Sorry to resurrect this one, but I was wondering why the
> PCI device ID in drivers/pci/quirks.c for the ASMedia ASM2824
> isn't checked before forcing the link down to Gen1... We have
> had to revert this patch during our kernel migration due to it
> interacting poorly with at least one older Gen3 PLX PCIe switch
> vendor/generation while using DPC. In another context we have
> found similar issues during system bringup without DPC while
> using a more legacy hot-plug model (BIOS defaults for us..).
> In both contexts our devices are stuck at Gen1 after physical
> hot-plug/insert, power-cycle.

 Sorry to hear about your problems.  However the workaround is supposed to 
only trigger if the link has already failed negotiation.  Could you please 
be more specific as to the actual scenario where it triggers?

 A scenario was mentioned earlier on, where a downstream device has been 
removed from a slot and left behind the LBMS bit set in the corresponding 
downstream port of the upstream device.  It then triggered the workaround 
where the port was rescanned with the slot still empty, which then left 
the link capped at 2.5GT/s for a device subsequently inserted.  Is it what 
happens for you?

 As I recall Ilpo has been working on changes that among others should 
make sure no stale LBMS bit has been left set, but I'm not sure what the 
state of affairs has been here.  Myself I've been too swamped in the 
recent months and consequently didn't look into any improvements in this 
area (and unrelated issues involving the system in question in my remote 
lab have further impeded me).

> Tried reading through the patch history/review but it was still
> a little bit unclear to me. Can we add the device ID check as a
> precondition to forcing link to Gen1?

 The main reason is it is believed that it is the downstream device 
causing the issue, and obviously you can't fetch its ID if you can't 
negotiate link so as to talk to it in the first place.

  Maciej
Matthew W Carlis July 24, 2024, 7:18 p.m. UTC | #3
Sorry for belated response. I wasn't really sure when you first asked & I
still only have a 'hand wavy' theory here. I think one thing that is getting
us in trouble is when we turn the endpoint device on, then off, wait for a
little while then turn it back on. It seems that the port here in this case
is forced to Gen1 & there is not any path for the kernel to allow it to
try another alternative again without an informed user to write the register.

I'm still trying to barter for the time to really deeply dive into this so
must apologize if this sounds crazy or couldn't be correct.

- Matt
Matthew W Carlis July 26, 2024, 8:04 a.m. UTC | #4
On Mon, 22 Jul 2024, Maciej W. Rozycki wrote:

> The main reason is it is believed that it is the downstream device
> causing the issue, and obviously you can't fetch its ID if you can't
> negotiate link so as to talk to it in the first place.

Have had some more time to look into this issue. So, I think the problem
with this change is that it is quite strict in its assumptions about what
it means when a device fails to train, but in an environment where hot-plug
is exercised frequently you are essentially bound have something interrupt
the link training. In the first case where we caught this problem our test
automation was doing some power cycle tortures on our endpoints. If you catch
the right timing the link will be forced down to Gen1 forever without some other
automation to recover you unless your device is the one single device in the
allowlist which had the hardware bug in the first place.

I wonder if we can come up with some kind of alternative.

- Matt
Ilpo Järvinen July 29, 2024, 10:27 a.m. UTC | #5
On Fri, 26 Jul 2024, Matthew W Carlis wrote:

> On Mon, 22 Jul 2024, Maciej W. Rozycki wrote:
> 
> > The main reason is it is believed that it is the downstream device
> > causing the issue, and obviously you can't fetch its ID if you can't
> > negotiate link so as to talk to it in the first place.
> 
> Have had some more time to look into this issue. So, I think the problem
> with this change is that it is quite strict in its assumptions about what
> it means when a device fails to train, but in an environment where hot-plug
> is exercised frequently you are essentially bound have something interrupt
> the link training. In the first case where we caught this problem our test
> automation was doing some power cycle tortures on our endpoints. If you catch
> the right timing the link will be forced down to Gen1 forever without some other
> automation to recover you unless your device is the one single device in the
> allowlist which had the hardware bug in the first place.
> 
> I wonder if we can come up with some kind of alternative.

The most obvious solution is to not leave the speed at Gen1 on failure in 
Target Speed quirk but to restore the original Target Speed value. The 
downside with that is if the current retraining interface (function) is 
used, it adds delay. But the retraining functions could be reworked such 
that the retraining is only triggered in case the Target Speed quirk 
fails but we don't wait for its result (which will very likely fail 
anyway).
Maciej W. Rozycki July 29, 2024, 2:51 p.m. UTC | #6
On Mon, 29 Jul 2024, Ilpo Järvinen wrote:

> > > The main reason is it is believed that it is the downstream device
> > > causing the issue, and obviously you can't fetch its ID if you can't
> > > negotiate link so as to talk to it in the first place.
> > 
> > Have had some more time to look into this issue. So, I think the problem
> > with this change is that it is quite strict in its assumptions about what
> > it means when a device fails to train, but in an environment where hot-plug
> > is exercised frequently you are essentially bound have something interrupt
> > the link training. In the first case where we caught this problem our test
> > automation was doing some power cycle tortures on our endpoints. If you catch
> > the right timing the link will be forced down to Gen1 forever without some other
> > automation to recover you unless your device is the one single device in the
> > allowlist which had the hardware bug in the first place.
> > 
> > I wonder if we can come up with some kind of alternative.
> 
> The most obvious solution is to not leave the speed at Gen1 on failure in 
> Target Speed quirk but to restore the original Target Speed value. The 
> downside with that is if the current retraining interface (function) is 
> used, it adds delay. But the retraining functions could be reworked such 
> that the retraining is only triggered in case the Target Speed quirk 
> fails but we don't wait for its result (which will very likely fail 
> anyway).

 This is what I have also been thinking of.

 After these many years it took from the inception of this change until it 
landed upstream I'm not sure anymore what my original idea was behind 
leaving the link clamped on a retrain failure, but I think it was either 
not to fiddle with the setting beyond the absolute necessity at hand 
(which the scenarios such as Matthew's prove wrong) or to leave the 
setting in a hope that training will eventually have succeeded (but it 
seems to make little sense as there'll be nothing there to actually 
observe the success unless the bus gets rescanned for another reason).

 I'll be at my lab towards the end of the week with a maintenance visit, 
so I'll allocate some time to fiddle with this issue on that occasion and 
implement such an update.

  Maciej
Matthew W Carlis July 29, 2024, 6:56 p.m. UTC | #7
On Mon, 29 July 2024, Ilpo Järvinen wrote:

> The most obvious solution is to not leave the speed at Gen1 on failure in
> Target Speed quirk but to restore the original Target Speed value. The
> downside with that is if the current retraining interface (function) is
> used, it adds delay.

Tends to be that I care less about how long a device is gone & more about
how it will behave once it reappears. For our purposes we don't even tend
to notice a few seconds of wiggle in this area, but we do notice impact
if the kernel creates the nvme device & it is degraded in some way. Even
though we might have automation to recover the device we will have lost
more time already than by the purposed delay afaik.

Some of the time a human would have hot-insert'ed a new device, but much of
the time perhaps the device will be coming back from downstream port containment
where there won't be a person to ensure the correctness of link speed/width.
In the DPC case perhaps the endpoint itself will have reset/rebooted/crashed
where you already suffer a few hundred ms of delay from EP's boot time.

I would be interested to know what kind of maximum delay we would all be
willing to tolerate & what applications might care.

On Mon, 29 Jul 2024, Maciej W. Rozycki wrote:

> After these many years it took from the inception of this change until it
> landed upstream I'm not sure anymore what my original idea was behind
> leaving the link clamped

A familiar question I have been known to ask myself. - "Why did I do this again?"
The scary/funny thing is that there is almost always a reason.

I do think there might be some benefit to overall system stability to have some
kind of damping on link retraining rate because I have also seen device stuck
in an infinite cycle of many retrains per second, but each time we come through
the hot-insert code path kernel should let the link partners try to get to their
maximum speeds because it could in theory be a totally new EP. In the handful
I have seen there was some kind of defect with a particular device & replacement
resolved it.

- Matt
diff mbox series

Patch

Index: linux-macro/drivers/pci/pci.c
===================================================================
--- linux-macro.orig/drivers/pci/pci.c
+++ linux-macro/drivers/pci/pci.c
@@ -4912,6 +4912,8 @@  static bool pcie_wait_for_link_delay(str
 	if (active)
 		msleep(20);
 	ret = pcie_wait_for_link_status(pdev, false, active);
+	if (active && !ret)
+		ret = pcie_failed_link_retrain(pdev);
 	if (active && ret)
 		msleep(delay);
 
Index: linux-macro/drivers/pci/pci.h
===================================================================
--- linux-macro.orig/drivers/pci/pci.h
+++ linux-macro/drivers/pci/pci.h
@@ -554,6 +554,10 @@  static inline int pci_dev_specific_disab
 	return -ENOTTY;
 }
 #endif
+static inline bool pcie_failed_link_retrain(struct pci_dev *dev)
+{
+	return false;
+}
 
 /* PCI error reporting and recovery */
 pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
Index: linux-macro/drivers/pci/probe.c
===================================================================
--- linux-macro.orig/drivers/pci/probe.c
+++ linux-macro/drivers/pci/probe.c
@@ -2549,6 +2549,8 @@  void pci_device_add(struct pci_dev *dev,
 	dma_set_max_seg_size(&dev->dev, 65536);
 	dma_set_seg_boundary(&dev->dev, 0xffffffff);
 
+	pcie_failed_link_retrain(dev);
+
 	/* Fix up broken headers */
 	pci_fixup_device(pci_fixup_header, dev);