Message ID | alpine.DEB.2.21.2305310024400.59226@angie.orcam.me.uk (mailing list archive) |
---|---|
Headers | show |
Series | pci: Work around ASMedia ASM2824 PCIe link training failures | expand |
On Sun, Jun 11, 2023 at 06:19:08PM +0100, Maciej W. Rozycki wrote: > Hi, > > This is v9 of the change to work around a PCIe link training phenomenon > where a pair of devices both capable of operating at a link speed above > 2.5GT/s seems unable to negotiate the link speed and continues training > indefinitely with the Link Training bit switching on and off repeatedly > and the data link layer never reaching the active state. > > With several requests addressed and a few extra issues spotted this > version has now grown to 14 patches. It has been verified for device > enumeration with and without PCI_QUIRKS enabled, using the same piece of > RISC-V hardware as previously. Hot plug or reset events have not been > verified, as this is difficult if at all feasible with hardware in > question. > > Last iteration: > <https://lore.kernel.org/r/alpine.DEB.2.21.2304060100160.13659@angie.orcam.me.uk/>, > and my input to it: > <https://lore.kernel.org/r/alpine.DEB.2.21.2306080224280.36323@angie.orcam.me.uk/>. Thanks, I applied these to pci/enumeration for v6.5. I tweaked a few things, so double-check to be sure I didn't break something: - Moved dev->link_active_reporting init to set_pcie_port_type() because it does other PCIe-related stuff. - Reordered to keep all the link_active_reporting things together. - Reordered to clean up & factor pcie_retrain_link() before exposing it to the rest of the PCI core. - Moved pcie_retrain_link() a little earlier to keep it next to pcie_wait_for_link_status(). - Squashed the stubs into the actual quirk so we don't have the intermediate state where we call the stubs but they never do anything (let me know if there's a reason we need your order). - Inline pcie_parent_link_retrain(), which seemed like it didn't add enough to be worthwhile. Interdiff below: diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 80694e2574b8..f11268924c8f 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1153,27 +1153,16 @@ void pci_resume_bus(struct pci_bus *bus) pci_walk_bus(bus, pci_resume_one, NULL); } -/** - * pcie_parent_link_retrain - Check and retrain link we are downstream from - * @dev: PCI device to handle. - * - * Return TRUE if the link was retrained, FALSE otherwise. - */ -static bool pcie_parent_link_retrain(struct pci_dev *dev) -{ - struct pci_dev *bridge; - - bridge = pci_upstream_bridge(dev); - if (bridge) - return pcie_failed_link_retrain(bridge); - else - return false; -} - static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout) { - bool retrain = true; int delay = 1; + bool retrain = false; + struct pci_dev *bridge; + + if (pci_is_pcie(dev)) { + retrain = true; + bridge = pci_upstream_bridge(dev); + } /* * After reset, the device should not silently discard config @@ -1201,9 +1190,9 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout) } if (delay > PCI_RESET_WAIT) { - if (retrain) { + if (retrain && bridge) { retrain = false; - if (pcie_parent_link_retrain(dev)) { + if (pcie_failed_link_retrain(bridge)) { delay = 1; continue; } @@ -4914,6 +4903,38 @@ static bool pcie_wait_for_link_status(struct pci_dev *pdev, return (lnksta & lnksta_mask) == lnksta_match; } +/** + * pcie_retrain_link - Request a link retrain and wait for it to complete + * @pdev: Device whose link to retrain. + * @use_lt: Use the LT bit if TRUE, or the DLLLA bit if FALSE, for status. + * + * Retrain completion status is retrieved from the Link Status Register + * according to @use_lt. It is not verified whether the use of the DLLLA + * bit is valid. + * + * Return TRUE if successful, or FALSE if training has not completed + * within PCIE_LINK_RETRAIN_TIMEOUT_MS milliseconds. + */ +bool pcie_retrain_link(struct pci_dev *pdev, bool use_lt) +{ + u16 lnkctl; + + pcie_capability_read_word(pdev, PCI_EXP_LNKCTL, &lnkctl); + lnkctl |= PCI_EXP_LNKCTL_RL; + pcie_capability_write_word(pdev, PCI_EXP_LNKCTL, lnkctl); + if (pdev->clear_retrain_link) { + /* + * Due to an erratum in some devices the Retrain Link bit + * needs to be cleared again manually to allow the link + * training to succeed. + */ + lnkctl &= ~PCI_EXP_LNKCTL_RL; + pcie_capability_write_word(pdev, PCI_EXP_LNKCTL, lnkctl); + } + + return pcie_wait_for_link_status(pdev, use_lt, !use_lt); +} + /** * pcie_wait_for_link_delay - Wait until link is active or inactive * @pdev: Bridge device @@ -4968,37 +4989,6 @@ bool pcie_wait_for_link(struct pci_dev *pdev, bool active) return pcie_wait_for_link_delay(pdev, active, 100); } -/** - * pcie_retrain_link - Request a link retrain and wait for it to complete - * @pdev: Device whose link to retrain. - * @use_lt: Use the LT bit if TRUE, or the DLLLA bit if FALSE, for status. - * - * Retrain completion status is retrieved from the Link Status Register - * according to @use_lt. It is not verified whether the use of the DLLLA - * bit is valid. - * - * Return TRUE if successful, or FALSE if training has not completed. - */ -bool pcie_retrain_link(struct pci_dev *pdev, bool use_lt) -{ - u16 lnkctl; - - pcie_capability_read_word(pdev, PCI_EXP_LNKCTL, &lnkctl); - lnkctl |= PCI_EXP_LNKCTL_RL; - pcie_capability_write_word(pdev, PCI_EXP_LNKCTL, lnkctl); - if (pdev->clear_retrain_link) { - /* - * Due to an erratum in some devices the Retrain Link bit - * needs to be cleared again manually to allow the link - * training to succeed. - */ - lnkctl &= ~PCI_EXP_LNKCTL_RL; - pcie_capability_write_word(pdev, PCI_EXP_LNKCTL, lnkctl); - } - - return pcie_wait_for_link_status(pdev, use_lt, !use_lt); -} - /* * Find maximum D3cold delay required by all the devices on the bus. The * spec says 100 ms, but firmware can lower it and we allow drivers to diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 016a9d4a61f7..f547db0a728f 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1526,6 +1526,7 @@ void set_pcie_port_type(struct pci_dev *pdev) { int pos; u16 reg16; + u32 reg32; int type; struct pci_dev *parent; @@ -1539,6 +1540,10 @@ void set_pcie_port_type(struct pci_dev *pdev) pci_read_config_dword(pdev, pos + PCI_EXP_DEVCAP, &pdev->devcap); pdev->pcie_mpss = FIELD_GET(PCI_EXP_DEVCAP_PAYLOAD, pdev->devcap); + pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, ®32); + if (reg32 & PCI_EXP_LNKCAP_DLLLARC) + pdev->link_active_reporting = 1; + parent = pci_upstream_bridge(pdev); if (!parent) return; @@ -1828,7 +1833,6 @@ int pci_setup_device(struct pci_dev *dev) int err, pos = 0; struct pci_bus_region region; struct resource *res; - u32 linkcap; hdr_type = pci_hdr_type(dev); @@ -1876,10 +1880,6 @@ int pci_setup_device(struct pci_dev *dev) /* "Unknown power state" */ dev->current_state = PCI_UNKNOWN; - /* Set it early to make it available to fixups, etc. */ - pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &linkcap); - dev->link_active_reporting = !!(linkcap & PCI_EXP_LNKCAP_DLLLARC); - /* Early fixups, before probing the BARs */ pci_fixup_device(pci_fixup_early, dev);
On Wed, 14 Jun 2023, Bjorn Helgaas wrote: > > This is v9 of the change to work around a PCIe link training phenomenon > > where a pair of devices both capable of operating at a link speed above > > 2.5GT/s seems unable to negotiate the link speed and continues training > > indefinitely with the Link Training bit switching on and off repeatedly > > and the data link layer never reaching the active state. > > > > With several requests addressed and a few extra issues spotted this > > version has now grown to 14 patches. It has been verified for device > > enumeration with and without PCI_QUIRKS enabled, using the same piece of > > RISC-V hardware as previously. Hot plug or reset events have not been > > verified, as this is difficult if at all feasible with hardware in > > question. > > > > Last iteration: > > <https://lore.kernel.org/r/alpine.DEB.2.21.2304060100160.13659@angie.orcam.me.uk/>, > > and my input to it: > > <https://lore.kernel.org/r/alpine.DEB.2.21.2306080224280.36323@angie.orcam.me.uk/>. > > Thanks, I applied these to pci/enumeration for v6.5. Great, thanks! > I tweaked a few things, so double-check to be sure I didn't break > something: > > - Moved dev->link_active_reporting init to set_pcie_port_type() > because it does other PCIe-related stuff. > > - Reordered to keep all the link_active_reporting things together. > > - Reordered to clean up & factor pcie_retrain_link() before exposing > it to the rest of the PCI core. > > - Moved pcie_retrain_link() a little earlier to keep it next to > pcie_wait_for_link_status(). > > - Squashed the stubs into the actual quirk so we don't have the > intermediate state where we call the stubs but they never do > anything (let me know if there's a reason we need your order). > > - Inline pcie_parent_link_retrain(), which seemed like it didn't add > enough to be worthwhile. Ack, I'll double-check and report back. A minor nit I've spotted below: > static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout) > { > - bool retrain = true; > int delay = 1; > + bool retrain = false; > + struct pci_dev *bridge; > + > + if (pci_is_pcie(dev)) { > + retrain = true; > + bridge = pci_upstream_bridge(dev); > + } If doing it this way, which I actually like, I think it would be a little bit better performance- and style-wise if this was written as: if (pci_is_pcie(dev)) { bridge = pci_upstream_bridge(dev); retrain = !!bridge; } (or "retrain = bridge != NULL" if you prefer this style), and then we don't have to repeatedly check two variables iff (pcie && !bridge) in the loop below: > @@ -1201,9 +1190,9 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout) > } > > if (delay > PCI_RESET_WAIT) { > - if (retrain) { > + if (retrain && bridge) { -- i.e. code can stay then as: if (retrain) { here. I hope you find this observation rather obvious, so will you amend your tree, or shall I send an incremental update? Otherwise I don't find anything suspicious with the interdiff itself (thanks for posting it, that's really useful indeed!), but as I say I'll yet double-check how things look and work with your tree. Hopefully tomorrow (Thu), as I have other stuff yet to complete tonight. Maciej
On Thu, Jun 15, 2023 at 01:41:10AM +0100, Maciej W. Rozycki wrote: > On Wed, 14 Jun 2023, Bjorn Helgaas wrote: > > > > This is v9 of the change to work around a PCIe link training phenomenon > > > where a pair of devices both capable of operating at a link speed above > > > 2.5GT/s seems unable to negotiate the link speed and continues training > > > indefinitely with the Link Training bit switching on and off repeatedly > > > and the data link layer never reaching the active state. > > > > > > With several requests addressed and a few extra issues spotted this > > > version has now grown to 14 patches. It has been verified for device > > > enumeration with and without PCI_QUIRKS enabled, using the same piece of > > > RISC-V hardware as previously. Hot plug or reset events have not been > > > verified, as this is difficult if at all feasible with hardware in > > > question. > > static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout) > > { > > - bool retrain = true; > > int delay = 1; > > + bool retrain = false; > > + struct pci_dev *bridge; > > + > > + if (pci_is_pcie(dev)) { > > + retrain = true; > > + bridge = pci_upstream_bridge(dev); > > + } > > If doing it this way, which I actually like, I think it would be a little > bit better performance- and style-wise if this was written as: > > if (pci_is_pcie(dev)) { > bridge = pci_upstream_bridge(dev); > retrain = !!bridge; > } > > (or "retrain = bridge != NULL" if you prefer this style), and then we > don't have to repeatedly check two variables iff (pcie && !bridge) in the > loop below: Done, thanks, I do like that better. I did: bridge = pci_upstream_bridge(dev); if (bridge) retrain = true; because it seems like it flows more naturally when reading. Bjorn
On Thu, 15 Jun 2023, Bjorn Helgaas wrote: > > If doing it this way, which I actually like, I think it would be a little > > bit better performance- and style-wise if this was written as: > > > > if (pci_is_pcie(dev)) { > > bridge = pci_upstream_bridge(dev); > > retrain = !!bridge; > > } > > > > (or "retrain = bridge != NULL" if you prefer this style), and then we > > don't have to repeatedly check two variables iff (pcie && !bridge) in the > > loop below: > > Done, thanks, I do like that better. I did: > > bridge = pci_upstream_bridge(dev); > if (bridge) > retrain = true; > > because it seems like it flows more naturally when reading. Perfect, and good timing too, as I have just started checking your tree as your message arrived. I ran my usual tests with and w/o PCI_QUIRKS enabled and results were as expected. As before I didn't check hot plug and reset paths as these features are awkward with the HiFive Unmatched system involved. I have skimmed over the changes as committed to pci/enumeration and found nothing suspicious. I have verified that the tree builds as at each of them with my configuration. As per my earlier remark: > I think making a system halfway-fixed would make little sense, but with > the actual fix actually made last as you suggested I think this can be > split off, because it'll make no functional change by itself. I am not perfectly happy with your rearrangement to fold the !PCI_QUIRKS stub into the change carrying the actual workaround and then have the reset path update with a follow-up change only, but I won't fight over it. It's only one tree revision that will be in this halfway-fixed state and I'll trust your judgement here. Let me know if anything pops up related to these changes anytime and I'll be happy to look into it. The system involved is nearing two years since its deployment already, but hopefully it has many years to go yet and will continue being ready to verify things. It's not that there's lots of real RISC-V hardware available, let alone with PCI/e connectivity. Thank you for staying with me and reviewing this patch series through all the iterations. Maciej
On Fri, Jun 16, 2023 at 01:27:52PM +0100, Maciej W. Rozycki wrote: > On Thu, 15 Jun 2023, Bjorn Helgaas wrote: > As per my earlier remark: > > > I think making a system halfway-fixed would make little sense, but with > > the actual fix actually made last as you suggested I think this can be > > split off, because it'll make no functional change by itself. > > I am not perfectly happy with your rearrangement to fold the !PCI_QUIRKS > stub into the change carrying the actual workaround and then have the > reset path update with a follow-up change only, but I won't fight over it. > It's only one tree revision that will be in this halfway-fixed state and > I'll trust your judgement here. Thanks for raising this. Here's my thought process: 12 PCI: Provide stub failed link recovery for device probing and hot plug 13 PCI: Add failed link recovery for device reset events 14 PCI: Work around PCIe link training failures Patch 12 [1] adds calls to pcie_failed_link_retrain(), which does nothing and returns false. Functionally, it's a no-op, but the structure is important later. Patch 13 [2] claims to request failed link recovery after resets, but actually doesn't do anything yet because pcie_failed_link_retrain() is still a no-op, so this was a bit confusing. Patch 14 [3] implements pcie_failed_link_retrain(), so the recovery mentioned in 12 and 13 actually happens. But this patch doesn't add the call to pcie_failed_link_retrain(), so it's a little bit hard to connect the dots. I agree that as I rearranged it, the workaround doesn't apply in all cases simultaneously. Maybe not ideal, but maybe not terrible either. Looking at it again, maybe it would have made more sense to move the pcie_wait_for_link_delay() change to the last patch along with the pci_dev_wait() change. I dunno. Bjorn [1] 12 https://lore.kernel.org/r/alpine.DEB.2.21.2306111619570.64925@angie.orcam.me.uk [2] 13 https://lore.kernel.org/r/alpine.DEB.2.21.2306111631050.64925@angie.orcam.me.uk [3] 14 https://lore.kernel.org/r/alpine.DEB.2.21.2305310038540.59226@angie.orcam.me.uk
On Fri, 16 Jun 2023, Bjorn Helgaas wrote: > I agree that as I rearranged it, the workaround doesn't apply in all > cases simultaneously. Maybe not ideal, but maybe not terrible either. > Looking at it again, maybe it would have made more sense to move the > pcie_wait_for_link_delay() change to the last patch along with the > pci_dev_wait() change. I dunno. I think the order of the changes is not important enough to justify spending a lot of time and mental effort on it. You decided, so be it. Thank you for your effort made with this review. With this series out of the way I have now posted a small clean-up for SBR code duplication between PCI core and an InfiniBand driver I came across in the course of working on this series. See <https://lore.kernel.org/r/alpine.DEB.2.21.2306200153110.14084@angie.orcam.me.uk/>. Please have a look at your convenience. Maciej
Hello again. I just realized that my first response to this thread two weeks ago was not actually starting from the end of the discussion. I hope I found it now... Must say sorry for this I am still figuring out how to follow these threads. I need to ask if we can either revert this patch or only modify the quirk to only run on the device in mention (ASMedia ASM2824). We have now identified it as causing devices to get stuck at Gen1 in multiple generations of our hardware & across product lines on ports were hot-plug is common. To be a little more specific it includes Intel root ports and Broadcomm PCIe switch ports and also Microchip PCIe switch ports. The most common place where we see our systems getting stuck at Gen1 is with device power cycling. If a device is powered on and then off quickly then the link will of course fail to train & the consequence here is that the port is forced to Gen1 forever. Does anybody know why the patch will only remove the forced Gen1 speed from the ASMedia device? - Matt
On Mon, Aug 05, 2024 at 06:06:59PM -0600, Matthew W Carlis wrote: > Hello again. I just realized that my first response to this thread two weeks > ago was not actually starting from the end of the discussion. I hope I found > it now... Must say sorry for this I am still figuring out how to follow these > threads. > I need to ask if we can either revert this patch or only modify the quirk to > only run on the device in mention (ASMedia ASM2824). We have now identified > it as causing devices to get stuck at Gen1 in multiple generations of our > hardware & across product lines on ports were hot-plug is common. To be a > little more specific it includes Intel root ports and Broadcomm PCIe switch > ports and also Microchip PCIe switch ports. > The most common place where we see our systems getting stuck at Gen1 is with > device power cycling. If a device is powered on and then off quickly then the > link will of course fail to train & the consequence here is that the port is > forced to Gen1 forever. Does anybody know why the patch will only remove the > forced Gen1 speed from the ASMedia device? Thanks for keeping this thread alive. I don't know the fix, but it does seem like this series made wASMedia ASM2824 work better but caused regressions elsewhere, so maybe we just need to accept that ASM2824 is slightly broken and doesn't work as well as it should. Bjorn
On Tues, 06 Aug 2024 Bjorn Helgaas wrote: > it does seem like this series made wASMedia ASM2824 work better but > caused regressions elsewhere, so maybe we just need to accept that > ASM2824 is slightly broken and doesn't work as well as it should. One of my colleagues challenged me to provide a more concrete example where the change will cause problems. One such configuration would be not implementing the Power Controller Control in the Slot Capabilities Register. Then, Powering off the slot via out-of-band interfaces would result in the kernel forcing the DSP to Gen1 100% of the time as far as I can tell. The aspect of this force to Gen1 that is the most concerning to my team is that it isn't cleaned up even if we replaced the EP with some other EP. I was curious about the PCIe devices mentioned in the commit because I look at crazy malfunctioning devices too often so I pasted the following: "Delock Riser Card PCI Expres 41433" into Google. I'm not really a physical layer guy, but is it possible that the reported issue be due to signal integrity? I'm not sure if sending PCIe over a USB cable is "reliable". I've never worked with an ASMedia switch and don't have a reliable way to reproduce anything like the interaction between the two device at hand. As much as I hate to make the request my thinking is that the patch should be reverted until there is a solution that doesn't leave the link forced to Gen1 forever for every EP thereafter. - Matt
On Wed, 7 Aug 2024, Matthew W Carlis wrote: > > it does seem like this series made wASMedia ASM2824 work better but > > caused regressions elsewhere, so maybe we just need to accept that > > ASM2824 is slightly broken and doesn't work as well as it should. > > One of my colleagues challenged me to provide a more concrete example > where the change will cause problems. One such configuration would be not > implementing the Power Controller Control in the Slot Capabilities Register. > Then, Powering off the slot via out-of-band interfaces would result in the > kernel forcing the DSP to Gen1 100% of the time as far as I can tell. > The aspect of this force to Gen1 that is the most concerning to my team is > that it isn't cleaned up even if we replaced the EP with some other EP. Why does that happen? For the quirk to trigger, the link has to be down and there has to be the LBMS Link Status bit set from link management events as per the PCIe spec while the link was previously up, and then both of that while rescanning the PCIe device in question, so there's a lot of conditions to meet. Is it the case that in your setup there is no device at this point, but one gets plugged in later? One aspect to mention here is that when taking a device offline the LBMS Link Status bit really ought to be cleared by Linux in the corresponding downstream port of the parent device. As I recall Ilpo was working on a broader link bandwidth management subsystem for Linux, which would do that among others. He asked me to make experiments with my problematic machine to see if this would interfere, but unrelated issues with DRAM controller (now fixed by reducing the DDR clock rate) have prevented me from doing that. I'll see if I can get back to that soon. > I was curious about the PCIe devices mentioned in the commit because I > look at crazy malfunctioning devices too often so I pasted the following: > "Delock Riser Card PCI Expres 41433" into Google. > I'm not really a physical layer guy, but is it possible that the reported > issue be due to signal integrity? I'm not sure if sending PCIe over a USB > cable is "reliable". Well, it's a transmission line and as long as bandwidth and latency requirements are met it's as good as any. My understanding has been one of the objectives of PCIe over conventional PCI was to make external links such as to expansion boxes easier to implement. Please note that it is a purpose-made cable too, rather than just an off-the-shelf USB cable. Then I have solutions deployed with PCIe routed over DVI cables too. I doubt it could be a signal integrity issue, because once the link has negotiated the highest speed possible (Gen2) via this complicated dance it works with no issues for months. I'm not sure what the highest uptime for the system in question exactly was, but it was in the range of half a year, and I have a network interface downstream which I regularly use for heavy NFS traffic in GNU tool chain regression verification, so any issue would have shown up pretty quickly. Given how switching between speed rates works with PCIe (by establishing a link at 2.5GT/s first and then exchanging rates available as data before choosing the highest supported by both endpoints) I suspect that it is a protocol issue: either or both devices have got it slightly wrong, which breaks it when they're combined together. Otherwise why would retraining to 5GT/s by hand work while it doesn't if to be done by hardware itself? There is not much if any difference here between both scenarios really. > I've never worked with an ASMedia switch and don't have a reliable way to > reproduce anything like the interaction between the two device at hand. As > much as I hate to make the request my thinking is that the patch should be > reverted until there is a solution that doesn't leave the link forced to > Gen1 forever for every EP thereafter. I'm working on such a change this week. It's just that my primary objectives for this maintenance visit at my lab were to fix a pair of broken PSUs (which I have now done) and upgrade the firmware of a console manager device to fix an issue with a remote serial BREAK feature affecting Magic SysRq (also completed), plus I have a day job too that is unrelated. I also bought a PCIe to dual M.2 M-Key option card with the ASM2824 switch onboard to have a different setup to evaluate and determine if this issue is specific to the RISC-V board or not. Unfortunately the ASM2824 does not bring the downstream ports up if the card is placed in a slot that does not supply Vaux (which is a conforming arrangement according to the PCIe spec), apparently due to a quirk in the ASM2824 switch according to the option card manufacturer, and there is no software workaround possible. So I won't be able to use this alternative arrangement until I have modified the option card, which I won't be able to do before October the earliest. I just can't help with that there is so many broken hardware designs out there and I have to strive to navigate through and do my job regardless. Maciej
On Mon, 5 Aug 2024, Matthew W Carlis wrote: > The most common place where we see our systems getting stuck at Gen1 is with > device power cycling. If a device is powered on and then off quickly then the > link will of course fail to train & the consequence here is that the port is > forced to Gen1 forever. Does anybody know why the patch will only remove the > forced Gen1 speed from the ASMedia device? I know, as the author of the change. The reason for this is safety: it's better to have a link run at 2.5GT/s than not at all, and from the nature of the issue there is no guarantee that if you remove the speed clamp, then the link won't go back into the infinite retraining loop that the workaround is supposed to break. I was only able to verify that the speed clamp is safe to lift with this particular device, but if you hit the actual issue with any other device and find that it is safe to remove the clamp as well, then you're welcome to add it to the list too. Maciej
On Wed, Aug 7, 2024 at 9:14 PM Maciej W. Rozycki <macro@orcam.me.uk> wrote: > > On Wed, 7 Aug 2024, Matthew W Carlis wrote: > > > > it does seem like this series made wASMedia ASM2824 work better but > > > caused regressions elsewhere, so maybe we just need to accept that > > > ASM2824 is slightly broken and doesn't work as well as it should. > > > > One of my colleagues challenged me to provide a more concrete example > > where the change will cause problems. One such configuration would be not > > implementing the Power Controller Control in the Slot Capabilities Register. > > Then, Powering off the slot via out-of-band interfaces would result in the > > kernel forcing the DSP to Gen1 100% of the time as far as I can tell. > > The aspect of this force to Gen1 that is the most concerning to my team is > > that it isn't cleaned up even if we replaced the EP with some other EP. > > Why does that happen? > > For the quirk to trigger, the link has to be down and there has to be the > LBMS Link Status bit set from link management events as per the PCIe spec > while the link was previously up, and then both of that while rescanning > the PCIe device in question, so there's a lot of conditions to meet. Is > it the case that in your setup there is no device at this point, but one > gets plugged in later? My read was that Matt is essentially doing a surprise hot-unplug by removing power to the card without notifying the OS. I thought the LBMS bit wouldn't be set in that case since the link goes down rather than changes speed, but the spec is a little vague and that appears to be happening in Matt's testing. It might be worth disabling the workaround if the port has the surprise hotplug capability bit set. It's fairly common for ports on NVMe drive backplanes to have it set and a lot of people would be unhappy about those being forced to Gen 1 by accident.
On Wed, 7 Aug 2024 22:29:35 +1000 Oliver O'Halloran Wrote > My read was that Matt is essentially doing a surprise hot-unplug by > removing power to the card without notifying the OS. I thought the > LBMS bit wouldn't be set in that case since the link goes down rather > than changes speed, but the spec is a little vague and that appears to > be happening in Matt's testing. It might be worth disabling the > workaround if the port has the surprise hotplug capability bit set. Most of the systems I have are using downstream port containment which does not recommend setting the Hot-Plug Surprise in Slot Capabilities & therefore we do not. The first time we noticed an issue with this patch was in test automation which was power cycling the endpoints & injecting uncorrectable errors to ensure our hosts are robust in the face of PCIe chaos & that they will recover. Later we started to see other teams on other products encountering the same bug in simpler cases where humans turn on and off EP power for development purposes. Although we are not using Hot-Plug Surprise we are often triggering DPC on the Surprise Down Uncorrectable Error when we hit this Gen1 issue. On Wed, 7 Aug 2024 12:14:13 +0100 Maciej W. Rozycki Wrote > For the quirk to trigger, the link has to be down and there has to be the > LBMS Link Status bit set from link management events as per the PCIe spec > while the link was previously up, and then both of that while rescanning > the PCIe device in question, so there's a lot of conditions to meet. If there is nothing clearing the bit then why is there any expectation that it wouldn't be set? We have 20/30/40 endpoints in a host that can be hot-removed, hot-added at any point in time in any combination & its often the case that the system uptime be hundreds of days. Eventually the bit will just become set as a result of time and scale. On Wed, 7 Aug 2024 12:14:13 +0100 Maciej W. Rozycki Wrote > The reason for this is safety: it's better to have a link run at 2.5GT/s > than not at all, and from the nature of the issue there is no guarantee > that if you remove the speed clamp, then the link won't go back into the > infinite retraining loop that the workaround is supposed to break. I guess I don't really understand why it wouldn't be safe for every device other than the ASMedia since they aren't the device that had the issue in the first place. The main problem in my mind is the system doesn't actually have to be retraining at all, it can occur the first time you replace a device or power cycle the device or the first time the device goes into DPC & comes back. If the quirk simply returned without doing anything when the ASMedia is not in the allow-list it would make me more at ease. I can also imagine some other implementations that would work well, but it doesn't seem correct that we could only give a single opportunity to a device before forcing it to live at Gen1. Perhaps it should be aware of the rate or a count or something... I can only assume that there will be more systems that start to run into issues with the patch as they strive to keep up with LTS & they exercise the hot-plug path.
On Thu, Aug 8, 2024 at 12:08 PM Matthew W Carlis <mattc@purestorage.com> wrote: > > On Wed, 7 Aug 2024 22:29:35 +1000 Oliver O'Halloran Wrote > > My read was that Matt is essentially doing a surprise hot-unplug by > > removing power to the card without notifying the OS. I thought the > > LBMS bit wouldn't be set in that case since the link goes down rather > > than changes speed, but the spec is a little vague and that appears to > > be happening in Matt's testing. It might be worth disabling the > > workaround if the port has the surprise hotplug capability bit set. > > Most of the systems I have are using downstream port containment which does > not recommend setting the Hot-Plug Surprise in Slot Capabilities & therefore > we do not. The first time we noticed an issue with this patch was in test > automation which was power cycling the endpoints & injecting uncorrectable > errors to ensure our hosts are robust in the face of PCIe chaos & that they > will recover. Later we started to see other teams on other products > encountering the same bug in simpler cases where humans turn on and off > EP power for development purposes. Ok? If we have to check for DPC being enabled in addition to checking the surprise bit in the slot capabilities then that's fine, we can do that. The question to be answered here is: how should this feature work on ports where it's normal for a device to be removed without any notice?
On Wed, 7 Aug 2024, Matthew W Carlis wrote: > > For the quirk to trigger, the link has to be down and there has to be the > > LBMS Link Status bit set from link management events as per the PCIe spec > > while the link was previously up, and then both of that while rescanning > > the PCIe device in question, so there's a lot of conditions to meet. > > If there is nothing clearing the bit then why is there any expectation that > it wouldn't be set? We have 20/30/40 endpoints in a host that can be hot-removed, > hot-added at any point in time in any combination & its often the case that > the system uptime be hundreds of days. Eventually the bit will just become set > as a result of time and scale. Well, in principle in a setup with reliable links the LBMS bit may never be set, e.g. this system of mine has been in 24/7 operation since the last reboot 410 days ago and for the devices that support Link Active reporting it shows: LnkSta:Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt- LnkSta:Speed 5GT/s, Width x8, TrErr- Train- SlotClk- DLActive+ BWMgmt+ ABWMgmt+ LnkSta:Speed 5GT/s, Width x1, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt+ LnkSta:Speed 5GT/s, Width x2, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt+ LnkSta:Speed 5GT/s, Width x1, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt+ LnkSta:Speed 8GT/s, Width x16, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt+ LnkSta:Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- LnkSta:Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- LnkSta:Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- LnkSta:Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- LnkSta:Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt- so out of 11 devices 6 have the LBMS bit clear. But then 5 have it set, perhaps worryingly, so of course you're right, that it will get set in the field, though it's not enough by itself for your problem to trigger. Then there is manual link retraining, which we do from time to time as well, which will set the bit too, which I overlooked. To cut the story short, it was an oversight of mine, especially as I don't really make any use myself of hot plug scenarios. > > The reason for this is safety: it's better to have a link run at 2.5GT/s > > than not at all, and from the nature of the issue there is no guarantee > > that if you remove the speed clamp, then the link won't go back into the > > infinite retraining loop that the workaround is supposed to break. > > I guess I don't really understand why it wouldn't be safe for every device > other than the ASMedia since they aren't the device that had the issue in the > first place. The main problem in my mind is the system doesn't actually have to > be retraining at all, it can occur the first time you replace a device or > power cycle the device or the first time the device goes into DPC & comes back. > If the quirk simply returned without doing anything when the ASMedia is not in the > allow-list it would make me more at ease. I can also imagine some other implementations > that would work well, but it doesn't seem correct that we could only give a single > opportunity to a device before forcing it to live at Gen1. Perhaps it should be > aware of the rate or a count or something... It's a complex matter. For a starter please read the introduction at `pcie_failed_link_retrain'. When the problem triggers the link goes into an infinite link training loop, with the Link Training (LT) bit flipping on and off. I've made a complementary workaround for U-Boot (since your bootstrap device can be downstream such a broken link), where a statistical probe is done for the LT bit flipping as discovered by polling the bit in a tight loop. This is fine for a piece of firmware that has nothing better to do, but in an OS kernel we just can't do it. Also U-Boot does not remove the 2.5GT/s clamp because it does not have to set up the system in an optimal way, but just sufficiently to successfully boot. An OS kernel can then adjust the configuration if it knows about the workaround, or leave it as it is. In the latter case things will continue working across PCIe resets, because the TLS field is sticky. For Linux I went for just checking the DLLLA bit as it seems good enough for devices that support Link Active reporting. And I admit that the workaround is quite aggressive, but this was a deliberate decision following the robustness principle: the end user may not be qualified enough to diagnose the problem and given its nature not even think it's something that can be made to work. The downstream device does not show up in the PCIe hierarchy and just looks like it's broken. It takes a lot of knowledge and experience to notice the LT bit flipping and even myself, quite a seasoned Linux kernel developer, only noticed it by chance. It was discussed to death with the original submission, the rationale is given in the introductory comment and I'd prefer that it stays as it is. The ASM2824 switch works just fine with other downstream devices and so does the PI7C9X2G304 switch with other upstream devices. Both vendors refused to help narrow it down and declined to comment (a person from Diodes née Pericom replied, but they learnt it's about a PCIe switch they told me they were from another department and couldn't help), which would have helped (I now just recommend staying away from both manufacturers). Therefore my assumption is this can potentially trigger with any pair of devices. I have now posted a patch series to address this problem of yours and a previous issue reported by Ilpo. See: <https://patchwork.kernel.org/project/linux-pci/list/?series=878216>, and I've included you in the list of direct recipients too. I will appreciate if you give it a try, and again, apologies for the trouble caused, but such things happen in engineering. Maciej
Sorry for the delay in my responses here I had some things get in my way. On Fri, 9 Aug 2024 09:13:52 Oliver O'Halloran <oohall@gmail.com> wrote: > Ok? If we have to check for DPC being enabled in addition to checking > the surprise bit in the slot capabilities then that's fine, we can do > that. The question to be answered here is: how should this feature > work on ports where it's normal for a device to be removed without any > notice? I'm not sure if its the correct thing to check however. I assumed that ports using the pciehp driver would usually consider it "normal" for a device to be removed actually, but maybe I have the idea of hp reversed. On Fri, 9 Aug 2024 14:34:04 Maciej W. Rozycki <macro@orcam.me.uk> wrote: > Well, in principle in a setup with reliable links the LBMS bit may never > be set, e.g. this system of mine has been in 24/7 operation since the last > reboot 410 days ago and for the devices that support Link Active reporting > it shows: > ... > so out of 11 devices 6 have the LBMS bit clear. But then 5 have it set, > perhaps worryingly, so of course you're right, that it will get set in the > field, though it's not enough by itself for your problem to trigger. The way I look at it is that its essentially a probability distribution with time, but I try to avoid learning too much about the physical layer because I would find myself debugging more hardware issues lol. I also don't think LBMS/LABS being set by itself is very interesting without knowing the rate at which it is being set. FWIW I have seen some devices in the past going into recovery state many times a second & still never downtrain, but at the same time they were setting the LBMS/LABS bits which maybe not quite spec compliant. I would like to help test these changes, but I would like to avoid having to test each mentioned change individually. Does anyone have any preferences in how I batch the patches for testing? Would it be ok if I just pulled them all together on one go? - Matt
On Thu, 15 Aug 2024, Matthew W Carlis wrote: > > Well, in principle in a setup with reliable links the LBMS bit may never > > be set, e.g. this system of mine has been in 24/7 operation since the last > > reboot 410 days ago and for the devices that support Link Active reporting > > it shows: > > ... > > so out of 11 devices 6 have the LBMS bit clear. But then 5 have it set, > > perhaps worryingly, so of course you're right, that it will get set in the > > field, though it's not enough by itself for your problem to trigger. > > The way I look at it is that its essentially a probability distribution with time, > but I try to avoid learning too much about the physical layer because I would find > myself debugging more hardware issues lol. I also don't think LBMS/LABS being set > by itself is very interesting without knowing the rate at which it is being set. Agreed. Ilpo's upcoming bandwidth controller will hopefully give us such data. > FWIW I have seen some devices in the past going into recovery state many times a > second & still never downtrain, but at the same time they were setting the > LBMS/LABS bits which maybe not quite spec compliant. > > I would like to help test these changes, but I would like to avoid having to test > each mentioned change individually. Does anyone have any preferences in how I batch > the patches for testing? Would it be ok if I just pulled them all together on one go? Certainly fine with me, especially as 3/4 and 4/4 aren't really related to your failure scenario, and then you need 1/4 and 2/4 both at a time to address both aspects of the issue you have reported. Maciej
I just wanted to follow up with our testing results for the mentioned patches. It took me a while to get them running in our test pool, but we just got it going yesterday and the initial results look really good. We will continue running them in our testing from now on & if any issues come up I'll try to report them, but otherwise I wanted to say thank you for entertaining the discussion & reacting so quickly with new patches. The patches we pulled into our testing: [PATCH v3 1/4] PCI: Clear the LBMS bit after a link retrain [PATCH v3 2/4] PCI: Revert to the original speed after PCIe failed link retraining - Matt
On Tue, 1 Oct 2024, Matthew W Carlis wrote: > I just wanted to follow up with our testing results for the mentioned > patches. It took me a while to get them running in our test pool, but > we just got it going yesterday and the initial results look really good. > We will continue running them in our testing from now on & if any issues > come up I'll try to report them, but otherwise I wanted to say thank you > for entertaining the discussion & reacting so quickly with new patches. My pleasure. I'm glad that the solution works for you. Let me know if you need anything else. Maciej
On Wed, Oct 02, 2024 at 01:58:15PM +0100, Maciej W. Rozycki wrote: > On Tue, 1 Oct 2024, Matthew W Carlis wrote: > > > I just wanted to follow up with our testing results for the mentioned > > patches. It took me a while to get them running in our test pool, but > > we just got it going yesterday and the initial results look really good. > > We will continue running them in our testing from now on & if any issues > > come up I'll try to report them, but otherwise I wanted to say thank you > > for entertaining the discussion & reacting so quickly with new patches. > > My pleasure. I'm glad that the solution works for you. Let me know if > you need anything else. If there's anything missing that still needs to be added to v6.13-rc1, can somebody repost those? I lost track of what's still outstanding. Bjorn
On Wed, 2 Oct 2024, Bjorn Helgaas wrote: > If there's anything missing that still needs to be added to v6.13-rc1, > can somebody repost those? I lost track of what's still outstanding. I have nothing outstanding to add right away. Thank you for asking. Maciej