Message ID | 20240202134108.4096-1-ilpo.jarvinen@linux.intel.com (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | [1/1] PCI: Cleanup link activation wait logic | expand |
On Fri, 2 Feb 2024, Ilpo Järvinen wrote: > 1. Change pcie_failed_link_retrain() to return true only if link was > retrained successfully due to the Target Speed quirk. If there is no > LBMS set, return false instead of true because no retraining was > even attempted. This seems correct considering expectations of both > callers of pcie_failed_link_retrain(). You change the logic here in that the second conditional isn't run if the first has not. This is wrong, unclamping is not supposed to rely on LBMS. It is supposed to be always run and any failure has to be reported too, as a retraining error. I'll send an update according to what I have outlined before then, with some testing done first. > 2. Handle link-was-not-retrained-successfully return (false) from > pcie_failed_link_retrain() properly in pcie_wait_for_link_delay() by > directly returning false. I think it has to be a separate change, as a fix to what I can see is an issue with a three-way-merge done with commit 1abb47390350 ("Merge branch 'pci/enumeration'"); surely a bool result wasn't supposed to be assigned to an int variable carrying an error code. Either or both changes may have to be backported independently. Maciej
On Fri, 2 Feb 2024, Maciej W. Rozycki wrote: > On Fri, 2 Feb 2024, Ilpo Järvinen wrote: > > > 1. Change pcie_failed_link_retrain() to return true only if link was > > retrained successfully due to the Target Speed quirk. If there is no > > LBMS set, return false instead of true because no retraining was > > even attempted. This seems correct considering expectations of both > > callers of pcie_failed_link_retrain(). > > You change the logic here in that the second conditional isn't run if the > first has not. This is wrong, unclamping is not supposed to rely on LBMS. > It is supposed to be always run and any failure has to be reported too, as > a retraining error. I'll send an update according to what I have outlined > before then, with some testing done first. Oh I see now, I'm sorry, I didn't read all the way to the last paragraph of the commit message because the earlier one in the commit message hinted the restriction is removed afterwards so I thought it was only linked to the first part of the quirk. > > 2. Handle link-was-not-retrained-successfully return (false) from > > pcie_failed_link_retrain() properly in pcie_wait_for_link_delay() by > > directly returning false. > > I think it has to be a separate change, as a fix to what I can see is an > issue with a three-way-merge done with commit 1abb47390350 ("Merge branch > 'pci/enumeration'"); surely a bool result wasn't supposed to be assigned > to an int variable carrying an error code. > > Either or both changes may have to be backported independently. But can it be? Won't the intermediate state cause more breakage? (although that obviously can only hit some very unfortunate bisecter so perhaps not a big deal because one would need many holes to align, the biggest being the link has to fail training which is rare to begin with).
On Fri, 2 Feb 2024, Ilpo Järvinen wrote: > > You change the logic here in that the second conditional isn't run if the > > first has not. This is wrong, unclamping is not supposed to rely on LBMS. > > It is supposed to be always run and any failure has to be reported too, as > > a retraining error. I'll send an update according to what I have outlined > > before then, with some testing done first. > > Oh I see now, I'm sorry, I didn't read all the way to the last paragraph > of the commit message because the earlier one in the commit message hinted > the restriction is removed afterwards so I thought it was only linked to > the first part of the quirk. No worries. I have submitted the changes now: <https://lore.kernel.org/r/alpine.DEB.2.21.2402092125070.2376@angie.orcam.me.uk/>. Unfortunately due to a sudden hardware failure I wasn't able to do any non-trivial verification, as explained in the cover letter. I'll try to get back to it as soon as reasonably possible, hopefully next month. > > I think it has to be a separate change, as a fix to what I can see is an > > issue with a three-way-merge done with commit 1abb47390350 ("Merge branch > > 'pci/enumeration'"); surely a bool result wasn't supposed to be assigned > > to an int variable carrying an error code. > > > > Either or both changes may have to be backported independently. > > But can it be? Won't the intermediate state cause more breakage? (although > that obviously can only hit some very unfortunate bisecter so perhaps not > a big deal because one would need many holes to align, the biggest being > the link has to fail training which is rare to begin with). I decided to do it differently after all, still with two changes. There doesn't have to be any intermediate state, and returning failure where no retraining has been done is no different from the case where a failure was caused by the lack of hardware features required, such as link active reporting, so if callers work without this update, then so they will with it in place (i.e. no regression). Maciej
On Fri, 2 Feb 2024, Maciej W. Rozycki wrote: > On Fri, 2 Feb 2024, Ilpo Järvinen wrote: > > > 1. Change pcie_failed_link_retrain() to return true only if link was > > retrained successfully due to the Target Speed quirk. If there is no > > LBMS set, return false instead of true because no retraining was > > even attempted. This seems correct considering expectations of both > > callers of pcie_failed_link_retrain(). > > You change the logic here in that the second conditional isn't run if the > first has not. This is wrong, unclamping is not supposed to rely on LBMS. > It is supposed to be always run and any failure has to be reported too, as > a retraining error. Now that (I think) I fully understand the intent of the second condition/block one additional question occurred to me. How is the 2nd condition even supposed to work in the current place when firmware has pre-arranged the 2.5GT/s resctriction? Wouldn't the link come up fine in that case and the quirk code is not called at all since the link came up successfully? Yet another thing in this quirk code I don't like is how it can leaves the target speed to 2.5GT/s when the quirk fails to get the link working (which actually does happen in the disconnection cases because DLLLA won't be set so the target speed will not be restored).
On Fri, 16 Feb 2024, Ilpo Järvinen wrote: > > You change the logic here in that the second conditional isn't run if the > > first has not. This is wrong, unclamping is not supposed to rely on LBMS. > > It is supposed to be always run and any failure has to be reported too, as > > a retraining error. > > Now that (I think) I fully understand the intent of the second > condition/block one additional question occurred to me. > > How is the 2nd condition even supposed to work in the current place when > firmware has pre-arranged the 2.5GT/s resctriction? Wouldn't the link come > up fine in that case and the quirk code is not called at all since the > link came up successfully? The quirk is called unconditionally from `pci_device_add', so an attempt to unclamp will always happen with a working link for qualifying devices. > Yet another thing in this quirk code I don't like is how it can leaves the > target speed to 2.5GT/s when the quirk fails to get the link working > (which actually does happen in the disconnection cases because DLLLA won't > be set so the target speed will not be restored). I chose to leave the target speed at the most recent setting, because the link doesn't work in that case anyway, so I concluded it doesn't matter, but reduces messing with the device; technically you should retrain again afterwards. I'm not opposed to changing this if you have a use case. Maciej
On Fri, 16 Feb 2024, Maciej W. Rozycki wrote: > On Fri, 16 Feb 2024, Ilpo Järvinen wrote: > > > > You change the logic here in that the second conditional isn't run if the > > > first has not. This is wrong, unclamping is not supposed to rely on LBMS. > > > It is supposed to be always run and any failure has to be reported too, as > > > a retraining error. > > > > Now that (I think) I fully understand the intent of the second > > condition/block one additional question occurred to me. > > > > How is the 2nd condition even supposed to work in the current place when > > firmware has pre-arranged the 2.5GT/s resctriction? Wouldn't the link come > > up fine in that case and the quirk code is not called at all since the > > link came up successfully? > > The quirk is called unconditionally from `pci_device_add', so an attempt > to unclamp will always happen with a working link for qualifying devices. Ah, thanks. I'd stared the other two calls enough of times I'd forgotten the 3rd one even existed. > > Yet another thing in this quirk code I don't like is how it can leaves the > > target speed to 2.5GT/s when the quirk fails to get the link working > > (which actually does happen in the disconnection cases because DLLLA won't > > be set so the target speed will not be restored). > > I chose to leave the target speed at the most recent setting, because the > link doesn't work in that case anyway, so I concluded it doesn't matter, > but reduces messing with the device; technically you should retrain again > afterwards. I'm not opposed to changing this if you have a use case. It remains suboptimally set in a case where something is plugged again into that port, for Thunderbolt it doesn't matter as the PCIe speed picked is quite bogus anyway, but disconnect then plug something again is not limited to Thunderbolt. I've no immediate plans on changing it now but it may come relevant when attempting to make the bandwidth controller to trigger the quirk. To me there are two quirks, not just one so I might have to split them to make it better suited for triggering them from bwctrl.
On Fri, 16 Feb 2024, Ilpo Järvinen wrote: > > > Yet another thing in this quirk code I don't like is how it can leaves the > > > target speed to 2.5GT/s when the quirk fails to get the link working > > > (which actually does happen in the disconnection cases because DLLLA won't > > > be set so the target speed will not be restored). > > > > I chose to leave the target speed at the most recent setting, because the > > link doesn't work in that case anyway, so I concluded it doesn't matter, > > but reduces messing with the device; technically you should retrain again > > afterwards. I'm not opposed to changing this if you have a use case. > > It remains suboptimally set in a case where something is plugged again > into that port, for Thunderbolt it doesn't matter as the PCIe speed picked > is quite bogus anyway, but disconnect then plug something again is not > limited to Thunderbolt. Sure, my understanding has been all PCIe option devices are supposed to be hot-pluggable, at least these in the regular form factor (which is why PCIe edge connector contacts have varying lengths, unlike conventional PCI). Maciej
On Sat, 10 Feb 2024, Maciej W. Rozycki wrote: > > > You change the logic here in that the second conditional isn't run if the > > > first has not. This is wrong, unclamping is not supposed to rely on LBMS. > > > It is supposed to be always run and any failure has to be reported too, as > > > a retraining error. I'll send an update according to what I have outlined > > > before then, with some testing done first. > > > > Oh I see now, I'm sorry, I didn't read all the way to the last paragraph > > of the commit message because the earlier one in the commit message hinted > > the restriction is removed afterwards so I thought it was only linked to > > the first part of the quirk. > > No worries. I have submitted the changes now: > <https://lore.kernel.org/r/alpine.DEB.2.21.2402092125070.2376@angie.orcam.me.uk/>. > > Unfortunately due to a sudden hardware failure I wasn't able to do any > non-trivial verification, as explained in the cover letter. I'll try to > get back to it as soon as reasonably possible, hopefully next month. I have the failure sorted now, simply reseating one of the connections made all the option modules work again as previously. I'm not impressed by this lack of reliability, but maybe it was just bad lack, as things used to work just fine for over 2 years before I replaced the mainboard. I'll see if I can do some verification this week. Maciej
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index d8f11a078924..ca4159472a72 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -5068,9 +5068,7 @@ static bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active, msleep(20); rc = pcie_wait_for_link_status(pdev, false, active); if (active) { - if (rc) - rc = pcie_failed_link_retrain(pdev); - if (rc) + if (rc < 0 && !pcie_failed_link_retrain(pdev)) return false; msleep(delay); diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index efb2ddbff115..e729157be95d 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -88,24 +88,25 @@ bool pcie_failed_link_retrain(struct pci_dev *dev) !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting) return false; - pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2); pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta); - if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) == - PCI_EXP_LNKSTA_LBMS) { - pci_info(dev, "broken device, retraining non-functional downstream link at 2.5GT/s\n"); + if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) != + PCI_EXP_LNKSTA_LBMS) + return false; - lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS; - lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT; - pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2); + pci_info(dev, "broken device, retraining non-functional downstream link at 2.5GT/s\n"); - if (pcie_retrain_link(dev, false)) { - pci_info(dev, "retraining failed\n"); - return false; - } + pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2); + lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS; + lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT; + pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2); - pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta); + if (pcie_retrain_link(dev, false)) { + pci_info(dev, "retraining failed\n"); + return false; } + pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta); + if ((lnksta & PCI_EXP_LNKSTA_DLLLA) && (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT && pci_match_id(ids, dev)) {
The combination of logic in pcie_failed_link_retrain() and pcie_wait_for_link_delay() is hard to track and does not really make sense in some cases. To cleanup the logic mess: 1. Change pcie_failed_link_retrain() to return true only if link was retrained successfully due to the Target Speed quirk. If there is no LBMS set, return false instead of true because no retraining was even attempted. This seems correct considering expectations of both callers of pcie_failed_link_retrain(). 2. Handle link-was-not-retrained-successfully return (false) from pcie_failed_link_retrain() properly in pcie_wait_for_link_delay() by directly returning false. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> --- drivers/pci/pci.c | 4 +--- drivers/pci/quirks.c | 25 +++++++++++++------------ 2 files changed, 14 insertions(+), 15 deletions(-)