Message ID | 1554919668-24873-1-git-send-email-spencer@mellanox.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | PCI: pciehp: Fix false command timeouts on boot | expand |
[cc += Mika, Sergey, Keith] On Wed, Apr 10, 2019 at 02:07:48PM -0400, Spencer Lingard wrote: > During command writes, pcie_poll_cmd() is invoked if > Command Completed notifications are disabled. When polling, > if Command Completed is set, the bit is reset and success is returned, > but ctrl->cmd_busy is not set back to 0. The next command write > then attempts to wait on a command that has already been completed, > timing out after 2 seconds. This delay occurs more frequently at > boot time, since pcie_init() disables notifications when powering > down empty slots. > > Clear cmd_busy upon successful command completion during > pcie_poll_cmd(). > > Signed-off-by: Spencer Lingard <spencer@mellanox.com> I suppose this can happen if a write to the Slot Control register is performed while HPIE and/or CCIE is disabled, the two notifications are subsequently enabled and another write to the Slot Control is performed. That second write will then call wait_event_timeout() because of the stale ctrl->cmd_busy == 1, but the Command Complete notification has already happened and was cleared by pcie_poll_cmd(), hence it times out. Sounds reasonable, I'm a little suprised though that I've never seen this myself. I guess we've been doing this wrong for years, so: Cc: stable@vger.kernel.org Reviewed-by: Lukas Wunner <lukas@wunner.de> Thanks, Lukas > --- > drivers/pci/hotplug/pciehp_hpc.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c > index 6a2365c..28c70cf 100644 > --- a/drivers/pci/hotplug/pciehp_hpc.c > +++ b/drivers/pci/hotplug/pciehp_hpc.c > @@ -77,6 +77,7 @@ static int pcie_poll_cmd(struct controller *ctrl, int timeout) > if (slot_status & PCI_EXP_SLTSTA_CC) { > pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, > PCI_EXP_SLTSTA_CC); > + ctrl->cmd_busy = 0; > return 1; > } > if (timeout < 0) > -- > 2.1.2
On Sun, Apr 14, 2019 at 09:36:41PM +0200, Lukas Wunner wrote: > I suppose this can happen if a write to the Slot Control register is > performed while HPIE and/or CCIE is disabled, the two notifications > are subsequently enabled and another write to the Slot Control is > performed. That second write will then call wait_event_timeout() > because of the stale ctrl->cmd_busy == 1, but the Command Complete > notification has already happened and was cleared by pcie_poll_cmd(), > hence it times out. > > Sounds reasonable, I'm a little suprised though that I've never seen > this myself. I guess we've been doing this wrong for years, so: On second thought, it's not surprising at all that I never saw this because Thunderbolt sets NoCompl+, so doesn't use Command Complete notifications. I suspect that even though we've been doing this wrong for a long time, the bug was exposed by a recent change to pciehp. Do you happen to know since which kernel version or commit you've been witnessing the timeouts? Thanks, Lukas
On 4/14/2019 10:59 PM, Lukas Wunner wrote: > On Sun, Apr 14, 2019 at 09:36:41PM +0200, Lukas Wunner wrote: >> I suppose this can happen if a write to the Slot Control register is >> performed while HPIE and/or CCIE is disabled, the two notifications >> are subsequently enabled and another write to the Slot Control is >> performed. That second write will then call wait_event_timeout() >> because of the stale ctrl->cmd_busy == 1, but the Command Complete >> notification has already happened and was cleared by pcie_poll_cmd(), >> hence it times out. >> >> Sounds reasonable, I'm a little suprised though that I've never seen >> this myself. I guess we've been doing this wrong for years, so: > On second thought, it's not surprising at all that I never saw this > because Thunderbolt sets NoCompl+, so doesn't use Command Complete > notifications. > > I suspect that even though we've been doing this wrong for a long time, > the bug was exposed by a recent change to pciehp. Do you happen to > know since which kernel version or commit you've been witnessing the > timeouts? Hi Lukas, thank you for your time. We started seeing these timeouts when we went to 4.20.5 from 4.14.61. In pcie_init(), there's a check that turns off a slot if it's powered on but unoccupied. Before 4e6a13356f1c ("PCI: pciehp: Deduplicate presence check on probe & resume"), that power check was at the end of pcie_probe(), after the IRQ is requested. I've investigated a little and found that the delays go away if the power check is moved back where it was before that commit. - Spencer
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 6a2365c..28c70cf 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -77,6 +77,7 @@ static int pcie_poll_cmd(struct controller *ctrl, int timeout) if (slot_status & PCI_EXP_SLTSTA_CC) { pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, PCI_EXP_SLTSTA_CC); + ctrl->cmd_busy = 0; return 1; } if (timeout < 0)
During command writes, pcie_poll_cmd() is invoked if Command Completed notifications are disabled. When polling, if Command Completed is set, the bit is reset and success is returned, but ctrl->cmd_busy is not set back to 0. The next command write then attempts to wait on a command that has already been completed, timing out after 2 seconds. This delay occurs more frequently at boot time, since pcie_init() disables notifications when powering down empty slots. Clear cmd_busy upon successful command completion during pcie_poll_cmd(). Signed-off-by: Spencer Lingard <spencer@mellanox.com> Cc: Lukas Wunner <lukas@wunner.de> --- drivers/pci/hotplug/pciehp_hpc.c | 1 + 1 file changed, 1 insertion(+)