PCI: pciehp: Fix false command timeouts on boot

Message ID	1554919668-24873-1-git-send-email-spencer@mellanox.com (mailing list archive)
State	New, archived
Delegated to:	Bjorn Helgaas
Headers	show Return-Path: <linux-pci-owner@kernel.org> From: Spencer Lingard <spencer@mellanox.com> To: lukas@wunner.de Cc: linux-pci@vger.kernel.org, Spencer Lingard <spencer@mellanox.com> Subject: [PATCH] PCI: pciehp: Fix false command timeouts on boot Date: Wed, 10 Apr 2019 14:07:48 -0400 Message-Id: <1554919668-24873-1-git-send-email-spencer@mellanox.com> Sender: linux-pci-owner@vger.kernel.org Precedence: bulk
Series	PCI: pciehp: Fix false command timeouts on boot \| expand PCI: pciehp: Fix false command timeouts on boot

Message ID

1554919668-24873-1-git-send-email-spencer@mellanox.com (mailing list archive)

State

New, archived

Delegated to:

Bjorn Helgaas

Headers

From: Spencer Lingard <spencer@mellanox.com>
To: lukas@wunner.de
Cc: linux-pci@vger.kernel.org, Spencer Lingard <spencer@mellanox.com>
Subject: [PATCH] PCI: pciehp: Fix false command timeouts on boot
Date: Wed, 10 Apr 2019 14:07:48 -0400
Message-Id: <1554919668-24873-1-git-send-email-spencer@mellanox.com>
Sender: linux-pci-owner@vger.kernel.org
Precedence: bulk

Series

PCI: pciehp: Fix false command timeouts on boot | expand

Commit Message

Spencer Lingard April 10, 2019, 6:07 p.m. UTC

During command writes, pcie_poll_cmd() is invoked if
Command Completed notifications are disabled. When polling,
if Command Completed is set, the bit is reset and success is returned,
but ctrl->cmd_busy is not set back to 0. The next command write
then attempts to wait on a command that has already been completed,
timing out after 2 seconds. This delay occurs more frequently at
boot time, since pcie_init() disables notifications when powering
down empty slots.

Clear cmd_busy upon successful command completion during
pcie_poll_cmd().

Signed-off-by: Spencer Lingard <spencer@mellanox.com>
Cc: Lukas Wunner <lukas@wunner.de>
---
 drivers/pci/hotplug/pciehp_hpc.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Lukas Wunner April 14, 2019, 7:36 p.m. UTC | #1

[cc += Mika, Sergey, Keith]

On Wed, Apr 10, 2019 at 02:07:48PM -0400, Spencer Lingard wrote:
> During command writes, pcie_poll_cmd() is invoked if
> Command Completed notifications are disabled. When polling,
> if Command Completed is set, the bit is reset and success is returned,
> but ctrl->cmd_busy is not set back to 0. The next command write
> then attempts to wait on a command that has already been completed,
> timing out after 2 seconds. This delay occurs more frequently at
> boot time, since pcie_init() disables notifications when powering
> down empty slots.
> 
> Clear cmd_busy upon successful command completion during
> pcie_poll_cmd().
> 
> Signed-off-by: Spencer Lingard <spencer@mellanox.com>

I suppose this can happen if a write to the Slot Control register is
performed while HPIE and/or CCIE is disabled, the two notifications
are subsequently enabled and another write to the Slot Control is
performed.  That second write will then call wait_event_timeout()
because of the stale ctrl->cmd_busy == 1, but the Command Complete
notification has already happened and was cleared by pcie_poll_cmd(),
hence it times out.

Sounds reasonable, I'm a little suprised though that I've never seen
this myself.  I guess we've been doing this wrong for years, so:

Cc: stable@vger.kernel.org
Reviewed-by: Lukas Wunner <lukas@wunner.de>

Thanks,

Lukas

> ---
>  drivers/pci/hotplug/pciehp_hpc.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> index 6a2365c..28c70cf 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -77,6 +77,7 @@ static int pcie_poll_cmd(struct controller *ctrl, int timeout)
>  		if (slot_status & PCI_EXP_SLTSTA_CC) {
>  			pcie_capability_write_word(pdev, PCI_EXP_SLTSTA,
>  						   PCI_EXP_SLTSTA_CC);
> +			ctrl->cmd_busy = 0;
>  			return 1;
>  		}
>  		if (timeout < 0)
> -- 
> 2.1.2

Lukas Wunner April 15, 2019, 2:59 a.m. UTC | #2

On Sun, Apr 14, 2019 at 09:36:41PM +0200, Lukas Wunner wrote:
> I suppose this can happen if a write to the Slot Control register is
> performed while HPIE and/or CCIE is disabled, the two notifications
> are subsequently enabled and another write to the Slot Control is
> performed.  That second write will then call wait_event_timeout()
> because of the stale ctrl->cmd_busy == 1, but the Command Complete
> notification has already happened and was cleared by pcie_poll_cmd(),
> hence it times out.
> 
> Sounds reasonable, I'm a little suprised though that I've never seen
> this myself.  I guess we've been doing this wrong for years, so:

On second thought, it's not surprising at all that I never saw this
because Thunderbolt sets NoCompl+, so doesn't use Command Complete
notifications.

I suspect that even though we've been doing this wrong for a long time,
the bug was exposed by a recent change to pciehp.  Do you happen to
know since which kernel version or commit you've been witnessing the
timeouts?

Thanks,

Lukas

Spencer Lingard April 16, 2019, 6:32 p.m. UTC | #3

On 4/14/2019 10:59 PM, Lukas Wunner wrote:
> On Sun, Apr 14, 2019 at 09:36:41PM +0200, Lukas Wunner wrote:
>> I suppose this can happen if a write to the Slot Control register is
>> performed while HPIE and/or CCIE is disabled, the two notifications
>> are subsequently enabled and another write to the Slot Control is
>> performed.  That second write will then call wait_event_timeout()
>> because of the stale ctrl->cmd_busy == 1, but the Command Complete
>> notification has already happened and was cleared by pcie_poll_cmd(),
>> hence it times out.
>>
>> Sounds reasonable, I'm a little suprised though that I've never seen
>> this myself.  I guess we've been doing this wrong for years, so:
> On second thought, it's not surprising at all that I never saw this
> because Thunderbolt sets NoCompl+, so doesn't use Command Complete
> notifications.
>
> I suspect that even though we've been doing this wrong for a long time,
> the bug was exposed by a recent change to pciehp.  Do you happen to
> know since which kernel version or commit you've been witnessing the
> timeouts?

Hi Lukas, thank you for your time.

We started seeing these timeouts when we went to 4.20.5 from 4.14.61.

In pcie_init(), there's a check that turns off a slot if it's powered on
but unoccupied. Before 4e6a13356f1c ("PCI: pciehp: Deduplicate presence
check on probe & resume"), that power check was at the end of
pcie_probe(), after the IRQ is requested. I've investigated a little and
found that the delays go away if the power check is moved back where it
was before that commit.

- Spencer

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 6a2365c..28c70cf 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -77,6 +77,7 @@  static int pcie_poll_cmd(struct controller *ctrl, int timeout)
 		if (slot_status & PCI_EXP_SLTSTA_CC) {
 			pcie_capability_write_word(pdev, PCI_EXP_SLTSTA,
 						   PCI_EXP_SLTSTA_CC);
+			ctrl->cmd_busy = 0;
 			return 1;
 		}
 		if (timeout < 0)

PCI: pciehp: Fix false command timeouts on boot

Commit Message

Comments

Patch