mbox series

[v4,0/3] Improve PCI device post-reset readiness polling

Message ID 20200307172044.29645-1-stanspas@amazon.com (mailing list archive)
Headers show
Series Improve PCI device post-reset readiness polling | expand

Message

Stanislav Spassov March 7, 2020, 5:20 p.m. UTC
From: Stanislav Spassov <stanspas@amazon.de>

The first version of this patch series can be found here:
https://lore.kernel.org/linux-pci/20200223122057.6504-1-stanspas@amazon.com

The goal of this patch series is to solve an issue where pci_dev_wait
can cause system crashes. After a reset, a hung device may keep
responding with CRS completions indefinitely. If CRS Software Visibility
is enabled on the Root Port, attempting to read any register other than
PCI_VENDOR_ID will cause the Root Port to autonomously retry the request
without reporting back to the CPU core. Unless the number of retries or
the amount of time spent retrying is limited by platform-specific means,
this scenario leads to low-level platform timeouts (such as a TOR
Timeout), which can easily escalate to a crash.

Feedback on the v1 inspired a lot of additional improvements all around the
device reset codepaths and reducing post-reset delays. These improvements
were published as part of v2 (v3 is just small build fixes).

It looks like there is immediate demand specifically for the CRS work,
so I am once again reducing the series to just that. The reset will be
posted as a separate patch series that will likely require more time and
iterations to stabilize.

Changes since v3:
- In pci_dev_wait(), added "timeout -= waited" to account the time spent
  polling PCI_VENDOR_ID before falling back to polling PCI_COMMAND if
  device readiness could not be positively established via CRS (i.e.,
  if we stopped receiving CRS completions but did not receive a valid
  vendor ID due to dealing with an SR-IOV VF, or due to a different error)
- Simplified the commit message of "PCI: Add CRS handling to pci_dev_wait()"
  to avoid confusion as to when Root Ports will autonomously retry requests
  that resulted in CRS completions.

Stanislav Spassov (3):
  PCI: Refactor polling loop out of pci_dev_wait
  PCI: Cache CRS Software Visibiliy in struct pci_dev
  PCI: Add CRS handling to pci_dev_wait()

 drivers/pci/pci.c   | 109 +++++++++++++++++++++++++++++++++++---------
 drivers/pci/probe.c |   8 +++-
 include/linux/pci.h |   3 ++
 3 files changed, 98 insertions(+), 22 deletions(-)


base-commit: bb6d3fb354c5ee8d6bde2d576eb7220ea09862b9

Comments

David Woodhouse Jan. 22, 2021, 8:54 a.m. UTC | #1
On Sat, 2020-03-07 at 18:20 +0100, Stanislav Spassov wrote:
> From: Stanislav Spassov <stanspas@amazon.de>
> 
> The first version of this patch series can be found here:
> https://lore.kernel.org/linux-pci/20200223122057.6504-1-stanspas@amazon.com
> 
> The goal of this patch series is to solve an issue where pci_dev_wait
> can cause system crashes. After a reset, a hung device may keep
> responding with CRS completions indefinitely. If CRS Software Visibility
> is enabled on the Root Port, attempting to read any register other than
> PCI_VENDOR_ID will cause the Root Port to autonomously retry the request
> without reporting back to the CPU core. Unless the number of retries or
> the amount of time spent retrying is limited by platform-specific means,
> this scenario leads to low-level platform timeouts (such as a TOR
> Timeout), which can easily escalate to a crash.
> 
> Feedback on the v1 inspired a lot of additional improvements all around the
> device reset codepaths and reducing post-reset delays. These improvements
> were published as part of v2 (v3 is just small build fixes).
> 
> It looks like there is immediate demand specifically for the CRS work,
> so I am once again reducing the series to just that. The reset will be
> posted as a separate patch series that will likely require more time and
> iterations to stabilize.

Hm, what happened to this?

Bjorn?
David Woodhouse Sept. 10, 2021, 9:32 a.m. UTC | #2
On Fri, 2021-01-22 at 08:54 +0000, David Woodhouse wrote:
> On Sat, 2020-03-07 at 18:20 +0100, Stanislav Spassov wrote:
> > From: Stanislav Spassov <
> > stanspas@amazon.de
> > >
> > 
> > The first version of this patch series can be found here:
> > https://lore.kernel.org/linux-pci/20200223122057.6504-1-stanspas@amazon.com
> > 
> > 
> > The goal of this patch series is to solve an issue where pci_dev_wait
> > can cause system crashes. After a reset, a hung device may keep
> > responding with CRS completions indefinitely. If CRS Software Visibility
> > is enabled on the Root Port, attempting to read any register other than
> > PCI_VENDOR_ID will cause the Root Port to autonomously retry the request
> > without reporting back to the CPU core. Unless the number of retries or
> > the amount of time spent retrying is limited by platform-specific means,
> > this scenario leads to low-level platform timeouts (such as a TOR
> > Timeout), which can easily escalate to a crash.
> > 
> > Feedback on the v1 inspired a lot of additional improvements all around the
> > device reset codepaths and reducing post-reset delays. These improvements
> > were published as part of v2 (v3 is just small build fixes).
> > 
> > It looks like there is immediate demand specifically for the CRS work,
> > so I am once again reducing the series to just that. The reset will be
> > posted as a separate patch series that will likely require more time and
> > iterations to stabilize.
> 
> Hm, what happened to this?
> 
> Bjorn?

Ping?