diff mbox series

[v2,1/3] PCI/PM: Observe reset delay irrespective of bridge_d3

Message ID eb37fa345285ec8bacabbf06b020b803f77bdd3d.1673769517.git.lukas@wunner.de (mailing list archive)
State Accepted
Delegated to: Bjorn Helgaas
Headers show
Series PCI reset delay fixes | expand

Commit Message

Lukas Wunner Jan. 15, 2023, 8:20 a.m. UTC
If a PCI bridge is suspended to D3cold upon entering system sleep,
resuming it entails a Fundamental Reset per PCIe r6.0 sec 5.8.

The delay prescribed after a Fundamental Reset in PCIe r6.0 sec 6.6.1
is sought to be observed by:

  pci_pm_resume_noirq()
    pci_pm_bridge_power_up_actions()
      pci_bridge_wait_for_secondary_bus()

However, pci_bridge_wait_for_secondary_bus() bails out if the bridge_d3
flag is not set.  That flag indicates whether a bridge is allowed to
suspend to D3cold at *runtime*.

Hence *no* delay is observed on resume from system sleep if runtime
D3cold is forbidden.  That doesn't make any sense, so drop the bridge_d3
check from pci_bridge_wait_for_secondary_bus().

The purpose of the bridge_d3 check was probably to avoid delays if a
bridge remained in D0 during suspend.  However the sole caller of
pci_bridge_wait_for_secondary_bus(), pci_pm_bridge_power_up_actions(),
is only invoked if the previous power state was D3cold.  Hence the
additional bridge_d3 check seems superfluous.

Fixes: ad9001f2f411 ("PCI/PM: Add missing link delays required by the PCIe spec")
Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@intel.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: stable@vger.kernel.org # v5.5+
---
Changes v1 -> v2:
 * Add Reviewed-by tags (Mika, Sathyanarayanan)

 drivers/pci/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Yang Su Feb. 18, 2023, 1:22 p.m. UTC | #1
Hi Lucas,

I figue out the reason of pci_bridge_secondary_bus_reset() why not work 
for NVIDIA GPU T4

which bind vfio passthrough hypervisor. I used the original func 
pci_bridge_secondary_bus_reset()

not your patch, your patch remove bridge_d3 flag, the real reason is 
bridge_d3 flag.

> However, pci_bridge_wait_for_secondary_bus() bails out if the bridge_d3
> flag is not set.  That flag indicates whether a bridge is allowed to
> suspend to D3cold at *runtime*.
>
>   drivers/pci/pci.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index fba95486caaf..f43f3e84f634 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4964,7 +4964,7 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
>   	if (pci_dev_is_disconnected(dev))
>   		return;
>   
> -	if (!pci_is_bridge(dev) || !dev->bridge_d3)
> +	if (!pci_is_bridge(dev))
>   		return;
>   
>   	down_read(&pci_bus_sem);

When I test the original func pci_bridge_secondary_bus_reset() in 
different machine node

which all consist of the same type NVIDIA GPU T4, I found 
pci_bridge_wait_for_secondary_bus()

bails out if the bridge_d3 flag is not set, but I still confused why 
same gpu some machine node not set

the bridge_d3 flag.

I find the linux kernel only two func will init bridge_d3 which is func 
pci_pm_init() and pci_bridge_d3_update().

If you know, please give me some hint.

On 2023/1/15 16:20, Lukas Wunner wrote:
> If a PCI bridge is suspended to D3cold upon entering system sleep,
> resuming it entails a Fundamental Reset per PCIe r6.0 sec 5.8.
>
> The delay prescribed after a Fundamental Reset in PCIe r6.0 sec 6.6.1
> is sought to be observed by:
>
>    pci_pm_resume_noirq()
>      pci_pm_bridge_power_up_actions()
>        pci_bridge_wait_for_secondary_bus()
>
> However, pci_bridge_wait_for_secondary_bus() bails out if the bridge_d3
> flag is not set.  That flag indicates whether a bridge is allowed to
> suspend to D3cold at *runtime*.
>
> Hence *no* delay is observed on resume from system sleep if runtime
> D3cold is forbidden.  That doesn't make any sense, so drop the bridge_d3
> check from pci_bridge_wait_for_secondary_bus().
>
> The purpose of the bridge_d3 check was probably to avoid delays if a
> bridge remained in D0 during suspend.  However the sole caller of
> pci_bridge_wait_for_secondary_bus(), pci_pm_bridge_power_up_actions(),
> is only invoked if the previous power state was D3cold.  Hence the
> additional bridge_d3 check seems superfluous.
>
> Fixes: ad9001f2f411 ("PCI/PM: Add missing link delays required by the PCIe spec")
> Tested-by: Ravi Kishore Koppuravuri<ravi.kishore.koppuravuri@intel.com>
> Signed-off-by: Lukas Wunner<lukas@wunner.de>
> Reviewed-by: Mika Westerberg<mika.westerberg@linux.intel.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan<sathyanarayanan.kuppuswamy@linux.intel.com>
> Cc:stable@vger.kernel.org  # v5.5+
> ---
> Changes v1 -> v2:
>   * Add Reviewed-by tags (Mika, Sathyanarayanan)
>
>   drivers/pci/pci.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index fba95486caaf..f43f3e84f634 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4964,7 +4964,7 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
>   	if (pci_dev_is_disconnected(dev))
>   		return;
>   
> -	if (!pci_is_bridge(dev) || !dev->bridge_d3)
> +	if (!pci_is_bridge(dev))
>   		return;
>   
>   	down_read(&pci_bus_sem);



Below is reply for your another email about the device invoke secondary 
bus reset.

Yes, as your previous email describes the endpoint can not reset its 
secondary bus.

> Normally pci_bridge_secondary_bus_reset() should never be executed for
> an endpoint device such as the Nvidia GPU T4.  It should only be executed
> for one of the bridges above the GPU.  An endpoint cannot reset its
> secondary bus, it doesn't have one.

As previous email I give the pci topology for the NVIDIA GPU T4, here I 
copy as below,

you say that means:

0000:17:00.0 = Root Port
0000:18:00.0 = Switch Upstream Port of PLX 9797
0000:19:04.0 = Switch Downstream Port of PLX 9797
0000:19:08.0 = Switch Downstream Port of PLX 9797
0000:4f:00.0 = Nvidia GPU T4

I apply your debug patch, the pci_info print the device is 0000:19:08.0 
which is the parent of

the endpoint device 0000:4f:00.0.

> pci_parent_bus_reset() invokes pci_bridge_secondary_bus_reset() on the
> parent of the device to be reset.  In this case that's the Switch Downstream
> Port 0000:19:08.0.  It finds the parent by following dev->bus->self.
>
> Perhaps you can apply the small debug patch below.  It should print a message
> when entering pci_bridge_secondary_bus_reset() and the message contains the
> pci_name() of the device for which the function is executed.  If this is
> not the Switch Downstream Port, you've found a bug.
>
> Usually a better way of finding the parent device is to call
> pci_upstream_bridge() instead of following dev->bus->self.  As you can
> see in the definition of pci_upstream_bridge() in include/linux/pci.h,
> this will find the parent of pci_phys_fn(dev).  So in virtualization
> scenarios, the result may indeed be different from dev->bus->self,
> but I still don't understand how.  You may want to try replacing
> dev->bus->self with pci_upstream_bridge(dev) in pci_parent_bus_reset()
> and see if that fixes the issue for you.
>
> Thanks,
>
> Lukas
>
> -- >8 --
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 19fe0ef0e583..f383e5d29bb1 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5073,6 +5073,9 @@ void __weak pcibios_reset_secondary_bus(struct pci_dev *dev)
>    */
>   int pci_bridge_secondary_bus_reset(struct pci_dev *dev)
>   {
> +	pci_info(dev, "%s\n", __func__);
> +	dump_stack();
> +
>   	pcibios_reset_secondary_bus(dev);
>   
>   	return pci_bridge_wait_for_secondary_bus(dev, "bus reset",



Thanks,

Yang
Lukas Wunner Feb. 19, 2023, 5:07 a.m. UTC | #2
On Sat, Feb 18, 2023 at 09:22:37PM +0800, Yang Su wrote:
> I figue out the reason of pci_bridge_secondary_bus_reset() why not work
> for NVIDIA GPU T4 which bind vfio passthrough hypervisor. I used the
> original func pci_bridge_secondary_bus_reset() not your patch, your
> patch remove bridge_d3 flag, the real reason is bridge_d3 flag.
> 
> When I test the original func pci_bridge_secondary_bus_reset() in
> different machine node which all consist of the same type NVIDIA GPU T4,
> I found pci_bridge_wait_for_secondary_bus() bails out if the bridge_d3
> flag is not set, but I still confused why same gpu some machine node
> not set the bridge_d3 flag.
> 
> I find the linux kernel only two func will init bridge_d3 which is func
> pci_pm_init() and pci_bridge_d3_update().
> 
> If you know, please give me some hint.

It sounds like patch [1/3] in this series fixes the issue your seeing.
That's good to hear.

The bridge_d3 flag indicates whether a PCIe port is allowed to
runtime suspend to D3:

First of all, pci_bridge_d3_possible() decides whether the PCIe port
may runtime suspend at all.  E.g. hotplug ports are generally not
allowed to runtime suspend except in certain known-to-work situations,
such as Thunderbolt or when specific ACPI properties are present.

Second, a device below the PCIe port may block the port from runtime
suspending to D3.  That is decided in pci_dev_check_d3cold().  E.g. if
user space disallows D3 via the "d3cold_allowed" attribute in sysfs,
that will block D3 on PCIe ports in the ancestry.

If you're seeing different values for bridge_d3 on different machines,
even though the device below the PCIe port is always the same type of
GPU, then either pci_bridge_d3_possible() returns a different result
for the PCIe port in question (because it's a hotplug port or has
different ACPI properties etc), or one of its children blocks runtime
suspend to D3 (e.g. via sysfs).

Thanks,

Lukas
diff mbox series

Patch

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index fba95486caaf..f43f3e84f634 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4964,7 +4964,7 @@  void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
 	if (pci_dev_is_disconnected(dev))
 		return;
 
-	if (!pci_is_bridge(dev) || !dev->bridge_d3)
+	if (!pci_is_bridge(dev))
 		return;
 
 	down_read(&pci_bus_sem);