diff mbox series

[v2] PCI/PM: Target PM state is D3cold if the upstream bridge is power manageable

Message ID 20210531133435.53259-1-mika.westerberg@linux.intel.com (mailing list archive)
State Changes Requested
Delegated to: Bjorn Helgaas
Headers show
Series [v2] PCI/PM: Target PM state is D3cold if the upstream bridge is power manageable | expand

Commit Message

Mika Westerberg May 31, 2021, 1:34 p.m. UTC
Some PCIe devices only support PME (Power Management Event) from D3cold.
One example is ASMedia xHCI controller:

11:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller (prog-if 30 [XHCI])
  ...
  Capabilities: [78] Power Management version 3
  	  Flags: PMEClk- DSI- D1- D2- AuxCurrent=55mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
	  Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-

With such devices, if it has wake enabled, the kernel selects lowest
possible power state to be D0 in pci_target_state(). This is problematic
because it prevents the root port it is connected to enter low power
state too which makes the system consume more energy than necessary.

The problem in pci_target_state() is that it only accounts the "current"
device state, so when the bridge above it (a root port for instance) is
transitioned into D3hot the device transitions into D3cold. This is
because when the root port is first transitioned into D3hot then the
ACPI power resource is turned off which puts the PCIe link to L2/L3 (and
the root port and the device are in D3cold). If the root port is kept in
D3hot it still means that the device below it is still effectively in
D3cold as no configuration messages pass through. Furthermore the
implementation note of PCIe 5.0 sec 5.3.1.4 says that the device should
expect to be transitioned into D3cold soon after its link transitions
into L2/L3 Ready state.

Taking the above into consideration, instead of forcing the device stay
in D0 we look at the upstream bridge and whether it is allowed to enter
D3 (hot/cold). If this is the case we conclude that the actual target
state of the device is D3cold. This also follows the logic in
pci_set_power_state() that sets power state of the subordinate devices
to D3cold after the bridge itself is transitioned into D3cold.

Reported-by: Utkarsh H Patel <utkarsh.h.patel@intel.com>
Reported-by: Koba Ko <koba.ko@canonical.com>
Tested-by: Koba Ko <koba.ko@canonical.com>
Acked-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
---
Previous version of the patch can be found here:

  https://patchwork.kernel.org/project/linux-pci/patch/20210510102647.40322-1-mika.westerberg@linux.intel.com/

Changes from the previous version:

  * Added Ack from Kai-Heng
  * Reworked the commit log according to Bjorn's comments (I tried my best
    to answer the questions and explain the issue).
  * Expanded the comment to mention why the target state is D3cold.

 drivers/pci/pci.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

Rafael J. Wysocki May 31, 2021, 5:37 p.m. UTC | #1
On Monday, May 31, 2021 3:34:35 PM CEST Mika Westerberg wrote:
> Some PCIe devices only support PME (Power Management Event) from D3cold.
> One example is ASMedia xHCI controller:
> 
> 11:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller (prog-if 30 [XHCI])
>   ...
>   Capabilities: [78] Power Management version 3
>   	  Flags: PMEClk- DSI- D1- D2- AuxCurrent=55mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
> 	  Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> 
> With such devices, if it has wake enabled, the kernel selects lowest
> possible power state to be D0 in pci_target_state(). This is problematic
> because it prevents the root port it is connected to enter low power
> state too which makes the system consume more energy than necessary.

But this is not the only case affected by the patch AFAICS.

> The problem in pci_target_state() is that it only accounts the "current"
> device state, so when the bridge above it (a root port for instance) is
> transitioned into D3hot the device transitions into D3cold.

Well, as designed, pci_target_state() is about states the the device can
be programmed into, which cannot be D3cold if the device has not platform PM.

> This is because when the root port is first transitioned into D3hot then the
> ACPI power resource is turned off which puts the PCIe link to L2/L3 (and
> the root port and the device are in D3cold). If the root port is kept in
> D3hot it still means that the device below it is still effectively in
> D3cold as no configuration messages pass through. Furthermore the
> implementation note of PCIe 5.0 sec 5.3.1.4 says that the device should
> expect to be transitioned into D3cold soon after its link transitions
> into L2/L3 Ready state.

That's true, but the prerequisite is to put the endpoint device into D3hot
and not to attempt to put it into D3cold.

> Taking the above into consideration, instead of forcing the device stay
> in D0 we look at the upstream bridge and whether it is allowed to enter
> D3 (hot/cold). If this is the case we conclude that the actual target
> state of the device is D3cold. This also follows the logic in
> pci_set_power_state() that sets power state of the subordinate devices
> to D3cold after the bridge itself is transitioned into D3cold.

IMO what needs to be fixed is what happens when the "wakeup" argument is "true"
and that simply needs to special-case D3hot.

Namely, if wakeup from D3hot is not supported, D3hot should still be returned
if wakeup from D3cold is supported and the upstream bridge supports D3cold.

Returning D3cold from pci_target_state() for devices that cannot be programmed
into D3cold would be confusing IMV.

> Reported-by: Utkarsh H Patel <utkarsh.h.patel@intel.com>
> Reported-by: Koba Ko <koba.ko@canonical.com>
> Tested-by: Koba Ko <koba.ko@canonical.com>
> Acked-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> ---
> Previous version of the patch can be found here:
> 
>   https://patchwork.kernel.org/project/linux-pci/patch/20210510102647.40322-1-mika.westerberg@linux.intel.com/
> 
> Changes from the previous version:
> 
>   * Added Ack from Kai-Heng
>   * Reworked the commit log according to Bjorn's comments (I tried my best
>     to answer the questions and explain the issue).
>   * Expanded the comment to mention why the target state is D3cold.
> 
>  drivers/pci/pci.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b717680377a9..71c6a6437406 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2578,8 +2578,21 @@ static pci_power_t pci_target_state(struct pci_dev *dev, bool wakeup)
>  		return target_state;
>  	}
>  
> -	if (!dev->pm_cap)
> +	if (!dev->pm_cap) {
>  		target_state = PCI_D0;
> +	} else {
> +		struct pci_dev *bridge;
> +
> +		/*
> +		 * Look at the upstream bridge and whether it is allowed to
> +		 * enter D3hot (or D3cold). In both cases this device is
> +		 * not accessible anymore and its effective power state
> +		 * becomes D3cold.
> +		 */
> +		bridge = pci_upstream_bridge(dev);
> +		if (bridge && pci_bridge_d3_possible(bridge))
> +			target_state = PCI_D3cold;
> +	}
>  
>  	/*
>  	 * If the device is in D3cold even though it's not power-manageable by
>
Mika Westerberg June 1, 2021, 7:46 a.m. UTC | #2
Hi Rafael,

On Mon, May 31, 2021 at 07:37:45PM +0200, Rafael J. Wysocki wrote:
> On Monday, May 31, 2021 3:34:35 PM CEST Mika Westerberg wrote:
> > Some PCIe devices only support PME (Power Management Event) from D3cold.
> > One example is ASMedia xHCI controller:
> > 
> > 11:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller (prog-if 30 [XHCI])
> >   ...
> >   Capabilities: [78] Power Management version 3
> >   	  Flags: PMEClk- DSI- D1- D2- AuxCurrent=55mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
> > 	  Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > 
> > With such devices, if it has wake enabled, the kernel selects lowest
> > possible power state to be D0 in pci_target_state(). This is problematic
> > because it prevents the root port it is connected to enter low power
> > state too which makes the system consume more energy than necessary.
> 
> But this is not the only case affected by the patch AFAICS.

Right.

> > The problem in pci_target_state() is that it only accounts the "current"
> > device state, so when the bridge above it (a root port for instance) is
> > transitioned into D3hot the device transitions into D3cold.
> 
> Well, as designed, pci_target_state() is about states the the device can
> be programmed into, which cannot be D3cold if the device has not platform PM.

Fair enough but the device itself does not need to have platform PM. It
is enough that the root port far up in the hierarchy has it.

> > This is because when the root port is first transitioned into D3hot then the
> > ACPI power resource is turned off which puts the PCIe link to L2/L3 (and
> > the root port and the device are in D3cold). If the root port is kept in
> > D3hot it still means that the device below it is still effectively in
> > D3cold as no configuration messages pass through. Furthermore the
> > implementation note of PCIe 5.0 sec 5.3.1.4 says that the device should
> > expect to be transitioned into D3cold soon after its link transitions
> > into L2/L3 Ready state.
> 
> That's true, but the prerequisite is to put the endpoint device into D3hot
> and not to attempt to put it into D3cold.

Okay.

> > Taking the above into consideration, instead of forcing the device stay
> > in D0 we look at the upstream bridge and whether it is allowed to enter
> > D3 (hot/cold). If this is the case we conclude that the actual target
> > state of the device is D3cold. This also follows the logic in
> > pci_set_power_state() that sets power state of the subordinate devices
> > to D3cold after the bridge itself is transitioned into D3cold.
> 
> IMO what needs to be fixed is what happens when the "wakeup" argument is "true"
> and that simply needs to special-case D3hot.
> 
> Namely, if wakeup from D3hot is not supported, D3hot should still be returned
> if wakeup from D3cold is supported and the upstream bridge supports D3cold.

This sounds like a good solution except that we probably need to look
further up then to see if any of the bridges above support D3cold,
right? For instance if the device is part of TBT topology there may be
multiple bridges (PCIe upstream ports, downstream ports) between it and
the root port that has the platform PM "support".
Rafael J. Wysocki June 2, 2021, 6:02 p.m. UTC | #3
On Tuesday, June 1, 2021 9:46:20 AM CEST Mika Westerberg wrote:
> Hi Rafael,
> 
> On Mon, May 31, 2021 at 07:37:45PM +0200, Rafael J. Wysocki wrote:
> > On Monday, May 31, 2021 3:34:35 PM CEST Mika Westerberg wrote:
> > > Some PCIe devices only support PME (Power Management Event) from D3cold.
> > > One example is ASMedia xHCI controller:
> > > 
> > > 11:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller (prog-if 30 [XHCI])
> > >   ...
> > >   Capabilities: [78] Power Management version 3
> > >   	  Flags: PMEClk- DSI- D1- D2- AuxCurrent=55mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
> > > 	  Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > > 
> > > With such devices, if it has wake enabled, the kernel selects lowest
> > > possible power state to be D0 in pci_target_state(). This is problematic
> > > because it prevents the root port it is connected to enter low power
> > > state too which makes the system consume more energy than necessary.
> > 
> > But this is not the only case affected by the patch AFAICS.
> 
> Right.
> 
> > > The problem in pci_target_state() is that it only accounts the "current"
> > > device state, so when the bridge above it (a root port for instance) is
> > > transitioned into D3hot the device transitions into D3cold.
> > 
> > Well, as designed, pci_target_state() is about states the the device can
> > be programmed into, which cannot be D3cold if the device has not platform PM.
> 
> Fair enough but the device itself does not need to have platform PM. It
> is enough that the root port far up in the hierarchy has it.
> 
> > > This is because when the root port is first transitioned into D3hot then the
> > > ACPI power resource is turned off which puts the PCIe link to L2/L3 (and
> > > the root port and the device are in D3cold). If the root port is kept in
> > > D3hot it still means that the device below it is still effectively in
> > > D3cold as no configuration messages pass through. Furthermore the
> > > implementation note of PCIe 5.0 sec 5.3.1.4 says that the device should
> > > expect to be transitioned into D3cold soon after its link transitions
> > > into L2/L3 Ready state.
> > 
> > That's true, but the prerequisite is to put the endpoint device into D3hot
> > and not to attempt to put it into D3cold.
> 
> Okay.
> 
> > > Taking the above into consideration, instead of forcing the device stay
> > > in D0 we look at the upstream bridge and whether it is allowed to enter
> > > D3 (hot/cold). If this is the case we conclude that the actual target
> > > state of the device is D3cold. This also follows the logic in
> > > pci_set_power_state() that sets power state of the subordinate devices
> > > to D3cold after the bridge itself is transitioned into D3cold.
> > 
> > IMO what needs to be fixed is what happens when the "wakeup" argument is "true"
> > and that simply needs to special-case D3hot.
> > 
> > Namely, if wakeup from D3hot is not supported, D3hot should still be returned
> > if wakeup from D3cold is supported and the upstream bridge supports D3cold.
> 
> This sounds like a good solution except that we probably need to look
> further up then to see if any of the bridges above support D3cold,
> right? For instance if the device is part of TBT topology there may be
> multiple bridges (PCIe upstream ports, downstream ports) between it and
> the root port that has the platform PM "support".

Moreover, pci_enable_wake() would need to be modified to also enable PME if the
target state is D3hot and only wakeup from D3cold is supported.

But if pci_enable_wake() is modified this way, we don't need to worry about the
upstream bridges.  Worst-case wakeup will not work (even though enabled) if the
device stays in D3hot.

So pci_target_state() should return D3hot even if wakeup from D3hot itself is
not supported, but wakeup from D3cold is supported, regardless of what the
upstream bridges can do (with the assumption that power will get removed from
the device via an upstream bridge) and pci_enable_wake() should be modified to
enable PME if the target state is D3hot, but the device can only signal wakeup
from D3cold.

IMO, allowing PME support from D3cold only without providing any means to put
the device into D3cold is not a valid configuration and it need not be taken
into account here.
Mika Westerberg June 3, 2021, 11:06 a.m. UTC | #4
On Wed, Jun 02, 2021 at 08:02:38PM +0200, Rafael J. Wysocki wrote:
> On Tuesday, June 1, 2021 9:46:20 AM CEST Mika Westerberg wrote:
> > Hi Rafael,
> > 
> > On Mon, May 31, 2021 at 07:37:45PM +0200, Rafael J. Wysocki wrote:
> > > On Monday, May 31, 2021 3:34:35 PM CEST Mika Westerberg wrote:
> > > > Some PCIe devices only support PME (Power Management Event) from D3cold.
> > > > One example is ASMedia xHCI controller:
> > > > 
> > > > 11:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller (prog-if 30 [XHCI])
> > > >   ...
> > > >   Capabilities: [78] Power Management version 3
> > > >   	  Flags: PMEClk- DSI- D1- D2- AuxCurrent=55mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
> > > > 	  Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > > > 
> > > > With such devices, if it has wake enabled, the kernel selects lowest
> > > > possible power state to be D0 in pci_target_state(). This is problematic
> > > > because it prevents the root port it is connected to enter low power
> > > > state too which makes the system consume more energy than necessary.
> > > 
> > > But this is not the only case affected by the patch AFAICS.
> > 
> > Right.
> > 
> > > > The problem in pci_target_state() is that it only accounts the "current"
> > > > device state, so when the bridge above it (a root port for instance) is
> > > > transitioned into D3hot the device transitions into D3cold.
> > > 
> > > Well, as designed, pci_target_state() is about states the the device can
> > > be programmed into, which cannot be D3cold if the device has not platform PM.
> > 
> > Fair enough but the device itself does not need to have platform PM. It
> > is enough that the root port far up in the hierarchy has it.
> > 
> > > > This is because when the root port is first transitioned into D3hot then the
> > > > ACPI power resource is turned off which puts the PCIe link to L2/L3 (and
> > > > the root port and the device are in D3cold). If the root port is kept in
> > > > D3hot it still means that the device below it is still effectively in
> > > > D3cold as no configuration messages pass through. Furthermore the
> > > > implementation note of PCIe 5.0 sec 5.3.1.4 says that the device should
> > > > expect to be transitioned into D3cold soon after its link transitions
> > > > into L2/L3 Ready state.
> > > 
> > > That's true, but the prerequisite is to put the endpoint device into D3hot
> > > and not to attempt to put it into D3cold.
> > 
> > Okay.
> > 
> > > > Taking the above into consideration, instead of forcing the device stay
> > > > in D0 we look at the upstream bridge and whether it is allowed to enter
> > > > D3 (hot/cold). If this is the case we conclude that the actual target
> > > > state of the device is D3cold. This also follows the logic in
> > > > pci_set_power_state() that sets power state of the subordinate devices
> > > > to D3cold after the bridge itself is transitioned into D3cold.
> > > 
> > > IMO what needs to be fixed is what happens when the "wakeup" argument is "true"
> > > and that simply needs to special-case D3hot.
> > > 
> > > Namely, if wakeup from D3hot is not supported, D3hot should still be returned
> > > if wakeup from D3cold is supported and the upstream bridge supports D3cold.
> > 
> > This sounds like a good solution except that we probably need to look
> > further up then to see if any of the bridges above support D3cold,
> > right? For instance if the device is part of TBT topology there may be
> > multiple bridges (PCIe upstream ports, downstream ports) between it and
> > the root port that has the platform PM "support".
> 
> Moreover, pci_enable_wake() would need to be modified to also enable PME if the
> target state is D3hot and only wakeup from D3cold is supported.
> 
> But if pci_enable_wake() is modified this way, we don't need to worry about the
> upstream bridges.  Worst-case wakeup will not work (even though enabled) if the
> device stays in D3hot.
> 
> So pci_target_state() should return D3hot even if wakeup from D3hot itself is
> not supported, but wakeup from D3cold is supported, regardless of what the
> upstream bridges can do (with the assumption that power will get removed from
> the device via an upstream bridge) and pci_enable_wake() should be modified to
> enable PME if the target state is D3hot, but the device can only signal wakeup
> from D3cold.

OK thanks. Let me try this in v3.

> IMO, allowing PME support from D3cold only without providing any means to put
> the device into D3cold is not a valid configuration and it need not be taken
> into account here.

AFAIK you don't need the device to have any other means than that it
responds to the L2/3 handshake properly according to the spec (so that
the link can be put to L2). I think this is perfectly valid configuration :)
diff mbox series

Patch

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b717680377a9..71c6a6437406 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2578,8 +2578,21 @@  static pci_power_t pci_target_state(struct pci_dev *dev, bool wakeup)
 		return target_state;
 	}
 
-	if (!dev->pm_cap)
+	if (!dev->pm_cap) {
 		target_state = PCI_D0;
+	} else {
+		struct pci_dev *bridge;
+
+		/*
+		 * Look at the upstream bridge and whether it is allowed to
+		 * enter D3hot (or D3cold). In both cases this device is
+		 * not accessible anymore and its effective power state
+		 * becomes D3cold.
+		 */
+		bridge = pci_upstream_bridge(dev);
+		if (bridge && pci_bridge_d3_possible(bridge))
+			target_state = PCI_D3cold;
+	}
 
 	/*
 	 * If the device is in D3cold even though it's not power-manageable by