diff mbox

PCI: Revert "PCI: Add runtime PM support for PCIe ports"

Message ID 20170102114040.GA20127@wunner.de (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Lukas Wunner Jan. 2, 2017, 11:40 a.m. UTC
On Fri, Dec 30, 2016 at 01:16:17AM +0100, Kilian Singer wrote:
> I did the debug message on the 4.10-rc1 for now. I could go back to 4.9
> if that helps but needs some time again to compile.
> The debug messages from the first rpm_... to the crash are:
[...]
> [   24.831417] nouveau 0000:01:00.0: rpm_suspend
> [   24.831427] nouveau 0000:01:00.0: DRM: suspending console...
> [   24.831432] nouveau 0000:01:00.0: DRM: suspending display...
> [   24.831477] nouveau 0000:01:00.0: DRM: evicting buffers...
> [   24.865243] nouveau 0000:01:00.0: DRM: waiting for kernel channels to go idle...
> [   24.865269] nouveau 0000:01:00.0: DRM: suspending client object trees...
> [   24.870724] nouveau 0000:01:00.0: DRM: suspending kernel object tree...
> [   26.080300] thinkpad_acpi: EC reports that Thermal Table has changed
> [   26.207691] pcieport 0000:00:01.0: rpm_idle
> [   26.207693] pcieport 0000:00:01.0: rpm_suspend
> [   28.927640] snd_hda_codec_hdmi hdaudioC0D0: rpm_suspend
> SYSTEM IS NOW NOT RESPONSIVE

So two seconds before the system became unresponsive, the root port above
the discrete GPU suspended, suggesting that's the culprit.  Could you test
either of the attached patches to confirm this theory?  They disable
runtime PM on this specific root port but allow it on all the others.

You've got an Optimus laptop, i.e. power to the discrete GPU can be cut.
Traditionally this is achieved by invoking an ACPI _DSM (Device Specific
Method).  That's what we did up until v4.7.

However on newer laptops Windows no longer cuts power to the discrete GPU
by invoking the _DSM, but rather by suspending the root port above the
GPU.  (More specifically by turning off Power Resources required for D3
of the root port, those are specified in a _PR3 object.)  We started
supporting this with v4.8.

If the above theory is correct, we need to involve Optimus experts
because this is not an issue then with powering down root ports in
general, but rather specific to this Optimus use case.

Thanks,

Lukas

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index a881c0d..2e4b32bd 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2240,6 +2240,10 @@ bool pci_bridge_d3_possible(struct pci_dev *bridge)
 		if (pci_bridge_d3_disable)
 			return false;
 
+		if (bridge->vendor == PCI_VENDOR_ID_INTEL &&
+		    bridge->device == 0x0c01)
+			return false;
+
 		/*
 		 * Hotplug ports handled by firmware in System Management Mode
 		 * may not be put into D3 by the OS (Thunderbolt on non-Macs).

Comments

Mika Westerberg Jan. 2, 2017, 12:10 p.m. UTC | #1
On Mon, Jan 02, 2017 at 12:40:40PM +0100, Lukas Wunner wrote:
> On Fri, Dec 30, 2016 at 01:16:17AM +0100, Kilian Singer wrote:
> > I did the debug message on the 4.10-rc1 for now. I could go back to 4.9
> > if that helps but needs some time again to compile.
> > The debug messages from the first rpm_... to the crash are:
> [...]
> > [   24.831417] nouveau 0000:01:00.0: rpm_suspend
> > [   24.831427] nouveau 0000:01:00.0: DRM: suspending console...
> > [   24.831432] nouveau 0000:01:00.0: DRM: suspending display...
> > [   24.831477] nouveau 0000:01:00.0: DRM: evicting buffers...
> > [   24.865243] nouveau 0000:01:00.0: DRM: waiting for kernel channels to go idle...
> > [   24.865269] nouveau 0000:01:00.0: DRM: suspending client object trees...
> > [   24.870724] nouveau 0000:01:00.0: DRM: suspending kernel object tree...
> > [   26.080300] thinkpad_acpi: EC reports that Thermal Table has changed
> > [   26.207691] pcieport 0000:00:01.0: rpm_idle
> > [   26.207693] pcieport 0000:00:01.0: rpm_suspend
> > [   28.927640] snd_hda_codec_hdmi hdaudioC0D0: rpm_suspend
> > SYSTEM IS NOW NOT RESPONSIVE
> 
> So two seconds before the system became unresponsive, the root port above
> the discrete GPU suspended, suggesting that's the culprit.  Could you test
> either of the attached patches to confirm this theory?  They disable
> runtime PM on this specific root port but allow it on all the others.
> 
> You've got an Optimus laptop, i.e. power to the discrete GPU can be cut.
> Traditionally this is achieved by invoking an ACPI _DSM (Device Specific
> Method).  That's what we did up until v4.7.
> 
> However on newer laptops Windows no longer cuts power to the discrete GPU
> by invoking the _DSM, but rather by suspending the root port above the
> GPU.  (More specifically by turning off Power Resources required for D3
> of the root port, those are specified in a _PR3 object.)  We started
> supporting this with v4.8.
> 
> If the above theory is correct, we need to involve Optimus experts
> because this is not an issue then with powering down root ports in
> general, but rather specific to this Optimus use case.

[Back from vacation now]

I've checked the acpidump of this machine and it does not seem to be a
traditional Optimus machine. At least this one is missing the magic _DSM
which is used to gather capabilities of the graphics device.

However, it does have _PR3 and it is attached to the device
(_SB.PCI0.PEG) itself, not the root port.

One thing you could try in addition to Lucas' patches is just to prevent
D3cold from the device by doing this:

  # echo 0 > /sys/bus/pci/devices/0000:01:00.0/d3cold_allowed
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mika Westerberg Jan. 2, 2017, 1:53 p.m. UTC | #2
On Mon, Jan 02, 2017 at 02:10:19PM +0200, Mika Westerberg wrote:
> I've checked the acpidump of this machine and it does not seem to be a
> traditional Optimus machine. At least this one is missing the magic _DSM
> which is used to gather capabilities of the graphics device.
> 
> However, it does have _PR3 and it is attached to the device
> (_SB.PCI0.PEG) itself, not the root port.
> 
> One thing you could try in addition to Lucas' patches is just to prevent
> D3cold from the device by doing this:
> 
>   # echo 0 > /sys/bus/pci/devices/0000:01:00.0/d3cold_allowed

Following messages look like the device fails to resume properly from D3cold:

Dec 30 08:45:06 klaptop kernel: [   27.775949] nouveau 0000:01:00.0: rpm_resume
Dec 30 08:45:06 klaptop kernel: [   27.776316] nouveau 0000:01:00.0: Refused to change power state, currently in D3
Dec 30 08:45:06 klaptop kernel: [   27.836049] nouveau 0000:01:00.0: Refused to change power state, currently in D3
Dec 30 08:45:06 klaptop kernel: [   27.836053] nouveau 0000:01:00.0: Refused to change power state, currently in D3

This happens if we read back 0xffffffff from PM register.

Dec 30 08:45:06 klaptop kernel: [   27.836055] nouveau 0000:01:00.0: DRM: resuming kernel object tree...
Dec 30 08:45:06 klaptop kernel: [   27.836127] nouveau 0000:01:00.0: pci: failed to adjust cap speed
Dec 30 08:45:06 klaptop kernel: [   27.836131] nouveau 0000:01:00.0: pci: failed to adjust lnkctl speed

Preventing D3cold should at least show some difference on resume path.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mika Westerberg Jan. 2, 2017, 2:48 p.m. UTC | #3
On Mon, Jan 02, 2017 at 02:10:19PM +0200, Mika Westerberg wrote:
> I've checked the acpidump of this machine and it does not seem to be a
> traditional Optimus machine. At least this one is missing the magic _DSM
> which is used to gather capabilities of the graphics device.
> 
> However, it does have _PR3 and it is attached to the device
> (_SB.PCI0.PEG) itself, not the root port.

Nah, actually PEG is the root port. So it certainly looks like
a traditional Optimus machine.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Jan. 2, 2017, 9:31 p.m. UTC | #4
On Monday, January 02, 2017 04:48:52 PM Mika Westerberg wrote:
> On Mon, Jan 02, 2017 at 02:10:19PM +0200, Mika Westerberg wrote:
> > I've checked the acpidump of this machine and it does not seem to be a
> > traditional Optimus machine. At least this one is missing the magic _DSM
> > which is used to gather capabilities of the graphics device.
> > 
> > However, it does have _PR3 and it is attached to the device
> > (_SB.PCI0.PEG) itself, not the root port.
> 
> Nah, actually PEG is the root port. So it certainly looks like
> a traditional Optimus machine.

So can we quirk that thing somehow and see if that helps (for debugging
purposes at least)?

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kilian Singer Jan. 3, 2017, 4:59 p.m. UTC | #5
I tried the 4.9.0 kernel and the patch fixes both the screen lock and firefox issue.


----- Original Message -----
From: "Lukas Wunner" <lukas@wunner.de>
To: "Kilian Singer" <kilian.singer@quantumtechnology.info>
Cc: "Bjorn Helgaas" <helgaas@kernel.org>, "linux-pci" <linux-pci@vger.kernel.org>, "Mika Westerberg" <mika.westerberg@linux.intel.com>, "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Sent: Monday, January 2, 2017 12:40:40 PM
Subject: Re: PCI: Revert "PCI: Add runtime PM support for PCIe ports"

On Fri, Dec 30, 2016 at 01:16:17AM +0100, Kilian Singer wrote:
> I did the debug message on the 4.10-rc1 for now. I could go back to 4.9
> if that helps but needs some time again to compile.
> The debug messages from the first rpm_... to the crash are:
[...]
> [   24.831417] nouveau 0000:01:00.0: rpm_suspend
> [   24.831427] nouveau 0000:01:00.0: DRM: suspending console...
> [   24.831432] nouveau 0000:01:00.0: DRM: suspending display...
> [   24.831477] nouveau 0000:01:00.0: DRM: evicting buffers...
> [   24.865243] nouveau 0000:01:00.0: DRM: waiting for kernel channels to go idle...
> [   24.865269] nouveau 0000:01:00.0: DRM: suspending client object trees...
> [   24.870724] nouveau 0000:01:00.0: DRM: suspending kernel object tree...
> [   26.080300] thinkpad_acpi: EC reports that Thermal Table has changed
> [   26.207691] pcieport 0000:00:01.0: rpm_idle
> [   26.207693] pcieport 0000:00:01.0: rpm_suspend
> [   28.927640] snd_hda_codec_hdmi hdaudioC0D0: rpm_suspend
> SYSTEM IS NOW NOT RESPONSIVE

So two seconds before the system became unresponsive, the root port above
the discrete GPU suspended, suggesting that's the culprit.  Could you test
either of the attached patches to confirm this theory?  They disable
runtime PM on this specific root port but allow it on all the others.

You've got an Optimus laptop, i.e. power to the discrete GPU can be cut.
Traditionally this is achieved by invoking an ACPI _DSM (Device Specific
Method).  That's what we did up until v4.7.

However on newer laptops Windows no longer cuts power to the discrete GPU
by invoking the _DSM, but rather by suspending the root port above the
GPU.  (More specifically by turning off Power Resources required for D3
of the root port, those are specified in a _PR3 object.)  We started
supporting this with v4.8.

If the above theory is correct, we need to involve Optimus experts
because this is not an issue then with powering down root ports in
general, but rather specific to this Optimus use case.

Thanks,

Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kilian Singer Jan. 3, 2017, 5:08 p.m. UTC | #6
Sorry I should mention the patch.
I tried the 4.9.0 kernel and the patch:
disable_pm_v4.9.patch

 fixes both the screen lock and firefox issue.

----- Original Message -----
From: "Lukas Wunner" <lukas@wunner.de>
To: "Kilian Singer" <kilian.singer@quantumtechnology.info>
Cc: "Bjorn Helgaas" <helgaas@kernel.org>, "linux-pci" <linux-pci@vger.kernel.org>, "Mika Westerberg" <mika.westerberg@linux.intel.com>, "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Sent: Monday, January 2, 2017 12:40:40 PM
Subject: Re: PCI: Revert "PCI: Add runtime PM support for PCIe ports"

On Fri, Dec 30, 2016 at 01:16:17AM +0100, Kilian Singer wrote:
> I did the debug message on the 4.10-rc1 for now. I could go back to 4.9
> if that helps but needs some time again to compile.
> The debug messages from the first rpm_... to the crash are:
[...]
> [   24.831417] nouveau 0000:01:00.0: rpm_suspend
> [   24.831427] nouveau 0000:01:00.0: DRM: suspending console...
> [   24.831432] nouveau 0000:01:00.0: DRM: suspending display...
> [   24.831477] nouveau 0000:01:00.0: DRM: evicting buffers...
> [   24.865243] nouveau 0000:01:00.0: DRM: waiting for kernel channels to go idle...
> [   24.865269] nouveau 0000:01:00.0: DRM: suspending client object trees...
> [   24.870724] nouveau 0000:01:00.0: DRM: suspending kernel object tree...
> [   26.080300] thinkpad_acpi: EC reports that Thermal Table has changed
> [   26.207691] pcieport 0000:00:01.0: rpm_idle
> [   26.207693] pcieport 0000:00:01.0: rpm_suspend
> [   28.927640] snd_hda_codec_hdmi hdaudioC0D0: rpm_suspend
> SYSTEM IS NOW NOT RESPONSIVE

So two seconds before the system became unresponsive, the root port above
the discrete GPU suspended, suggesting that's the culprit.  Could you test
either of the attached patches to confirm this theory?  They disable
runtime PM on this specific root port but allow it on all the others.

You've got an Optimus laptop, i.e. power to the discrete GPU can be cut.
Traditionally this is achieved by invoking an ACPI _DSM (Device Specific
Method).  That's what we did up until v4.7.

However on newer laptops Windows no longer cuts power to the discrete GPU
by invoking the _DSM, but rather by suspending the root port above the
GPU.  (More specifically by turning off Power Resources required for D3
of the root port, those are specified in a _PR3 object.)  We started
supporting this with v4.8.

If the above theory is correct, we need to involve Optimus experts
because this is not an issue then with powering down root ports in
general, but rather specific to this Optimus use case.

Thanks,

Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kilian Singer Jan. 3, 2017, 5:10 p.m. UTC | #7
This makes the bash where executed locked but system stays responsive.
Lockscreen still leads to system becoming unresponsive.

----- Original Message -----
From: "Mika Westerberg" <mika.westerberg@linux.intel.com>
To: "Lukas Wunner" <lukas@wunner.de>
Cc: "Kilian Singer" <kilian.singer@quantumtechnology.info>, "Bjorn Helgaas" <helgaas@kernel.org>, "linux-pci" <linux-pci@vger.kernel.org>, "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Sent: Monday, January 2, 2017 1:10:19 PM
Subject: Re: PCI: Revert "PCI: Add runtime PM support for PCIe ports"

On Mon, Jan 02, 2017 at 12:40:40PM +0100, Lukas Wunner wrote:
> On Fri, Dec 30, 2016 at 01:16:17AM +0100, Kilian Singer wrote:
> > I did the debug message on the 4.10-rc1 for now. I could go back to 4.9
> > if that helps but needs some time again to compile.
> > The debug messages from the first rpm_... to the crash are:
> [...]
> > [   24.831417] nouveau 0000:01:00.0: rpm_suspend
> > [   24.831427] nouveau 0000:01:00.0: DRM: suspending console...
> > [   24.831432] nouveau 0000:01:00.0: DRM: suspending display...
> > [   24.831477] nouveau 0000:01:00.0: DRM: evicting buffers...
> > [   24.865243] nouveau 0000:01:00.0: DRM: waiting for kernel channels to go idle...
> > [   24.865269] nouveau 0000:01:00.0: DRM: suspending client object trees...
> > [   24.870724] nouveau 0000:01:00.0: DRM: suspending kernel object tree...
> > [   26.080300] thinkpad_acpi: EC reports that Thermal Table has changed
> > [   26.207691] pcieport 0000:00:01.0: rpm_idle
> > [   26.207693] pcieport 0000:00:01.0: rpm_suspend
> > [   28.927640] snd_hda_codec_hdmi hdaudioC0D0: rpm_suspend
> > SYSTEM IS NOW NOT RESPONSIVE
> 
> So two seconds before the system became unresponsive, the root port above
> the discrete GPU suspended, suggesting that's the culprit.  Could you test
> either of the attached patches to confirm this theory?  They disable
> runtime PM on this specific root port but allow it on all the others.
> 
> You've got an Optimus laptop, i.e. power to the discrete GPU can be cut.
> Traditionally this is achieved by invoking an ACPI _DSM (Device Specific
> Method).  That's what we did up until v4.7.
> 
> However on newer laptops Windows no longer cuts power to the discrete GPU
> by invoking the _DSM, but rather by suspending the root port above the
> GPU.  (More specifically by turning off Power Resources required for D3
> of the root port, those are specified in a _PR3 object.)  We started
> supporting this with v4.8.
> 
> If the above theory is correct, we need to involve Optimus experts
> because this is not an issue then with powering down root ports in
> general, but rather specific to this Optimus use case.

[Back from vacation now]

I've checked the acpidump of this machine and it does not seem to be a
traditional Optimus machine. At least this one is missing the magic _DSM
which is used to gather capabilities of the graphics device.

However, it does have _PR3 and it is attached to the device
(_SB.PCI0.PEG) itself, not the root port.

One thing you could try in addition to Lucas' patches is just to prevent
D3cold from the device by doing this:

  # echo 0 > /sys/bus/pci/devices/0000:01:00.0/d3cold_allowed
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index ba34907..c9dd1e0 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2242,6 +2242,10 @@  static bool pci_bridge_d3_possible(struct pci_dev *bridge)
 		if (pci_bridge_d3_force)
 			return true;
 
+		if (bridge->vendor == PCI_VENDOR_ID_INTEL &&
+		    bridge->device == 0x0c01)
+			return false;
+
 		/*
 		 * It should be safe to put PCIe ports from 2015 or newer
 		 * to D3.