[0/5] Fix deadlock on runtime suspend in DRM drivers
diff mbox

Message ID 20180211194154.GB22869@wunner.de
State New
Headers show

Commit Message

Lukas Wunner Feb. 11, 2018, 7:41 p.m. UTC
On Sun, Feb 11, 2018 at 08:23:14PM +0100, Lukas Wunner wrote:
> On Sun, Feb 11, 2018 at 06:58:11PM +0000, Mike Lothian wrote:
> > On 11 February 2018 at 09:38, Lukas Wunner <lukas@wunner.de> wrote:
> > > The patches for radeon and amdgpu are compile-tested only, I only have a
> > > MacBook Pro with an Nvidia GK107 to test.  To test the patches, add an
> > > "msleep(12*1000);" at the top of the driver's ->runtime_suspend hook.
> > > This ensures that the poll worker runs after ->runtime_suspend has begun.
> > > Wait 12 sec after the GPU has begun runtime suspend, then check
> > > /sys/bus/pci/devices/0000:01:00.0/power/runtime_status.  Without this
> > > series, the status will be stuck at "suspending" and you'll get hung task
> > > errors in dmesg after a few minutes.
> > 
> > I wasn't quite sure where to add that msleep. I've tested the patches
> > as is on top of agd5f's wip branch without ill effects
> > 
> > I've had a radeon and now a amdgpu PRIME setup and don't believe I've
> > ever seen this issue
> > 
> > If you could pop a patch together for the msleep I'll give it a test on
> > amdgpu
> 
> Here you go, this is for all 3 drivers.
> Should deadlock without the series.
> Thanks!

Sorry, I missed that amdgpu_drv.c and radeon_drv.c don't include delay.h,
rectified testing patch below:

Comments

Mike Lothian Feb. 12, 2018, 12:35 a.m. UTC | #1
Hi

I've not been able to reproduce the original problem you're trying to
solve on amdgpu thats with or without your patch set and the above
"trigger" too

Is anything else required to trigger it, I started multiple DRI_PRIME
glxgears, in parallel, serial waiting the 12 seconds and serial within
the 12 seconds and I couldn't reproduce it

Regards

Mike
Lukas Wunner Feb. 12, 2018, 3:39 a.m. UTC | #2
On Mon, Feb 12, 2018 at 12:35:51AM +0000, Mike Lothian wrote:
> I've not been able to reproduce the original problem you're trying to
> solve on amdgpu thats with or without your patch set and the above
> "trigger" too
> 
> Is anything else required to trigger it, I started multiple DRI_PRIME
> glxgears, in parallel, serial waiting the 12 seconds and serial within
> the 12 seconds and I couldn't reproduce it

The discrete GPU needs to runtime suspend, that's the trigger,
so no DRI_PRIME executables should be running.  Just let it
autosuspend after boot.  Do you see "waiting 12 sec" messages
in dmesg?  If not it's not autosuspending.

Thanks,

Lukas
Mike Lothian Feb. 12, 2018, 9:03 a.m. UTC | #3
On 12 February 2018 at 03:39, Lukas Wunner <lukas@wunner.de> wrote:
> On Mon, Feb 12, 2018 at 12:35:51AM +0000, Mike Lothian wrote:
>> I've not been able to reproduce the original problem you're trying to
>> solve on amdgpu thats with or without your patch set and the above
>> "trigger" too
>>
>> Is anything else required to trigger it, I started multiple DRI_PRIME
>> glxgears, in parallel, serial waiting the 12 seconds and serial within
>> the 12 seconds and I couldn't reproduce it
>
> The discrete GPU needs to runtime suspend, that's the trigger,
> so no DRI_PRIME executables should be running.  Just let it
> autosuspend after boot.  Do you see "waiting 12 sec" messages
> in dmesg?  If not it's not autosuspending.
>
> Thanks,
>
> Lukas

Hi

Yes I'm seeing those messages, I'm just not seeing the hangs

I've attached the dmesg in case you're interested

Regards

Mike
Lukas Wunner Feb. 12, 2018, 9:45 a.m. UTC | #4
On Mon, Feb 12, 2018 at 09:03:26AM +0000, Mike Lothian wrote:
> On 12 February 2018 at 03:39, Lukas Wunner <lukas@wunner.de> wrote:
> > On Mon, Feb 12, 2018 at 12:35:51AM +0000, Mike Lothian wrote:
> > > I've not been able to reproduce the original problem you're trying to
> > > solve on amdgpu thats with or without your patch set and the above
> > > "trigger" too
> > >
> > > Is anything else required to trigger it, I started multiple DRI_PRIME
> > > glxgears, in parallel, serial waiting the 12 seconds and serial within
> > > the 12 seconds and I couldn't reproduce it
> >
> > The discrete GPU needs to runtime suspend, that's the trigger,
> > so no DRI_PRIME executables should be running.  Just let it
> > autosuspend after boot.  Do you see "waiting 12 sec" messages
> > in dmesg?  If not it's not autosuspending.
> 
> Yes I'm seeing those messages, I'm just not seeing the hangs
> 
> I've attached the dmesg in case you're interested

Okay the reason you're not seeing deadlocks is because the output poll
worker is not enabled.  And the output poll worker is not enabled
because your discrete GPU doesn't have any outputs:

[    0.265568] [drm:dc_create] *ERROR* DC: Number of connectors is zero!

The outputs are only polled if there are connectors which have the
DRM_CONNECTOR_POLL_CONNECT or DRM_CONNECTOR_POLL_DISCONNECT flag set.
And that only ever seems to be the case for VGA and DVI.

We know based on bugzilla reports that hybrid graphics laptops do exist
which poll outputs with radeon and nouveau.  If there are no laptops
supported by amdgpu whose discrete GPU has polled connectors, then
patch [5/5] would be unnecessary.  That is for Alex to decide.

However that is very good to know, so thanks a lot for your testing
efforts, much appreciated!

Kind regards,

Lukas
Alex Deucher Feb. 12, 2018, 6:58 p.m. UTC | #5
On Mon, Feb 12, 2018 at 4:45 AM, Lukas Wunner <lukas@wunner.de> wrote:
> On Mon, Feb 12, 2018 at 09:03:26AM +0000, Mike Lothian wrote:
>> On 12 February 2018 at 03:39, Lukas Wunner <lukas@wunner.de> wrote:
>> > On Mon, Feb 12, 2018 at 12:35:51AM +0000, Mike Lothian wrote:
>> > > I've not been able to reproduce the original problem you're trying to
>> > > solve on amdgpu thats with or without your patch set and the above
>> > > "trigger" too
>> > >
>> > > Is anything else required to trigger it, I started multiple DRI_PRIME
>> > > glxgears, in parallel, serial waiting the 12 seconds and serial within
>> > > the 12 seconds and I couldn't reproduce it
>> >
>> > The discrete GPU needs to runtime suspend, that's the trigger,
>> > so no DRI_PRIME executables should be running.  Just let it
>> > autosuspend after boot.  Do you see "waiting 12 sec" messages
>> > in dmesg?  If not it's not autosuspending.
>>
>> Yes I'm seeing those messages, I'm just not seeing the hangs
>>
>> I've attached the dmesg in case you're interested
>
> Okay the reason you're not seeing deadlocks is because the output poll
> worker is not enabled.  And the output poll worker is not enabled
> because your discrete GPU doesn't have any outputs:
>
> [    0.265568] [drm:dc_create] *ERROR* DC: Number of connectors is zero!
>
> The outputs are only polled if there are connectors which have the
> DRM_CONNECTOR_POLL_CONNECT or DRM_CONNECTOR_POLL_DISCONNECT flag set.
> And that only ever seems to be the case for VGA and DVI.
>
> We know based on bugzilla reports that hybrid graphics laptops do exist
> which poll outputs with radeon and nouveau.  If there are no laptops
> supported by amdgpu whose discrete GPU has polled connectors, then
> patch [5/5] would be unnecessary.  That is for Alex to decide.

Most hybrid laptops don't have display connectors on the dGPU and we
only use polling on analog connectors, so you are not likely to run
into this on recent laptops.  That said, I don't know if there is some
OEM system out there with a VGA port on the dGPU in a hybrid laptop.
I guess another option is to just disable polling on hybrid laptops.

Alex

>
> However that is very good to know, so thanks a lot for your testing
> efforts, much appreciated!
>
> Kind regards,
>
> Lukas
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Lukas Wunner Feb. 13, 2018, 8:17 a.m. UTC | #6
On Mon, Feb 12, 2018 at 01:58:32PM -0500, Alex Deucher wrote:
> On Mon, Feb 12, 2018 at 4:45 AM, Lukas Wunner <lukas@wunner.de> wrote:
> > On Mon, Feb 12, 2018 at 09:03:26AM +0000, Mike Lothian wrote:
> >> On 12 February 2018 at 03:39, Lukas Wunner <lukas@wunner.de> wrote:
> >> > On Mon, Feb 12, 2018 at 12:35:51AM +0000, Mike Lothian wrote:
> >> > > I've not been able to reproduce the original problem you're trying to
> >> > > solve on amdgpu thats with or without your patch set and the above
> >> > > "trigger" too
> >
> > Okay the reason you're not seeing deadlocks is because the output poll
> > worker is not enabled.  And the output poll worker is not enabled
> > because your discrete GPU doesn't have any outputs:
> >
> > [    0.265568] [drm:dc_create] *ERROR* DC: Number of connectors is zero!
> >
> > The outputs are only polled if there are connectors which have the
> > DRM_CONNECTOR_POLL_CONNECT or DRM_CONNECTOR_POLL_DISCONNECT flag set.
> > And that only ever seems to be the case for VGA and DVI.
> >
> > We know based on bugzilla reports that hybrid graphics laptops do exist
> > which poll outputs with radeon and nouveau.  If there are no laptops
> > supported by amdgpu whose discrete GPU has polled connectors, then
> > patch [5/5] would be unnecessary.  That is for Alex to decide.
> 
> Most hybrid laptops don't have display connectors on the dGPU and we
> only use polling on analog connectors, so you are not likely to run
> into this on recent laptops.  That said, I don't know if there is some
> OEM system out there with a VGA port on the dGPU in a hybrid laptop.
> I guess another option is to just disable polling on hybrid laptops.

If we don't know for sure, applying patch [5/5] would seem to be the
safest approach.  (Assuming it doesn't break anything else.)

Right now runtime PM is only used on hybrid graphics dGPUs by nouveau,
radeon and amdgpu.  Would it be conceivable that its use is expanded
beyond that in the future?  E.g. on a desktop machine, if DPMS is off
on all screens, why keep the GPU in D0?  If that is conceivable, chances
that analog connectors are present are higher, and then the patch would
be necessary again.  (Of course this would mean that analog screens
wouldn't light up automatically if they're attached while the GPU is
in D3hot, but the user may forbid runtime PM via sysfs if that is
unwanted.)

Thanks,

Lukas
Alex Deucher Feb. 13, 2018, 3:19 p.m. UTC | #7
On Tue, Feb 13, 2018 at 3:17 AM, Lukas Wunner <lukas@wunner.de> wrote:
> On Mon, Feb 12, 2018 at 01:58:32PM -0500, Alex Deucher wrote:
>> On Mon, Feb 12, 2018 at 4:45 AM, Lukas Wunner <lukas@wunner.de> wrote:
>> > On Mon, Feb 12, 2018 at 09:03:26AM +0000, Mike Lothian wrote:
>> >> On 12 February 2018 at 03:39, Lukas Wunner <lukas@wunner.de> wrote:
>> >> > On Mon, Feb 12, 2018 at 12:35:51AM +0000, Mike Lothian wrote:
>> >> > > I've not been able to reproduce the original problem you're trying to
>> >> > > solve on amdgpu thats with or without your patch set and the above
>> >> > > "trigger" too
>> >
>> > Okay the reason you're not seeing deadlocks is because the output poll
>> > worker is not enabled.  And the output poll worker is not enabled
>> > because your discrete GPU doesn't have any outputs:
>> >
>> > [    0.265568] [drm:dc_create] *ERROR* DC: Number of connectors is zero!
>> >
>> > The outputs are only polled if there are connectors which have the
>> > DRM_CONNECTOR_POLL_CONNECT or DRM_CONNECTOR_POLL_DISCONNECT flag set.
>> > And that only ever seems to be the case for VGA and DVI.
>> >
>> > We know based on bugzilla reports that hybrid graphics laptops do exist
>> > which poll outputs with radeon and nouveau.  If there are no laptops
>> > supported by amdgpu whose discrete GPU has polled connectors, then
>> > patch [5/5] would be unnecessary.  That is for Alex to decide.
>>
>> Most hybrid laptops don't have display connectors on the dGPU and we
>> only use polling on analog connectors, so you are not likely to run
>> into this on recent laptops.  That said, I don't know if there is some
>> OEM system out there with a VGA port on the dGPU in a hybrid laptop.
>> I guess another option is to just disable polling on hybrid laptops.
>
> If we don't know for sure, applying patch [5/5] would seem to be the
> safest approach.  (Assuming it doesn't break anything else.)


I don't have any objections.  I see no reason to leave out the amdgpu changes.

Alex

Patch
diff mbox

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 50afcf6..beaaf2c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -36,6 +36,7 @@ 
 
 #include <drm/drm_pciids.h>
 #include <linux/console.h>
+#include <linux/delay.h>
 #include <linux/module.h>
 #include <linux/pm_runtime.h>
 #include <linux/vga_switcheroo.h>
@@ -718,6 +719,9 @@  static int amdgpu_pmops_runtime_suspend(struct device *dev)
 		return -EBUSY;
 	}
 
+	printk("waiting 12 sec\n");
+	msleep(12*1000);
+	printk("done waiting 12 sec\n");
 	drm_dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
 	drm_kms_helper_poll_disable(drm_dev);
 	vga_switcheroo_set_dynamic_switch(pdev, VGA_SWITCHEROO_OFF);
diff --git a/drivers/gpu/drm/drm_probe_helper.c b/drivers/gpu/drm/drm_probe_helper.c
index 555fbe5..ee7cf0d 100644
--- a/drivers/gpu/drm/drm_probe_helper.c
+++ b/drivers/gpu/drm/drm_probe_helper.c
@@ -586,6 +586,7 @@  static void output_poll_execute(struct work_struct *work)
 		repoll = true;
 		goto out;
 	}
+	dev_info(&dev->pdev->dev, "begin poll\n");
 
 	drm_connector_list_iter_begin(dev, &conn_iter);
 	drm_for_each_connector_iter(connector, &conn_iter) {
@@ -651,6 +652,7 @@  static void output_poll_execute(struct work_struct *work)
 
 	if (repoll)
 		schedule_delayed_work(delayed_work, DRM_OUTPUT_POLL_PERIOD);
+	dev_info(&dev->pdev->dev, "end poll\n");
 }
 
 /**
diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
index 3e29302..f9da5bc 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -855,6 +855,9 @@  static int nouveau_drm_probe(struct pci_dev *pdev,
 		return -EBUSY;
 	}
 
+	printk("waiting 12 sec\n");
+	msleep(12*1000);
+	printk("done waiting 12 sec\n");
 	drm_kms_helper_poll_disable(drm_dev);
 	vga_switcheroo_set_dynamic_switch(pdev, VGA_SWITCHEROO_OFF);
 	nouveau_switcheroo_optimus_dsm();
diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c
index 31dd04f..2b4e7e0 100644
--- a/drivers/gpu/drm/radeon/radeon_drv.c
+++ b/drivers/gpu/drm/radeon/radeon_drv.c
@@ -35,6 +35,7 @@ 
 
 #include <drm/drm_pciids.h>
 #include <linux/console.h>
+#include <linux/delay.h>
 #include <linux/module.h>
 #include <linux/pm_runtime.h>
 #include <linux/vga_switcheroo.h>
@@ -413,6 +414,9 @@  static int radeon_pmops_runtime_suspend(struct device *dev)
 		return -EBUSY;
 	}
 
+	printk("waiting 12 sec\n");
+	msleep(12*1000);
+	printk("done waiting 12 sec\n");
 	drm_dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
 	drm_kms_helper_poll_disable(drm_dev);
 	vga_switcheroo_set_dynamic_switch(pdev, VGA_SWITCHEROO_OFF);