diff mbox

[v2] drm/i915: Cancel the hotplug work when unregistering the connector

Message ID 20171006171841.24492-1-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson Oct. 6, 2017, 5:18 p.m. UTC
When we unregister the connector, we may have a pending hotplug work.
This needs to be cancel early during the teardown so that it does not
fire after we have freed the connector. Or else we may see something like:

 DEBUG_LOCKS_WARN_ON(mutex_is_locked(lock))
 ------------[ cut here ]------------
 WARNING: CPU: 4 PID: 5010 at kernel/locking/mutex-debug.c:103 mutex_destroy+0x4e/0x60
 Modules linked in: i915(-) snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec snd_hwdep snd_hda_core snd_pcm vgem ax88179_178a usbnet mii x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel e1000e ptp pps_core prime_numbers i2c_hid [last unloaded: snd_hda_intel]
 CPU: 4 PID: 5010 Comm: drv_module_relo Tainted: G     U          4.14.0-rc3-CI-CI_DRM_3186+ #1
 Hardware name: Intel Corporation CoffeeLake Client Platform/CoffeeLake S UDIMM RVP, BIOS CNLSFWX1.R00.X104.A03.1709140524 09/14/2017
 task: ffff8803c827aa40 task.stack: ffffc90000520000
 RIP: 0010:mutex_destroy+0x4e/0x60
 RSP: 0018:ffffc90000523d58 EFLAGS: 00010292
 RAX: 000000000000002a RBX: ffff88044fbef648 RCX: 0000000000000000
 RDX: 0000000080000001 RSI: 0000000000000001 RDI: ffffffff810f0cf0
 RBP: ffffc90000523d60 R08: 0000000000000001 R09: 0000000000000001
 R10: 000000000f21cb81 R11: 0000000000000000 R12: ffff88044f71efc8
 R13: ffffffffa02b3d20 R14: ffffffffa02b3d90 R15: ffff880459b29308
 FS:  00007f5df4d6e8c0(0000) GS:ffff88045d300000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000055ec51f00a18 CR3: 0000000451782006 CR4: 00000000003606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  drm_fb_helper_fini+0xd9/0x130
  intel_fbdev_destroy+0x12/0x60 [i915]
  intel_fbdev_fini+0x28/0x30 [i915]
  intel_modeset_cleanup+0x45/0xa0 [i915]
  i915_driver_unload+0x92/0x180 [i915]
  i915_pci_remove+0x19/0x30 [i915]
  pci_device_remove+0x39/0xb0
  device_release_driver_internal+0x15d/0x220
  driver_detach+0x40/0x80
  bus_remove_driver+0x58/0xd0
  driver_unregister+0x2c/0x40
  pci_unregister_driver+0x36/0xb0
  i915_exit+0x1a/0x8b [i915]
  SyS_delete_module+0x18c/0x1e0
  entry_SYSCALL_64_fastpath+0x1c/0xb1
 RIP: 0033:0x7f5df3286287
 RSP: 002b:00007fff8e107cc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0
 RAX: ffffffffffffffda RBX: ffffffff81493a03 RCX: 00007f5df3286287
 RDX: 0000000000000001 RSI: 0000000000000800 RDI: 0000564c7be02e48
 RBP: ffffc90000523f88 R08: 0000000000000000 R09: 0000000000000080
 R10: 00007f5df4d6e8c0 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007fff8e107eb0 R14: 0000000000000000 R15: 0000000000000000
  ? __this_cpu_preempt_check+0x13/0x20
 Code: 00 00 5b 5d c3 e8 93 b9 3a 00 85 c0 74 ec 8b 05 e1 53 c3 01 85 c0 75 e2 48 c7 c6 86 a6 c7 81 48 c7 c7 8b 8d c6 81 e8 03 ae 01 00 <0f> ff eb cb 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 b8
 ---[ end trace 08901ff1a77d30c6 ]---
 [drm:wait_panel_status [i915]] mask b800000f value 00000000 status 00000000 control 00000060
 [drm:wait_panel_status [i915]] Wait complete
 [drm:edp_panel_vdd_on [i915]] PP_STATUS: 0x00000000 PP_CONTROL: 0x00000068
 [drm:edp_panel_vdd_on [i915]] eDP port A panel power wasn't enabled
 [drm:drm_dp_read_desc] DP sink: OUI 00-1c-f8 dev-ID  HW-rev 0.0 SW-rev 7.49 quirks 0x0000
 [drm:drm_edid_to_eld] ELD: no CEA Extension found
 [drm:drm_helper_probe_single_connector_modes] [CONNECTOR:48:eDP-1] probed modes :
 [drm:drm_mode_debug_printmodeline] Modeline 49:"1920x1080" 60 138780 1920 1966 1996 2080 1080 1082 1086 1112 0x48 0xa
 [drm:drm_mode_debug_printmodeline] Modeline 50:"1920x1080" 40 92520 1920 1966 1996 2080 1080 1082 1086 1112 0x40 0xa
 general protection fault: 0000 [#1] PREEMPT SMP
 Modules linked in: i915(-) snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec snd_hwdep snd_hda_core snd_pcm vgem ax88179_178a usbnet mii x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel e1000e ptp pps_core prime_numbers i2c_hid [last unloaded: snd_hda_intel]
 CPU: 0 PID: 82 Comm: kworker/0:1 Tainted: G     U  W       4.14.0-rc3-CI-CI_DRM_3186+ #1
 Hardware name: Intel Corporation CoffeeLake Client Platform/CoffeeLake S UDIMM RVP, BIOS CNLSFWX1.R00.X104.A03.1709140524 09/14/2017
 Workqueue: events intel_dp_modeset_retry_work_fn [i915]
 task: ffff88045a5caa40 task.stack: ffffc90000378000
 RIP: 0010:drm_setup_crtcs+0x143/0xbf0
 RSP: 0018:ffffc9000037bd20 EFLAGS: 00010202
 RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000002 RCX: 0000000000000001
 RDX: 0000000000000001 RSI: 0000000000000780 RDI: 00000000ffffffff
 RBP: ffffc9000037bdb8 R08: 0000000000000001 R09: 0000000000000001
 R10: 0000000000000780 R11: 0000000000000000 R12: 0000000000000002
 R13: ffff88044fbef4e8 R14: 0000000000000780 R15: 0000000000000438
 FS:  0000000000000000(0000) GS:ffff88045d200000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000055ec51ee5168 CR3: 000000044c89d003 CR4: 00000000003606f0
 Call Trace:
  drm_fb_helper_hotplug_event.part.18+0x7e/0xc0
  drm_fb_helper_hotplug_event+0x1a/0x20
  intel_fbdev_output_poll_changed+0x1a/0x20 [i915]
  drm_kms_helper_hotplug_event+0x27/0x30
  intel_dp_modeset_retry_work_fn+0x77/0x80 [i915]
  process_one_work+0x233/0x660
  worker_thread+0x206/0x3b0
  kthread+0x152/0x190
  ? process_one_work+0x660/0x660
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x27/0x40
 Code: 06 00 00 45 8b 45 20 31 db 45 31 e4 45 85 c0 0f 8e 91 06 00 00 44 8b 75 94 44 8b 7d 90 49 8b 45 28 49 63 d4 44 89 f6 41 83 c4 01 <48> 8b 04 d0 44 89 fa 48 8b 38 48 8b 87 a8 01 00 00 ff 50 20 01
 RIP: drm_setup_crtcs+0x143/0xbf0 RSP: ffffc9000037bd20
 ---[ end trace 08901ff1a77d30c7 ]---

v2: Only cancel if the work has been initialised.

Suggested-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Fixes: 9301397a63b3 ("drm/i915: Implement Link Rate fallback on Link training failure")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tony Cheng <tony.cheng@amd.com>
Cc: Harry Wentland <Harry.wentland@amd.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
Cc: Manasi Navare <manasi.d.navare@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---

I'm not sure if this cancellation is early enough in the sequence,
considering it's called modeset_retry_work I presume we should perhaps
be doing this inside an intel_modeset function.

---
 drivers/gpu/drm/i915/intel_display.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Maarten Lankhorst Oct. 9, 2017, 9:41 a.m. UTC | #1
Op 06-10-17 om 19:18 schreef Chris Wilson:
> When we unregister the connector, we may have a pending hotplug work.
> This needs to be cancel early during the teardown so that it does not
> fire after we have freed the connector. Or else we may see something like:
Well the nice thing is even if it's called modeset_retry_work, it just sets the link status to bad for DP.

I worry it might be too early, wouldn't intel_dp_connector_destroy be a better place? At that point we know userspace can no longer use it,
because the last reference has been removed.

~Maarten
Chris Wilson Oct. 9, 2017, 9:59 a.m. UTC | #2
Quoting Maarten Lankhorst (2017-10-09 10:41:29)
> Op 06-10-17 om 19:18 schreef Chris Wilson:
> > When we unregister the connector, we may have a pending hotplug work.
> > This needs to be cancel early during the teardown so that it does not
> > fire after we have freed the connector. Or else we may see something like:
> Well the nice thing is even if it's called modeset_retry_work, it just sets the link status to bad for DP.

Ok, and sends a hotplug event. At what point in the shutdown sequence
does that drm_kms_helper_hotplug_event() become invalid?

> I worry it might be too early, wouldn't intel_dp_connector_destroy be a better place? At that point we know userspace can no longer use it,
> because the last reference has been removed.

connector_destroy is after drm_kms_helper_poll_fini(), so that seems
suspect given a query about drm_kms_helper_hotplug_event()

A hook from drm_atomic_helper_shutdown? Extending unregister to have a
late phase?
-Chris
Maarten Lankhorst Oct. 9, 2017, 10:33 a.m. UTC | #3
Op 09-10-17 om 11:59 schreef Chris Wilson:
> Quoting Maarten Lankhorst (2017-10-09 10:41:29)
>> Op 06-10-17 om 19:18 schreef Chris Wilson:
>>> When we unregister the connector, we may have a pending hotplug work.
>>> This needs to be cancel early during the teardown so that it does not
>>> fire after we have freed the connector. Or else we may see something like:
>> Well the nice thing is even if it's called modeset_retry_work, it just sets the link status to bad for DP.
> Ok, and sends a hotplug event. At what point in the shutdown sequence
> does that drm_kms_helper_hotplug_event() become invalid?
Some digging makes me suspect at the very end of driver unload. So for this either is fine. But after some more digging, it's not enough, see below.
>> I worry it might be too early, wouldn't intel_dp_connector_destroy be a better place? At that point we know userspace can no longer use it,
>> because the last reference has been removed.
> connector_destroy is after drm_kms_helper_poll_fini(), so that seems
> suspect given a query about drm_kms_helper_hotplug_event()
>
> A hook from drm_atomic_helper_shutdown? Extending unregister to have a
> late phase?
>
Well after some more digging the only case where early_unregister vs unregister matters is DP-MST, where we can lose connectors dynamically.
As long as we never use modeset_retry_work on DP-MST connectors, everything is fine. Fortunately we don't do that, only used in DP connectors for now.

But if you look at the backtrace, the first error is from intel_fbdev_fini, so we need to cancel the work after poll_fini, and before fbdev_fini.

Fixing it in early_unregister is simply too late..

~Maarten
Chris Wilson Oct. 25, 2017, 4:52 p.m. UTC | #4
Quoting Maarten Lankhorst (2017-10-09 11:33:21)
> Op 09-10-17 om 11:59 schreef Chris Wilson:
> > Quoting Maarten Lankhorst (2017-10-09 10:41:29)
> >> Op 06-10-17 om 19:18 schreef Chris Wilson:
> >>> When we unregister the connector, we may have a pending hotplug work.
> >>> This needs to be cancel early during the teardown so that it does not
> >>> fire after we have freed the connector. Or else we may see something like:
> >> Well the nice thing is even if it's called modeset_retry_work, it just sets the link status to bad for DP.
> > Ok, and sends a hotplug event. At what point in the shutdown sequence
> > does that drm_kms_helper_hotplug_event() become invalid?
> Some digging makes me suspect at the very end of driver unload. So for this either is fine. But after some more digging, it's not enough, see below.
> >> I worry it might be too early, wouldn't intel_dp_connector_destroy be a better place? At that point we know userspace can no longer use it,
> >> because the last reference has been removed.
> > connector_destroy is after drm_kms_helper_poll_fini(), so that seems
> > suspect given a query about drm_kms_helper_hotplug_event()
> >
> > A hook from drm_atomic_helper_shutdown? Extending unregister to have a
> > late phase?
> >
> Well after some more digging the only case where early_unregister vs unregister matters is DP-MST, where we can lose connectors dynamically.
> As long as we never use modeset_retry_work on DP-MST connectors, everything is fine. Fortunately we don't do that, only used in DP connectors for now.
> 
> But if you look at the backtrace, the first error is from intel_fbdev_fini, so we need to cancel the work after poll_fini, and before fbdev_fini.
> 
> Fixing it in early_unregister is simply too late..

early_unregister is before fbdev_fini. You may mean that it's too early?

Anyway, shall we just revert

commit 9301397a63b3bf1090dffe846c6f1c8efa032236
Author: Manasi Navare <manasi.d.navare@intel.com>
Date:   Thu Apr 6 16:44:19 2017 +0300

    drm/i915: Implement Link Rate fallback on Link training failure

as the authors seem not to care about the kernel oopses?
-Chris
Navare, Manasi Oct. 25, 2017, 7:23 p.m. UTC | #5
On Wed, Oct 25, 2017 at 05:52:35PM +0100, Chris Wilson wrote:
> Quoting Maarten Lankhorst (2017-10-09 11:33:21)
> > Op 09-10-17 om 11:59 schreef Chris Wilson:
> > > Quoting Maarten Lankhorst (2017-10-09 10:41:29)
> > >> Op 06-10-17 om 19:18 schreef Chris Wilson:
> > >>> When we unregister the connector, we may have a pending hotplug work.
> > >>> This needs to be cancel early during the teardown so that it does not
> > >>> fire after we have freed the connector. Or else we may see something like:
> > >> Well the nice thing is even if it's called modeset_retry_work, it just sets the link status to bad for DP.
> > > Ok, and sends a hotplug event. At what point in the shutdown sequence
> > > does that drm_kms_helper_hotplug_event() become invalid?
> > Some digging makes me suspect at the very end of driver unload. So for this either is fine. But after some more digging, it's not enough, see below.
> > >> I worry it might be too early, wouldn't intel_dp_connector_destroy be a better place? At that point we know userspace can no longer use it,
> > >> because the last reference has been removed.
> > > connector_destroy is after drm_kms_helper_poll_fini(), so that seems
> > > suspect given a query about drm_kms_helper_hotplug_event()
> > >
> > > A hook from drm_atomic_helper_shutdown? Extending unregister to have a
> > > late phase?
> > >
> > Well after some more digging the only case where early_unregister vs unregister matters is DP-MST, where we can lose connectors dynamically.
> > As long as we never use modeset_retry_work on DP-MST connectors, everything is fine. Fortunately we don't do that, only used in DP connectors for now.
> >

This function actually would also get called for MST connectors since it gets scheduled anytime intel_dp_start_link_train() fails
and that can happen even for DP MST connectors.

This work function will get scheduled after the existing modeset is completed. So if the driver is unloaded before that, then
why dont we cancel this work first thing in the intel_modeset_cleanup()?
Now the question what if it was already scheduled and it has sent the hotplug event, how can we make sure the hotplu event is killed?

 
> > But if you look at the backtrace, the first error is from intel_fbdev_fini, so we need to cancel the work after poll_fini, and before fbdev_fini.
> > 
> > Fixing it in early_unregister is simply too late..
> 
> early_unregister is before fbdev_fini. You may mean that it's too early?
> 
> Anyway, shall we just revert

I dont think reverting this patch is gonna help since its gonna start
introducing random black screens as a result of unaddressed link failures that this
patch attempts on fixing.

> 
> commit 9301397a63b3bf1090dffe846c6f1c8efa032236
> Author: Manasi Navare <manasi.d.navare@intel.com>
> Date:   Thu Apr 6 16:44:19 2017 +0300
> 
>     drm/i915: Implement Link Rate fallback on Link training failure
> 
> as the authors seem not to care about the kernel oopses?
> -Chris

On it to find a proper place for adding this cancel work call to fix the kernel oopses.
Thanks for catching this.

Manasi
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 9f2bf3b3f759..32c46444aa69 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -15238,6 +15238,9 @@  void intel_connector_unregister(struct drm_connector *connector)
 {
 	struct intel_connector *intel_connector = to_intel_connector(connector);
 
+	if (intel_connector->modeset_retry_work.func)
+		cancel_work_sync(&intel_connector->modeset_retry_work);
+
 	intel_backlight_device_unregister(intel_connector);
 	intel_panel_destroy_backlight(connector);
 }