diff mbox

4.10-rc2 oops in DRM connector code

Message ID 7fd16549-1349-a9e5-ceff-9aa6f748caae@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dave Hansen Jan. 5, 2017, 7:03 p.m. UTC
My Thinkpad x260 doesn't like to be unplugged from its dock.  I don't
think this is a new bug.  It's happening on my distro's 4.4 kernel
as well.

The actual oops is in device_del().  It appears to have been passed a
null 'struct device *'.

There appears to have been a race _around_ here fixed in 1f7717552e.
I've looked for and tried to find the locking that prevents
drm_connector_unregister() from being called twice concurrently.  I'm
unable to find anything.

drm_dp_destroy_connector_work() has some locking that looks useful:

	mutex_lock(&mgr->destroy_connector_lock)

but it's released before the offending call:

	mgr->cbs->destroy_connector(mgr, port->connector);

which actually calls intel_dp_destroy_mst_connector().  I have no idea
if it's correct (and haven't even run it with lockdep), but the attached
patch does seem to fix my oopses.

Any ideas?

> Jan  5 10:22:32 ray kernel: [  537.087042] BUG: unable to handle kernel NULL pointer dereference at 000000000000009e
> Jan  5 10:22:32 ray kernel: [  537.087954] IP: device_del+0x19/0x330
> Jan  5 10:22:32 ray kernel: [  537.088860] PGD 0
> Jan  5 10:22:32 ray kernel: [  537.088860]
> Jan  5 10:22:32 ray kernel: [  537.090578] Oops: 0000 [#1] SMP
> Jan  5 10:22:32 ray kernel: [  537.091406] Modules linked in: ctr ccm ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables cmac rfcomm bnep dm_crypt arc4 iwlmvm mac80211 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic iwlwifi intel_rapl snd_hda_intel iosf_mbi hid_logitech_hidpp snd_seq_midi cfg80211 x86_pkg_temp_thermal snd_hda_codec snd_seq_midi_event snd_hwdep btusb snd_rawmidi snd_hda_core btrtl coretemp snd_seq snd_pcm btbcm btintel joydev bluetooth ghash_clmulni_intel snd_timer shpchp thinkpad_acpi snd_seq_device nvram wmi snd soundcore mac_hid aesni_intel aes_x86_64 crypto_simd cryptd glue_helper kvm_intel
> Jan  5 10:22:32 ray kernel: [  537.095222]  kvm irqbypass hid_generic hid_logitech_dj usbhid hid
> Jan  5 10:22:32 ray kernel: [  537.096272] CPU: 2 PID: 23 Comm: kworker/2:0 Tainted: G        W       4.10.0-rc2 #47
> Jan  5 10:22:32 ray kernel: [  537.097263] Hardware name: LENOVO 20F5S7V800/20F5S7V800, BIOS R02ET50W (1.23 ) 09/20/2016
> Jan  5 10:22:32 ray kernel: [  537.098291] Workqueue: events drm_dp_destroy_connector_work
> Jan  5 10:22:32 ray kernel: [  537.099328] task: ffff88040f2f1e00 task.stack: ffffc9000198c000
> Jan  5 10:22:32 ray kernel: [  537.100335] RIP: 0010:device_del+0x19/0x330
> Jan  5 10:22:32 ray kernel: [  537.101340] RSP: 0018:ffffc9000198fd58 EFLAGS: 00010282
> Jan  5 10:22:32 ray kernel: [  537.102361] RAX: 0000000000000000 RBX: fffffffffffffffe RCX: ffff88040c5191b0
> Jan  5 10:22:32 ray kernel: [  537.103418] RDX: ffffffff81cb6246 RSI: 0000000000000001 RDI: fffffffffffffffe
> Jan  5 10:22:32 ray kernel: [  537.104473] RBP: ffffc9000198fd90 R08: 0000000000000000 R09: ffff880421517780
> Jan  5 10:22:32 ray kernel: [  537.105574] R10: 0000007d0ce17c93 R11: 0000000000000001 R12: fffffffffffffffe
> Jan  5 10:22:32 ray kernel: [  537.106636] R13: ffff88040ed36bd8 R14: ffff88040ed36788 R15: ffff88040c728810
> Jan  5 10:22:32 ray kernel: [  537.107728] FS:  0000000000000000(0000) GS:ffff880421500000(0000) knlGS:0000000000000000
> Jan  5 10:22:32 ray kernel: [  537.108812] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jan  5 10:22:32 ray kernel: [  537.109937] CR2: 000000000000009e CR3: 0000000384894000 CR4: 00000000003406e0
> Jan  5 10:22:32 ray kernel: [  537.111038] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Jan  5 10:22:32 ray kernel: [  537.112142] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Jan  5 10:22:32 ray kernel: [  537.113223] Call Trace:
> Jan  5 10:22:32 ray kernel: [  537.114293]  device_unregister+0x12/0x30
> Jan  5 10:22:32 ray kernel: [  537.115354]  drm_sysfs_connector_remove+0x3b/0x50
> Jan  5 10:22:32 ray kernel: [  537.116391]  drm_connector_unregister.part.8+0x27/0x40
> Jan  5 10:22:32 ray kernel: [  537.117433]  drm_connector_unregister+0x14/0x20
> Jan  5 10:22:32 ray kernel: [  537.118478]  intel_dp_destroy_mst_connector+0x1a/0x80
> Jan  5 10:22:32 ray kernel: [  537.119513]  drm_dp_destroy_connector_work+0xa9/0x150
> Jan  5 10:22:32 ray kernel: [  537.120539]  process_one_work+0x14b/0x430
> Jan  5 10:22:32 ray kernel: [  537.121568]  worker_thread+0x12b/0x4a0
> Jan  5 10:22:32 ray kernel: [  537.122581]  kthread+0x10c/0x140
> Jan  5 10:22:32 ray kernel: [  537.123583]  ? process_one_work+0x430/0x430
> Jan  5 10:22:32 ray kernel: [  537.124584]  ? kthread_create_on_node+0x40/0x40
> Jan  5 10:22:32 ray kernel: [  537.125574]  ret_from_fork+0x27/0x40
> Jan  5 10:22:32 ray kernel: [  537.126562] Code: 00 00 00 00 00 00 00 5b 41 5c 41 5d 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 <48> 8b 87 a0 00 00 00 4c 8b 2f 48 85 c0 74 1b 48 8b b8 90 00 00
> Jan  5 10:22:32 ray kernel: [  537.127644] RIP: device_del+0x19/0x330 RSP: ffffc9000198fd58
> Jan  5 10:22:32 ray kernel: [  537.128690] CR2: 000000000000009e
> Jan  5 10:22:32 ray kernel: [  537.129759] ---[ end trace 7e17c77627e8f513 ]---

Comments

Daniel Vetter Jan. 9, 2017, 10:15 a.m. UTC | #1
On Thu, Jan 05, 2017 at 11:03:44AM -0800, Dave Hansen wrote:
> My Thinkpad x260 doesn't like to be unplugged from its dock.  I don't
> think this is a new bug.  It's happening on my distro's 4.4 kernel
> as well.
> 
> The actual oops is in device_del().  It appears to have been passed a
> null 'struct device *'.
> 
> There appears to have been a race _around_ here fixed in 1f7717552e.
> I've looked for and tried to find the locking that prevents
> drm_connector_unregister() from being called twice concurrently.  I'm
> unable to find anything.
> 
> drm_dp_destroy_connector_work() has some locking that looks useful:
> 
> 	mutex_lock(&mgr->destroy_connector_lock)
> 
> but it's released before the offending call:
> 
> 	mgr->cbs->destroy_connector(mgr, port->connector);
> 
> which actually calls intel_dp_destroy_mst_connector().  I have no idea
> if it's correct (and haven't even run it with lockdep), but the attached
> patch does seem to fix my oopses.
> 
> Any ideas?
> 
> > Jan  5 10:22:32 ray kernel: [  537.087042] BUG: unable to handle kernel NULL pointer dereference at 000000000000009e
> > Jan  5 10:22:32 ray kernel: [  537.087954] IP: device_del+0x19/0x330
> > Jan  5 10:22:32 ray kernel: [  537.088860] PGD 0
> > Jan  5 10:22:32 ray kernel: [  537.088860]
> > Jan  5 10:22:32 ray kernel: [  537.090578] Oops: 0000 [#1] SMP
> > Jan  5 10:22:32 ray kernel: [  537.091406] Modules linked in: ctr ccm ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables cmac rfcomm bnep dm_crypt arc4 iwlmvm mac80211 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic iwlwifi intel_rapl snd_hda_intel iosf_mbi hid_logitech_hidpp snd_seq_midi cfg80211 x86_pkg_temp_thermal snd_hda_codec snd_seq_midi_event snd_hwdep btusb snd_rawmidi snd_hda_core btrtl coretemp snd_seq snd_pcm btbcm btintel joydev bluetooth ghash_clmulni_intel snd_timer shpchp thinkpad_acpi snd_seq_device nvram wmi snd soundcore mac_hid aesni_intel aes_x86_64 crypto_simd cryptd glue_helper kvm_intel
> > Jan  5 10:22:32 ray kernel: [  537.095222]  kvm irqbypass hid_generic hid_logitech_dj usbhid hid
> > Jan  5 10:22:32 ray kernel: [  537.096272] CPU: 2 PID: 23 Comm: kworker/2:0 Tainted: G        W       4.10.0-rc2 #47
> > Jan  5 10:22:32 ray kernel: [  537.097263] Hardware name: LENOVO 20F5S7V800/20F5S7V800, BIOS R02ET50W (1.23 ) 09/20/2016
> > Jan  5 10:22:32 ray kernel: [  537.098291] Workqueue: events drm_dp_destroy_connector_work
> > Jan  5 10:22:32 ray kernel: [  537.099328] task: ffff88040f2f1e00 task.stack: ffffc9000198c000
> > Jan  5 10:22:32 ray kernel: [  537.100335] RIP: 0010:device_del+0x19/0x330
> > Jan  5 10:22:32 ray kernel: [  537.101340] RSP: 0018:ffffc9000198fd58 EFLAGS: 00010282
> > Jan  5 10:22:32 ray kernel: [  537.102361] RAX: 0000000000000000 RBX: fffffffffffffffe RCX: ffff88040c5191b0
> > Jan  5 10:22:32 ray kernel: [  537.103418] RDX: ffffffff81cb6246 RSI: 0000000000000001 RDI: fffffffffffffffe
> > Jan  5 10:22:32 ray kernel: [  537.104473] RBP: ffffc9000198fd90 R08: 0000000000000000 R09: ffff880421517780
> > Jan  5 10:22:32 ray kernel: [  537.105574] R10: 0000007d0ce17c93 R11: 0000000000000001 R12: fffffffffffffffe
> > Jan  5 10:22:32 ray kernel: [  537.106636] R13: ffff88040ed36bd8 R14: ffff88040ed36788 R15: ffff88040c728810
> > Jan  5 10:22:32 ray kernel: [  537.107728] FS:  0000000000000000(0000) GS:ffff880421500000(0000) knlGS:0000000000000000
> > Jan  5 10:22:32 ray kernel: [  537.108812] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > Jan  5 10:22:32 ray kernel: [  537.109937] CR2: 000000000000009e CR3: 0000000384894000 CR4: 00000000003406e0
> > Jan  5 10:22:32 ray kernel: [  537.111038] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > Jan  5 10:22:32 ray kernel: [  537.112142] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Jan  5 10:22:32 ray kernel: [  537.113223] Call Trace:
> > Jan  5 10:22:32 ray kernel: [  537.114293]  device_unregister+0x12/0x30
> > Jan  5 10:22:32 ray kernel: [  537.115354]  drm_sysfs_connector_remove+0x3b/0x50
> > Jan  5 10:22:32 ray kernel: [  537.116391]  drm_connector_unregister.part.8+0x27/0x40
> > Jan  5 10:22:32 ray kernel: [  537.117433]  drm_connector_unregister+0x14/0x20
> > Jan  5 10:22:32 ray kernel: [  537.118478]  intel_dp_destroy_mst_connector+0x1a/0x80
> > Jan  5 10:22:32 ray kernel: [  537.119513]  drm_dp_destroy_connector_work+0xa9/0x150
> > Jan  5 10:22:32 ray kernel: [  537.120539]  process_one_work+0x14b/0x430
> > Jan  5 10:22:32 ray kernel: [  537.121568]  worker_thread+0x12b/0x4a0
> > Jan  5 10:22:32 ray kernel: [  537.122581]  kthread+0x10c/0x140
> > Jan  5 10:22:32 ray kernel: [  537.123583]  ? process_one_work+0x430/0x430
> > Jan  5 10:22:32 ray kernel: [  537.124584]  ? kthread_create_on_node+0x40/0x40
> > Jan  5 10:22:32 ray kernel: [  537.125574]  ret_from_fork+0x27/0x40
> > Jan  5 10:22:32 ray kernel: [  537.126562] Code: 00 00 00 00 00 00 00 5b 41 5c 41 5d 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 <48> 8b 87 a0 00 00 00 4c 8b 2f 48 85 c0 74 1b 48 8b b8 90 00 00
> > Jan  5 10:22:32 ray kernel: [  537.127644] RIP: device_del+0x19/0x330 RSP: ffffc9000198fd58
> > Jan  5 10:22:32 ray kernel: [  537.128690] CR2: 000000000000009e
> > Jan  5 10:22:32 ray kernel: [  537.129759] ---[ end trace 7e17c77627e8f513 ]---

> diff --git a/drivers/gpu/drm/drm_dp_mst_topology.c b/drivers/gpu/drm/drm_dp_mst_topology.c
> index aa64448..85beebc 100644
> --- a/drivers/gpu/drm/drm_dp_mst_topology.c
> +++ b/drivers/gpu/drm/drm_dp_mst_topology.c
> @@ -2914,13 +2914,14 @@ static void drm_dp_destroy_connector_work(struct work_struct *work)
>  			break;
>  		}
>  		list_del(&port->next);
> -		mutex_unlock(&mgr->destroy_connector_lock);
>  
>  		kref_init(&port->kref);
>  		INIT_LIST_HEAD(&port->next);
>  
>  		mgr->cbs->destroy_connector(mgr, port->connector);
>  
> +		mutex_unlock(&mgr->destroy_connector_lock);

The lock here is just for port->next, and that should ensure that you're
double-releasing the same connector. We do still have lifetime issues with
connectors in 4.10 (getting fixed in 4.11, it's a bit a mess), but I don't
think those could be blamed for this oops.

The other funky thing is that this is from a worker, and it's the only
place that ever calls ->destroy_connector. It /should/ already be
single-threaded afaik connector destruction goes.

Can you pls do some printk tracing to make sure that without your patch
we're indeed releasing the same connector twice from this loop? I suspect
you're just ever-so-slightly shifting the timing and things blow up
somewhre else. But no idea where :(
-Daniel
Dave Hansen Jan. 9, 2017, 12:59 p.m. UTC | #2
On 01/09/2017 02:15 AM, Daniel Vetter wrote:
...
> Can you pls do some printk tracing to make sure that without your patch
> we're indeed releasing the same connector twice from this loop? I suspect
> you're just ever-so-slightly shifting the timing and things blow up
> somewhre else. But no idea where :(

Your suspicious appear correct.  I still get an oops even with my patch
applied.  I did add some printk's but they're weird.  I'm seeing a
'kdev' that looks to be -2, although it should never get set to that
unless device_create_with_groups() and th error checking failed somehow.

I'll take out the locking portion of the patch and add some more
printk's and see if anything else shows up.  Here's a copy of the
printk's and associated oops, with my locking modification patch
applied, if this helps at all:

> [ 7927.049763] drm_dp_destroy_port() kfree(ffff88040c759000)
> [ 7927.050039] drm_dp_destroy_connector_work() port: ffff88040c759800 connector: ffff88040c75a000
> [ 7927.050061] drm_sysfs_connector_remove() connector: ffff88040c75a000 kdev: fffffffffffffffe
> [ 7927.050106] BUG: unable to handle kernel NULL pointer dereference at 000000000000009e
> [ 7927.050242] IP: device_del+0x19/0x330
> [ 7927.050300] PGD 0 
> [ 7927.050302] 
> [ 7927.050364] Oops: 0000 [#1] SMP
> [ 7927.050414] Modules linked in: netconsole ctr ccm ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables cmac rfcomm bnep dm_crypt arc4 iwlmvm mac80211 snd_hda_codec_hdmi iwlwifi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_seq_midi snd_seq_midi_event snd_hda_codec cfg80211 snd_hwdep snd_rawmidi intel_rapl snd_hda_core iosf_mbi snd_pcm snd_seq btusb x86_pkg_temp_thermal hid_logitech_hidpp btrtl btbcm btintel coretemp shpchp snd_timer ghash_clmulni_intel bluetooth joydev thinkpad_acpi snd_seq_device nvram wmi snd soundcore mac_hid aesni_intel aes_x86_64 crypto_simd cryptd glue_helper
> [ 7927.051492]  kvm_intel kvm irqbypass hid_generic hid_logitech_dj usbhid hid [last unloaded: netconsole]
> [ 7927.051642] CPU: 2 PID: 123 Comm: kworker/2:2 Tainted: G        W       4.10.0-rc2-dirty #53
> [ 7927.051763] Hardware name: LENOVO 20F5S7V800/20F5S7V800, BIOS R02ET50W (1.23 ) 09/20/2016
> [ 7927.051887] Workqueue: events drm_dp_destroy_connector_work
> [ 7927.051971] task: ffff88040c21bc00 task.stack: ffffc90002484000
> [ 7927.052065] RIP: 0010:device_del+0x19/0x330
> [ 7927.052129] RSP: 0018:ffffc90002487d58 EFLAGS: 00010282
> [ 7927.052209] RAX: 0000000000000000 RBX: fffffffffffffffe RCX: ffff88040c72d1f8
> [ 7927.052310] RDX: ffffffff81cb69b2 RSI: 0000000000000001 RDI: fffffffffffffffe
> [ 7927.052412] RBP: ffffc90002487d90 R08: 0000000000000000 R09: 0000000000000392
> [ 7927.052517] R10: 00000003e05de802 R11: 0000000000000000 R12: fffffffffffffffe
> [ 7927.052677] R13: ffff88040c516c18 R14: 0000000000000000 R15: ffff88040c516bd8
> [ 7927.052781] FS:  0000000000000000(0000) GS:ffff880421500000(0000) knlGS:0000000000000000
> [ 7927.052897] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 7927.052981] CR2: 000000000000009e CR3: 0000000001e0b000 CR4: 00000000003406e0
> [ 7927.053082] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 7927.053186] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 7927.053288] Call Trace:
> [ 7927.053338]  ? printk+0x4d/0x4f
> [ 7927.053396]  device_unregister+0x12/0x30
> [ 7927.053464]  drm_sysfs_connector_remove+0x57/0x70
> [ 7927.053538]  drm_connector_unregister.part.8+0x27/0x40
> [ 7927.053616]  drm_connector_unregister+0x14/0x20
> [ 7927.053690]  intel_dp_destroy_mst_connector+0x1a/0x80
> [ 7927.053771]  drm_dp_destroy_connector_work+0xce/0x150
> [ 7927.053850]  process_one_work+0x14b/0x430
> [ 7927.053914]  worker_thread+0x12b/0x4a0
> [ 7927.053982]  kthread+0x10c/0x140
> [ 7927.054037]  ? process_one_work+0x430/0x430
> [ 7927.054104]  ? kthread_create_on_node+0x40/0x40
> [ 7927.054179]  ret_from_fork+0x27/0x40
> [ 7927.054237] Code: 00 00 00 00 00 00 00 5b 41 5c 41 5d 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 <48> 8b 87 a0 00 00 00 4c 8b 2f 48 85 c0 74 1b 48 8b b8 90 00 00 
> [ 7927.054462] RIP: device_del+0x19/0x330 RSP: ffffc90002487d58
> [ 7927.054486] CR2: 000000000000009e
> [ 7927.065278] ---[ end trace 1a2a5ca2cf6c3a4f ]---
Dave Hansen Jan. 9, 2017, 1:40 p.m. UTC | #3
Well, now I found where the -2 comes from.
intel_dp_register_mst_connector() calls drm_connector_register(), which
fails to add the kobject (warning below).  But, it does zero error
checking on the drm_connector_register() call and leaves the
partially-constructed connector in place.

The next time some poor, hapless code goes and tries to do anything with
that kdev, they oops.  I'm perplexed by this, though.  The
drm_dp_mst_topology_cbs->register_connector just returns void.  It seems
a bit goofy that it can't even _return_ failure.

Is there some stable code to go back to here?  Or, is there something
about my configuration that's unique?  I really wonder why nobody else
is running into this.

There's probably some other race going on here.  This warning doesn't
happen on every boot.

> [    1.396475] ------------[ cut here ]------------
> [    1.396486] WARNING: CPU: 0 PID: 91 at lib/kobject.c:244 kobject_add_internal+0x108/0x2d0
> [    1.396492] kobject_add_internal failed for card0-DP-3 (error: -2 parent: card0)
> [    1.396496] Modules linked in:
> [    1.396500] CPU: 0 PID: 91 Comm: kworker/0:2 Not tainted 4.10.0-rc2-dirty #58
> [    1.396503] Hardware name: LENOVO 20F5S7V800/20F5S7V800, BIOS R02ET50W (1.23 ) 09/20/2016
> [    1.396509] Workqueue: events_long drm_dp_mst_link_probe_work
> [    1.396511] Call Trace:
> [    1.396517]  dump_stack+0x67/0x99
> [    1.396521]  __warn+0xd1/0xf0
> [    1.396524]  warn_slowpath_fmt+0x4f/0x60
> [    1.396527]  kobject_add_internal+0x108/0x2d0
> [    1.396531]  ? vsnprintf+0x112/0x4b0
> [    1.396533]  kobject_add+0x65/0xb0
> [    1.396537]  ? klist_init+0x31/0x40
> [    1.396542]  device_add+0x102/0x5d0
> [    1.396546]  ? kfree_const+0x22/0x30
> [    1.396549]  device_create_groups_vargs+0xd8/0x100
> [    1.396553]  device_create_with_groups+0x36/0x40
> [    1.396557]  ? drm_fb_helper_add_one_connector+0x57/0xd0
> [    1.396561]  ? kmem_cache_alloc_trace+0x1d2/0x1f0
> [    1.396566]  drm_sysfs_connector_add+0x60/0xe0
> [    1.396570]  drm_connector_register+0x21/0xc0
> [    1.396575]  intel_dp_register_mst_connector+0x41/0x50
> [    1.396579]  drm_dp_add_port+0x350/0x450
> [    1.396583]  ? rcu_early_boot_tests+0x1/0x10
> [    1.396587]  ? schedule_timeout+0x1cd/0x390
> [    1.396591]  ? __might_sleep+0x4a/0x90
> [    1.396594]  ? mutex_lock+0x25/0x50
> [    1.396597]  ? drm_dp_mst_wait_tx_reply+0x118/0x1e0
> [    1.396600]  ? prepare_to_wait_event+0x120/0x120
> [    1.396603]  ? drm_dp_check_mstb_guid+0x3d/0x120
> [    1.396606] drm_sysfs_connector_add() connector: ffff88040ed3f000 kdev: ffff88040c55dc00
> [    1.396610]  drm_dp_send_link_address+0x185/0x1f0
> [    1.396614]  drm_dp_check_and_send_link_address+0xad/0xc0
> [    1.396617]  drm_dp_mst_link_probe_work+0x57/0xa0
> [    1.396620]  process_one_work+0x14b/0x430
> [    1.396623]  worker_thread+0x12b/0x4a0
> [    1.396626]  kthread+0x10c/0x140
> [    1.396629]  ? process_one_work+0x430/0x430
> [    1.396632]  ? kthread_create_on_node+0x40/0x40
> [    1.396635]  ret_from_fork+0x27/0x40
> [    1.396639] ---[ end trace dd05bb2cd9ebd900 ]---
> [    1.396643] drm_sysfs_connector_add() connector: ffff88040c75a000 kdev: fffffffffffffffe
> [    1.396648] [drm:drm_sysfs_connector_add] *ERROR* failed to register connector device: -2
> [    1.396651] drm_connector_register(ffff88040ed3f000) connector->registered: 1
> [    1.396666] drm_sysfs_connector_add() connector: ffff88040ed3f800 kdev: ffff88040c55e400
> [    1.396700] drm_connector_register(ffff88040ed3f800) connector->registered: 1
> [    1.396717] drm_sysfs_connector_add() connector: ffff88040ed39800 kdev: ffff88040c55e800
> [    1.396726] drm_connector_register(ffff88040ed39800) connector->registered: 1
> [    1.396743] drm_sysfs_connector_add() connector: ffff88040c758000 kdev: ffff88040c55ec00
> [    1.396769] drm_connector_register(ffff88040c758000) connector->registered: 1
> [    1.396786] drm_sysfs_connector_add() connector: ffff88040c758800 kdev: ffff88040c55f000
> [    1.396799] drm_connector_register(ffff88040c758800) connector->registered: 1
> [    1.396804] drm_connector_register(ffff88040c75a000) connector->registered: 1
> [    1.398665] ACPI: Video Device [GFX0] (multi-head: yes  rom: no  post: no)
> [    1.398893] drm_sysfs_connector_add() connector: ffff88040c75b000 kdev: ffff88040c790400
> [    1.398907] drm_connector_register(ffff88040c75b000) connector->registered: 1
> [    1.398930] drm_sysfs_connector_add() connector: ffff88040c75c000 kdev: ffff88040c790800
> [    1.398939] drm_connector_register(ffff88040c75c000) connector->registered: 1
Dave Hansen Jan. 9, 2017, 1:46 p.m. UTC | #4
On 01/09/2017 05:40 AM, Dave Hansen wrote:
> Is there some stable code to go back to here?  Or, is there something
> about my configuration that's unique?  I really wonder why nobody else
> is running into this.

Here are a couple of similar-looking reports, if that helps:

https://bugzilla.redhat.com/attachment.cgi?id=1209947
https://bugzilla.redhat.com/show_bug.cgi?id=1382342
http://marc.info/?l=openbsd-bugs&m=147487082821348&w=2
Daniel Vetter Jan. 9, 2017, 4:41 p.m. UTC | #5
On Mon, Jan 9, 2017 at 2:40 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> Well, now I found where the -2 comes from.
> intel_dp_register_mst_connector() calls drm_connector_register(), which
> fails to add the kobject (warning below).  But, it does zero error
> checking on the drm_connector_register() call and leaves the
> partially-constructed connector in place.
>
> The next time some poor, hapless code goes and tries to do anything with
> that kdev, they oops.  I'm perplexed by this, though.  The
> drm_dp_mst_topology_cbs->register_connector just returns void.  It seems
> a bit goofy that it can't even _return_ failure.
>
> Is there some stable code to go back to here?  Or, is there something
> about my configuration that's unique?  I really wonder why nobody else
> is running into this.
>
> There's probably some other race going on here.  This warning doesn't
> happen on every boot.

This smells more like the root-cause: Something goes wrong on boot
that prevents connectors from properly registering, then we fall over
later on. And the register callback is intentionally void, assuming
that any prep work has been done earlier and that therefore the
register step can't fail. Can you pls check whether the oops later on
only happens together with this warning at boot, or whether they're
not correlated?

Still no idea though why this blows up for you.
-Daniel

>
>> [    1.396475] ------------[ cut here ]------------
>> [    1.396486] WARNING: CPU: 0 PID: 91 at lib/kobject.c:244 kobject_add_internal+0x108/0x2d0
>> [    1.396492] kobject_add_internal failed for card0-DP-3 (error: -2 parent: card0)
>> [    1.396496] Modules linked in:
>> [    1.396500] CPU: 0 PID: 91 Comm: kworker/0:2 Not tainted 4.10.0-rc2-dirty #58
>> [    1.396503] Hardware name: LENOVO 20F5S7V800/20F5S7V800, BIOS R02ET50W (1.23 ) 09/20/2016
>> [    1.396509] Workqueue: events_long drm_dp_mst_link_probe_work
>> [    1.396511] Call Trace:
>> [    1.396517]  dump_stack+0x67/0x99
>> [    1.396521]  __warn+0xd1/0xf0
>> [    1.396524]  warn_slowpath_fmt+0x4f/0x60
>> [    1.396527]  kobject_add_internal+0x108/0x2d0
>> [    1.396531]  ? vsnprintf+0x112/0x4b0
>> [    1.396533]  kobject_add+0x65/0xb0
>> [    1.396537]  ? klist_init+0x31/0x40
>> [    1.396542]  device_add+0x102/0x5d0
>> [    1.396546]  ? kfree_const+0x22/0x30
>> [    1.396549]  device_create_groups_vargs+0xd8/0x100
>> [    1.396553]  device_create_with_groups+0x36/0x40
>> [    1.396557]  ? drm_fb_helper_add_one_connector+0x57/0xd0
>> [    1.396561]  ? kmem_cache_alloc_trace+0x1d2/0x1f0
>> [    1.396566]  drm_sysfs_connector_add+0x60/0xe0
>> [    1.396570]  drm_connector_register+0x21/0xc0
>> [    1.396575]  intel_dp_register_mst_connector+0x41/0x50
>> [    1.396579]  drm_dp_add_port+0x350/0x450
>> [    1.396583]  ? rcu_early_boot_tests+0x1/0x10
>> [    1.396587]  ? schedule_timeout+0x1cd/0x390
>> [    1.396591]  ? __might_sleep+0x4a/0x90
>> [    1.396594]  ? mutex_lock+0x25/0x50
>> [    1.396597]  ? drm_dp_mst_wait_tx_reply+0x118/0x1e0
>> [    1.396600]  ? prepare_to_wait_event+0x120/0x120
>> [    1.396603]  ? drm_dp_check_mstb_guid+0x3d/0x120
>> [    1.396606] drm_sysfs_connector_add() connector: ffff88040ed3f000 kdev: ffff88040c55dc00
>> [    1.396610]  drm_dp_send_link_address+0x185/0x1f0
>> [    1.396614]  drm_dp_check_and_send_link_address+0xad/0xc0
>> [    1.396617]  drm_dp_mst_link_probe_work+0x57/0xa0
>> [    1.396620]  process_one_work+0x14b/0x430
>> [    1.396623]  worker_thread+0x12b/0x4a0
>> [    1.396626]  kthread+0x10c/0x140
>> [    1.396629]  ? process_one_work+0x430/0x430
>> [    1.396632]  ? kthread_create_on_node+0x40/0x40
>> [    1.396635]  ret_from_fork+0x27/0x40
>> [    1.396639] ---[ end trace dd05bb2cd9ebd900 ]---
>> [    1.396643] drm_sysfs_connector_add() connector: ffff88040c75a000 kdev: fffffffffffffffe
>> [    1.396648] [drm:drm_sysfs_connector_add] *ERROR* failed to register connector device: -2
>> [    1.396651] drm_connector_register(ffff88040ed3f000) connector->registered: 1
>> [    1.396666] drm_sysfs_connector_add() connector: ffff88040ed3f800 kdev: ffff88040c55e400
>> [    1.396700] drm_connector_register(ffff88040ed3f800) connector->registered: 1
>> [    1.396717] drm_sysfs_connector_add() connector: ffff88040ed39800 kdev: ffff88040c55e800
>> [    1.396726] drm_connector_register(ffff88040ed39800) connector->registered: 1
>> [    1.396743] drm_sysfs_connector_add() connector: ffff88040c758000 kdev: ffff88040c55ec00
>> [    1.396769] drm_connector_register(ffff88040c758000) connector->registered: 1
>> [    1.396786] drm_sysfs_connector_add() connector: ffff88040c758800 kdev: ffff88040c55f000
>> [    1.396799] drm_connector_register(ffff88040c758800) connector->registered: 1
>> [    1.396804] drm_connector_register(ffff88040c75a000) connector->registered: 1
>> [    1.398665] ACPI: Video Device [GFX0] (multi-head: yes  rom: no  post: no)
>> [    1.398893] drm_sysfs_connector_add() connector: ffff88040c75b000 kdev: ffff88040c790400
>> [    1.398907] drm_connector_register(ffff88040c75b000) connector->registered: 1
>> [    1.398930] drm_sysfs_connector_add() connector: ffff88040c75c000 kdev: ffff88040c790800
>> [    1.398939] drm_connector_register(ffff88040c75c000) connector->registered: 1
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Dave Hansen Jan. 9, 2017, 4:50 p.m. UTC | #6
On 01/09/2017 08:41 AM, Daniel Vetter wrote:
> On Mon, Jan 9, 2017 at 2:40 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>> Well, now I found where the -2 comes from.
>> intel_dp_register_mst_connector() calls drm_connector_register(), which
>> fails to add the kobject (warning below).  But, it does zero error
>> checking on the drm_connector_register() call and leaves the
>> partially-constructed connector in place.
>>
>> The next time some poor, hapless code goes and tries to do anything with
>> that kdev, they oops.  I'm perplexed by this, though.  The
>> drm_dp_mst_topology_cbs->register_connector just returns void.  It seems
>> a bit goofy that it can't even _return_ failure.
>>
>> Is there some stable code to go back to here?  Or, is there something
>> about my configuration that's unique?  I really wonder why nobody else
>> is running into this.
>>
>> There's probably some other race going on here.  This warning doesn't
>> happen on every boot.
> This smells more like the root-cause: Something goes wrong on boot
> that prevents connectors from properly registering, then we fall over
> later on. And the register callback is intentionally void, assuming
> that any prep work has been done earlier and that therefore the
> register step can't fail. Can you pls check whether the oops later on
> only happens together with this warning at boot, or whether they're
> not correlated?

Looking through my logs, I can't find any instance of the oops without
the warning at boot.  So I do think the later oops is entirely caused by
the issue warned about in early boot.

My distro kernel (4.4.0-57-generic) is also unstable, but I haven't
managed to capture a good oops there.  It's hitting this, which I assume
is unrelated:

	WARNING: CPU: 0 PID: 41 at /build/linux-lts-xenial-FdAdUy/linux-
	lts-xenial-4.4.0/ubuntu/i915/intel_pm.c:3675
	skl_update_other_pipe_wm+0x191/0x1a0 [i915_bpo]()
Daniel Vetter Jan. 9, 2017, 4:59 p.m. UTC | #7
On Mon, Jan 9, 2017 at 5:50 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 01/09/2017 08:41 AM, Daniel Vetter wrote:
>> On Mon, Jan 9, 2017 at 2:40 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>>> Well, now I found where the -2 comes from.
>>> intel_dp_register_mst_connector() calls drm_connector_register(), which
>>> fails to add the kobject (warning below).  But, it does zero error
>>> checking on the drm_connector_register() call and leaves the
>>> partially-constructed connector in place.
>>>
>>> The next time some poor, hapless code goes and tries to do anything with
>>> that kdev, they oops.  I'm perplexed by this, though.  The
>>> drm_dp_mst_topology_cbs->register_connector just returns void.  It seems
>>> a bit goofy that it can't even _return_ failure.
>>>
>>> Is there some stable code to go back to here?  Or, is there something
>>> about my configuration that's unique?  I really wonder why nobody else
>>> is running into this.
>>>
>>> There's probably some other race going on here.  This warning doesn't
>>> happen on every boot.
>> This smells more like the root-cause: Something goes wrong on boot
>> that prevents connectors from properly registering, then we fall over
>> later on. And the register callback is intentionally void, assuming
>> that any prep work has been done earlier and that therefore the
>> register step can't fail. Can you pls check whether the oops later on
>> only happens together with this warning at boot, or whether they're
>> not correlated?
>
> Looking through my logs, I can't find any instance of the oops without
> the warning at boot.  So I do think the later oops is entirely caused by
> the issue warned about in early boot.

Hm, I guess then we'd need to fix that boot-up warning. Can you try to
figure out why it's unhappy? On a hunch it could be that we call
drm_connector_register from the mst probe worker before the main
driver load thread has reached the drm_dev_register call. A few printk
to decide whether that's the case (plus a few boot-up tests to gather
the statistics, sorry about that) would be real great.

If that's inconclusive I'm again a bit low on ideas ...

> My distro kernel (4.4.0-57-generic) is also unstable, but I haven't
> managed to capture a good oops there.  It's hitting this, which I assume
> is unrelated:
>
>         WARNING: CPU: 0 PID: 41 at /build/linux-lts-xenial-FdAdUy/linux-
>         lts-xenial-4.4.0/ubuntu/i915/intel_pm.c:3675
>         skl_update_other_pipe_wm+0x191/0x1a0 [i915_bpo]()

wm programming issues, which will kill your box. Needs a newer kernel
to fix (both the wm programming issues, and that wm programming issues
lead to system death).
-Daniel
Dave Hansen Jan. 9, 2017, 5:22 p.m. UTC | #8
On 01/09/2017 08:59 AM, Daniel Vetter wrote:
> On Mon, Jan 9, 2017 at 5:50 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>> On 01/09/2017 08:41 AM, Daniel Vetter wrote:
>>> On Mon, Jan 9, 2017 at 2:40 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>>>> Well, now I found where the -2 comes from.
>>>> intel_dp_register_mst_connector() calls drm_connector_register(), which
>>>> fails to add the kobject (warning below).  But, it does zero error
>>>> checking on the drm_connector_register() call and leaves the
>>>> partially-constructed connector in place.
>>>>
>>>> The next time some poor, hapless code goes and tries to do anything with
>>>> that kdev, they oops.  I'm perplexed by this, though.  The
>>>> drm_dp_mst_topology_cbs->register_connector just returns void.  It seems
>>>> a bit goofy that it can't even _return_ failure.
>>>>
>>>> Is there some stable code to go back to here?  Or, is there something
>>>> about my configuration that's unique?  I really wonder why nobody else
>>>> is running into this.
>>>>
>>>> There's probably some other race going on here.  This warning doesn't
>>>> happen on every boot.
>>> This smells more like the root-cause: Something goes wrong on boot
>>> that prevents connectors from properly registering, then we fall over
>>> later on. And the register callback is intentionally void, assuming
>>> that any prep work has been done earlier and that therefore the
>>> register step can't fail. Can you pls check whether the oops later on
>>> only happens together with this warning at boot, or whether they're
>>> not correlated?
>>
>> Looking through my logs, I can't find any instance of the oops without
>> the warning at boot.  So I do think the later oops is entirely caused by
>> the issue warned about in early boot.
> 
> Hm, I guess then we'd need to fix that boot-up warning. Can you try to
> figure out why it's unhappy? On a hunch it could be that we call
> drm_connector_register from the mst probe worker before the main
> driver load thread has reached the drm_dev_register call. A few printk
> to decide whether that's the case (plus a few boot-up tests to gather
> the statistics, sorry about that) would be real great.
> 
> If that's inconclusive I'm again a bit low on ideas ...

I'll do that shortly.  But, for now I can confirm that the failure is
precipitated by the !parent check in sysfs_create_dir_ns().

I also can't reproduce this if I build i915 as a module.  It only
happens when built in.

> Jan  9 09:07:34 ray kernel: [    1.400547] sysfs_create_dir_ns()::53 error: -2
> Jan  9 09:07:34 ray kernel: [    1.400554] create_dir()::75 error: -2
> Jan  9 09:07:34 ray kernel: [    1.400558] ------------[ cut here ]------------
> Jan  9 09:07:34 ray kernel: [    1.400565] WARNING: CPU: 1 PID: 90 at lib/kobject.c:249 kobject_add_internal+0x273/0x320
> Jan  9 09:07:34 ray kernel: [    1.400569] kobject_add_internal failed for card0-DP-3 (error: -2 parent: card0)
> Jan  9 09:07:34 ray kernel: [    1.400572] Modules linked in:
> Jan  9 09:07:34 ray kernel: [    1.400577] CPU: 1 PID: 90 Comm: kworker/1:2 Not tainted 4.10.0-rc3-dirty #61
> Jan  9 09:07:34 ray kernel: [    1.400579] Hardware name: LENOVO 20F5S7V800/20F5S7V800, BIOS R02ET50W (1.23 ) 09/20/2016
> Jan  9 09:07:34 ray kernel: [    1.400585] Workqueue: events_long drm_dp_mst_link_probe_work
> Jan  9 09:07:34 ray kernel: [    1.400588] Call Trace:
> Jan  9 09:07:34 ray kernel: [    1.400593]  dump_stack+0x67/0x99
> Jan  9 09:07:34 ray kernel: [    1.400598]  __warn+0xd1/0xf0
> Jan  9 09:07:34 ray kernel: [    1.400601]  warn_slowpath_fmt+0x4f/0x60
> Jan  9 09:07:34 ray kernel: [    1.400604]  kobject_add_internal+0x273/0x320
> Jan  9 09:07:34 ray kernel: [    1.400607]  kobject_add+0x65/0xb0
> Jan  9 09:07:34 ray kernel: [    1.400611]  ? klist_init+0x31/0x40
> Jan  9 09:07:34 ray kernel: [    1.400615]  device_add+0x102/0x5d0
> Jan  9 09:07:34 ray kernel: [    1.400619]  ? kfree_const+0x22/0x30
> Jan  9 09:07:34 ray kernel: [    1.400623]  device_create_groups_vargs+0xd8/0x100
> Jan  9 09:07:34 ray kernel: [    1.400626]  device_create_with_groups+0x36/0x40
> Jan  9 09:07:34 ray kernel: [    1.400631]  ? drm_fb_helper_add_one_connector+0x57/0xd0
> Jan  9 09:07:34 ray kernel: [    1.400636]  ? kmem_cache_alloc_trace+0x1d2/0x1f0
> Jan  9 09:07:34 ray kernel: [    1.400641]  drm_sysfs_connector_add+0x60/0xe0
> Jan  9 09:07:34 ray kernel: [    1.400645]  drm_connector_register+0x21/0xc0
> Jan  9 09:07:34 ray kernel: [    1.400649]  intel_dp_register_mst_connector+0x41/0x50
> Jan  9 09:07:34 ray kernel: [    1.400653]  drm_dp_add_port+0x350/0x450
> Jan  9 09:07:34 ray kernel: [    1.400657]  ? rcu_early_boot_tests+0x1/0x10
> Jan  9 09:07:34 ray kernel: [    1.400660]  ? schedule_timeout+0x1cd/0x390
> Jan  9 09:07:34 ray kernel: [    1.400664]  ? __might_sleep+0x4a/0x90
> Jan  9 09:07:34 ray kernel: [    1.400667]  ? mutex_lock+0x25/0x50
> Jan  9 09:07:34 ray kernel: [    1.400670]  ? drm_dp_mst_wait_tx_reply+0x118/0x1e0
> Jan  9 09:07:34 ray kernel: [    1.400673]  ? prepare_to_wait_event+0x120/0x120
> Jan  9 09:07:34 ray kernel: [    1.400675] drm_sysfs_connector_add() connector: ffff88040c778000 kdev: ffff88040ef15000
> Jan  9 09:07:34 ray kernel: [    1.400681]  ? drm_dp_check_mstb_guid+0x3d/0x120
> Jan  9 09:07:34 ray kernel: [    1.400684]  drm_dp_send_link_address+0x185/0x1f0
> Jan  9 09:07:34 ray kernel: [    1.400688]  drm_dp_check_and_send_link_address+0xad/0xc0
> Jan  9 09:07:34 ray kernel: [    1.400691]  drm_dp_mst_link_probe_work+0x57/0xa0
> Jan  9 09:07:34 ray kernel: [    1.400694]  process_one_work+0x14b/0x430
> Jan  9 09:07:34 ray kernel: [    1.400697]  worker_thread+0x12b/0x4a0
> Jan  9 09:07:34 ray kernel: [    1.400700]  kthread+0x10c/0x140
> Jan  9 09:07:34 ray kernel: [    1.400703]  ? process_one_work+0x430/0x430
> Jan  9 09:07:34 ray kernel: [    1.400706]  ? kthread_create_on_node+0x40/0x40
> Jan  9 09:07:34 ray kernel: [    1.400709]  ret_from_fork+0x27/0x40
> Jan  9 09:07:34 ray kernel: [    1.400714] ---[ end trace 0009c9dc7b253d9c ]---
Dave Hansen Jan. 9, 2017, 5:42 p.m. UTC | #9
On 01/09/2017 08:59 AM, Daniel Vetter wrote:
> On Mon, Jan 9, 2017 at 5:50 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>> On 01/09/2017 08:41 AM, Daniel Vetter wrote:
>>> On Mon, Jan 9, 2017 at 2:40 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>>>> Well, now I found where the -2 comes from.
>>>> intel_dp_register_mst_connector() calls drm_connector_register(), which
>>>> fails to add the kobject (warning below).  But, it does zero error
>>>> checking on the drm_connector_register() call and leaves the
>>>> partially-constructed connector in place.
>>>>
>>>> The next time some poor, hapless code goes and tries to do anything with
>>>> that kdev, they oops.  I'm perplexed by this, though.  The
>>>> drm_dp_mst_topology_cbs->register_connector just returns void.  It seems
>>>> a bit goofy that it can't even _return_ failure.
>>>>
>>>> Is there some stable code to go back to here?  Or, is there something
>>>> about my configuration that's unique?  I really wonder why nobody else
>>>> is running into this.
>>>>
>>>> There's probably some other race going on here.  This warning doesn't
>>>> happen on every boot.
>>> This smells more like the root-cause: Something goes wrong on boot
>>> that prevents connectors from properly registering, then we fall over
>>> later on. And the register callback is intentionally void, assuming
>>> that any prep work has been done earlier and that therefore the
>>> register step can't fail. Can you pls check whether the oops later on
>>> only happens together with this warning at boot, or whether they're
>>> not correlated?
>>
>> Looking through my logs, I can't find any instance of the oops without
>> the warning at boot.  So I do think the later oops is entirely caused by
>> the issue warned about in early boot.
> 
> Hm, I guess then we'd need to fix that boot-up warning. Can you try to
> figure out why it's unhappy? On a hunch it could be that we call
> drm_connector_register from the mst probe worker before the main
> driver load thread has reached the drm_dev_register call. A few printk
> to decide whether that's the case (plus a few boot-up tests to gather
> the statistics, sorry about that) would be real great.

I think your suspicion is correct.

The WARNING is getting printed from CPU0 (the
drm_dp_mst_link_probe_work() thread) and we can see the
drm_connector_register()s happening in parallel on CPU1 while the
WARNING is getting printed out.

This seems to happen very consistently when i915 is built into the kernel.

> [    1.404237] drm_dev_register(00000000000002da)::209092608 cpu: 1
> [    1.404284] drm_connector_register(ffff88040ed3e800)::379 connector->registered: 0 cpu: 0
> [    1.404293] sysfs_create_dir_ns()::53 error: -2
> [    1.404295] create_dir()::75 error: -2 cpu: 0
> [    1.404298] ------------[ cut here ]------------
> [    1.404304] WARNING: CPU: 0 PID: 3 at lib/kobject.c:249 kobject_add_internal+0x273/0x330
> [    1.404308] kobject_add_internal failed for card0-DP-3 (error: -2 parent: card0)
> [    1.404311] Modules linked in:
> [    1.404316] CPU: 0 PID: 3 Comm: kworker/0:0 Not tainted 4.10.0-rc3-i915borked-dirty #66
> [    1.404319] Hardware name: LENOVO 20F5S7V800/20F5S7V800, BIOS R02ET50W (1.23 ) 09/20/2016
> [    1.404325] Workqueue: events_long drm_dp_mst_link_probe_work
> [    1.404328] Call Trace:
> [    1.404334]  dump_stack+0x67/0x99
> [    1.404338]  __warn+0xd1/0xf0
> [    1.404342]  warn_slowpath_fmt+0x4f/0x60
> [    1.404345]  kobject_add_internal+0x273/0x330
> [    1.404347]  kobject_add+0x65/0xb0
> [    1.404352]  ? klist_init+0x31/0x40
> [    1.404356]  device_add+0x102/0x5d0
> [    1.404360]  ? kfree_const+0x22/0x30
> [    1.404364]  device_create_groups_vargs+0xd8/0x100
> [    1.404367]  device_create_with_groups+0x36/0x40
> [    1.404370] drm_connector_register(ffff88040c779000)::379 connector->registered: 0 cpu: 1
> [    1.404374]  ? vprintk_default+0x29/0x50
> [    1.404375]  ? printk+0x4d/0x4f
> [    1.404378]  drm_sysfs_connector_add+0x60/0xe0
> [    1.404381]  drm_connector_register+0x4e/0x100
> [    1.404383]  intel_dp_register_mst_connector+0x41/0x50
> [    1.404385]  drm_dp_add_port+0x350/0x450
> [    1.404388]  ? rcu_early_boot_tests+0x1/0x10
> [    1.404390]  ? schedule_timeout+0x1cd/0x390
> [    1.404392]  ? __might_sleep+0x4a/0x90
> [    1.404394]  ? mutex_lock+0x25/0x50
> [    1.404395]  ? drm_dp_mst_wait_tx_reply+0x118/0x1e0
> [    1.404397]  ? prepare_to_wait_event+0x120/0x120
> [    1.404399]  ? drm_dp_check_mstb_guid+0x3d/0x120
> [    1.404401]  drm_dp_send_link_address+0x185/0x1f0
> [    1.404403]  drm_dp_check_and_send_link_address+0xad/0xc0
> [    1.404405]  drm_dp_mst_link_probe_work+0x57/0xa0
> [    1.404407]  process_one_work+0x14b/0x430
> [    1.404408]  worker_thread+0x12b/0x4a0
> [    1.404410]  kthread+0x10c/0x140
> [    1.404411]  ? process_one_work+0x430/0x430
> [    1.404413]  ? kthread_create_on_node+0x40/0x40
> [    1.404415]  ret_from_fork+0x27/0x40
> [    1.404418] ---[ end trace 952a97b7cba4984f ]---
> [    1.404420] drm_sysfs_connector_add() connector: ffff88040ed3e800 kdev: fffffffffffffffe
> [    1.404422] [drm:drm_sysfs_connector_add] *ERROR* failed to register connector device: -2
> [    1.404453] drm_sysfs_connector_add() connector: ffff88040c779000 kdev: ffff88040c771c00
> [    1.404508] drm_connector_register(ffff88040c779000)::402 connector->registered: 1 cpu: 1
> [    1.404512] drm_connector_register(ffff88040ed3b800)::379 connector->registered: 0 cpu: 1
> [    1.404531] drm_sysfs_connector_add() connector: ffff88040ed3b800 kdev: ffff88040c772000
> [    1.404559] drm_connector_register(ffff88040ed3b800)::402 connector->registered: 1 cpu: 1
> [    1.404564] drm_connector_register(ffff88040ed3c000)::379 connector->registered: 0 cpu: 1
> [    1.404583] drm_sysfs_connector_add() connector: ffff88040ed3c000 kdev: ffff88040c772400
Alex Deucher Jan. 9, 2017, 7:34 p.m. UTC | #10
On Mon, Jan 9, 2017 at 12:22 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 01/09/2017 08:59 AM, Daniel Vetter wrote:
>> On Mon, Jan 9, 2017 at 5:50 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>>> On 01/09/2017 08:41 AM, Daniel Vetter wrote:
>>>> On Mon, Jan 9, 2017 at 2:40 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>>>>> Well, now I found where the -2 comes from.
>>>>> intel_dp_register_mst_connector() calls drm_connector_register(), which
>>>>> fails to add the kobject (warning below).  But, it does zero error
>>>>> checking on the drm_connector_register() call and leaves the
>>>>> partially-constructed connector in place.
>>>>>
>>>>> The next time some poor, hapless code goes and tries to do anything with
>>>>> that kdev, they oops.  I'm perplexed by this, though.  The
>>>>> drm_dp_mst_topology_cbs->register_connector just returns void.  It seems
>>>>> a bit goofy that it can't even _return_ failure.
>>>>>
>>>>> Is there some stable code to go back to here?  Or, is there something
>>>>> about my configuration that's unique?  I really wonder why nobody else
>>>>> is running into this.
>>>>>
>>>>> There's probably some other race going on here.  This warning doesn't
>>>>> happen on every boot.
>>>> This smells more like the root-cause: Something goes wrong on boot
>>>> that prevents connectors from properly registering, then we fall over
>>>> later on. And the register callback is intentionally void, assuming
>>>> that any prep work has been done earlier and that therefore the
>>>> register step can't fail. Can you pls check whether the oops later on
>>>> only happens together with this warning at boot, or whether they're
>>>> not correlated?
>>>
>>> Looking through my logs, I can't find any instance of the oops without
>>> the warning at boot.  So I do think the later oops is entirely caused by
>>> the issue warned about in early boot.
>>
>> Hm, I guess then we'd need to fix that boot-up warning. Can you try to
>> figure out why it's unhappy? On a hunch it could be that we call
>> drm_connector_register from the mst probe worker before the main
>> driver load thread has reached the drm_dev_register call. A few printk
>> to decide whether that's the case (plus a few boot-up tests to gather
>> the statistics, sorry about that) would be real great.
>>
>> If that's inconclusive I'm again a bit low on ideas ...
>
> I'll do that shortly.  But, for now I can confirm that the failure is
> precipitated by the !parent check in sysfs_create_dir_ns().
>
> I also can't reproduce this if I build i915 as a module.  It only
> happens when built in.

FWIW, we ran into a race with fbdev and mst topology discovery.  Maybe
you are seeing something similar?

Alex


>
>> Jan  9 09:07:34 ray kernel: [    1.400547] sysfs_create_dir_ns()::53 error: -2
>> Jan  9 09:07:34 ray kernel: [    1.400554] create_dir()::75 error: -2
>> Jan  9 09:07:34 ray kernel: [    1.400558] ------------[ cut here ]------------
>> Jan  9 09:07:34 ray kernel: [    1.400565] WARNING: CPU: 1 PID: 90 at lib/kobject.c:249 kobject_add_internal+0x273/0x320
>> Jan  9 09:07:34 ray kernel: [    1.400569] kobject_add_internal failed for card0-DP-3 (error: -2 parent: card0)
>> Jan  9 09:07:34 ray kernel: [    1.400572] Modules linked in:
>> Jan  9 09:07:34 ray kernel: [    1.400577] CPU: 1 PID: 90 Comm: kworker/1:2 Not tainted 4.10.0-rc3-dirty #61
>> Jan  9 09:07:34 ray kernel: [    1.400579] Hardware name: LENOVO 20F5S7V800/20F5S7V800, BIOS R02ET50W (1.23 ) 09/20/2016
>> Jan  9 09:07:34 ray kernel: [    1.400585] Workqueue: events_long drm_dp_mst_link_probe_work
>> Jan  9 09:07:34 ray kernel: [    1.400588] Call Trace:
>> Jan  9 09:07:34 ray kernel: [    1.400593]  dump_stack+0x67/0x99
>> Jan  9 09:07:34 ray kernel: [    1.400598]  __warn+0xd1/0xf0
>> Jan  9 09:07:34 ray kernel: [    1.400601]  warn_slowpath_fmt+0x4f/0x60
>> Jan  9 09:07:34 ray kernel: [    1.400604]  kobject_add_internal+0x273/0x320
>> Jan  9 09:07:34 ray kernel: [    1.400607]  kobject_add+0x65/0xb0
>> Jan  9 09:07:34 ray kernel: [    1.400611]  ? klist_init+0x31/0x40
>> Jan  9 09:07:34 ray kernel: [    1.400615]  device_add+0x102/0x5d0
>> Jan  9 09:07:34 ray kernel: [    1.400619]  ? kfree_const+0x22/0x30
>> Jan  9 09:07:34 ray kernel: [    1.400623]  device_create_groups_vargs+0xd8/0x100
>> Jan  9 09:07:34 ray kernel: [    1.400626]  device_create_with_groups+0x36/0x40
>> Jan  9 09:07:34 ray kernel: [    1.400631]  ? drm_fb_helper_add_one_connector+0x57/0xd0
>> Jan  9 09:07:34 ray kernel: [    1.400636]  ? kmem_cache_alloc_trace+0x1d2/0x1f0
>> Jan  9 09:07:34 ray kernel: [    1.400641]  drm_sysfs_connector_add+0x60/0xe0
>> Jan  9 09:07:34 ray kernel: [    1.400645]  drm_connector_register+0x21/0xc0
>> Jan  9 09:07:34 ray kernel: [    1.400649]  intel_dp_register_mst_connector+0x41/0x50
>> Jan  9 09:07:34 ray kernel: [    1.400653]  drm_dp_add_port+0x350/0x450
>> Jan  9 09:07:34 ray kernel: [    1.400657]  ? rcu_early_boot_tests+0x1/0x10
>> Jan  9 09:07:34 ray kernel: [    1.400660]  ? schedule_timeout+0x1cd/0x390
>> Jan  9 09:07:34 ray kernel: [    1.400664]  ? __might_sleep+0x4a/0x90
>> Jan  9 09:07:34 ray kernel: [    1.400667]  ? mutex_lock+0x25/0x50
>> Jan  9 09:07:34 ray kernel: [    1.400670]  ? drm_dp_mst_wait_tx_reply+0x118/0x1e0
>> Jan  9 09:07:34 ray kernel: [    1.400673]  ? prepare_to_wait_event+0x120/0x120
>> Jan  9 09:07:34 ray kernel: [    1.400675] drm_sysfs_connector_add() connector: ffff88040c778000 kdev: ffff88040ef15000
>> Jan  9 09:07:34 ray kernel: [    1.400681]  ? drm_dp_check_mstb_guid+0x3d/0x120
>> Jan  9 09:07:34 ray kernel: [    1.400684]  drm_dp_send_link_address+0x185/0x1f0
>> Jan  9 09:07:34 ray kernel: [    1.400688]  drm_dp_check_and_send_link_address+0xad/0xc0
>> Jan  9 09:07:34 ray kernel: [    1.400691]  drm_dp_mst_link_probe_work+0x57/0xa0
>> Jan  9 09:07:34 ray kernel: [    1.400694]  process_one_work+0x14b/0x430
>> Jan  9 09:07:34 ray kernel: [    1.400697]  worker_thread+0x12b/0x4a0
>> Jan  9 09:07:34 ray kernel: [    1.400700]  kthread+0x10c/0x140
>> Jan  9 09:07:34 ray kernel: [    1.400703]  ? process_one_work+0x430/0x430
>> Jan  9 09:07:34 ray kernel: [    1.400706]  ? kthread_create_on_node+0x40/0x40
>> Jan  9 09:07:34 ray kernel: [    1.400709]  ret_from_fork+0x27/0x40
>> Jan  9 09:07:34 ray kernel: [    1.400714] ---[ end trace 0009c9dc7b253d9c ]---
>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Daniel Vetter Jan. 10, 2017, 10:31 a.m. UTC | #11
On Mon, Jan 09, 2017 at 09:42:22AM -0800, Dave Hansen wrote:
> On 01/09/2017 08:59 AM, Daniel Vetter wrote:
> > On Mon, Jan 9, 2017 at 5:50 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> >> On 01/09/2017 08:41 AM, Daniel Vetter wrote:
> >>> On Mon, Jan 9, 2017 at 2:40 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> >>>> Well, now I found where the -2 comes from.
> >>>> intel_dp_register_mst_connector() calls drm_connector_register(), which
> >>>> fails to add the kobject (warning below).  But, it does zero error
> >>>> checking on the drm_connector_register() call and leaves the
> >>>> partially-constructed connector in place.
> >>>>
> >>>> The next time some poor, hapless code goes and tries to do anything with
> >>>> that kdev, they oops.  I'm perplexed by this, though.  The
> >>>> drm_dp_mst_topology_cbs->register_connector just returns void.  It seems
> >>>> a bit goofy that it can't even _return_ failure.
> >>>>
> >>>> Is there some stable code to go back to here?  Or, is there something
> >>>> about my configuration that's unique?  I really wonder why nobody else
> >>>> is running into this.
> >>>>
> >>>> There's probably some other race going on here.  This warning doesn't
> >>>> happen on every boot.
> >>> This smells more like the root-cause: Something goes wrong on boot
> >>> that prevents connectors from properly registering, then we fall over
> >>> later on. And the register callback is intentionally void, assuming
> >>> that any prep work has been done earlier and that therefore the
> >>> register step can't fail. Can you pls check whether the oops later on
> >>> only happens together with this warning at boot, or whether they're
> >>> not correlated?
> >>
> >> Looking through my logs, I can't find any instance of the oops without
> >> the warning at boot.  So I do think the later oops is entirely caused by
> >> the issue warned about in early boot.
> > 
> > Hm, I guess then we'd need to fix that boot-up warning. Can you try to
> > figure out why it's unhappy? On a hunch it could be that we call
> > drm_connector_register from the mst probe worker before the main
> > driver load thread has reached the drm_dev_register call. A few printk
> > to decide whether that's the case (plus a few boot-up tests to gather
> > the statistics, sorry about that) would be real great.
> 
> I think your suspicion is correct.
> 
> The WARNING is getting printed from CPU0 (the
> drm_dp_mst_link_probe_work() thread) and we can see the
> drm_connector_register()s happening in parallel on CPU1 while the
> WARNING is getting printed out.
> 
> This seems to happen very consistently when i915 is built into the kernel.

Do you have

commit e73ab00e9a0f1731f34d0620a9c55f5c30c4ad4e
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Sun Dec 18 14:35:45 2016 +0100

    drm: prevent double-(un)registration for connectors

Lack of that would perfectly explain that oops ... Otherwise still no idea
what's going wrong.
-Daniel

> 
> > [    1.404237] drm_dev_register(00000000000002da)::209092608 cpu: 1
> > [    1.404284] drm_connector_register(ffff88040ed3e800)::379 connector->registered: 0 cpu: 0
> > [    1.404293] sysfs_create_dir_ns()::53 error: -2
> > [    1.404295] create_dir()::75 error: -2 cpu: 0
> > [    1.404298] ------------[ cut here ]------------
> > [    1.404304] WARNING: CPU: 0 PID: 3 at lib/kobject.c:249 kobject_add_internal+0x273/0x330
> > [    1.404308] kobject_add_internal failed for card0-DP-3 (error: -2 parent: card0)
> > [    1.404311] Modules linked in:
> > [    1.404316] CPU: 0 PID: 3 Comm: kworker/0:0 Not tainted 4.10.0-rc3-i915borked-dirty #66
> > [    1.404319] Hardware name: LENOVO 20F5S7V800/20F5S7V800, BIOS R02ET50W (1.23 ) 09/20/2016
> > [    1.404325] Workqueue: events_long drm_dp_mst_link_probe_work
> > [    1.404328] Call Trace:
> > [    1.404334]  dump_stack+0x67/0x99
> > [    1.404338]  __warn+0xd1/0xf0
> > [    1.404342]  warn_slowpath_fmt+0x4f/0x60
> > [    1.404345]  kobject_add_internal+0x273/0x330
> > [    1.404347]  kobject_add+0x65/0xb0
> > [    1.404352]  ? klist_init+0x31/0x40
> > [    1.404356]  device_add+0x102/0x5d0
> > [    1.404360]  ? kfree_const+0x22/0x30
> > [    1.404364]  device_create_groups_vargs+0xd8/0x100
> > [    1.404367]  device_create_with_groups+0x36/0x40
> > [    1.404370] drm_connector_register(ffff88040c779000)::379 connector->registered: 0 cpu: 1
> > [    1.404374]  ? vprintk_default+0x29/0x50
> > [    1.404375]  ? printk+0x4d/0x4f
> > [    1.404378]  drm_sysfs_connector_add+0x60/0xe0
> > [    1.404381]  drm_connector_register+0x4e/0x100
> > [    1.404383]  intel_dp_register_mst_connector+0x41/0x50
> > [    1.404385]  drm_dp_add_port+0x350/0x450
> > [    1.404388]  ? rcu_early_boot_tests+0x1/0x10
> > [    1.404390]  ? schedule_timeout+0x1cd/0x390
> > [    1.404392]  ? __might_sleep+0x4a/0x90
> > [    1.404394]  ? mutex_lock+0x25/0x50
> > [    1.404395]  ? drm_dp_mst_wait_tx_reply+0x118/0x1e0
> > [    1.404397]  ? prepare_to_wait_event+0x120/0x120
> > [    1.404399]  ? drm_dp_check_mstb_guid+0x3d/0x120
> > [    1.404401]  drm_dp_send_link_address+0x185/0x1f0
> > [    1.404403]  drm_dp_check_and_send_link_address+0xad/0xc0
> > [    1.404405]  drm_dp_mst_link_probe_work+0x57/0xa0
> > [    1.404407]  process_one_work+0x14b/0x430
> > [    1.404408]  worker_thread+0x12b/0x4a0
> > [    1.404410]  kthread+0x10c/0x140
> > [    1.404411]  ? process_one_work+0x430/0x430
> > [    1.404413]  ? kthread_create_on_node+0x40/0x40
> > [    1.404415]  ret_from_fork+0x27/0x40
> > [    1.404418] ---[ end trace 952a97b7cba4984f ]---
> > [    1.404420] drm_sysfs_connector_add() connector: ffff88040ed3e800 kdev: fffffffffffffffe
> > [    1.404422] [drm:drm_sysfs_connector_add] *ERROR* failed to register connector device: -2
> > [    1.404453] drm_sysfs_connector_add() connector: ffff88040c779000 kdev: ffff88040c771c00
> > [    1.404508] drm_connector_register(ffff88040c779000)::402 connector->registered: 1 cpu: 1
> > [    1.404512] drm_connector_register(ffff88040ed3b800)::379 connector->registered: 0 cpu: 1
> > [    1.404531] drm_sysfs_connector_add() connector: ffff88040ed3b800 kdev: ffff88040c772000
> > [    1.404559] drm_connector_register(ffff88040ed3b800)::402 connector->registered: 1 cpu: 1
> > [    1.404564] drm_connector_register(ffff88040ed3c000)::379 connector->registered: 0 cpu: 1
> > [    1.404583] drm_sysfs_connector_add() connector: ffff88040ed3c000 kdev: ffff88040c772400
>
Dave Hansen Jan. 10, 2017, 4:52 p.m. UTC | #12
On 01/10/2017 02:31 AM, Daniel Vetter wrote:
> commit e73ab00e9a0f1731f34d0620a9c55f5c30c4ad4e
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Sun Dec 18 14:35:45 2016 +0100
> 
>     drm: prevent double-(un)registration for connectors
> 
> Lack of that would perfectly explain that oops ... Otherwise still no idea
> what's going wrong.

No...  That's not in mainline as far as I can see.  Should I test with
it applied?
Daniel Vetter Jan. 11, 2017, 7:43 a.m. UTC | #13
On Tue, Jan 10, 2017 at 08:52:47AM -0800, Dave Hansen wrote:
> On 01/10/2017 02:31 AM, Daniel Vetter wrote:
> > commit e73ab00e9a0f1731f34d0620a9c55f5c30c4ad4e
> > Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> > Date:   Sun Dec 18 14:35:45 2016 +0100
> > 
> >     drm: prevent double-(un)registration for connectors
> > 
> > Lack of that would perfectly explain that oops ... Otherwise still no idea
> > what's going wrong.
> 
> No...  That's not in mainline as far as I can see.  Should I test with
> it applied?

Hm, I guess failed to cc: stable that one properly, iirc we decided the
race fix is too academic and can't be hit in reality ;-)

Testing would be great. Probably conflicts because we extracted
drm_connector.c only recently, but running s/drm_connector\.c/drm_crtc.c/
over the diff and then applying with some fudge should take care of that.

Thanks, Daniel
Dave Hansen Jan. 11, 2017, 3:24 p.m. UTC | #14
On 01/10/2017 11:43 PM, Daniel Vetter wrote:
> On Tue, Jan 10, 2017 at 08:52:47AM -0800, Dave Hansen wrote:
>> On 01/10/2017 02:31 AM, Daniel Vetter wrote:
>>> commit e73ab00e9a0f1731f34d0620a9c55f5c30c4ad4e
>>> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
>>> Date:   Sun Dec 18 14:35:45 2016 +0100
>>>
>>>     drm: prevent double-(un)registration for connectors
>>>
>>> Lack of that would perfectly explain that oops ... Otherwise still no idea
>>> what's going wrong.
>> No...  That's not in mainline as far as I can see.  Should I test with
>> it applied?
> Hm, I guess failed to cc: stable that one properly, iirc we decided the
> race fix is too academic and can't be hit in reality ;-)
> 
> Testing would be great. Probably conflicts because we extracted
> drm_connector.c only recently, but running s/drm_connector\.c/drm_crtc.c/
> over the diff and then applying with some fudge should take care of that.

It doesn't apply to mainline, with or without the substitution you suggest.

Do you want that commit itself tested from -next?
Daniel Vetter Jan. 11, 2017, 3:39 p.m. UTC | #15
On Wed, Jan 11, 2017 at 4:24 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 01/10/2017 11:43 PM, Daniel Vetter wrote:
>> On Tue, Jan 10, 2017 at 08:52:47AM -0800, Dave Hansen wrote:
>>> On 01/10/2017 02:31 AM, Daniel Vetter wrote:
>>>> commit e73ab00e9a0f1731f34d0620a9c55f5c30c4ad4e
>>>> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
>>>> Date:   Sun Dec 18 14:35:45 2016 +0100
>>>>
>>>>     drm: prevent double-(un)registration for connectors
>>>>
>>>> Lack of that would perfectly explain that oops ... Otherwise still no idea
>>>> what's going wrong.
>>> No...  That's not in mainline as far as I can see.  Should I test with
>>> it applied?
>> Hm, I guess failed to cc: stable that one properly, iirc we decided the
>> race fix is too academic and can't be hit in reality ;-)
>>
>> Testing would be great. Probably conflicts because we extracted
>> drm_connector.c only recently, but running s/drm_connector\.c/drm_crtc.c/
>> over the diff and then applying with some fudge should take care of that.
>
> It doesn't apply to mainline, with or without the substitution you suggest.
>
> Do you want that commit itself tested from -next?

Hm, just cherry-picked it on top of Linus' latest 4.10 git, applies
cleanly there. The substituation was for 4.9. I can send you the patch
here, but seems all fine from what I can tell ...
-Daniel
Dave Hansen Jan. 11, 2017, 4:16 p.m. UTC | #16
On 01/11/2017 07:39 AM, Daniel Vetter wrote:
> Hm, just cherry-picked it on top of Linus' latest 4.10 git, applies
> cleanly there. The substituation was for 4.9. I can send you the patch
> here, but seems all fine from what I can tell ...

All of the printk's that I added were making it fail to apply.

So, I took a 4.10-rc3 kernel with i915 compiled in (not as a module) and
applied e73ab00e9a0f17, which I grabbed from linux-next.

I'm seeing basically the same behavior that I did before applying
e73ab00e9a0f17.  sysfs_create_dir_ns() fails because of a NULL kobj->parent.

Have you guys tried testing this yourselves?  It seems really easy to
reproduce if you just compile the driver in.

> [    1.400797] drm_dev_register(ffff88040c730000)::730 cpu: 2
> [    1.400860] drm_connector_register(ffff88040c76b000)::382 connector->registered: 0 cpu: 1
> [    1.400870] sysfs_create_dir_ns()::53 error: -2
> [    1.400874] create_dir()::75 error: -2 cpu: 1
> [    1.400878] ------------[ cut here ]------------
> [    1.400884] WARNING: CPU: 1 PID: 91 at lib/kobject.c:249 kobject_add_internal+0x273/0x330
> [    1.400888] kobject_add_internal failed for card0-DP-3 (error: -2 parent: card0)
> [    1.400892] Modules linked in:
> [    1.400896] CPU: 1 PID: 91 Comm: kworker/1:2 Not tainted 4.10.0-rc3-i915borked-dirty #67
> [    1.400900] Hardware name: LENOVO 20F5S7V800/20F5S7V800, BIOS R02ET50W (1.23 ) 09/20/2016
> [    1.400906] Workqueue: events_long drm_dp_mst_link_probe_work
> [    1.400909] Call Trace:
> [    1.400914]  dump_stack+0x67/0x99
> [    1.400918]  __warn+0xd1/0xf0
> [    1.400922]  warn_slowpath_fmt+0x4f/0x60
> [    1.400925]  kobject_add_internal+0x273/0x330
> [    1.400927]  kobject_add+0x65/0xb0
> [    1.400931]  ? klist_init+0x31/0x40
> [    1.400936]  device_add+0x102/0x5d0
> [    1.400940]  ? kfree_const+0x22/0x30
> [    1.400944]  device_create_groups_vargs+0xd8/0x100
> [    1.400947]  device_create_with_groups+0x36/0x40
> [    1.400952]  ? vprintk_default+0x29/0x50
> [    1.400957]  ? __might_sleep+0x4a/0x90
> [    1.400962]  drm_sysfs_connector_add+0x60/0xe0
> [    1.400967]  drm_connector_register+0x74/0xd0
> [    1.400971]  intel_dp_register_mst_connector+0x41/0x50
> [    1.400975]  drm_dp_add_port+0x350/0x450
> [    1.400977] drm_connector_register(ffff88040ee6f800)::382 connector->registered: 0 cpu: 2
> [    1.400982]  ? rcu_early_boot_tests+0x1/0x10
> [    1.400986]  ? schedule_timeout+0x1cd/0x390
> [    1.400989]  ? __might_sleep+0x4a/0x90
> [    1.400992]  ? mutex_lock+0x25/0x50
> [    1.400995]  ? drm_dp_mst_wait_tx_reply+0x118/0x1e0
> [    1.400996] drm_sysfs_connector_add() connector: ffff88040ee6f800 kdev: ffff88040eef9c00
> [    1.401002]  ? prepare_to_wait_event+0x120/0x120
> [    1.401005]  ? drm_dp_check_mstb_guid+0x3d/0x120
> [    1.401008]  drm_dp_send_link_address+0x185/0x1f0
> [    1.401012]  drm_dp_check_and_send_link_address+0xad/0xc0
> [    1.401015]  drm_dp_mst_link_probe_work+0x57/0xa0
> [    1.401018]  process_one_work+0x14b/0x430
> [    1.401021]  worker_thread+0x12b/0x4a0
> [    1.401025]  kthread+0x10c/0x140
> [    1.401027]  ? process_one_work+0x430/0x430
> [    1.401030]  ? kthread_create_on_node+0x40/0x40
> [    1.401034]  ret_from_fork+0x27/0x40
> [    1.401038] ---[ end trace ba43fc250fbf282d ]---
> [    1.401041] drm_sysfs_connector_add() connector: ffff88040c76b000 kdev: fffffffffffffffe
> [    1.401043] drm_connector_register(ffff88040c768000)::382 connector->registered: 0 cpu: 2
> [    1.401050] [drm:drm_sysfs_connector_add] *ERROR* failed to register connector device: -2
> [    1.401057] drm_sysfs_connector_add() connector: ffff88040c768000 kdev: ffff88040eefa000
> [    1.401093] drm_connector_register(ffff88040c768800)::382 connector->registered: 0 cpu: 2
> [    1.401113] drm_sysfs_connector_add() connector: ffff88040c768800 kdev: ffff88040eefa400
> [    1.401122] drm_connector_register(ffff88040c769000)::382 connector->registered: 0 cpu: 2
> [    1.401140] drm_sysfs_connector_add() connector: ffff88040c769000 kdev: ffff88040eefa800
> [    1.401167] drm_connector_register(ffff88040c769800)::382 connector->registered: 0 cpu: 2
> [    1.401186] drm_sysfs_connector_add() connector: ffff88040c769800 kdev: ffff88040eefac00
> [    1.401195] drm_connector_register(ffff88040c76b000)::382 connector->registered: 0 cpu: 2
Daniel Vetter Jan. 11, 2017, 10:25 p.m. UTC | #17
On Wed, Jan 11, 2017 at 08:16:56AM -0800, Dave Hansen wrote:
> On 01/11/2017 07:39 AM, Daniel Vetter wrote:
> > Hm, just cherry-picked it on top of Linus' latest 4.10 git, applies
> > cleanly there. The substituation was for 4.9. I can send you the patch
> > here, but seems all fine from what I can tell ...
> 
> All of the printk's that I added were making it fail to apply.
> 
> So, I took a 4.10-rc3 kernel with i915 compiled in (not as a module) and
> applied e73ab00e9a0f17, which I grabbed from linux-next.
> 
> I'm seeing basically the same behavior that I did before applying
> e73ab00e9a0f17.  sysfs_create_dir_ns() fails because of a NULL kobj->parent.
> 
> Have you guys tried testing this yourselves?  It seems really easy to
> reproduce if you just compile the driver in.

With some screaming I can reproduce it. I'll try to take a look, but since
I'm travelling to lca might need someone else to look into this too (I
don't have an mst machine to carry around). And I still don't get why this
blows up even.
-Daniel

> 
> > [    1.400797] drm_dev_register(ffff88040c730000)::730 cpu: 2
> > [    1.400860] drm_connector_register(ffff88040c76b000)::382 connector->registered: 0 cpu: 1
> > [    1.400870] sysfs_create_dir_ns()::53 error: -2
> > [    1.400874] create_dir()::75 error: -2 cpu: 1
> > [    1.400878] ------------[ cut here ]------------
> > [    1.400884] WARNING: CPU: 1 PID: 91 at lib/kobject.c:249 kobject_add_internal+0x273/0x330
> > [    1.400888] kobject_add_internal failed for card0-DP-3 (error: -2 parent: card0)
> > [    1.400892] Modules linked in:
> > [    1.400896] CPU: 1 PID: 91 Comm: kworker/1:2 Not tainted 4.10.0-rc3-i915borked-dirty #67
> > [    1.400900] Hardware name: LENOVO 20F5S7V800/20F5S7V800, BIOS R02ET50W (1.23 ) 09/20/2016
> > [    1.400906] Workqueue: events_long drm_dp_mst_link_probe_work
> > [    1.400909] Call Trace:
> > [    1.400914]  dump_stack+0x67/0x99
> > [    1.400918]  __warn+0xd1/0xf0
> > [    1.400922]  warn_slowpath_fmt+0x4f/0x60
> > [    1.400925]  kobject_add_internal+0x273/0x330
> > [    1.400927]  kobject_add+0x65/0xb0
> > [    1.400931]  ? klist_init+0x31/0x40
> > [    1.400936]  device_add+0x102/0x5d0
> > [    1.400940]  ? kfree_const+0x22/0x30
> > [    1.400944]  device_create_groups_vargs+0xd8/0x100
> > [    1.400947]  device_create_with_groups+0x36/0x40
> > [    1.400952]  ? vprintk_default+0x29/0x50
> > [    1.400957]  ? __might_sleep+0x4a/0x90
> > [    1.400962]  drm_sysfs_connector_add+0x60/0xe0
> > [    1.400967]  drm_connector_register+0x74/0xd0
> > [    1.400971]  intel_dp_register_mst_connector+0x41/0x50
> > [    1.400975]  drm_dp_add_port+0x350/0x450
> > [    1.400977] drm_connector_register(ffff88040ee6f800)::382 connector->registered: 0 cpu: 2
> > [    1.400982]  ? rcu_early_boot_tests+0x1/0x10
> > [    1.400986]  ? schedule_timeout+0x1cd/0x390
> > [    1.400989]  ? __might_sleep+0x4a/0x90
> > [    1.400992]  ? mutex_lock+0x25/0x50
> > [    1.400995]  ? drm_dp_mst_wait_tx_reply+0x118/0x1e0
> > [    1.400996] drm_sysfs_connector_add() connector: ffff88040ee6f800 kdev: ffff88040eef9c00
> > [    1.401002]  ? prepare_to_wait_event+0x120/0x120
> > [    1.401005]  ? drm_dp_check_mstb_guid+0x3d/0x120
> > [    1.401008]  drm_dp_send_link_address+0x185/0x1f0
> > [    1.401012]  drm_dp_check_and_send_link_address+0xad/0xc0
> > [    1.401015]  drm_dp_mst_link_probe_work+0x57/0xa0
> > [    1.401018]  process_one_work+0x14b/0x430
> > [    1.401021]  worker_thread+0x12b/0x4a0
> > [    1.401025]  kthread+0x10c/0x140
> > [    1.401027]  ? process_one_work+0x430/0x430
> > [    1.401030]  ? kthread_create_on_node+0x40/0x40
> > [    1.401034]  ret_from_fork+0x27/0x40
> > [    1.401038] ---[ end trace ba43fc250fbf282d ]---
> > [    1.401041] drm_sysfs_connector_add() connector: ffff88040c76b000 kdev: fffffffffffffffe
> > [    1.401043] drm_connector_register(ffff88040c768000)::382 connector->registered: 0 cpu: 2
> > [    1.401050] [drm:drm_sysfs_connector_add] *ERROR* failed to register connector device: -2
> > [    1.401057] drm_sysfs_connector_add() connector: ffff88040c768000 kdev: ffff88040eefa000
> > [    1.401093] drm_connector_register(ffff88040c768800)::382 connector->registered: 0 cpu: 2
> > [    1.401113] drm_sysfs_connector_add() connector: ffff88040c768800 kdev: ffff88040eefa400
> > [    1.401122] drm_connector_register(ffff88040c769000)::382 connector->registered: 0 cpu: 2
> > [    1.401140] drm_sysfs_connector_add() connector: ffff88040c769000 kdev: ffff88040eefa800
> > [    1.401167] drm_connector_register(ffff88040c769800)::382 connector->registered: 0 cpu: 2
> > [    1.401186] drm_sysfs_connector_add() connector: ffff88040c769800 kdev: ffff88040eefac00
> > [    1.401195] drm_connector_register(ffff88040c76b000)::382 connector->registered: 0 cpu: 2
>
diff mbox

Patch

diff --git a/drivers/gpu/drm/drm_dp_mst_topology.c b/drivers/gpu/drm/drm_dp_mst_topology.c
index aa64448..85beebc 100644
--- a/drivers/gpu/drm/drm_dp_mst_topology.c
+++ b/drivers/gpu/drm/drm_dp_mst_topology.c
@@ -2914,13 +2914,14 @@  static void drm_dp_destroy_connector_work(struct work_struct *work)
 			break;
 		}
 		list_del(&port->next);
-		mutex_unlock(&mgr->destroy_connector_lock);
 
 		kref_init(&port->kref);
 		INIT_LIST_HEAD(&port->next);
 
 		mgr->cbs->destroy_connector(mgr, port->connector);
 
+		mutex_unlock(&mgr->destroy_connector_lock);
+
 		drm_dp_port_teardown_pdt(port, port->pdt);
 		port->pdt = DP_PEER_DEVICE_NONE;