diff mbox

drm/i915/guc: Protect against no desc-pool on premature shutdown

Message ID 20180713172658.14070-1-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson July 13, 2018, 5:26 p.m. UTC
Hopefully the final hack to get guc fault-injection happy before we can
clean it up again, starting from a known good baseline...

[  383.017530] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
[  383.017556] Oops: 0000 [#1] PREEMPT SMP PTI
[  383.017566] CPU: 7 PID: 4725 Comm: drv_module_relo Tainted: G     U            4.18.0-rc4-CI-CI_DRM_4485+ #1
[  383.017581] Hardware name: Micro-Star International Co., Ltd. MS-7B54/Z370M MORTAR (MS-7B54), BIOS 1.10 12/28/2017
[  383.017664] RIP: 0010:guc_stage_desc_pool_destroy+0x17/0xe0 [i915]
[  383.017674] Code: 59 a0 c6 05 02 59 18 00 01 e8 5e 01 c3 e0 eb b1 0f 1f 00 53 48 89 fb 48 81 c7 90 02 00 00 e8 60 64 45 e1 48 8b 83 80 02 00 00 <48> 8b 80 a0 00 00 00 48 8b 90 68 02 00 00 48 83 ea 01 48 81 fa ff
[  383.017771] RSP: 0018:ffffc900004bbdd0 EFLAGS: 00010282
[  383.017782] RAX: 0000000000000000 RBX: ffff88012ff41300 RCX: 0000000000000000
[  383.017794] RDX: 0000000000000000 RSI: ffffc900004bbd80 RDI: 0000000000000000
[  383.017805] RBP: ffff88012ff40000 R08: 00000000d876ee11 R09: 0000000000000000
[  383.017817] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88012ff47770
[  383.017828] R13: ffff88012ff40068 R14: ffff880264392ef8 R15: ffffffffa0639950
[  383.017840] FS:  00007fb9c18c8980(0000) GS:ffff8802663c0000(0000) knlGS:0000000000000000
[  383.017853] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  383.017864] CR2: 00000000000000a0 CR3: 00000001df6cc003 CR4: 00000000003606e0
[  383.017875] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  383.017887] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  383.017898] Call Trace:
[  383.017962]  intel_uc_fini+0x34/0xd0 [i915]
[  383.018020]  i915_gem_fini+0x5c/0x100 [i915]
[  383.018093]  i915_driver_unload+0xd2/0x110 [i915]
[  383.018150]  i915_pci_remove+0x10/0x20 [i915]
[  383.018165]  pci_device_remove+0x36/0xb0
[  383.018179]  device_release_driver_internal+0x185/0x250
[  383.018193]  driver_detach+0x35/0x70
[  383.018205]  bus_remove_driver+0x53/0xd0
[  383.018217]  pci_unregister_driver+0x25/0xa0
[  383.018232]  __se_sys_delete_module+0x162/0x210
[  383.018245]  ? do_syscall_64+0xd/0x190
[  383.018257]  do_syscall_64+0x55/0x190
[  383.018270]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  383.018282] RIP: 0033:0x7fb9c0f7c1b7
[  383.018290] Code: 73 01 c3 48 8b 0d d1 8c 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a1 8c 2c 00 f7 d8 64 89 01 48
[  383.018408] RSP: 002b:00007fffa01c2aa8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  383.018425] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9c0f7c1b7
[  383.018440] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 0000560b96856d48
[  383.018454] RBP: 0000560b96856ce0 R08: 0000560b96856d4c R09: 00007fffa01c2ae8
[  383.018468] R10: 00007fffa01c1aa4 R11: 0000000000000206 R12: 0000560b954f7470

Testcase: igt/drv_module_reload/basic-reload-inject
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
---
 drivers/gpu/drm/i915/intel_guc_submission.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Michal Wajdeczko July 13, 2018, 5:48 p.m. UTC | #1
On Fri, 13 Jul 2018 19:26:58 +0200, Chris Wilson  
<chris@chris-wilson.co.uk> wrote:

> Hopefully the final hack to get guc fault-injection happy before we can
> clean it up again, starting from a known good baseline...
>
> [  383.017530] BUG: unable to handle kernel NULL pointer dereference at  
> 00000000000000a0
> [  383.017556] Oops: 0000 [#1] PREEMPT SMP PTI
> [  383.017566] CPU: 7 PID: 4725 Comm: drv_module_relo Tainted: G      
> U            4.18.0-rc4-CI-CI_DRM_4485+ #1
> [  383.017581] Hardware name: Micro-Star International Co., Ltd.  
> MS-7B54/Z370M MORTAR (MS-7B54), BIOS 1.10 12/28/2017
> [  383.017664] RIP: 0010:guc_stage_desc_pool_destroy+0x17/0xe0 [i915]
> [  383.017674] Code: 59 a0 c6 05 02 59 18 00 01 e8 5e 01 c3 e0 eb b1 0f  
> 1f 00 53 48 89 fb 48 81 c7 90 02 00 00 e8 60 64 45 e1 48 8b 83 80 02 00  
> 00 <48> 8b 80 a0 00 00 00 48 8b 90 68 02 00 00 48 83 ea 01 48 81 fa ff
> [  383.017771] RSP: 0018:ffffc900004bbdd0 EFLAGS: 00010282
> [  383.017782] RAX: 0000000000000000 RBX: ffff88012ff41300 RCX:  
> 0000000000000000
> [  383.017794] RDX: 0000000000000000 RSI: ffffc900004bbd80 RDI:  
> 0000000000000000
> [  383.017805] RBP: ffff88012ff40000 R08: 00000000d876ee11 R09:  
> 0000000000000000
> [  383.017817] R10: 0000000000000000 R11: 0000000000000000 R12:  
> ffff88012ff47770
> [  383.017828] R13: ffff88012ff40068 R14: ffff880264392ef8 R15:  
> ffffffffa0639950
> [  383.017840] FS:  00007fb9c18c8980(0000) GS:ffff8802663c0000(0000)  
> knlGS:0000000000000000
> [  383.017853] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  383.017864] CR2: 00000000000000a0 CR3: 00000001df6cc003 CR4:  
> 00000000003606e0
> [  383.017875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:  
> 0000000000000000
> [  383.017887] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:  
> 0000000000000400
> [  383.017898] Call Trace:
> [  383.017962]  intel_uc_fini+0x34/0xd0 [i915]
> [  383.018020]  i915_gem_fini+0x5c/0x100 [i915]
> [  383.018093]  i915_driver_unload+0xd2/0x110 [i915]
> [  383.018150]  i915_pci_remove+0x10/0x20 [i915]
> [  383.018165]  pci_device_remove+0x36/0xb0
> [  383.018179]  device_release_driver_internal+0x185/0x250
> [  383.018193]  driver_detach+0x35/0x70
> [  383.018205]  bus_remove_driver+0x53/0xd0
> [  383.018217]  pci_unregister_driver+0x25/0xa0
> [  383.018232]  __se_sys_delete_module+0x162/0x210
> [  383.018245]  ? do_syscall_64+0xd/0x190
> [  383.018257]  do_syscall_64+0x55/0x190
> [  383.018270]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  383.018282] RIP: 0033:0x7fb9c0f7c1b7
> [  383.018290] Code: 73 01 c3 48 8b 0d d1 8c 2c 00 f7 d8 64 89 01 48 83  
> c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f  
> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a1 8c 2c 00 f7 d8 64 89 01 48
> [  383.018408] RSP: 002b:00007fffa01c2aa8 EFLAGS: 00000206 ORIG_RAX:  
> 00000000000000b0
> [  383.018425] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:  
> 00007fb9c0f7c1b7
> [  383.018440] RDX: 0000000000000000 RSI: 0000000000000800 RDI:  
> 0000560b96856d48
> [  383.018454] RBP: 0000560b96856ce0 R08: 0000560b96856d4c R09:  
> 00007fffa01c2ae8
> [  383.018468] R10: 00007fffa01c1aa4 R11: 0000000000000206 R12:  
> 0000560b954f7470
>
> Testcase: igt/drv_module_reload/basic-reload-inject
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_guc_submission.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_guc_submission.c  
> b/drivers/gpu/drm/i915/intel_guc_submission.c
> index 22367131d6a1..cc444dc5f3ad 100644
> --- a/drivers/gpu/drm/i915/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/intel_guc_submission.c
> @@ -1184,7 +1184,8 @@ void intel_guc_submission_fini(struct intel_guc  
> *guc)
>  	guc_clients_destroy(guc);
>  	WARN_ON(!guc_verify_doorbells(guc));
> -	guc_stage_desc_pool_destroy(guc);
> +	if (guc->stage_desc_pool)
> +		guc_stage_desc_pool_destroy(guc);

As short-term hack this is probably ok, but maybe to avoid such case by
case hacks we should add single flag at UC level (intel_uc_init) that we
have completed our initialization and then use this flag at cleanup phase
(intel_uc_fini) just once.

Michal

ps.
I recall some earlier reviews saying that using "if" at fini is wrong ;)
Chris Wilson July 13, 2018, 5:52 p.m. UTC | #2
Quoting Michal Wajdeczko (2018-07-13 18:48:05)
> On Fri, 13 Jul 2018 19:26:58 +0200, Chris Wilson  
> <chris@chris-wilson.co.uk> wrote:
> 
> > Hopefully the final hack to get guc fault-injection happy before we can
> > clean it up again, starting from a known good baseline...
> >
> > [  383.017530] BUG: unable to handle kernel NULL pointer dereference at  
> > 00000000000000a0
> > [  383.017556] Oops: 0000 [#1] PREEMPT SMP PTI
> > [  383.017566] CPU: 7 PID: 4725 Comm: drv_module_relo Tainted: G      
> > U            4.18.0-rc4-CI-CI_DRM_4485+ #1
> > [  383.017581] Hardware name: Micro-Star International Co., Ltd.  
> > MS-7B54/Z370M MORTAR (MS-7B54), BIOS 1.10 12/28/2017
> > [  383.017664] RIP: 0010:guc_stage_desc_pool_destroy+0x17/0xe0 [i915]
> > [  383.017674] Code: 59 a0 c6 05 02 59 18 00 01 e8 5e 01 c3 e0 eb b1 0f  
> > 1f 00 53 48 89 fb 48 81 c7 90 02 00 00 e8 60 64 45 e1 48 8b 83 80 02 00  
> > 00 <48> 8b 80 a0 00 00 00 48 8b 90 68 02 00 00 48 83 ea 01 48 81 fa ff
> > [  383.017771] RSP: 0018:ffffc900004bbdd0 EFLAGS: 00010282
> > [  383.017782] RAX: 0000000000000000 RBX: ffff88012ff41300 RCX:  
> > 0000000000000000
> > [  383.017794] RDX: 0000000000000000 RSI: ffffc900004bbd80 RDI:  
> > 0000000000000000
> > [  383.017805] RBP: ffff88012ff40000 R08: 00000000d876ee11 R09:  
> > 0000000000000000
> > [  383.017817] R10: 0000000000000000 R11: 0000000000000000 R12:  
> > ffff88012ff47770
> > [  383.017828] R13: ffff88012ff40068 R14: ffff880264392ef8 R15:  
> > ffffffffa0639950
> > [  383.017840] FS:  00007fb9c18c8980(0000) GS:ffff8802663c0000(0000)  
> > knlGS:0000000000000000
> > [  383.017853] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  383.017864] CR2: 00000000000000a0 CR3: 00000001df6cc003 CR4:  
> > 00000000003606e0
> > [  383.017875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:  
> > 0000000000000000
> > [  383.017887] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:  
> > 0000000000000400
> > [  383.017898] Call Trace:
> > [  383.017962]  intel_uc_fini+0x34/0xd0 [i915]
> > [  383.018020]  i915_gem_fini+0x5c/0x100 [i915]
> > [  383.018093]  i915_driver_unload+0xd2/0x110 [i915]
> > [  383.018150]  i915_pci_remove+0x10/0x20 [i915]
> > [  383.018165]  pci_device_remove+0x36/0xb0
> > [  383.018179]  device_release_driver_internal+0x185/0x250
> > [  383.018193]  driver_detach+0x35/0x70
> > [  383.018205]  bus_remove_driver+0x53/0xd0
> > [  383.018217]  pci_unregister_driver+0x25/0xa0
> > [  383.018232]  __se_sys_delete_module+0x162/0x210
> > [  383.018245]  ? do_syscall_64+0xd/0x190
> > [  383.018257]  do_syscall_64+0x55/0x190
> > [  383.018270]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > [  383.018282] RIP: 0033:0x7fb9c0f7c1b7
> > [  383.018290] Code: 73 01 c3 48 8b 0d d1 8c 2c 00 f7 d8 64 89 01 48 83  
> > c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f  
> > 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a1 8c 2c 00 f7 d8 64 89 01 48
> > [  383.018408] RSP: 002b:00007fffa01c2aa8 EFLAGS: 00000206 ORIG_RAX:  
> > 00000000000000b0
> > [  383.018425] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:  
> > 00007fb9c0f7c1b7
> > [  383.018440] RDX: 0000000000000000 RSI: 0000000000000800 RDI:  
> > 0000560b96856d48
> > [  383.018454] RBP: 0000560b96856ce0 R08: 0000560b96856d4c R09:  
> > 00007fffa01c2ae8
> > [  383.018468] R10: 00007fffa01c1aa4 R11: 0000000000000206 R12:  
> > 0000560b954f7470
> >
> > Testcase: igt/drv_module_reload/basic-reload-inject
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > ---
> >  drivers/gpu/drm/i915/intel_guc_submission.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_guc_submission.c  
> > b/drivers/gpu/drm/i915/intel_guc_submission.c
> > index 22367131d6a1..cc444dc5f3ad 100644
> > --- a/drivers/gpu/drm/i915/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/intel_guc_submission.c
> > @@ -1184,7 +1184,8 @@ void intel_guc_submission_fini(struct intel_guc  
> > *guc)
> >       guc_clients_destroy(guc);
> >       WARN_ON(!guc_verify_doorbells(guc));
> > -     guc_stage_desc_pool_destroy(guc);
> > +     if (guc->stage_desc_pool)
> > +             guc_stage_desc_pool_destroy(guc);
> 
> As short-term hack this is probably ok, but maybe to avoid such case by
> case hacks we should add single flag at UC level (intel_uc_init) that we
> have completed our initialization and then use this flag at cleanup phase
> (intel_uc_fini) just once.
> 
> Michal
> 
> ps.
> I recall some earlier reviews saying that using "if" at fini is wrong ;)

It is so wrong, and I'm cutting every corner in order to get the test in
place for someone else to clean up. Judging by the number of bugs that
have swept by with the non-functional drv_module_reload fault-injection,
we need to get it back working.
-Chris
Rodrigo Vivi July 13, 2018, 5:54 p.m. UTC | #3
On Fri, Jul 13, 2018 at 06:26:58PM +0100, Chris Wilson wrote:
> Hopefully the final hack to get guc fault-injection happy before we can
> clean it up again, starting from a known good baseline...
> 
> [  383.017530] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
> [  383.017556] Oops: 0000 [#1] PREEMPT SMP PTI
> [  383.017566] CPU: 7 PID: 4725 Comm: drv_module_relo Tainted: G     U            4.18.0-rc4-CI-CI_DRM_4485+ #1
> [  383.017581] Hardware name: Micro-Star International Co., Ltd. MS-7B54/Z370M MORTAR (MS-7B54), BIOS 1.10 12/28/2017
> [  383.017664] RIP: 0010:guc_stage_desc_pool_destroy+0x17/0xe0 [i915]
> [  383.017674] Code: 59 a0 c6 05 02 59 18 00 01 e8 5e 01 c3 e0 eb b1 0f 1f 00 53 48 89 fb 48 81 c7 90 02 00 00 e8 60 64 45 e1 48 8b 83 80 02 00 00 <48> 8b 80 a0 00 00 00 48 8b 90 68 02 00 00 48 83 ea 01 48 81 fa ff
> [  383.017771] RSP: 0018:ffffc900004bbdd0 EFLAGS: 00010282
> [  383.017782] RAX: 0000000000000000 RBX: ffff88012ff41300 RCX: 0000000000000000
> [  383.017794] RDX: 0000000000000000 RSI: ffffc900004bbd80 RDI: 0000000000000000
> [  383.017805] RBP: ffff88012ff40000 R08: 00000000d876ee11 R09: 0000000000000000
> [  383.017817] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88012ff47770
> [  383.017828] R13: ffff88012ff40068 R14: ffff880264392ef8 R15: ffffffffa0639950
> [  383.017840] FS:  00007fb9c18c8980(0000) GS:ffff8802663c0000(0000) knlGS:0000000000000000
> [  383.017853] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  383.017864] CR2: 00000000000000a0 CR3: 00000001df6cc003 CR4: 00000000003606e0
> [  383.017875] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  383.017887] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  383.017898] Call Trace:
> [  383.017962]  intel_uc_fini+0x34/0xd0 [i915]
> [  383.018020]  i915_gem_fini+0x5c/0x100 [i915]
> [  383.018093]  i915_driver_unload+0xd2/0x110 [i915]
> [  383.018150]  i915_pci_remove+0x10/0x20 [i915]
> [  383.018165]  pci_device_remove+0x36/0xb0
> [  383.018179]  device_release_driver_internal+0x185/0x250
> [  383.018193]  driver_detach+0x35/0x70
> [  383.018205]  bus_remove_driver+0x53/0xd0
> [  383.018217]  pci_unregister_driver+0x25/0xa0
> [  383.018232]  __se_sys_delete_module+0x162/0x210
> [  383.018245]  ? do_syscall_64+0xd/0x190
> [  383.018257]  do_syscall_64+0x55/0x190
> [  383.018270]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  383.018282] RIP: 0033:0x7fb9c0f7c1b7
> [  383.018290] Code: 73 01 c3 48 8b 0d d1 8c 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a1 8c 2c 00 f7 d8 64 89 01 48
> [  383.018408] RSP: 002b:00007fffa01c2aa8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> [  383.018425] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9c0f7c1b7
> [  383.018440] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 0000560b96856d48
> [  383.018454] RBP: 0000560b96856ce0 R08: 0000560b96856d4c R09: 00007fffa01c2ae8
> [  383.018468] R10: 00007fffa01c1aa4 R11: 0000000000000206 R12: 0000560b954f7470
> 
> Testcase: igt/drv_module_reload/basic-reload-inject
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_guc_submission.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_guc_submission.c b/drivers/gpu/drm/i915/intel_guc_submission.c
> index 22367131d6a1..cc444dc5f3ad 100644
> --- a/drivers/gpu/drm/i915/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/intel_guc_submission.c
> @@ -1184,7 +1184,8 @@ void intel_guc_submission_fini(struct intel_guc *guc)
>  	guc_clients_destroy(guc);
>  	WARN_ON(!guc_verify_doorbells(guc));
>  
> -	guc_stage_desc_pool_destroy(guc);
> +	if (guc->stage_desc_pool)
> +		guc_stage_desc_pool_destroy(guc);

it seems all this fini vs init could have something more solid

but these checks are really needed on the current code and I have
no other suggestion, so:

Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

>  }
>  
>  static void guc_interrupts_capture(struct drm_i915_private *dev_priv)
> -- 
> 2.18.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Michal Wajdeczko July 13, 2018, 5:57 p.m. UTC | #4
On Fri, 13 Jul 2018 19:52:09 +0200, Chris Wilson  
<chris@chris-wilson.co.uk> wrote:

> Quoting Michal Wajdeczko (2018-07-13 18:48:05)
>> On Fri, 13 Jul 2018 19:26:58 +0200, Chris Wilson
>> <chris@chris-wilson.co.uk> wrote:
>>
>> > Hopefully the final hack to get guc fault-injection happy before we  
>> can
>> > clean it up again, starting from a known good baseline...
>> >
>> > [  383.017530] BUG: unable to handle kernel NULL pointer dereference  
>> at
>> > 00000000000000a0
>> > [  383.017556] Oops: 0000 [#1] PREEMPT SMP PTI
>> > [  383.017566] CPU: 7 PID: 4725 Comm: drv_module_relo Tainted: G
>> > U            4.18.0-rc4-CI-CI_DRM_4485+ #1
>> > [  383.017581] Hardware name: Micro-Star International Co., Ltd.
>> > MS-7B54/Z370M MORTAR (MS-7B54), BIOS 1.10 12/28/2017
>> > [  383.017664] RIP: 0010:guc_stage_desc_pool_destroy+0x17/0xe0 [i915]
>> > [  383.017674] Code: 59 a0 c6 05 02 59 18 00 01 e8 5e 01 c3 e0 eb b1  
>> 0f
>> > 1f 00 53 48 89 fb 48 81 c7 90 02 00 00 e8 60 64 45 e1 48 8b 83 80 02  
>> 00
>> > 00 <48> 8b 80 a0 00 00 00 48 8b 90 68 02 00 00 48 83 ea 01 48 81 fa ff
>> > [  383.017771] RSP: 0018:ffffc900004bbdd0 EFLAGS: 00010282
>> > [  383.017782] RAX: 0000000000000000 RBX: ffff88012ff41300 RCX:
>> > 0000000000000000
>> > [  383.017794] RDX: 0000000000000000 RSI: ffffc900004bbd80 RDI:
>> > 0000000000000000
>> > [  383.017805] RBP: ffff88012ff40000 R08: 00000000d876ee11 R09:
>> > 0000000000000000
>> > [  383.017817] R10: 0000000000000000 R11: 0000000000000000 R12:
>> > ffff88012ff47770
>> > [  383.017828] R13: ffff88012ff40068 R14: ffff880264392ef8 R15:
>> > ffffffffa0639950
>> > [  383.017840] FS:  00007fb9c18c8980(0000) GS:ffff8802663c0000(0000)
>> > knlGS:0000000000000000
>> > [  383.017853] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > [  383.017864] CR2: 00000000000000a0 CR3: 00000001df6cc003 CR4:
>> > 00000000003606e0
>> > [  383.017875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> > 0000000000000000
>> > [  383.017887] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>> > 0000000000000400
>> > [  383.017898] Call Trace:
>> > [  383.017962]  intel_uc_fini+0x34/0xd0 [i915]
>> > [  383.018020]  i915_gem_fini+0x5c/0x100 [i915]
>> > [  383.018093]  i915_driver_unload+0xd2/0x110 [i915]
>> > [  383.018150]  i915_pci_remove+0x10/0x20 [i915]
>> > [  383.018165]  pci_device_remove+0x36/0xb0
>> > [  383.018179]  device_release_driver_internal+0x185/0x250
>> > [  383.018193]  driver_detach+0x35/0x70
>> > [  383.018205]  bus_remove_driver+0x53/0xd0
>> > [  383.018217]  pci_unregister_driver+0x25/0xa0
>> > [  383.018232]  __se_sys_delete_module+0x162/0x210
>> > [  383.018245]  ? do_syscall_64+0xd/0x190
>> > [  383.018257]  do_syscall_64+0x55/0x190
>> > [  383.018270]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> > [  383.018282] RIP: 0033:0x7fb9c0f7c1b7
>> > [  383.018290] Code: 73 01 c3 48 8b 0d d1 8c 2c 00 f7 d8 64 89 01 48  
>> 83
>> > c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00  
>> 0f
>> > 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a1 8c 2c 00 f7 d8 64 89 01 48
>> > [  383.018408] RSP: 002b:00007fffa01c2aa8 EFLAGS: 00000206 ORIG_RAX:
>> > 00000000000000b0
>> > [  383.018425] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
>> > 00007fb9c0f7c1b7
>> > [  383.018440] RDX: 0000000000000000 RSI: 0000000000000800 RDI:
>> > 0000560b96856d48
>> > [  383.018454] RBP: 0000560b96856ce0 R08: 0000560b96856d4c R09:
>> > 00007fffa01c2ae8
>> > [  383.018468] R10: 00007fffa01c1aa4 R11: 0000000000000206 R12:
>> > 0000560b954f7470
>> >
>> > Testcase: igt/drv_module_reload/basic-reload-inject
>> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> > Cc: Michał Winiarski <michal.winiarski@intel.com>
>> > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
>> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>> > ---
>> >  drivers/gpu/drm/i915/intel_guc_submission.c | 3 ++-
>> >  1 file changed, 2 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/gpu/drm/i915/intel_guc_submission.c
>> > b/drivers/gpu/drm/i915/intel_guc_submission.c
>> > index 22367131d6a1..cc444dc5f3ad 100644
>> > --- a/drivers/gpu/drm/i915/intel_guc_submission.c
>> > +++ b/drivers/gpu/drm/i915/intel_guc_submission.c
>> > @@ -1184,7 +1184,8 @@ void intel_guc_submission_fini(struct intel_guc
>> > *guc)
>> >       guc_clients_destroy(guc);
>> >       WARN_ON(!guc_verify_doorbells(guc));
>> > -     guc_stage_desc_pool_destroy(guc);
>> > +     if (guc->stage_desc_pool)
>> > +             guc_stage_desc_pool_destroy(guc);
>>
>> As short-term hack this is probably ok, but maybe to avoid such case by
>> case hacks we should add single flag at UC level (intel_uc_init) that we
>> have completed our initialization and then use this flag at cleanup  
>> phase
>> (intel_uc_fini) just once.
>>
>> Michal
>>
>> ps.
>> I recall some earlier reviews saying that using "if" at fini is wrong ;)
>
> It is so wrong, and I'm cutting every corner in order to get the test in
> place for someone else to clean up. Judging by the number of bugs that
> have swept by with the non-functional drv_module_reload fault-injection,
> we need to get it back working.

ok, I hope 'someone' will find time to finish this cleanup, so

Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Chris Wilson July 13, 2018, 6:24 p.m. UTC | #5
Quoting Michal Wajdeczko (2018-07-13 18:57:33)
> On Fri, 13 Jul 2018 19:52:09 +0200, Chris Wilson  
> <chris@chris-wilson.co.uk> wrote:
> 
> > Quoting Michal Wajdeczko (2018-07-13 18:48:05)
> >> On Fri, 13 Jul 2018 19:26:58 +0200, Chris Wilson
> >> <chris@chris-wilson.co.uk> wrote:
> >>
> >> > Hopefully the final hack to get guc fault-injection happy before we  
> >> can
> >> > clean it up again, starting from a known good baseline...
> >> >
> >> > [  383.017530] BUG: unable to handle kernel NULL pointer dereference  
> >> at
> >> > 00000000000000a0
> >> > [  383.017556] Oops: 0000 [#1] PREEMPT SMP PTI
> >> > [  383.017566] CPU: 7 PID: 4725 Comm: drv_module_relo Tainted: G
> >> > U            4.18.0-rc4-CI-CI_DRM_4485+ #1
> >> > [  383.017581] Hardware name: Micro-Star International Co., Ltd.
> >> > MS-7B54/Z370M MORTAR (MS-7B54), BIOS 1.10 12/28/2017
> >> > [  383.017664] RIP: 0010:guc_stage_desc_pool_destroy+0x17/0xe0 [i915]
> >> > [  383.017674] Code: 59 a0 c6 05 02 59 18 00 01 e8 5e 01 c3 e0 eb b1  
> >> 0f
> >> > 1f 00 53 48 89 fb 48 81 c7 90 02 00 00 e8 60 64 45 e1 48 8b 83 80 02  
> >> 00
> >> > 00 <48> 8b 80 a0 00 00 00 48 8b 90 68 02 00 00 48 83 ea 01 48 81 fa ff
> >> > [  383.017771] RSP: 0018:ffffc900004bbdd0 EFLAGS: 00010282
> >> > [  383.017782] RAX: 0000000000000000 RBX: ffff88012ff41300 RCX:
> >> > 0000000000000000
> >> > [  383.017794] RDX: 0000000000000000 RSI: ffffc900004bbd80 RDI:
> >> > 0000000000000000
> >> > [  383.017805] RBP: ffff88012ff40000 R08: 00000000d876ee11 R09:
> >> > 0000000000000000
> >> > [  383.017817] R10: 0000000000000000 R11: 0000000000000000 R12:
> >> > ffff88012ff47770
> >> > [  383.017828] R13: ffff88012ff40068 R14: ffff880264392ef8 R15:
> >> > ffffffffa0639950
> >> > [  383.017840] FS:  00007fb9c18c8980(0000) GS:ffff8802663c0000(0000)
> >> > knlGS:0000000000000000
> >> > [  383.017853] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> > [  383.017864] CR2: 00000000000000a0 CR3: 00000001df6cc003 CR4:
> >> > 00000000003606e0
> >> > [  383.017875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >> > 0000000000000000
> >> > [  383.017887] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> >> > 0000000000000400
> >> > [  383.017898] Call Trace:
> >> > [  383.017962]  intel_uc_fini+0x34/0xd0 [i915]
> >> > [  383.018020]  i915_gem_fini+0x5c/0x100 [i915]
> >> > [  383.018093]  i915_driver_unload+0xd2/0x110 [i915]
> >> > [  383.018150]  i915_pci_remove+0x10/0x20 [i915]
> >> > [  383.018165]  pci_device_remove+0x36/0xb0
> >> > [  383.018179]  device_release_driver_internal+0x185/0x250
> >> > [  383.018193]  driver_detach+0x35/0x70
> >> > [  383.018205]  bus_remove_driver+0x53/0xd0
> >> > [  383.018217]  pci_unregister_driver+0x25/0xa0
> >> > [  383.018232]  __se_sys_delete_module+0x162/0x210
> >> > [  383.018245]  ? do_syscall_64+0xd/0x190
> >> > [  383.018257]  do_syscall_64+0x55/0x190
> >> > [  383.018270]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >> > [  383.018282] RIP: 0033:0x7fb9c0f7c1b7
> >> > [  383.018290] Code: 73 01 c3 48 8b 0d d1 8c 2c 00 f7 d8 64 89 01 48  
> >> 83
> >> > c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00  
> >> 0f
> >> > 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a1 8c 2c 00 f7 d8 64 89 01 48
> >> > [  383.018408] RSP: 002b:00007fffa01c2aa8 EFLAGS: 00000206 ORIG_RAX:
> >> > 00000000000000b0
> >> > [  383.018425] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
> >> > 00007fb9c0f7c1b7
> >> > [  383.018440] RDX: 0000000000000000 RSI: 0000000000000800 RDI:
> >> > 0000560b96856d48
> >> > [  383.018454] RBP: 0000560b96856ce0 R08: 0000560b96856d4c R09:
> >> > 00007fffa01c2ae8
> >> > [  383.018468] R10: 00007fffa01c1aa4 R11: 0000000000000206 R12:
> >> > 0000560b954f7470
> >> >
> >> > Testcase: igt/drv_module_reload/basic-reload-inject
> >> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> >> > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> >> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> >> > ---
> >> >  drivers/gpu/drm/i915/intel_guc_submission.c | 3 ++-
> >> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/drivers/gpu/drm/i915/intel_guc_submission.c
> >> > b/drivers/gpu/drm/i915/intel_guc_submission.c
> >> > index 22367131d6a1..cc444dc5f3ad 100644
> >> > --- a/drivers/gpu/drm/i915/intel_guc_submission.c
> >> > +++ b/drivers/gpu/drm/i915/intel_guc_submission.c
> >> > @@ -1184,7 +1184,8 @@ void intel_guc_submission_fini(struct intel_guc
> >> > *guc)
> >> >       guc_clients_destroy(guc);
> >> >       WARN_ON(!guc_verify_doorbells(guc));
> >> > -     guc_stage_desc_pool_destroy(guc);
> >> > +     if (guc->stage_desc_pool)
> >> > +             guc_stage_desc_pool_destroy(guc);
> >>
> >> As short-term hack this is probably ok, but maybe to avoid such case by
> >> case hacks we should add single flag at UC level (intel_uc_init) that we
> >> have completed our initialization and then use this flag at cleanup  
> >> phase
> >> (intel_uc_fini) just once.
> >>
> >> Michal
> >>
> >> ps.
> >> I recall some earlier reviews saying that using "if" at fini is wrong ;)
> >
> > It is so wrong, and I'm cutting every corner in order to get the test in
> > place for someone else to clean up. Judging by the number of bugs that
> > have swept by with the non-functional drv_module_reload fault-injection,
> > we need to get it back working.
> 
> ok, I hope 'someone' will find time to finish this cleanup, so

Let's have a game of musical chairs!

Pushed, now I just need to apply an igt patch and then can try again to
see if drv_module_reload is ready.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_guc_submission.c b/drivers/gpu/drm/i915/intel_guc_submission.c
index 22367131d6a1..cc444dc5f3ad 100644
--- a/drivers/gpu/drm/i915/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/intel_guc_submission.c
@@ -1184,7 +1184,8 @@  void intel_guc_submission_fini(struct intel_guc *guc)
 	guc_clients_destroy(guc);
 	WARN_ON(!guc_verify_doorbells(guc));
 
-	guc_stage_desc_pool_destroy(guc);
+	if (guc->stage_desc_pool)
+		guc_stage_desc_pool_destroy(guc);
 }
 
 static void guc_interrupts_capture(struct drm_i915_private *dev_priv)