diff mbox series

drm/i915/guc: Fix recursive lock in GuC submission

Message ID 20211020192147.8048-1-matthew.brost@intel.com (mailing list archive)
State New, archived
Headers show
Series drm/i915/guc: Fix recursive lock in GuC submission | expand

Commit Message

Matthew Brost Oct. 20, 2021, 7:21 p.m. UTC
Use __release_guc_id (lock held) rather than release_guc_id (acquires
lock), add lockdep annotations.

213.280129] i915: Running i915_perf_live_selftests/live_noa_gpr
[ 213.283459] ============================================
[ 213.283462] WARNING: possible recursive locking detected
{{[ 213.283466] 5.15.0-rc6+ #18 Tainted: G U W }}
[ 213.283470] --------------------------------------------
[ 213.283472] kworker/u24:0/8 is trying to acquire lock:
[ 213.283475] ffff8ffc4f6cc1e8 (&guc->submission_state.lock){....}-{2:2}, at: destroyed_worker_func+0x2df/0x350 [i915]
{{[ 213.283618] }}
{{ but task is already holding lock:}}
[ 213.283621] ffff8ffc4f6cc1e8 (&guc->submission_state.lock){....}-{2:2}, at: destroyed_worker_func+0x4f/0x350 [i915]
{{[ 213.283720] }}
{{ other info that might help us debug this:}}
[ 213.283724] Possible unsafe locking scenario:[ 213.283727] CPU0
[ 213.283728] ----
[ 213.283730] lock(&guc->submission_state.lock);
[ 213.283734] lock(&guc->submission_state.lock);
{{[ 213.283737] }}
{{ *** DEADLOCK ***}}[ 213.283740] May be due to missing lock nesting notation[ 213.283744] 3 locks held by kworker/u24:0/8:
[ 213.283747] #0: ffff8ffb80059d38 ((wq_completion)events_unbound){..}-{0:0}, at: process_one_work+0x1f3/0x550
[ 213.283757] #1: ffffb509000e3e78 ((work_completion)(&guc->submission_state.destroyed_worker)){..}-{0:0}, at: process_one_work+0x1f3/0x550
[ 213.283766] #2: ffff8ffc4f6cc1e8 (&guc->submission_state.lock){....}-{2:2}, at: destroyed_worker_func+0x4f/0x350 [i915]
{{[ 213.283860] }}
{{ stack backtrace:}}
[ 213.283863] CPU: 8 PID: 8 Comm: kworker/u24:0 Tainted: G U W 5.15.0-rc6+ #18
[ 213.283868] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 0403 01/26/2021
[ 213.283873] Workqueue: events_unbound destroyed_worker_func [i915]
[ 213.283957] Call Trace:
[ 213.283960] dump_stack_lvl+0x57/0x72
[ 213.283966] __lock_acquire.cold+0x191/0x2d3
[ 213.283972] lock_acquire+0xb5/0x2b0
[ 213.283978] ? destroyed_worker_func+0x2df/0x350 [i915]
[ 213.284059] ? destroyed_worker_func+0x2d7/0x350 [i915]
[ 213.284139] ? lock_release+0xb9/0x280
[ 213.284143] _raw_spin_lock_irqsave+0x48/0x60
[ 213.284148] ? destroyed_worker_func+0x2df/0x350 [i915]
[ 213.284226] destroyed_worker_func+0x2df/0x350 [i915]
[ 213.284310] process_one_work+0x270/0x550
[ 213.284315] worker_thread+0x52/0x3b0
[ 213.284319] ? process_one_work+0x550/0x550
[ 213.284322] kthread+0x135/0x160
[ 213.284326] ? set_kthread_struct+0x40/0x40
[ 213.284331] ret_from_fork+0x1f/0x30

and a bit later in the trace:

{{ 227.499864] do_raw_spin_lock+0x94/0xa0}}
[ 227.499868] _raw_spin_lock_irqsave+0x50/0x60
[ 227.499871] ? guc_flush_destroyed_contexts+0x4f/0xf0 [i915]
[ 227.499995] guc_flush_destroyed_contexts+0x4f/0xf0 [i915]
[ 227.500104] intel_guc_submission_reset_prepare+0x99/0x4b0 [i915]
[ 227.500209] ? mark_held_locks+0x49/0x70
[ 227.500212] intel_uc_reset_prepare+0x46/0x50 [i915]
[ 227.500320] reset_prepare+0x78/0x90 [i915]
[ 227.500412] __intel_gt_set_wedged.part.0+0x13/0xe0 [i915]
[ 227.500485] intel_gt_set_wedged.part.0+0x54/0x100 [i915]
[ 227.500556] intel_gt_set_wedged_on_fini+0x1a/0x30 [i915]
[ 227.500622] intel_gt_driver_unregister+0x1e/0x60 [i915]
[ 227.500694] i915_driver_remove+0x4a/0xf0 [i915]
[ 227.500767] i915_pci_probe+0x84/0x170 [i915]
[ 227.500838] local_pci_probe+0x42/0x80
[ 227.500842] pci_device_probe+0xd9/0x190
[ 227.500844] really_probe+0x1f2/0x3f0
[ 227.500847] __driver_probe_device+0xfe/0x180
[ 227.500848] driver_probe_device+0x1e/0x90
[ 227.500850] __driver_attach+0xc4/0x1d0
[ 227.500851] ? __device_attach_driver+0xe0/0xe0
[ 227.500853] ? __device_attach_driver+0xe0/0xe0
[ 227.500854] bus_for_each_dev+0x64/0x90
[ 227.500856] bus_add_driver+0x12e/0x1f0
[ 227.500857] driver_register+0x8f/0xe0
[ 227.500859] i915_init+0x1d/0x8f [i915]
[ 227.500934] ? 0xffffffffc144a000
[ 227.500936] do_one_initcall+0x58/0x2d0
[ 227.500938] ? rcu_read_lock_sched_held+0x3f/0x80
[ 227.500940] ? kmem_cache_alloc_trace+0x238/0x2d0
[ 227.500944] do_init_module+0x5c/0x270
[ 227.500946] __do_sys_finit_module+0x95/0xe0
[ 227.500949] do_syscall_64+0x38/0x90
[ 227.500951] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 227.500953] RIP: 0033:0x7ffa59d2ae0d
[ 227.500954] Code: c8 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3b 80 0c 00 f7 d8 64 89 01 48
[ 227.500955] RSP: 002b:00007fff320bbf48 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[ 227.500956] RAX: ffffffffffffffda RBX: 00000000022ea710 RCX: 00007ffa59d2ae0d
[ 227.500957] RDX: 0000000000000000 RSI: 00000000022e1d90 RDI: 0000000000000004
[ 227.500958] RBP: 0000000000000020 R08: 00007ffa59df3a60 R09: 0000000000000070
[ 227.500958] R10: 00000000022e1d90 R11: 0000000000000246 R12: 00000000022e1d90
[ 227.500959] R13: 00000000022e58e0 R14: 0000000000000043 R15: 00000000022e42c0

v2:
 (CI build)
  - Fix build error

Fixes: 1a52faed31311 ("drm/i915/guc: Take engine PM when a context is pinned with GuC submission")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Thomas Hellström Oct. 21, 2021, 5:39 a.m. UTC | #1
On Wed, 2021-10-20 at 12:21 -0700, Matthew Brost wrote:
> Use __release_guc_id (lock held) rather than release_guc_id (acquires
> lock), add lockdep annotations.
> 
> 213.280129] i915: Running i915_perf_live_selftests/live_noa_gpr
> [ 213.283459] ============================================
> [ 213.283462] WARNING: possible recursive locking detected
> {{[ 213.283466] 5.15.0-rc6+ #18 Tainted: G U W }}
> [ 213.283470] --------------------------------------------
> [ 213.283472] kworker/u24:0/8 is trying to acquire lock:
> [ 213.283475] ffff8ffc4f6cc1e8 (&guc->submission_state.lock){....}-
> {2:2}, at: destroyed_worker_func+0x2df/0x350 [i915]
> {{[ 213.283618] }}
> {{ but task is already holding lock:}}
> [ 213.283621] ffff8ffc4f6cc1e8 (&guc->submission_state.lock){....}-
> {2:2}, at: destroyed_worker_func+0x4f/0x350 [i915]
> {{[ 213.283720] }}
> {{ other info that might help us debug this:}}
> [ 213.283724] Possible unsafe locking scenario:[ 213.283727] CPU0
> [ 213.283728] ----
> [ 213.283730] lock(&guc->submission_state.lock);
> [ 213.283734] lock(&guc->submission_state.lock);
> {{[ 213.283737] }}
> {{ *** DEADLOCK ***}}[ 213.283740] May be due to missing lock nesting
> notation[ 213.283744] 3 locks held by kworker/u24:0/8:
> [ 213.283747] #0: ffff8ffb80059d38
> ((wq_completion)events_unbound){..}-{0:0}, at:
> process_one_work+0x1f3/0x550
> [ 213.283757] #1: ffffb509000e3e78 ((work_completion)(&guc-
> >submission_state.destroyed_worker)){..}-{0:0}, at:
> process_one_work+0x1f3/0x550
> [ 213.283766] #2: ffff8ffc4f6cc1e8 (&guc-
> >submission_state.lock){....}-{2:2}, at:
> destroyed_worker_func+0x4f/0x350 [i915]
> {{[ 213.283860] }}
> {{ stack backtrace:}}
> [ 213.283863] CPU: 8 PID: 8 Comm: kworker/u24:0 Tainted: G U W
> 5.15.0-rc6+ #18
> [ 213.283868] Hardware name: ASUS System Product Name/PRIME B560M-A
> AC, BIOS 0403 01/26/2021
> [ 213.283873] Workqueue: events_unbound destroyed_worker_func [i915]
> [ 213.283957] Call Trace:
> [ 213.283960] dump_stack_lvl+0x57/0x72
> [ 213.283966] __lock_acquire.cold+0x191/0x2d3
> [ 213.283972] lock_acquire+0xb5/0x2b0
> [ 213.283978] ? destroyed_worker_func+0x2df/0x350 [i915]
> [ 213.284059] ? destroyed_worker_func+0x2d7/0x350 [i915]
> [ 213.284139] ? lock_release+0xb9/0x280
> [ 213.284143] _raw_spin_lock_irqsave+0x48/0x60
> [ 213.284148] ? destroyed_worker_func+0x2df/0x350 [i915]
> [ 213.284226] destroyed_worker_func+0x2df/0x350 [i915]
> [ 213.284310] process_one_work+0x270/0x550
> [ 213.284315] worker_thread+0x52/0x3b0
> [ 213.284319] ? process_one_work+0x550/0x550
> [ 213.284322] kthread+0x135/0x160
> [ 213.284326] ? set_kthread_struct+0x40/0x40
> [ 213.284331] ret_from_fork+0x1f/0x30
> 
> and a bit later in the trace:
> 
> {{ 227.499864] do_raw_spin_lock+0x94/0xa0}}
> [ 227.499868] _raw_spin_lock_irqsave+0x50/0x60
> [ 227.499871] ? guc_flush_destroyed_contexts+0x4f/0xf0 [i915]
> [ 227.499995] guc_flush_destroyed_contexts+0x4f/0xf0 [i915]
> [ 227.500104] intel_guc_submission_reset_prepare+0x99/0x4b0 [i915]
> [ 227.500209] ? mark_held_locks+0x49/0x70
> [ 227.500212] intel_uc_reset_prepare+0x46/0x50 [i915]
> [ 227.500320] reset_prepare+0x78/0x90 [i915]
> [ 227.500412] __intel_gt_set_wedged.part.0+0x13/0xe0 [i915]
> [ 227.500485] intel_gt_set_wedged.part.0+0x54/0x100 [i915]
> [ 227.500556] intel_gt_set_wedged_on_fini+0x1a/0x30 [i915]
> [ 227.500622] intel_gt_driver_unregister+0x1e/0x60 [i915]
> [ 227.500694] i915_driver_remove+0x4a/0xf0 [i915]
> [ 227.500767] i915_pci_probe+0x84/0x170 [i915]
> [ 227.500838] local_pci_probe+0x42/0x80
> [ 227.500842] pci_device_probe+0xd9/0x190
> [ 227.500844] really_probe+0x1f2/0x3f0
> [ 227.500847] __driver_probe_device+0xfe/0x180
> [ 227.500848] driver_probe_device+0x1e/0x90
> [ 227.500850] __driver_attach+0xc4/0x1d0
> [ 227.500851] ? __device_attach_driver+0xe0/0xe0
> [ 227.500853] ? __device_attach_driver+0xe0/0xe0
> [ 227.500854] bus_for_each_dev+0x64/0x90
> [ 227.500856] bus_add_driver+0x12e/0x1f0
> [ 227.500857] driver_register+0x8f/0xe0
> [ 227.500859] i915_init+0x1d/0x8f [i915]
> [ 227.500934] ? 0xffffffffc144a000
> [ 227.500936] do_one_initcall+0x58/0x2d0
> [ 227.500938] ? rcu_read_lock_sched_held+0x3f/0x80
> [ 227.500940] ? kmem_cache_alloc_trace+0x238/0x2d0
> [ 227.500944] do_init_module+0x5c/0x270
> [ 227.500946] __do_sys_finit_module+0x95/0xe0
> [ 227.500949] do_syscall_64+0x38/0x90
> [ 227.500951] entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 227.500953] RIP: 0033:0x7ffa59d2ae0d
> [ 227.500954] Code: c8 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e
> fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
> 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3b 80 0c 00 f7 d8 64
> 89 01 48
> [ 227.500955] RSP: 002b:00007fff320bbf48 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000139
> [ 227.500956] RAX: ffffffffffffffda RBX: 00000000022ea710 RCX:
> 00007ffa59d2ae0d
> [ 227.500957] RDX: 0000000000000000 RSI: 00000000022e1d90 RDI:
> 0000000000000004
> [ 227.500958] RBP: 0000000000000020 R08: 00007ffa59df3a60 R09:
> 0000000000000070
> [ 227.500958] R10: 00000000022e1d90 R11: 0000000000000246 R12:
> 00000000022e1d90
> [ 227.500959] R13: 00000000022e58e0 R14: 0000000000000043 R15:
> 00000000022e42c0
> 
> v2:
>  (CI build)
>   - Fix build error
> 
> Fixes: 1a52faed31311 ("drm/i915/guc: Take engine PM when a context is
> pinned with GuC submission")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: stable@vger.kernel.org

Looks correct to me.
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index d7710debcd47..38b47e73e35d 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -2373,6 +2373,7 @@ static inline void guc_lrc_desc_unpin(struct
> intel_context *ce)
>         unsigned long flags;
>         bool disabled;
>  
> +       lockdep_assert_held(&guc->submission_state.lock);
>         GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
>         GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
>         GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
> @@ -2388,7 +2389,7 @@ static inline void guc_lrc_desc_unpin(struct
> intel_context *ce)
>         }
>         spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>         if (unlikely(disabled)) {
> -               release_guc_id(guc, ce);
> +               __release_guc_id(guc, ce);
>                 __guc_context_destroy(ce);
>                 return;
>         }
Joonas Lahtinen Oct. 25, 2021, 12:23 p.m. UTC | #2
Quoting Thomas Hellström (2021-10-21 08:39:48)
> On Wed, 2021-10-20 at 12:21 -0700, Matthew Brost wrote:

<SNIP>

> > Fixes: 1a52faed31311 ("drm/i915/guc: Take engine PM when a context is
> > pinned with GuC submission")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: stable@vger.kernel.org

This Cc: stable annotation is unnecessary.

Please always use "dim fixes 1a52faed31311" for helping to decide which
Cc's are needed. In this case stable is not needed. If it was, there
would be an indication of kernel version. In this case this is fine to
be picked up by in drm-intel-next-fixes PR.

Let's pay attention to the right Fixes: annotation while submitting and
reviewing patches.

Regards, Joonas
Matthew Brost Oct. 25, 2021, 5:13 p.m. UTC | #3
On Mon, Oct 25, 2021 at 03:23:00PM +0300, Joonas Lahtinen wrote:
> Quoting Thomas Hellström (2021-10-21 08:39:48)
> > On Wed, 2021-10-20 at 12:21 -0700, Matthew Brost wrote:
> 
> <SNIP>
> 
> > > Fixes: 1a52faed31311 ("drm/i915/guc: Take engine PM when a context is
> > > pinned with GuC submission")
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > Cc: stable@vger.kernel.org
> 
> This Cc: stable annotation is unnecessary.
> 
> Please always use "dim fixes 1a52faed31311" for helping to decide which
> Cc's are needed. In this case stable is not needed. If it was, there
> would be an indication of kernel version. In this case this is fine to
> be picked up by in drm-intel-next-fixes PR.
> 
> Let's pay attention to the right Fixes: annotation while submitting and
> reviewing patches.
> 

Will do. Working on getting push rights. Is there any documentation with
all the rules when pushing as it seems like there are a lot of rules.

Matt 

> Regards, Joonas
Joonas Lahtinen Oct. 26, 2021, 9:15 a.m. UTC | #4
Quoting Matthew Brost (2021-10-25 20:13:22)
> On Mon, Oct 25, 2021 at 03:23:00PM +0300, Joonas Lahtinen wrote:
> > Quoting Thomas Hellström (2021-10-21 08:39:48)
> > > On Wed, 2021-10-20 at 12:21 -0700, Matthew Brost wrote:
> > 
> > <SNIP>
> > 
> > > > Fixes: 1a52faed31311 ("drm/i915/guc: Take engine PM when a context is
> > > > pinned with GuC submission")
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > Cc: stable@vger.kernel.org
> > 
> > This Cc: stable annotation is unnecessary.
> > 
> > Please always use "dim fixes 1a52faed31311" for helping to decide which
> > Cc's are needed. In this case stable is not needed. If it was, there
> > would be an indication of kernel version. In this case this is fine to
> > be picked up by in drm-intel-next-fixes PR.
> > 
> > Let's pay attention to the right Fixes: annotation while submitting and
> > reviewing patches.
> > 
> 
> Will do. Working on getting push rights. Is there any documentation with
> all the rules when pushing as it seems like there are a lot of rules.

Yes, we have the documentation here:

https://drm.pages.freedesktop.org/maintainer-tools/committer-guidelines.html

And more specifically this topic:

https://drm.pages.freedesktop.org/maintainer-tools/committer-drm-intel.html#labeling-fixes-before-pushing

I could even recommend to at least do a cursory read through the wider
documentation about how the different trees interact:

https://drm.pages.freedesktop.org/maintainer-tools/index.html

Makes it easier to understand how the tags are used.

Regards, Joonas

> 
> Matt 
> 
> > Regards, Joonas
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index d7710debcd47..38b47e73e35d 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2373,6 +2373,7 @@  static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 	unsigned long flags;
 	bool disabled;
 
+	lockdep_assert_held(&guc->submission_state.lock);
 	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
 	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
 	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
@@ -2388,7 +2389,7 @@  static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 	}
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 	if (unlikely(disabled)) {
-		release_guc_id(guc, ce);
+		__release_guc_id(guc, ce);
 		__guc_context_destroy(ce);
 		return;
 	}