mbox series

[v3,0/2] Fix error propagation amongst request

Message ID 20230228021142.1905349-1-andi.shyti@linux.intel.com (mailing list archive)
Headers show
Series Fix error propagation amongst request | expand

Message

Andi Shyti Feb. 28, 2023, 2:11 a.m. UTC
Hi,

This series of two patches fixes the issue introduced in
cf586021642d80 ("drm/i915/gt: Pipelined page migration") where,
as reported by Matt, in a chain of requests an error is reported
only if happens in the last request.

However Chris noticed that without ensuring exclusivity in the
locking we might end up in some deadlock. That's why patch 1
throttles for the ringspace in order to make sure that no one is
holding it.

Version 1 of this patch has been reviewed by matt and this
version is adding Chris exclusive locking.

Thanks Chris for this work.

Andi

Changelog
=========
v1 -> v2
 - Add patch 1 for ensuring exclusive locking of the timeline
 - Reword git commit of patch 2.

Andi Shyti (1):
  drm/i915/gt: Make sure that errors are propagated through request
    chains

Chris Wilson (1):
  drm/i915: Throttle for ringspace prior to taking the timeline mutex

 drivers/gpu/drm/i915/gt/intel_context.c | 41 +++++++++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_context.h |  2 ++
 drivers/gpu/drm/i915/gt/intel_migrate.c | 39 +++++++++++++++++------
 drivers/gpu/drm/i915/i915_request.c     |  3 ++
 4 files changed, 75 insertions(+), 10 deletions(-)

Comments

Gwan-gyeong Mun March 7, 2023, 7:33 a.m. UTC | #1
Hi Andi,

After applying these two patches, deadlock is being detected in the call 
stack below. Please review whether the patch to update the 
intel_context_migrate_copy() part affected the deadlock.


https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v1/bat-dg2-8/igt@i915_module_load@load.html#dmesg-warnings1037

<4> [33.070967] ============================================
<4> [33.070968] WARNING: possible recursive locking detected
<4> [33.070969] 6.2.0-Patchwork_114451v1-g8589fd9227ca+ #1 Not tainted
<4> [33.070970] --------------------------------------------
<4> [33.070971] i915_module_loa/948 is trying to acquire lock:
<4> [33.070972] ffff8881127f0478 (migrate){+.+.}-{3:3}, at: 
i915_request_create+0x1c6/0x230 [i915]
<4> [33.071215]
but task is already holding lock:
<4> [33.071235] ffff8881127f0478 (migrate){+.+.}-{3:3}, at: 
intel_context_migrate_copy+0x1b3/0xa80 [i915]
<4> [33.071484]
other info that might help us debug this:
<4> [33.071504]  Possible unsafe locking scenario:
<4> [33.071522]        CPU0
<4> [33.071532]        ----
<4> [33.071541]   lock(migrate);
<4> [33.071554]   lock(migrate);
<4> [33.071567]
  *** DEADLOCK ***
<4> [33.071585]  May be due to missing lock nesting notation
<4> [33.071606] 3 locks held by i915_module_loa/948:
<4> [33.071622]  #0: ffffc90001eb7b70 
(reservation_ww_class_acquire){+.+.}-{0:0}, at: 
i915_gem_do_execbuffer+0xae2/0x21c0 [i915]
<4> [33.071893]  #1: ffff8881127b9c28 
(reservation_ww_class_mutex){+.+.}-{3:3}, at: 
__intel_context_do_pin_ww+0x7a/0xa30 [i915]
<4> [33.072133]  #2: ffff8881127f0478 (migrate){+.+.}-{3:3}, at: 
intel_context_migrate_copy+0x1b3/0xa80 [i915]
<4> [33.072384]
stack backtrace:
<4> [33.072399] CPU: 7 PID: 948 Comm: i915_module_loa Not tainted 
6.2.0-Patchwork_114451v1-g8589fd9227ca+ #1
<4> [33.072428] Hardware name: Intel Corporation CoffeeLake Client 
Platform/CoffeeLake S UDIMM RVP, BIOS CNLSFWR1.R00.X220.B00.2103302221 
03/30/2021
<4> [33.072465] Call Trace:
<4> [33.072475]  <TASK>
<4> [33.072486]  dump_stack_lvl+0x5b/0x85
<4> [33.072503]  __lock_acquire.cold+0x158/0x33b
<4> [33.072524]  lock_acquire+0xd6/0x310
<4> [33.072541]  ? i915_request_create+0x1c6/0x230 [i915]
<4> [33.072812]  __mutex_lock+0x95/0xf40
<4> [33.072829]  ? i915_request_create+0x1c6/0x230 [i915]
<4> [33.073093]  ? rcu_read_lock_sched_held+0x55/0x80
<4> [33.073112]  ? __mutex_lock+0x133/0xf40
<4> [33.073128]  ? i915_request_create+0x1c6/0x230 [i915]
<4> [33.073388]  ? intel_context_migrate_copy+0x1b3/0xa80 [i915]
<4> [33.073619]  ? i915_request_create+0x1c6/0x230 [i915]
<4> [33.073876]  i915_request_create+0x1c6/0x230 [i915]
<4> [33.074135]  intel_context_migrate_copy+0x1d0/0xa80 [i915]
<4> [33.074360]  __i915_ttm_move+0x7a8/0x940 [i915]
<4> [33.074538]  ? _raw_spin_unlock_irqrestore+0x41/0x70
<4> [33.074552]  ? dma_resv_iter_next+0x91/0xb0
<4> [33.074564]  ? dma_resv_iter_first+0x42/0xb0
<4> [33.074576]  ? i915_deps_add_resv+0x4c/0xc0 [i915]
<4> [33.074744]  i915_ttm_move+0x2ac/0x430 [i915]
<4> [33.074910]  ttm_bo_handle_move_mem+0xb5/0x140 [ttm]
<4> [33.074930]  ttm_bo_validate+0xe9/0x1a0 [ttm]
<4> [33.074947]  __i915_ttm_get_pages+0x4e/0x190 [i915]
<4> [33.075112]  i915_ttm_get_pages+0xf3/0x160 [i915]
<4> [33.075280]  ____i915_gem_object_get_pages+0x36/0xb0 [i915]
<4> [33.075446]  __i915_gem_object_get_pages+0x95/0xa0 [i915]
<4> [33.075608]  i915_vma_get_pages+0xfa/0x160 [i915]
<4> [33.075779]  i915_vma_pin_ww+0xdc/0xb50 [i915]
<4> [33.075953]  eb_validate_vmas+0x1c6/0xac0 [i915]
<4> [33.076114]  i915_gem_do_execbuffer+0xb2a/0x21c0 [i915]
<4> [33.076276]  ? __stack_depot_save+0x3f/0x4e0
<4> [33.076292]  ? 0xffffffff81000000
<4> [33.076301]  ? _raw_spin_unlock_irq+0x41/0x50
<4> [33.076312]  ? lockdep_hardirqs_on+0xc3/0x140
<4> [33.076325]  ? set_track_update+0x25/0x50
<4> [33.076338]  ? __lock_acquire+0x5f2/0x2130
<4> [33.076356]  i915_gem_execbuffer2_ioctl+0x123/0x2e0 [i915]
<4> [33.076519]  ? __pfx_i915_gem_execbuffer2_ioctl+0x10/0x10 [i915]
<4> [33.076679]  drm_ioctl_kernel+0xb4/0x150
<4> [33.076692]  drm_ioctl+0x21d/0x420
<4> [33.076703]  ? __pfx_i915_gem_execbuffer2_ioctl+0x10/0x10 [i915]
<4> [33.076864]  ? __vm_munmap+0xd3/0x170
<4> [33.076877]  __x64_sys_ioctl+0x76/0xb0
<4> [33.076889]  do_syscall_64+0x3c/0x90
<4> [33.076900]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
<4> [33.076913] RIP: 0033:0x7f304aa903ab
<4> [33.076923] Code: 0f 1e fa 48 8b 05 e5 7a 0d 00 64 c7 00 26 00 00 00 
48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b5 7a 0d 00 f7 d8 64 89 01 48
<4> [33.076957] RSP: 002b:00007fffb1424cf8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
<4> [33.076975] RAX: ffffffffffffffda RBX: 00007fffb1424da0 RCX: 
00007f304aa903ab
<4> [33.076990] RDX: 00007fffb1424da0 RSI: 0000000040406469 RDI: 
0000000000000005
<4> [33.077004] RBP: 0000000040406469 R08: 0000000000000005 R09: 
0000000100003000
<4> [33.077019] R10: 0000000000000001 R11: 0000000000000246 R12: 
0000000000010000
<4> [33.077034] R13: 0000000000000005 R14: 00000000ffffffff R15: 
00000000000056a0
<4> [33.077052]  </TASK>

Br,

G.G.

On 2/28/23 4:11 AM, Andi Shyti wrote:
> Hi,
> 
> This series of two patches fixes the issue introduced in
> cf586021642d80 ("drm/i915/gt: Pipelined page migration") where,
> as reported by Matt, in a chain of requests an error is reported
> only if happens in the last request.
> 
> However Chris noticed that without ensuring exclusivity in the
> locking we might end up in some deadlock. That's why patch 1
> throttles for the ringspace in order to make sure that no one is
> holding it.
> 
> Version 1 of this patch has been reviewed by matt and this
> version is adding Chris exclusive locking.
> 
> Thanks Chris for this work.
> 
> Andi
> 
> Changelog
> =========
> v1 -> v2
>   - Add patch 1 for ensuring exclusive locking of the timeline
>   - Reword git commit of patch 2.
> 
> Andi Shyti (1):
>    drm/i915/gt: Make sure that errors are propagated through request
>      chains
> 
> Chris Wilson (1):
>    drm/i915: Throttle for ringspace prior to taking the timeline mutex
> 
>   drivers/gpu/drm/i915/gt/intel_context.c | 41 +++++++++++++++++++++++++
>   drivers/gpu/drm/i915/gt/intel_context.h |  2 ++
>   drivers/gpu/drm/i915/gt/intel_migrate.c | 39 +++++++++++++++++------
>   drivers/gpu/drm/i915/i915_request.c     |  3 ++
>   4 files changed, 75 insertions(+), 10 deletions(-)
>
Andi Shyti March 7, 2023, 9:45 a.m. UTC | #2
Hi GG,

On Tue, Mar 07, 2023 at 09:33:12AM +0200, Gwan-gyeong Mun wrote:
> Hi Andi,
> 
> After applying these two patches, deadlock is being detected in the call
> stack below. Please review whether the patch to update the
> intel_context_migrate_copy() part affected the deadlock.
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v1/bat-dg2-8/igt@i915_module_load@load.html#dmesg-warnings1037

Thanks for looking into this. Yes, there is a basic locking issue
here coming from migrate. migrate() takes the timeline lock and
then calls the request_create() which tries to lock again. We
inevitably fall into deadlock.

The locking of the timeline is quite exotic, it's started in
request_create() and released in request_add().

It's still in trybot, but this is supposed to be the next
version:

https://patchwork.freedesktop.org/series/114645/

This creates new version of request_create_locked() and
request_add_locked() where there the timeline is not locked in
the process.

There are still some selftests that need to be fixed, though.

Andi

> <4> [33.070967] ============================================
> <4> [33.070968] WARNING: possible recursive locking detected
> <4> [33.070969] 6.2.0-Patchwork_114451v1-g8589fd9227ca+ #1 Not tainted
> <4> [33.070970] --------------------------------------------
> <4> [33.070971] i915_module_loa/948 is trying to acquire lock:
> <4> [33.070972] ffff8881127f0478 (migrate){+.+.}-{3:3}, at:
> i915_request_create+0x1c6/0x230 [i915]
> <4> [33.071215]
> but task is already holding lock:
> <4> [33.071235] ffff8881127f0478 (migrate){+.+.}-{3:3}, at:
> intel_context_migrate_copy+0x1b3/0xa80 [i915]
> <4> [33.071484]
> other info that might help us debug this:
> <4> [33.071504]  Possible unsafe locking scenario:
> <4> [33.071522]        CPU0
> <4> [33.071532]        ----
> <4> [33.071541]   lock(migrate);
> <4> [33.071554]   lock(migrate);
> <4> [33.071567]
>  *** DEADLOCK ***
> <4> [33.071585]  May be due to missing lock nesting notation
> <4> [33.071606] 3 locks held by i915_module_loa/948:
> <4> [33.071622]  #0: ffffc90001eb7b70
> (reservation_ww_class_acquire){+.+.}-{0:0}, at:
> i915_gem_do_execbuffer+0xae2/0x21c0 [i915]
> <4> [33.071893]  #1: ffff8881127b9c28
> (reservation_ww_class_mutex){+.+.}-{3:3}, at:
> __intel_context_do_pin_ww+0x7a/0xa30 [i915]
> <4> [33.072133]  #2: ffff8881127f0478 (migrate){+.+.}-{3:3}, at:
> intel_context_migrate_copy+0x1b3/0xa80 [i915]
> <4> [33.072384]
> stack backtrace:
> <4> [33.072399] CPU: 7 PID: 948 Comm: i915_module_loa Not tainted
> 6.2.0-Patchwork_114451v1-g8589fd9227ca+ #1
> <4> [33.072428] Hardware name: Intel Corporation CoffeeLake Client
> Platform/CoffeeLake S UDIMM RVP, BIOS CNLSFWR1.R00.X220.B00.2103302221
> 03/30/2021
> <4> [33.072465] Call Trace:
> <4> [33.072475]  <TASK>
> <4> [33.072486]  dump_stack_lvl+0x5b/0x85
> <4> [33.072503]  __lock_acquire.cold+0x158/0x33b
> <4> [33.072524]  lock_acquire+0xd6/0x310
> <4> [33.072541]  ? i915_request_create+0x1c6/0x230 [i915]
> <4> [33.072812]  __mutex_lock+0x95/0xf40
> <4> [33.072829]  ? i915_request_create+0x1c6/0x230 [i915]
> <4> [33.073093]  ? rcu_read_lock_sched_held+0x55/0x80
> <4> [33.073112]  ? __mutex_lock+0x133/0xf40
> <4> [33.073128]  ? i915_request_create+0x1c6/0x230 [i915]
> <4> [33.073388]  ? intel_context_migrate_copy+0x1b3/0xa80 [i915]
> <4> [33.073619]  ? i915_request_create+0x1c6/0x230 [i915]
> <4> [33.073876]  i915_request_create+0x1c6/0x230 [i915]
> <4> [33.074135]  intel_context_migrate_copy+0x1d0/0xa80 [i915]
> <4> [33.074360]  __i915_ttm_move+0x7a8/0x940 [i915]
> <4> [33.074538]  ? _raw_spin_unlock_irqrestore+0x41/0x70
> <4> [33.074552]  ? dma_resv_iter_next+0x91/0xb0
> <4> [33.074564]  ? dma_resv_iter_first+0x42/0xb0
> <4> [33.074576]  ? i915_deps_add_resv+0x4c/0xc0 [i915]
> <4> [33.074744]  i915_ttm_move+0x2ac/0x430 [i915]
> <4> [33.074910]  ttm_bo_handle_move_mem+0xb5/0x140 [ttm]
> <4> [33.074930]  ttm_bo_validate+0xe9/0x1a0 [ttm]
> <4> [33.074947]  __i915_ttm_get_pages+0x4e/0x190 [i915]
> <4> [33.075112]  i915_ttm_get_pages+0xf3/0x160 [i915]
> <4> [33.075280]  ____i915_gem_object_get_pages+0x36/0xb0 [i915]
> <4> [33.075446]  __i915_gem_object_get_pages+0x95/0xa0 [i915]
> <4> [33.075608]  i915_vma_get_pages+0xfa/0x160 [i915]
> <4> [33.075779]  i915_vma_pin_ww+0xdc/0xb50 [i915]
> <4> [33.075953]  eb_validate_vmas+0x1c6/0xac0 [i915]
> <4> [33.076114]  i915_gem_do_execbuffer+0xb2a/0x21c0 [i915]
> <4> [33.076276]  ? __stack_depot_save+0x3f/0x4e0
> <4> [33.076292]  ? 0xffffffff81000000
> <4> [33.076301]  ? _raw_spin_unlock_irq+0x41/0x50
> <4> [33.076312]  ? lockdep_hardirqs_on+0xc3/0x140
> <4> [33.076325]  ? set_track_update+0x25/0x50
> <4> [33.076338]  ? __lock_acquire+0x5f2/0x2130
> <4> [33.076356]  i915_gem_execbuffer2_ioctl+0x123/0x2e0 [i915]
> <4> [33.076519]  ? __pfx_i915_gem_execbuffer2_ioctl+0x10/0x10 [i915]
> <4> [33.076679]  drm_ioctl_kernel+0xb4/0x150
> <4> [33.076692]  drm_ioctl+0x21d/0x420
> <4> [33.076703]  ? __pfx_i915_gem_execbuffer2_ioctl+0x10/0x10 [i915]
> <4> [33.076864]  ? __vm_munmap+0xd3/0x170
> <4> [33.076877]  __x64_sys_ioctl+0x76/0xb0
> <4> [33.076889]  do_syscall_64+0x3c/0x90
> <4> [33.076900]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> <4> [33.076913] RIP: 0033:0x7f304aa903ab
> <4> [33.076923] Code: 0f 1e fa 48 8b 05 e5 7a 0d 00 64 c7 00 26 00 00 00 48
> c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48>
> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b5 7a 0d 00 f7 d8 64 89 01 48
> <4> [33.076957] RSP: 002b:00007fffb1424cf8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> <4> [33.076975] RAX: ffffffffffffffda RBX: 00007fffb1424da0 RCX:
> 00007f304aa903ab
> <4> [33.076990] RDX: 00007fffb1424da0 RSI: 0000000040406469 RDI:
> 0000000000000005
> <4> [33.077004] RBP: 0000000040406469 R08: 0000000000000005 R09:
> 0000000100003000
> <4> [33.077019] R10: 0000000000000001 R11: 0000000000000246 R12:
> 0000000000010000
> <4> [33.077034] R13: 0000000000000005 R14: 00000000ffffffff R15:
> 00000000000056a0
> <4> [33.077052]  </TASK>
> 
> Br,
> 
> G.G.
> 
> On 2/28/23 4:11 AM, Andi Shyti wrote:
> > Hi,
> > 
> > This series of two patches fixes the issue introduced in
> > cf586021642d80 ("drm/i915/gt: Pipelined page migration") where,
> > as reported by Matt, in a chain of requests an error is reported
> > only if happens in the last request.
> > 
> > However Chris noticed that without ensuring exclusivity in the
> > locking we might end up in some deadlock. That's why patch 1
> > throttles for the ringspace in order to make sure that no one is
> > holding it.
> > 
> > Version 1 of this patch has been reviewed by matt and this
> > version is adding Chris exclusive locking.
> > 
> > Thanks Chris for this work.
> > 
> > Andi
> > 
> > Changelog
> > =========
> > v1 -> v2
> >   - Add patch 1 for ensuring exclusive locking of the timeline
> >   - Reword git commit of patch 2.
> > 
> > Andi Shyti (1):
> >    drm/i915/gt: Make sure that errors are propagated through request
> >      chains
> > 
> > Chris Wilson (1):
> >    drm/i915: Throttle for ringspace prior to taking the timeline mutex
> > 
> >   drivers/gpu/drm/i915/gt/intel_context.c | 41 +++++++++++++++++++++++++
> >   drivers/gpu/drm/i915/gt/intel_context.h |  2 ++
> >   drivers/gpu/drm/i915/gt/intel_migrate.c | 39 +++++++++++++++++------
> >   drivers/gpu/drm/i915/i915_request.c     |  3 ++
> >   4 files changed, 75 insertions(+), 10 deletions(-)
> >