diff mbox

drm/i915: Check rq->timeline before deference

Message ID 20180314101630.8933-1-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson March 14, 2018, 10:16 a.m. UTC
Not only is the context suspect to disappearing, but so is it's
timeline. Under a lockless inspection of the requests for
debugging from intel_engine_dump(), the context may already have been
freed and we have to check before chasing the dangling pointer.

[28033.681755] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_intel crct10dif_pclmul crc32_pclmul snd_hda_codec snd_hwdep snd_hda_core ghash_clmulni_intel snd_pcm mei_me mei i915 r8169 mii prime_numbers i2c_hid
[28033.681796] CPU: 3 PID: 3058 Comm: gem_exec_schedu Tainted: G     U           4.16.0-rc5+ #9
[28033.681804] Hardware name: Acer Aspire E5-575G/Ironman_SK  , BIOS V1.12 08/02/2016
[28033.681834] RIP: 0010:print_request+0x2b/0xb0 [i915]
[28033.681840] RSP: 0018:ffffc90004afbc18 EFLAGS: 00010202
[28033.681847] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8801921b5a40 RCX: 0000000000000006
[28033.681854] RDX: ffffc90004afbc60 RSI: ffff8801921b5a40 RDI: 0000000000000004
[28033.681861] RBP: ffffc90004afbd80 R08: 0000000000000000 R09: 0000000000000001
[28033.681868] R10: ffffc90004afbbd0 R11: ffffc90004afbc73 R12: ffffc90004afbc60
[28033.681875] R13: ffffc90004afbd80 R14: ffff8801d40ec670 R15: ffff8801921b5a40
[28033.681883] FS:  00007fbba5f6c8c0(0000) GS:ffff8801e8400000(0000) knlGS:0000000000000000
[28033.681891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[28033.681897] CR2: 00007fbba5f8f000 CR3: 00000001b2efa002 CR4: 00000000003606e0
[28033.681904] Call Trace:
[28033.681932]  intel_engine_print_registers+0x6a7/0x930 [i915]
[28033.681962]  intel_engine_dump+0x30d/0x740 [i915]
[28033.681971]  ? seq_printf+0x3a/0x50
[28033.681995]  i915_engine_info+0xb8/0xe0 [i915]
[28033.682003]  ? drm_get_color_range_name+0x20/0x20
[28033.682010]  seq_read+0xe1/0x440
[28033.682018]  full_proxy_read+0x51/0x80
[28033.682025]  __vfs_read+0x21/0x130
[28033.682031]  ? do_sys_open+0x134/0x220
[28033.682037]  ? kmem_cache_free+0x177/0x2b0
[28033.682043]  vfs_read+0xa1/0x150
[28033.682049]  SyS_read+0x40/0xa0
[28033.682055]  do_syscall_64+0x6b/0x1b0
[28033.682063]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
[28033.682069] RIP: 0033:0x7fbba4655d11
[28033.682074] RSP: 002b:00007ffd8c49da58 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[28033.682082] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbba4655d11
[28033.682089] RDX: 000000000000003f RSI: 00005647bfbfc260 RDI: 0000000000000006
[28033.682096] RBP: 000000000000003f R08: 00000000ffffffff R09: 0000000000000000
[28033.682104] R10: 0000000000000000 R11: 0000000000000246 R12: 00005647bfbfc260
[28033.682111] R13: 0000000000000006 R14: 0000000000000000 R15: 00005647bfbfc260
[28033.682119] Code: 41 55 41 54 49 89 d4 55 53 48 89 fd 48 8b 86 c8 00 00 00 48 8b 3d d6 1e 14 e2 48 89 f3 48 2b be a8 02 00 00 48 8b 80 b0 00 00 00 <4c> 8b 68 18 e8 bc 80 02 e1 8b 8b 70 02 00 00 8b b3 28 02 00 00
[28033.682206] RIP: print_request+0x2b/0xb0 [i915] RSP: ffffc90004afbc18

Reported-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_engine_cs.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Mika Kuoppala March 14, 2018, 2:16 p.m. UTC | #1
Chris Wilson <chris@chris-wilson.co.uk> writes:

> Not only is the context suspect to disappearing, but so is it's
> timeline. Under a lockless inspection of the requests for
> debugging from intel_engine_dump(), the context may already have been
> freed and we have to check before chasing the dangling pointer.
>
> [28033.681755] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_intel crct10dif_pclmul crc32_pclmul snd_hda_codec snd_hwdep snd_hda_core ghash_clmulni_intel snd_pcm mei_me mei i915 r8169 mii prime_numbers i2c_hid
> [28033.681796] CPU: 3 PID: 3058 Comm: gem_exec_schedu Tainted: G     U           4.16.0-rc5+ #9
> [28033.681804] Hardware name: Acer Aspire E5-575G/Ironman_SK  , BIOS V1.12 08/02/2016
> [28033.681834] RIP: 0010:print_request+0x2b/0xb0 [i915]
> [28033.681840] RSP: 0018:ffffc90004afbc18 EFLAGS: 00010202
> [28033.681847] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8801921b5a40 RCX: 0000000000000006
> [28033.681854] RDX: ffffc90004afbc60 RSI: ffff8801921b5a40 RDI: 0000000000000004
> [28033.681861] RBP: ffffc90004afbd80 R08: 0000000000000000 R09: 0000000000000001
> [28033.681868] R10: ffffc90004afbbd0 R11: ffffc90004afbc73 R12: ffffc90004afbc60
> [28033.681875] R13: ffffc90004afbd80 R14: ffff8801d40ec670 R15: ffff8801921b5a40
> [28033.681883] FS:  00007fbba5f6c8c0(0000) GS:ffff8801e8400000(0000) knlGS:0000000000000000
> [28033.681891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [28033.681897] CR2: 00007fbba5f8f000 CR3: 00000001b2efa002 CR4: 00000000003606e0
> [28033.681904] Call Trace:
> [28033.681932]  intel_engine_print_registers+0x6a7/0x930 [i915]
> [28033.681962]  intel_engine_dump+0x30d/0x740 [i915]
> [28033.681971]  ? seq_printf+0x3a/0x50
> [28033.681995]  i915_engine_info+0xb8/0xe0 [i915]
> [28033.682003]  ? drm_get_color_range_name+0x20/0x20
> [28033.682010]  seq_read+0xe1/0x440
> [28033.682018]  full_proxy_read+0x51/0x80
> [28033.682025]  __vfs_read+0x21/0x130
> [28033.682031]  ? do_sys_open+0x134/0x220
> [28033.682037]  ? kmem_cache_free+0x177/0x2b0
> [28033.682043]  vfs_read+0xa1/0x150
> [28033.682049]  SyS_read+0x40/0xa0
> [28033.682055]  do_syscall_64+0x6b/0x1b0
> [28033.682063]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> [28033.682069] RIP: 0033:0x7fbba4655d11
> [28033.682074] RSP: 002b:00007ffd8c49da58 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [28033.682082] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbba4655d11
> [28033.682089] RDX: 000000000000003f RSI: 00005647bfbfc260 RDI: 0000000000000006
> [28033.682096] RBP: 000000000000003f R08: 00000000ffffffff R09: 0000000000000000
> [28033.682104] R10: 0000000000000000 R11: 0000000000000246 R12: 00005647bfbfc260
> [28033.682111] R13: 0000000000000006 R14: 0000000000000000 R15: 00005647bfbfc260
> [28033.682119] Code: 41 55 41 54 49 89 d4 55 53 48 89 fd 48 8b 86 c8 00 00 00 48 8b 3d d6 1e 14 e2 48 89 f3 48 2b be a8 02 00 00 48 8b 80 b0 00 00 00 <4c> 8b 68 18 e8 bc 80 02 e1 8b 8b 70 02 00 00 8b b3 28 02 00 00
> [28033.682206] RIP: print_request+0x2b/0xb0 [i915] RSP: ffffc90004afbc18
>
> Reported-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/intel_engine_cs.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> index a2b1e9e2c008..f22c5f72df8d 100644
> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> @@ -1716,13 +1716,15 @@ static void print_request(struct drm_printer *m,
>  			  struct i915_request *rq,
>  			  const char *prefix)
>  {
> +	const char *name = rq->fence.ops->get_timeline_name(&rq->fence);
> +

So we get "signaled"

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

>  	drm_printf(m, "%s%x%s [%llx:%x] prio=%d @ %dms: %s\n", prefix,
>  		   rq->global_seqno,
>  		   i915_request_completed(rq) ? "!" : "",
>  		   rq->fence.context, rq->fence.seqno,
>  		   rq->priotree.priority,
>  		   jiffies_to_msecs(jiffies - rq->emitted_jiffies),
> -		   rq->timeline->common->name);
> +		   name);
>  }
>  
>  static void hexdump(struct drm_printer *m, const void *buf, size_t len)
> -- 
> 2.16.2
Chris Wilson March 14, 2018, 2:33 p.m. UTC | #2
Quoting Mika Kuoppala (2018-03-14 14:16:08)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Not only is the context suspect to disappearing, but so is it's
> > timeline. Under a lockless inspection of the requests for
> > debugging from intel_engine_dump(), the context may already have been
> > freed and we have to check before chasing the dangling pointer.
> >
> > [28033.681755] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_intel crct10dif_pclmul crc32_pclmul snd_hda_codec snd_hwdep snd_hda_core ghash_clmulni_intel snd_pcm mei_me mei i915 r8169 mii prime_numbers i2c_hid
> > [28033.681796] CPU: 3 PID: 3058 Comm: gem_exec_schedu Tainted: G     U           4.16.0-rc5+ #9
> > [28033.681804] Hardware name: Acer Aspire E5-575G/Ironman_SK  , BIOS V1.12 08/02/2016
> > [28033.681834] RIP: 0010:print_request+0x2b/0xb0 [i915]
> > [28033.681840] RSP: 0018:ffffc90004afbc18 EFLAGS: 00010202
> > [28033.681847] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8801921b5a40 RCX: 0000000000000006
> > [28033.681854] RDX: ffffc90004afbc60 RSI: ffff8801921b5a40 RDI: 0000000000000004
> > [28033.681861] RBP: ffffc90004afbd80 R08: 0000000000000000 R09: 0000000000000001
> > [28033.681868] R10: ffffc90004afbbd0 R11: ffffc90004afbc73 R12: ffffc90004afbc60
> > [28033.681875] R13: ffffc90004afbd80 R14: ffff8801d40ec670 R15: ffff8801921b5a40
> > [28033.681883] FS:  00007fbba5f6c8c0(0000) GS:ffff8801e8400000(0000) knlGS:0000000000000000
> > [28033.681891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [28033.681897] CR2: 00007fbba5f8f000 CR3: 00000001b2efa002 CR4: 00000000003606e0
> > [28033.681904] Call Trace:
> > [28033.681932]  intel_engine_print_registers+0x6a7/0x930 [i915]
> > [28033.681962]  intel_engine_dump+0x30d/0x740 [i915]
> > [28033.681971]  ? seq_printf+0x3a/0x50
> > [28033.681995]  i915_engine_info+0xb8/0xe0 [i915]
> > [28033.682003]  ? drm_get_color_range_name+0x20/0x20
> > [28033.682010]  seq_read+0xe1/0x440
> > [28033.682018]  full_proxy_read+0x51/0x80
> > [28033.682025]  __vfs_read+0x21/0x130
> > [28033.682031]  ? do_sys_open+0x134/0x220
> > [28033.682037]  ? kmem_cache_free+0x177/0x2b0
> > [28033.682043]  vfs_read+0xa1/0x150
> > [28033.682049]  SyS_read+0x40/0xa0
> > [28033.682055]  do_syscall_64+0x6b/0x1b0
> > [28033.682063]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > [28033.682069] RIP: 0033:0x7fbba4655d11
> > [28033.682074] RSP: 002b:00007ffd8c49da58 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> > [28033.682082] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbba4655d11
> > [28033.682089] RDX: 000000000000003f RSI: 00005647bfbfc260 RDI: 0000000000000006
> > [28033.682096] RBP: 000000000000003f R08: 00000000ffffffff R09: 0000000000000000
> > [28033.682104] R10: 0000000000000000 R11: 0000000000000246 R12: 00005647bfbfc260
> > [28033.682111] R13: 0000000000000006 R14: 0000000000000000 R15: 00005647bfbfc260
> > [28033.682119] Code: 41 55 41 54 49 89 d4 55 53 48 89 fd 48 8b 86 c8 00 00 00 48 8b 3d d6 1e 14 e2 48 89 f3 48 2b be a8 02 00 00 48 8b 80 b0 00 00 00 <4c> 8b 68 18 e8 bc 80 02 e1 8b 8b 70 02 00 00 8b b3 28 02 00 00
> > [28033.682206] RIP: print_request+0x2b/0xb0 [i915] RSP: ffffc90004afbc18
> >
> > Reported-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/intel_engine_cs.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> > index a2b1e9e2c008..f22c5f72df8d 100644
> > --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> > +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> > @@ -1716,13 +1716,15 @@ static void print_request(struct drm_printer *m,
> >                         struct i915_request *rq,
> >                         const char *prefix)
> >  {
> > +     const char *name = rq->fence.ops->get_timeline_name(&rq->fence);
> > +
> 
> So we get "signaled"

And as we hold the rcu lock here, context can't also vanish between the
test inside get_timeline_name() and our use in the printf.
(Probably could do with another sentence to in get_timeline_name() to
illustrate the rcu_read_lock() requirement in the caller.)

> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Thanks, hopefully you will be able to run gem_ctx_switch all night now!
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index a2b1e9e2c008..f22c5f72df8d 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -1716,13 +1716,15 @@  static void print_request(struct drm_printer *m,
 			  struct i915_request *rq,
 			  const char *prefix)
 {
+	const char *name = rq->fence.ops->get_timeline_name(&rq->fence);
+
 	drm_printf(m, "%s%x%s [%llx:%x] prio=%d @ %dms: %s\n", prefix,
 		   rq->global_seqno,
 		   i915_request_completed(rq) ? "!" : "",
 		   rq->fence.context, rq->fence.seqno,
 		   rq->priotree.priority,
 		   jiffies_to_msecs(jiffies - rq->emitted_jiffies),
-		   rq->timeline->common->name);
+		   name);
 }
 
 static void hexdump(struct drm_printer *m, const void *buf, size_t len)