diff mbox series

drm/i915: Fix the use-after-free between i915_active_ref and __active_retire

Message ID 20191206115635.46931-1-chuansheng.liu@intel.com (mailing list archive)
State New, archived
Headers show
Series drm/i915: Fix the use-after-free between i915_active_ref and __active_retire | expand

Commit Message

Chuansheng Liu Dec. 6, 2019, 11:56 a.m. UTC
We easily hit drm/i915 panic on TGL when running glmark2, and finally
caught the race condition of use-after-free with enabling KASAN.

The call stack is below:
===
[  534.472675] BUG: KASAN: use-after-free in __i915_active_fence_set+0x26d/0x3d0 [i915]
[  534.472679] Write of size 8 at addr ffff8883f0372388 by task glmark2/3199

[  534.472684] CPU: 3 PID: 3199 Comm: glmark2 Tainted: G     U      E     5.4.0-rc8 #8
[  534.472687] Call Trace:
[  534.472693]  dump_stack+0x95/0xd5
[  534.472722]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
[  534.472727]  print_address_description.constprop.5+0x20/0x320
[  534.472751]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
[  534.472792]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
[  534.472794]  __kasan_report+0x149/0x18c
[  534.472798]  ? _raw_spin_lock+0x1/0xd0
[  534.472820]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
[  534.472822]  kasan_report+0x12/0x20
[  534.472825]  __asan_report_store8_noabort+0x17/0x20
[  534.472847]  __i915_active_fence_set+0x26d/0x3d0 [i915]
[  534.472870]  i915_active_ref+0x2c8/0x530 [i915]
[  534.472874]  ? dma_resv_add_shared_fence+0x291/0x460
[  534.472902]  __i915_vma_move_to_active+0x56/0x70 [i915]
[  534.472927]  i915_vma_move_to_active+0x54/0x420 [i915]
[  534.472931]  ? mutex_unlock+0x22/0x40
[  534.472957]  i915_gem_do_execbuffer+0x1d45/0x3e20 [i915]
[  534.472959]  ? __kmalloc_node+0x12c/0x350
[  534.472983]  ? eb_relocate_slow+0xb40/0xb40 [i915]
[  534.472985]  ? _raw_write_trylock+0x110/0x110
[  534.472987]  ? get_partial_node.isra.72+0x51/0x260
[  534.472991]  ? unix_stream_read_generic+0x583/0x1a80
[  534.472994]  ? ___slab_alloc+0x1d8/0x550
[  534.472998]  ? kvmalloc_node+0x31/0x80
[  534.473000]  ? kasan_unpoison_shadow+0x35/0x50
[  534.473002]  ? _raw_spin_lock+0x7b/0xd0
[  534.473004]  ? radix_tree_lookup+0xd/0x10
[  534.473006]  ? idr_find+0x3b/0x60
[  534.473029]  i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915]
[  534.473052]  ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915]
[  534.473054]  ? unix_stream_recvmsg+0x97/0xd0
[  534.473056]  ? unix_stream_splice_read+0x1c0/0x1c0
[  534.473058]  ? __unix_insert_socket+0x180/0x180
[  534.473081]  ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915]
[  534.473094]  drm_ioctl_kernel+0x1ed/0x2b0 [drm]
[  534.473103]  ? drm_setversion+0x8c0/0x8c0 [drm]
[  534.473106]  ? __kasan_check_write+0x14/0x20
[  534.473115]  drm_ioctl+0x68b/0xaa0 [drm]
...

[  534.473239] Allocated by task 3199:
[  534.473241]  save_stack+0x21/0x90
[  534.473243]  __kasan_kmalloc.constprop.8+0xa7/0xd0
[  534.473245]  kasan_slab_alloc+0x11/0x20
[  534.473246]  kmem_cache_alloc+0xce/0x240
[  534.473273]  i915_active_ref+0xc2/0x530 [i915]
[  534.473302]  __i915_vma_move_to_active+0x56/0x70 [i915]
[  534.473328]  i915_vma_move_to_active+0x54/0x420 [i915]
[  534.473355]  i915_gem_do_execbuffer+0x1d45/0x3e20 [i915]
[  534.473381]  i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915]
[  534.473392]  drm_ioctl_kernel+0x1ed/0x2b0 [drm]
[  534.473402]  drm_ioctl+0x68b/0xaa0 [drm]
[  534.473404]  do_vfs_ioctl+0x19a/0xf10
[  534.473405]  ksys_ioctl+0x75/0x80
[  534.473407]  __x64_sys_ioctl+0x73/0xb0
[  534.473408]  do_syscall_64+0x9f/0x3a0
[  534.473410]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[  534.473412] Freed by task 0:
[  534.473414]  save_stack+0x21/0x90
[  534.473415]  __kasan_slab_free+0x137/0x190
[  534.473417]  kasan_slab_free+0xe/0x10
[  534.473418]  kmem_cache_free+0xeb/0x2c0
[  534.473444]  __active_retire+0x1f2/0x240 [i915]
[  534.473471]  active_retire+0x13b/0x1b0 [i915]
[  534.473496]  node_retire+0x54/0x80 [i915]
[  534.473523]  intel_engine_breadcrumbs_irq+0x5f0/0xd10 [i915]
[  534.473549]  cs_irq_handler+0x66/0xb0 [i915]
[  534.473575]  gen11_gt_irq_handler+0x26c/0x400 [i915]
[  534.473600]  gen11_irq_handler+0xc3/0x250 [i915]
[  534.473603]  __handle_irq_event_percpu+0xe0/0x4c0
[  534.473605]  handle_irq_event_percpu+0x71/0x140
[  534.473606]  handle_irq_event+0xad/0x140
[  534.473608]  handle_edge_irq+0x1f6/0x780
[  534.473610]  do_IRQ+0x9f/0x1f0

[  534.473612] The buggy address belongs to the object at ffff8883f0372380
                which belongs to the cache active_node of size 72
[  534.473615] The buggy address is located 8 bytes inside of

===

The race scenerio is like:
Initially ref->count is 1, interrupt handler is trying to free the
node.

===
CPUA in interrupt context                CPUB in i915_gem_execbuffer2_ioctl
__active_retire -->
  spin_lock(&ref->tree_lock)
  decrease ref->count to 0
                                         i915_active_ref -->
                                           increase ref->count to 1
                                           (i915_active_acquire)

                                           get the dirty ref->cache
                                              (READ_ONCE(ref->cache))

                                           return the dirty node

  set ref->cache to NULL
  spin_unlock(&ref->tree_lock)
  free the node

                                           hit use-after-free in
                                              __i915_active_fence_set()

===

Here we need to use spinlock ref->tree_lock to protect the access
of READ_ONCE(ref->cache), then the race scenerio can be resolved.

with this patch, it passed our stress test for a very long time.

Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
---
 drivers/gpu/drm/i915/i915_active.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Chris Wilson Dec. 6, 2019, 12:04 p.m. UTC | #1
Quoting Chuansheng Liu (2019-12-06 11:56:35)
> We easily hit drm/i915 panic on TGL when running glmark2, and finally
> caught the race condition of use-after-free with enabling KASAN.
> 
> The call stack is below:
> ===
> [  534.472675] BUG: KASAN: use-after-free in __i915_active_fence_set+0x26d/0x3d0 [i915]
> [  534.472679] Write of size 8 at addr ffff8883f0372388 by task glmark2/3199
> 
> [  534.472684] CPU: 3 PID: 3199 Comm: glmark2 Tainted: G     U      E     5.4.0-rc8 #8
> [  534.472687] Call Trace:
> [  534.472693]  dump_stack+0x95/0xd5
> [  534.472722]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
> [  534.472727]  print_address_description.constprop.5+0x20/0x320
> [  534.472751]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
> [  534.472792]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
> [  534.472794]  __kasan_report+0x149/0x18c
> [  534.472798]  ? _raw_spin_lock+0x1/0xd0
> [  534.472820]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
> [  534.472822]  kasan_report+0x12/0x20
> [  534.472825]  __asan_report_store8_noabort+0x17/0x20
> [  534.472847]  __i915_active_fence_set+0x26d/0x3d0 [i915]
> [  534.472870]  i915_active_ref+0x2c8/0x530 [i915]
> [  534.472874]  ? dma_resv_add_shared_fence+0x291/0x460
> [  534.472902]  __i915_vma_move_to_active+0x56/0x70 [i915]
> [  534.472927]  i915_vma_move_to_active+0x54/0x420 [i915]
> [  534.472931]  ? mutex_unlock+0x22/0x40
> [  534.472957]  i915_gem_do_execbuffer+0x1d45/0x3e20 [i915]
> [  534.472959]  ? __kmalloc_node+0x12c/0x350
> [  534.472983]  ? eb_relocate_slow+0xb40/0xb40 [i915]
> [  534.472985]  ? _raw_write_trylock+0x110/0x110
> [  534.472987]  ? get_partial_node.isra.72+0x51/0x260
> [  534.472991]  ? unix_stream_read_generic+0x583/0x1a80
> [  534.472994]  ? ___slab_alloc+0x1d8/0x550
> [  534.472998]  ? kvmalloc_node+0x31/0x80
> [  534.473000]  ? kasan_unpoison_shadow+0x35/0x50
> [  534.473002]  ? _raw_spin_lock+0x7b/0xd0
> [  534.473004]  ? radix_tree_lookup+0xd/0x10
> [  534.473006]  ? idr_find+0x3b/0x60
> [  534.473029]  i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915]
> [  534.473052]  ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915]
> [  534.473054]  ? unix_stream_recvmsg+0x97/0xd0
> [  534.473056]  ? unix_stream_splice_read+0x1c0/0x1c0
> [  534.473058]  ? __unix_insert_socket+0x180/0x180
> [  534.473081]  ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915]
> [  534.473094]  drm_ioctl_kernel+0x1ed/0x2b0 [drm]
> [  534.473103]  ? drm_setversion+0x8c0/0x8c0 [drm]
> [  534.473106]  ? __kasan_check_write+0x14/0x20
> [  534.473115]  drm_ioctl+0x68b/0xaa0 [drm]
> ...
> 
> [  534.473239] Allocated by task 3199:
> [  534.473241]  save_stack+0x21/0x90
> [  534.473243]  __kasan_kmalloc.constprop.8+0xa7/0xd0
> [  534.473245]  kasan_slab_alloc+0x11/0x20
> [  534.473246]  kmem_cache_alloc+0xce/0x240
> [  534.473273]  i915_active_ref+0xc2/0x530 [i915]
> [  534.473302]  __i915_vma_move_to_active+0x56/0x70 [i915]
> [  534.473328]  i915_vma_move_to_active+0x54/0x420 [i915]
> [  534.473355]  i915_gem_do_execbuffer+0x1d45/0x3e20 [i915]
> [  534.473381]  i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915]
> [  534.473392]  drm_ioctl_kernel+0x1ed/0x2b0 [drm]
> [  534.473402]  drm_ioctl+0x68b/0xaa0 [drm]
> [  534.473404]  do_vfs_ioctl+0x19a/0xf10
> [  534.473405]  ksys_ioctl+0x75/0x80
> [  534.473407]  __x64_sys_ioctl+0x73/0xb0
> [  534.473408]  do_syscall_64+0x9f/0x3a0
> [  534.473410]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> [  534.473412] Freed by task 0:
> [  534.473414]  save_stack+0x21/0x90
> [  534.473415]  __kasan_slab_free+0x137/0x190
> [  534.473417]  kasan_slab_free+0xe/0x10
> [  534.473418]  kmem_cache_free+0xeb/0x2c0
> [  534.473444]  __active_retire+0x1f2/0x240 [i915]
> [  534.473471]  active_retire+0x13b/0x1b0 [i915]
> [  534.473496]  node_retire+0x54/0x80 [i915]
> [  534.473523]  intel_engine_breadcrumbs_irq+0x5f0/0xd10 [i915]
> [  534.473549]  cs_irq_handler+0x66/0xb0 [i915]
> [  534.473575]  gen11_gt_irq_handler+0x26c/0x400 [i915]
> [  534.473600]  gen11_irq_handler+0xc3/0x250 [i915]
> [  534.473603]  __handle_irq_event_percpu+0xe0/0x4c0
> [  534.473605]  handle_irq_event_percpu+0x71/0x140
> [  534.473606]  handle_irq_event+0xad/0x140
> [  534.473608]  handle_edge_irq+0x1f6/0x780
> [  534.473610]  do_IRQ+0x9f/0x1f0
> 
> [  534.473612] The buggy address belongs to the object at ffff8883f0372380
>                 which belongs to the cache active_node of size 72
> [  534.473615] The buggy address is located 8 bytes inside of
> 
> ===
> 
> The race scenerio is like:
> Initially ref->count is 1, interrupt handler is trying to free the
> node.
> 
> ===
> CPUA in interrupt context                CPUB in i915_gem_execbuffer2_ioctl
> __active_retire -->
>   spin_lock(&ref->tree_lock)
>   decrease ref->count to 0
>                                          i915_active_ref -->
>                                            increase ref->count to 1
>                                            (i915_active_acquire)
> 
>                                            get the dirty ref->cache
>                                               (READ_ONCE(ref->cache))
> 
>                                            return the dirty node
> 
>   set ref->cache to NULL
>   spin_unlock(&ref->tree_lock)
>   free the node
> 
>                                            hit use-after-free in
>                                               __i915_active_fence_set()
> 
> ===
> 
> Here we need to use spinlock ref->tree_lock to protect the access
> of READ_ONCE(ref->cache), then the race scenerio can be resolved.
> 
> with this patch, it passed our stress test for a very long time.
> 
> Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_active.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
> index dca15ace88f6..3d68b910e949 100644
> --- a/drivers/gpu/drm/i915/i915_active.c
> +++ b/drivers/gpu/drm/i915/i915_active.c
> @@ -214,7 +214,10 @@ active_instance(struct i915_active *ref, struct intel_timeline *tl)
>          * after the previous activity has been retired, or if it matches the
>          * current timeline.
>          */
> +       spin_lock_irq(&ref->tree_lock);
>         node = READ_ONCE(ref->cache);
> +       spin_unlock_irq(&ref->tree_lock);

Incorrect. The serialisation with __active_retire is required at
i915_active_acquire. The problem is that serialisation was provided by
ODEBUG for our CI so it went under the radar.
-Chris
Chuansheng Liu Dec. 6, 2019, 12:10 p.m. UTC | #2
Chris,

Thanks for reviewing, please see below comments.

> -----Original Message-----
> From: Chris Wilson <chris@chris-wilson.co.uk>
> Sent: Friday, December 6, 2019 8:04 PM
> To: Liu, Chuansheng <chuansheng.liu@intel.com>; intel-
> gfx@lists.freedesktop.org
> Subject: Re: [Intel-gfx] [PATCH] drm/i915: Fix the use-after-free between
> i915_active_ref and __active_retire
> 
> Quoting Chuansheng Liu (2019-12-06 11:56:35)
> > We easily hit drm/i915 panic on TGL when running glmark2, and finally
> > caught the race condition of use-after-free with enabling KASAN.
> >
> > The call stack is below:
> > ===
> > [  534.472675] BUG: KASAN: use-after-free in
> __i915_active_fence_set+0x26d/0x3d0 [i915]
> > [  534.472679] Write of size 8 at addr ffff8883f0372388 by task glmark2/3199
> >
> > [  534.472684] CPU: 3 PID: 3199 Comm: glmark2 Tainted: G     U      E     5.4.0-
> rc8 #8
> > [  534.472687] Call Trace:
> > [  534.472693]  dump_stack+0x95/0xd5
> > [  534.472722]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
> > [  534.472727]  print_address_description.constprop.5+0x20/0x320
> > [  534.472751]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
> > [  534.472792]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
> > [  534.472794]  __kasan_report+0x149/0x18c
> > [  534.472798]  ? _raw_spin_lock+0x1/0xd0
> > [  534.472820]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
> > [  534.472822]  kasan_report+0x12/0x20
> > [  534.472825]  __asan_report_store8_noabort+0x17/0x20
> > [  534.472847]  __i915_active_fence_set+0x26d/0x3d0 [i915]
> > [  534.472870]  i915_active_ref+0x2c8/0x530 [i915]
> > [  534.472874]  ? dma_resv_add_shared_fence+0x291/0x460
> > [  534.472902]  __i915_vma_move_to_active+0x56/0x70 [i915]
> > [  534.472927]  i915_vma_move_to_active+0x54/0x420 [i915]
> > [  534.472931]  ? mutex_unlock+0x22/0x40
> > [  534.472957]  i915_gem_do_execbuffer+0x1d45/0x3e20 [i915]
> > [  534.472959]  ? __kmalloc_node+0x12c/0x350
> > [  534.472983]  ? eb_relocate_slow+0xb40/0xb40 [i915]
> > [  534.472985]  ? _raw_write_trylock+0x110/0x110
> > [  534.472987]  ? get_partial_node.isra.72+0x51/0x260
> > [  534.472991]  ? unix_stream_read_generic+0x583/0x1a80
> > [  534.472994]  ? ___slab_alloc+0x1d8/0x550
> > [  534.472998]  ? kvmalloc_node+0x31/0x80
> > [  534.473000]  ? kasan_unpoison_shadow+0x35/0x50
> > [  534.473002]  ? _raw_spin_lock+0x7b/0xd0
> > [  534.473004]  ? radix_tree_lookup+0xd/0x10
> > [  534.473006]  ? idr_find+0x3b/0x60
> > [  534.473029]  i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915]
> > [  534.473052]  ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915]
> > [  534.473054]  ? unix_stream_recvmsg+0x97/0xd0
> > [  534.473056]  ? unix_stream_splice_read+0x1c0/0x1c0
> > [  534.473058]  ? __unix_insert_socket+0x180/0x180
> > [  534.473081]  ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915]
> > [  534.473094]  drm_ioctl_kernel+0x1ed/0x2b0 [drm]
> > [  534.473103]  ? drm_setversion+0x8c0/0x8c0 [drm]
> > [  534.473106]  ? __kasan_check_write+0x14/0x20
> > [  534.473115]  drm_ioctl+0x68b/0xaa0 [drm]
> > ...
> >
> > [  534.473239] Allocated by task 3199:
> > [  534.473241]  save_stack+0x21/0x90
> > [  534.473243]  __kasan_kmalloc.constprop.8+0xa7/0xd0
> > [  534.473245]  kasan_slab_alloc+0x11/0x20
> > [  534.473246]  kmem_cache_alloc+0xce/0x240
> > [  534.473273]  i915_active_ref+0xc2/0x530 [i915]
> > [  534.473302]  __i915_vma_move_to_active+0x56/0x70 [i915]
> > [  534.473328]  i915_vma_move_to_active+0x54/0x420 [i915]
> > [  534.473355]  i915_gem_do_execbuffer+0x1d45/0x3e20 [i915]
> > [  534.473381]  i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915]
> > [  534.473392]  drm_ioctl_kernel+0x1ed/0x2b0 [drm]
> > [  534.473402]  drm_ioctl+0x68b/0xaa0 [drm]
> > [  534.473404]  do_vfs_ioctl+0x19a/0xf10
> > [  534.473405]  ksys_ioctl+0x75/0x80
> > [  534.473407]  __x64_sys_ioctl+0x73/0xb0
> > [  534.473408]  do_syscall_64+0x9f/0x3a0
> > [  534.473410]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > [  534.473412] Freed by task 0:
> > [  534.473414]  save_stack+0x21/0x90
> > [  534.473415]  __kasan_slab_free+0x137/0x190
> > [  534.473417]  kasan_slab_free+0xe/0x10
> > [  534.473418]  kmem_cache_free+0xeb/0x2c0
> > [  534.473444]  __active_retire+0x1f2/0x240 [i915]
> > [  534.473471]  active_retire+0x13b/0x1b0 [i915]
> > [  534.473496]  node_retire+0x54/0x80 [i915]
> > [  534.473523]  intel_engine_breadcrumbs_irq+0x5f0/0xd10 [i915]
> > [  534.473549]  cs_irq_handler+0x66/0xb0 [i915]
> > [  534.473575]  gen11_gt_irq_handler+0x26c/0x400 [i915]
> > [  534.473600]  gen11_irq_handler+0xc3/0x250 [i915]
> > [  534.473603]  __handle_irq_event_percpu+0xe0/0x4c0
> > [  534.473605]  handle_irq_event_percpu+0x71/0x140
> > [  534.473606]  handle_irq_event+0xad/0x140
> > [  534.473608]  handle_edge_irq+0x1f6/0x780
> > [  534.473610]  do_IRQ+0x9f/0x1f0
> >
> > [  534.473612] The buggy address belongs to the object at ffff8883f0372380
> >                 which belongs to the cache active_node of size 72
> > [  534.473615] The buggy address is located 8 bytes inside of
> >
> > ===
> >
> > The race scenerio is like:
> > Initially ref->count is 1, interrupt handler is trying to free the
> > node.
> >
> > ===
> > CPUA in interrupt context                CPUB in i915_gem_execbuffer2_ioctl
> > __active_retire -->
> >   spin_lock(&ref->tree_lock)
> >   decrease ref->count to 0
> >                                          i915_active_ref -->
> >                                            increase ref->count to 1
> >                                            (i915_active_acquire)
> >
> >                                            get the dirty ref->cache
> >                                               (READ_ONCE(ref->cache))
> >
> >                                            return the dirty node
> >
> >   set ref->cache to NULL
> >   spin_unlock(&ref->tree_lock)
> >   free the node
> >
> >                                            hit use-after-free in
> >                                               __i915_active_fence_set()
> >
> > ===
> >
> > Here we need to use spinlock ref->tree_lock to protect the access
> > of READ_ONCE(ref->cache), then the race scenerio can be resolved.
> >
> > with this patch, it passed our stress test for a very long time.
> >
> > Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_active.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_active.c
> b/drivers/gpu/drm/i915/i915_active.c
> > index dca15ace88f6..3d68b910e949 100644
> > --- a/drivers/gpu/drm/i915/i915_active.c
> > +++ b/drivers/gpu/drm/i915/i915_active.c
> > @@ -214,7 +214,10 @@ active_instance(struct i915_active *ref, struct
> intel_timeline *tl)
> >          * after the previous activity has been retired, or if it matches the
> >          * current timeline.
> >          */
> > +       spin_lock_irq(&ref->tree_lock);
> >         node = READ_ONCE(ref->cache);
> > +       spin_unlock_irq(&ref->tree_lock);
> 
> Incorrect. The serialisation with __active_retire is required at
> i915_active_acquire.
You suggest the change can be made in i915_active_acquire()?
So that we can play ref->count closely together with tree_lock
and ODEBUG stuff.

If so, I can make a new patch
Chris Wilson Dec. 6, 2019, 12:15 p.m. UTC | #3
Quoting Liu, Chuansheng (2019-12-06 12:10:25)
> Chris,
> 
> Thanks for reviewing, please see below comments.
> 
> > -----Original Message-----
> > From: Chris Wilson <chris@chris-wilson.co.uk>
> > Sent: Friday, December 6, 2019 8:04 PM
> > To: Liu, Chuansheng <chuansheng.liu@intel.com>; intel-
> > gfx@lists.freedesktop.org
> > Subject: Re: [Intel-gfx] [PATCH] drm/i915: Fix the use-after-free between
> > i915_active_ref and __active_retire
> > 
> > Quoting Chuansheng Liu (2019-12-06 11:56:35)
> > > We easily hit drm/i915 panic on TGL when running glmark2, and finally
> > > caught the race condition of use-after-free with enabling KASAN.
> > >
> > > The call stack is below:
> > > ===
> > > [  534.472675] BUG: KASAN: use-after-free in
> > __i915_active_fence_set+0x26d/0x3d0 [i915]
> > > [  534.472679] Write of size 8 at addr ffff8883f0372388 by task glmark2/3199
> > >
> > > [  534.472684] CPU: 3 PID: 3199 Comm: glmark2 Tainted: G     U      E     5.4.0-
> > rc8 #8
> > > [  534.472687] Call Trace:
> > > [  534.472693]  dump_stack+0x95/0xd5
> > > [  534.472722]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
> > > [  534.472727]  print_address_description.constprop.5+0x20/0x320
> > > [  534.472751]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
> > > [  534.472792]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
> > > [  534.472794]  __kasan_report+0x149/0x18c
> > > [  534.472798]  ? _raw_spin_lock+0x1/0xd0
> > > [  534.472820]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
> > > [  534.472822]  kasan_report+0x12/0x20
> > > [  534.472825]  __asan_report_store8_noabort+0x17/0x20
> > > [  534.472847]  __i915_active_fence_set+0x26d/0x3d0 [i915]
> > > [  534.472870]  i915_active_ref+0x2c8/0x530 [i915]
> > > [  534.472874]  ? dma_resv_add_shared_fence+0x291/0x460
> > > [  534.472902]  __i915_vma_move_to_active+0x56/0x70 [i915]
> > > [  534.472927]  i915_vma_move_to_active+0x54/0x420 [i915]
> > > [  534.472931]  ? mutex_unlock+0x22/0x40
> > > [  534.472957]  i915_gem_do_execbuffer+0x1d45/0x3e20 [i915]
> > > [  534.472959]  ? __kmalloc_node+0x12c/0x350
> > > [  534.472983]  ? eb_relocate_slow+0xb40/0xb40 [i915]
> > > [  534.472985]  ? _raw_write_trylock+0x110/0x110
> > > [  534.472987]  ? get_partial_node.isra.72+0x51/0x260
> > > [  534.472991]  ? unix_stream_read_generic+0x583/0x1a80
> > > [  534.472994]  ? ___slab_alloc+0x1d8/0x550
> > > [  534.472998]  ? kvmalloc_node+0x31/0x80
> > > [  534.473000]  ? kasan_unpoison_shadow+0x35/0x50
> > > [  534.473002]  ? _raw_spin_lock+0x7b/0xd0
> > > [  534.473004]  ? radix_tree_lookup+0xd/0x10
> > > [  534.473006]  ? idr_find+0x3b/0x60
> > > [  534.473029]  i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915]
> > > [  534.473052]  ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915]
> > > [  534.473054]  ? unix_stream_recvmsg+0x97/0xd0
> > > [  534.473056]  ? unix_stream_splice_read+0x1c0/0x1c0
> > > [  534.473058]  ? __unix_insert_socket+0x180/0x180
> > > [  534.473081]  ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915]
> > > [  534.473094]  drm_ioctl_kernel+0x1ed/0x2b0 [drm]
> > > [  534.473103]  ? drm_setversion+0x8c0/0x8c0 [drm]
> > > [  534.473106]  ? __kasan_check_write+0x14/0x20
> > > [  534.473115]  drm_ioctl+0x68b/0xaa0 [drm]
> > > ...
> > >
> > > [  534.473239] Allocated by task 3199:
> > > [  534.473241]  save_stack+0x21/0x90
> > > [  534.473243]  __kasan_kmalloc.constprop.8+0xa7/0xd0
> > > [  534.473245]  kasan_slab_alloc+0x11/0x20
> > > [  534.473246]  kmem_cache_alloc+0xce/0x240
> > > [  534.473273]  i915_active_ref+0xc2/0x530 [i915]
> > > [  534.473302]  __i915_vma_move_to_active+0x56/0x70 [i915]
> > > [  534.473328]  i915_vma_move_to_active+0x54/0x420 [i915]
> > > [  534.473355]  i915_gem_do_execbuffer+0x1d45/0x3e20 [i915]
> > > [  534.473381]  i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915]
> > > [  534.473392]  drm_ioctl_kernel+0x1ed/0x2b0 [drm]
> > > [  534.473402]  drm_ioctl+0x68b/0xaa0 [drm]
> > > [  534.473404]  do_vfs_ioctl+0x19a/0xf10
> > > [  534.473405]  ksys_ioctl+0x75/0x80
> > > [  534.473407]  __x64_sys_ioctl+0x73/0xb0
> > > [  534.473408]  do_syscall_64+0x9f/0x3a0
> > > [  534.473410]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >
> > > [  534.473412] Freed by task 0:
> > > [  534.473414]  save_stack+0x21/0x90
> > > [  534.473415]  __kasan_slab_free+0x137/0x190
> > > [  534.473417]  kasan_slab_free+0xe/0x10
> > > [  534.473418]  kmem_cache_free+0xeb/0x2c0
> > > [  534.473444]  __active_retire+0x1f2/0x240 [i915]
> > > [  534.473471]  active_retire+0x13b/0x1b0 [i915]
> > > [  534.473496]  node_retire+0x54/0x80 [i915]
> > > [  534.473523]  intel_engine_breadcrumbs_irq+0x5f0/0xd10 [i915]
> > > [  534.473549]  cs_irq_handler+0x66/0xb0 [i915]
> > > [  534.473575]  gen11_gt_irq_handler+0x26c/0x400 [i915]
> > > [  534.473600]  gen11_irq_handler+0xc3/0x250 [i915]
> > > [  534.473603]  __handle_irq_event_percpu+0xe0/0x4c0
> > > [  534.473605]  handle_irq_event_percpu+0x71/0x140
> > > [  534.473606]  handle_irq_event+0xad/0x140
> > > [  534.473608]  handle_edge_irq+0x1f6/0x780
> > > [  534.473610]  do_IRQ+0x9f/0x1f0
> > >
> > > [  534.473612] The buggy address belongs to the object at ffff8883f0372380
> > >                 which belongs to the cache active_node of size 72
> > > [  534.473615] The buggy address is located 8 bytes inside of
> > >
> > > ===
> > >
> > > The race scenerio is like:
> > > Initially ref->count is 1, interrupt handler is trying to free the
> > > node.
> > >
> > > ===
> > > CPUA in interrupt context                CPUB in i915_gem_execbuffer2_ioctl
> > > __active_retire -->
> > >   spin_lock(&ref->tree_lock)
> > >   decrease ref->count to 0
> > >                                          i915_active_ref -->
> > >                                            increase ref->count to 1
> > >                                            (i915_active_acquire)
> > >
> > >                                            get the dirty ref->cache
> > >                                               (READ_ONCE(ref->cache))
> > >
> > >                                            return the dirty node
> > >
> > >   set ref->cache to NULL
> > >   spin_unlock(&ref->tree_lock)
> > >   free the node
> > >
> > >                                            hit use-after-free in
> > >                                               __i915_active_fence_set()
> > >
> > > ===
> > >
> > > Here we need to use spinlock ref->tree_lock to protect the access
> > > of READ_ONCE(ref->cache), then the race scenerio can be resolved.
> > >
> > > with this patch, it passed our stress test for a very long time.
> > >
> > > Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/i915_active.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/i915/i915_active.c
> > b/drivers/gpu/drm/i915/i915_active.c
> > > index dca15ace88f6..3d68b910e949 100644
> > > --- a/drivers/gpu/drm/i915/i915_active.c
> > > +++ b/drivers/gpu/drm/i915/i915_active.c
> > > @@ -214,7 +214,10 @@ active_instance(struct i915_active *ref, struct
> > intel_timeline *tl)
> > >          * after the previous activity has been retired, or if it matches the
> > >          * current timeline.
> > >          */
> > > +       spin_lock_irq(&ref->tree_lock);
> > >         node = READ_ONCE(ref->cache);
> > > +       spin_unlock_irq(&ref->tree_lock);
> > 
> > Incorrect. The serialisation with __active_retire is required at
> > i915_active_acquire.
> You suggest the change can be made in i915_active_acquire()?
> So that we can play ref->count closely together with tree_lock
> and ODEBUG stuff.
> 
> If so, I can make a new patch
Chuansheng Liu Dec. 7, 2019, 1:50 a.m. UTC | #4
> -----Original Message-----
> From: Chris Wilson <chris@chris-wilson.co.uk>
> Sent: Friday, December 6, 2019 8:15 PM
> To: Liu, Chuansheng <chuansheng.liu@intel.com>; intel-
> gfx@lists.freedesktop.org
> Subject: RE: [Intel-gfx] [PATCH] drm/i915: Fix the use-after-free between
> i915_active_ref and __active_retire
> 
> Quoting Liu, Chuansheng (2019-12-06 12:10:25)
> > Chris,
> >
> > Thanks for reviewing, please see below comments.
> >
> > > -----Original Message-----
> > > From: Chris Wilson <chris@chris-wilson.co.uk>
> > > Sent: Friday, December 6, 2019 8:04 PM
> > > To: Liu, Chuansheng <chuansheng.liu@intel.com>; intel-
> > > gfx@lists.freedesktop.org
> > > Subject: Re: [Intel-gfx] [PATCH] drm/i915: Fix the use-after-free between
> > > i915_active_ref and __active_retire
> > >
> > > Quoting Chuansheng Liu (2019-12-06 11:56:35)
> > > > We easily hit drm/i915 panic on TGL when running glmark2, and finally
> > > > caught the race condition of use-after-free with enabling KASAN.
> > > >
> > > > The call stack is below:
> > > > ===
> > > > [  534.472675] BUG: KASAN: use-after-free in
> > > __i915_active_fence_set+0x26d/0x3d0 [i915]
> > > > [  534.472679] Write of size 8 at addr ffff8883f0372388 by task
> glmark2/3199
> > > >
> > > > [  534.472684] CPU: 3 PID: 3199 Comm: glmark2 Tainted: G     U      E
> 5.4.0-
> > > rc8 #8
> > > > [  534.472687] Call Trace:
> > > > [  534.472693]  dump_stack+0x95/0xd5
> > > > [  534.472722]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
> > > > [  534.472727]  print_address_description.constprop.5+0x20/0x320
> > > > [  534.472751]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
> > > > [  534.472792]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
> > > > [  534.472794]  __kasan_report+0x149/0x18c
> > > > [  534.472798]  ? _raw_spin_lock+0x1/0xd0
> > > > [  534.472820]  ? __i915_active_fence_set+0x26d/0x3d0 [i915]
> > > > [  534.472822]  kasan_report+0x12/0x20
> > > > [  534.472825]  __asan_report_store8_noabort+0x17/0x20
> > > > [  534.472847]  __i915_active_fence_set+0x26d/0x3d0 [i915]
> > > > [  534.472870]  i915_active_ref+0x2c8/0x530 [i915]
> > > > [  534.472874]  ? dma_resv_add_shared_fence+0x291/0x460
> > > > [  534.472902]  __i915_vma_move_to_active+0x56/0x70 [i915]
> > > > [  534.472927]  i915_vma_move_to_active+0x54/0x420 [i915]
> > > > [  534.472931]  ? mutex_unlock+0x22/0x40
> > > > [  534.472957]  i915_gem_do_execbuffer+0x1d45/0x3e20 [i915]
> > > > [  534.472959]  ? __kmalloc_node+0x12c/0x350
> > > > [  534.472983]  ? eb_relocate_slow+0xb40/0xb40 [i915]
> > > > [  534.472985]  ? _raw_write_trylock+0x110/0x110
> > > > [  534.472987]  ? get_partial_node.isra.72+0x51/0x260
> > > > [  534.472991]  ? unix_stream_read_generic+0x583/0x1a80
> > > > [  534.472994]  ? ___slab_alloc+0x1d8/0x550
> > > > [  534.472998]  ? kvmalloc_node+0x31/0x80
> > > > [  534.473000]  ? kasan_unpoison_shadow+0x35/0x50
> > > > [  534.473002]  ? _raw_spin_lock+0x7b/0xd0
> > > > [  534.473004]  ? radix_tree_lookup+0xd/0x10
> > > > [  534.473006]  ? idr_find+0x3b/0x60
> > > > [  534.473029]  i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915]
> > > > [  534.473052]  ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915]
> > > > [  534.473054]  ? unix_stream_recvmsg+0x97/0xd0
> > > > [  534.473056]  ? unix_stream_splice_read+0x1c0/0x1c0
> > > > [  534.473058]  ? __unix_insert_socket+0x180/0x180
> > > > [  534.473081]  ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915]
> > > > [  534.473094]  drm_ioctl_kernel+0x1ed/0x2b0 [drm]
> > > > [  534.473103]  ? drm_setversion+0x8c0/0x8c0 [drm]
> > > > [  534.473106]  ? __kasan_check_write+0x14/0x20
> > > > [  534.473115]  drm_ioctl+0x68b/0xaa0 [drm]
> > > > ...
> > > >
> > > > [  534.473239] Allocated by task 3199:
> > > > [  534.473241]  save_stack+0x21/0x90
> > > > [  534.473243]  __kasan_kmalloc.constprop.8+0xa7/0xd0
> > > > [  534.473245]  kasan_slab_alloc+0x11/0x20
> > > > [  534.473246]  kmem_cache_alloc+0xce/0x240
> > > > [  534.473273]  i915_active_ref+0xc2/0x530 [i915]
> > > > [  534.473302]  __i915_vma_move_to_active+0x56/0x70 [i915]
> > > > [  534.473328]  i915_vma_move_to_active+0x54/0x420 [i915]
> > > > [  534.473355]  i915_gem_do_execbuffer+0x1d45/0x3e20 [i915]
> > > > [  534.473381]  i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915]
> > > > [  534.473392]  drm_ioctl_kernel+0x1ed/0x2b0 [drm]
> > > > [  534.473402]  drm_ioctl+0x68b/0xaa0 [drm]
> > > > [  534.473404]  do_vfs_ioctl+0x19a/0xf10
> > > > [  534.473405]  ksys_ioctl+0x75/0x80
> > > > [  534.473407]  __x64_sys_ioctl+0x73/0xb0
> > > > [  534.473408]  do_syscall_64+0x9f/0x3a0
> > > > [  534.473410]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > >
> > > > [  534.473412] Freed by task 0:
> > > > [  534.473414]  save_stack+0x21/0x90
> > > > [  534.473415]  __kasan_slab_free+0x137/0x190
> > > > [  534.473417]  kasan_slab_free+0xe/0x10
> > > > [  534.473418]  kmem_cache_free+0xeb/0x2c0
> > > > [  534.473444]  __active_retire+0x1f2/0x240 [i915]
> > > > [  534.473471]  active_retire+0x13b/0x1b0 [i915]
> > > > [  534.473496]  node_retire+0x54/0x80 [i915]
> > > > [  534.473523]  intel_engine_breadcrumbs_irq+0x5f0/0xd10 [i915]
> > > > [  534.473549]  cs_irq_handler+0x66/0xb0 [i915]
> > > > [  534.473575]  gen11_gt_irq_handler+0x26c/0x400 [i915]
> > > > [  534.473600]  gen11_irq_handler+0xc3/0x250 [i915]
> > > > [  534.473603]  __handle_irq_event_percpu+0xe0/0x4c0
> > > > [  534.473605]  handle_irq_event_percpu+0x71/0x140
> > > > [  534.473606]  handle_irq_event+0xad/0x140
> > > > [  534.473608]  handle_edge_irq+0x1f6/0x780
> > > > [  534.473610]  do_IRQ+0x9f/0x1f0
> > > >
> > > > [  534.473612] The buggy address belongs to the object at
> ffff8883f0372380
> > > >                 which belongs to the cache active_node of size 72
> > > > [  534.473615] The buggy address is located 8 bytes inside of
> > > >
> > > > ===
> > > >
> > > > The race scenerio is like:
> > > > Initially ref->count is 1, interrupt handler is trying to free the
> > > > node.
> > > >
> > > > ===
> > > > CPUA in interrupt context                CPUB in i915_gem_execbuffer2_ioctl
> > > > __active_retire -->
> > > >   spin_lock(&ref->tree_lock)
> > > >   decrease ref->count to 0
> > > >                                          i915_active_ref -->
> > > >                                            increase ref->count to 1
> > > >                                            (i915_active_acquire)
> > > >
> > > >                                            get the dirty ref->cache
> > > >                                               (READ_ONCE(ref->cache))
> > > >
> > > >                                            return the dirty node
> > > >
> > > >   set ref->cache to NULL
> > > >   spin_unlock(&ref->tree_lock)
> > > >   free the node
> > > >
> > > >                                            hit use-after-free in
> > > >                                               __i915_active_fence_set()
> > > >
> > > > ===
> > > >
> > > > Here we need to use spinlock ref->tree_lock to protect the access
> > > > of READ_ONCE(ref->cache), then the race scenerio can be resolved.
> > > >
> > > > with this patch, it passed our stress test for a very long time.
> > > >
> > > > Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/i915_active.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/i915/i915_active.c
> > > b/drivers/gpu/drm/i915/i915_active.c
> > > > index dca15ace88f6..3d68b910e949 100644
> > > > --- a/drivers/gpu/drm/i915/i915_active.c
> > > > +++ b/drivers/gpu/drm/i915/i915_active.c
> > > > @@ -214,7 +214,10 @@ active_instance(struct i915_active *ref, struct
> > > intel_timeline *tl)
> > > >          * after the previous activity has been retired, or if it matches the
> > > >          * current timeline.
> > > >          */
> > > > +       spin_lock_irq(&ref->tree_lock);
> > > >         node = READ_ONCE(ref->cache);
> > > > +       spin_unlock_irq(&ref->tree_lock);
> > >
> > > Incorrect. The serialisation with __active_retire is required at
> > > i915_active_acquire.
> > You suggest the change can be made in i915_active_acquire()?
> > So that we can play ref->count closely together with tree_lock
> > and ODEBUG stuff.
> >
> > If so, I can make a new patch
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
index dca15ace88f6..3d68b910e949 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -214,7 +214,10 @@  active_instance(struct i915_active *ref, struct intel_timeline *tl)
 	 * after the previous activity has been retired, or if it matches the
 	 * current timeline.
 	 */
+	spin_lock_irq(&ref->tree_lock);
 	node = READ_ONCE(ref->cache);
+	spin_unlock_irq(&ref->tree_lock);
+
 	if (node && node->timeline == idx)
 		return &node->base;