diff mbox series

drm/scheduler: Add missing RCU flag to fence slab

Message ID 20230710205625.130664-1-robdclark@gmail.com (mailing list archive)
State New, archived
Headers show
Series drm/scheduler: Add missing RCU flag to fence slab | expand

Commit Message

Rob Clark July 10, 2023, 8:56 p.m. UTC
From: Rob Clark <robdclark@chromium.org>

Fixes the KASAN splat:

   ==================================================================
   BUG: KASAN: use-after-free in msm_ioctl_wait_fence+0x31c/0x7b0
   Read of size 4 at addr ffffff808cb7c2f8 by task syz-executor/12236
   CPU: 6 PID: 12236 Comm: syz-executor Tainted: G        W         5.15.119-lockdep-19932-g4a017c53fe63 #1 b15455e5b94c63032dd99eb0190c27e582b357ed
   Hardware name: Google Homestar (rev3) (DT)
   Call trace:
    dump_backtrace+0x0/0x4e8
    show_stack+0x34/0x50
    dump_stack_lvl+0xdc/0x11c
    print_address_description+0x30/0x2d8
    kasan_report+0x178/0x1e4
    kasan_check_range+0x1b0/0x1b8
    __kasan_check_read+0x44/0x54
    msm_ioctl_wait_fence+0x31c/0x7b0
    drm_ioctl_kernel+0x214/0x418
    drm_ioctl+0x524/0xbe8
    __arm64_sys_ioctl+0x154/0x1d0
    invoke_syscall+0x98/0x278
    el0_svc_common+0x214/0x274
    do_el0_svc+0x9c/0x19c
    el0_svc+0x5c/0xc0
    el0t_64_sync_handler+0x78/0x108
    el0t_64_sync+0x1a4/0x1a8
   Allocated by task 12224:
    kasan_save_stack+0x38/0x68
    __kasan_slab_alloc+0x6c/0x88
    kmem_cache_alloc+0x1b8/0x428
    drm_sched_fence_alloc+0x30/0x94
    drm_sched_job_init+0x7c/0x178
    msm_ioctl_gem_submit+0x2b8/0x5ac4
    drm_ioctl_kernel+0x214/0x418
    drm_ioctl+0x524/0xbe8
    __arm64_sys_ioctl+0x154/0x1d0
    invoke_syscall+0x98/0x278
    el0_svc_common+0x214/0x274
    do_el0_svc+0x9c/0x19c
    el0_svc+0x5c/0xc0
    el0t_64_sync_handler+0x78/0x108
    el0t_64_sync+0x1a4/0x1a8
   Freed by task 32:
    kasan_save_stack+0x38/0x68
    kasan_set_track+0x28/0x3c
    kasan_set_free_info+0x28/0x4c
    ____kasan_slab_free+0x110/0x164
    __kasan_slab_free+0x18/0x28
    kmem_cache_free+0x1e0/0x904
    drm_sched_fence_free_rcu+0x80/0x9c
    rcu_do_batch+0x318/0xcf0
    rcu_nocb_cb_kthread+0x1a0/0xc4c
    kthread+0x2e4/0x3a0
    ret_from_fork+0x10/0x20
   Last potentially related work creation:
    kasan_save_stack+0x38/0x68
    kasan_record_aux_stack+0xd4/0x114
    __call_rcu_common+0xd4/0x1478
    call_rcu+0x1c/0x28
    drm_sched_fence_release_scheduled+0x108/0x158
    dma_fence_release+0x178/0x564
    drm_sched_fence_release_finished+0xb4/0x124
    dma_fence_release+0x178/0x564
    __msm_gem_submit_destroy+0x150/0x488
    msm_job_free+0x9c/0xdc
    drm_sched_main+0xec/0x9ac
    kthread+0x2e4/0x3a0
    ret_from_fork+0x10/0x20
   Second to last potentially related work creation:
    kasan_save_stack+0x38/0x68
    kasan_record_aux_stack+0xd4/0x114
    __call_rcu_common+0xd4/0x1478
    call_rcu+0x1c/0x28
    drm_sched_fence_release_scheduled+0x108/0x158
    dma_fence_release+0x178/0x564
    drm_sched_fence_release_finished+0xb4/0x124
    dma_fence_release+0x178/0x564
    drm_sched_entity_fini+0x170/0x238
    drm_sched_entity_destroy+0x34/0x44
    __msm_file_private_destroy+0x60/0x118
    msm_submitqueue_destroy+0xd0/0x110
    __msm_gem_submit_destroy+0x384/0x488
    retire_submits+0x6a8/0xa14
    recover_worker+0x764/0xa50
    kthread_worker_fn+0x3f4/0x9ec
    kthread+0x2e4/0x3a0
    ret_from_fork+0x10/0x20
   The buggy address belongs to the object at ffffff808cb7c280
   The buggy address is located 120 bytes inside of
   The buggy address belongs to the page:
   page:000000008b01d27d refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10cb7c
   head:000000008b01d27d order:1 compound_mapcount:0
   flags: 0x8000000000010200(slab|head|zone=2)
   raw: 8000000000010200 fffffffe06844d80 0000000300000003 ffffff80860dca00
   raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000
   page dumped because: kasan: bad access detected
   Memory state around the buggy address:
    ffffff808cb7c180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ffffff808cb7c200: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
   >ffffff808cb7c280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                                   ^
    ffffff808cb7c300: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
    ffffff808cb7c380: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
   ==================================================================

Suggested-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/scheduler/sched_fence.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Luben Tuikov July 10, 2023, 9:15 p.m. UTC | #1
On 2023-07-10 16:56, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> Fixes the KASAN splat:
> 
>    ==================================================================
>    BUG: KASAN: use-after-free in msm_ioctl_wait_fence+0x31c/0x7b0
>    Read of size 4 at addr ffffff808cb7c2f8 by task syz-executor/12236
>    CPU: 6 PID: 12236 Comm: syz-executor Tainted: G        W         5.15.119-lockdep-19932-g4a017c53fe63 #1 b15455e5b94c63032dd99eb0190c27e582b357ed
>    Hardware name: Google Homestar (rev3) (DT)
>    Call trace:
>     dump_backtrace+0x0/0x4e8
>     show_stack+0x34/0x50
>     dump_stack_lvl+0xdc/0x11c
>     print_address_description+0x30/0x2d8
>     kasan_report+0x178/0x1e4
>     kasan_check_range+0x1b0/0x1b8
>     __kasan_check_read+0x44/0x54
>     msm_ioctl_wait_fence+0x31c/0x7b0
>     drm_ioctl_kernel+0x214/0x418
>     drm_ioctl+0x524/0xbe8
>     __arm64_sys_ioctl+0x154/0x1d0
>     invoke_syscall+0x98/0x278
>     el0_svc_common+0x214/0x274
>     do_el0_svc+0x9c/0x19c
>     el0_svc+0x5c/0xc0
>     el0t_64_sync_handler+0x78/0x108
>     el0t_64_sync+0x1a4/0x1a8
>    Allocated by task 12224:
>     kasan_save_stack+0x38/0x68
>     __kasan_slab_alloc+0x6c/0x88
>     kmem_cache_alloc+0x1b8/0x428
>     drm_sched_fence_alloc+0x30/0x94
>     drm_sched_job_init+0x7c/0x178
>     msm_ioctl_gem_submit+0x2b8/0x5ac4
>     drm_ioctl_kernel+0x214/0x418
>     drm_ioctl+0x524/0xbe8
>     __arm64_sys_ioctl+0x154/0x1d0
>     invoke_syscall+0x98/0x278
>     el0_svc_common+0x214/0x274
>     do_el0_svc+0x9c/0x19c
>     el0_svc+0x5c/0xc0
>     el0t_64_sync_handler+0x78/0x108
>     el0t_64_sync+0x1a4/0x1a8
>    Freed by task 32:
>     kasan_save_stack+0x38/0x68
>     kasan_set_track+0x28/0x3c
>     kasan_set_free_info+0x28/0x4c
>     ____kasan_slab_free+0x110/0x164
>     __kasan_slab_free+0x18/0x28
>     kmem_cache_free+0x1e0/0x904
>     drm_sched_fence_free_rcu+0x80/0x9c
>     rcu_do_batch+0x318/0xcf0
>     rcu_nocb_cb_kthread+0x1a0/0xc4c
>     kthread+0x2e4/0x3a0
>     ret_from_fork+0x10/0x20
>    Last potentially related work creation:
>     kasan_save_stack+0x38/0x68
>     kasan_record_aux_stack+0xd4/0x114
>     __call_rcu_common+0xd4/0x1478
>     call_rcu+0x1c/0x28
>     drm_sched_fence_release_scheduled+0x108/0x158
>     dma_fence_release+0x178/0x564
>     drm_sched_fence_release_finished+0xb4/0x124
>     dma_fence_release+0x178/0x564
>     __msm_gem_submit_destroy+0x150/0x488
>     msm_job_free+0x9c/0xdc
>     drm_sched_main+0xec/0x9ac
>     kthread+0x2e4/0x3a0
>     ret_from_fork+0x10/0x20
>    Second to last potentially related work creation:
>     kasan_save_stack+0x38/0x68
>     kasan_record_aux_stack+0xd4/0x114
>     __call_rcu_common+0xd4/0x1478
>     call_rcu+0x1c/0x28
>     drm_sched_fence_release_scheduled+0x108/0x158
>     dma_fence_release+0x178/0x564
>     drm_sched_fence_release_finished+0xb4/0x124
>     dma_fence_release+0x178/0x564
>     drm_sched_entity_fini+0x170/0x238
>     drm_sched_entity_destroy+0x34/0x44
>     __msm_file_private_destroy+0x60/0x118
>     msm_submitqueue_destroy+0xd0/0x110
>     __msm_gem_submit_destroy+0x384/0x488
>     retire_submits+0x6a8/0xa14
>     recover_worker+0x764/0xa50
>     kthread_worker_fn+0x3f4/0x9ec
>     kthread+0x2e4/0x3a0
>     ret_from_fork+0x10/0x20
>    The buggy address belongs to the object at ffffff808cb7c280
>    The buggy address is located 120 bytes inside of
>    The buggy address belongs to the page:
>    page:000000008b01d27d refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10cb7c
>    head:000000008b01d27d order:1 compound_mapcount:0
>    flags: 0x8000000000010200(slab|head|zone=2)
>    raw: 8000000000010200 fffffffe06844d80 0000000300000003 ffffff80860dca00
>    raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000
>    page dumped because: kasan: bad access detected
>    Memory state around the buggy address:
>     ffffff808cb7c180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>     ffffff808cb7c200: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
>    >ffffff808cb7c280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>                                                                    ^
>     ffffff808cb7c300: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
>     ffffff808cb7c380: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
>    ==================================================================
> 
> Suggested-by: Alexander Potapenko <glider@google.com>
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/gpu/drm/scheduler/sched_fence.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
> index ef120475e7c6..b624711c6e03 100644
> --- a/drivers/gpu/drm/scheduler/sched_fence.c
> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
> @@ -35,7 +35,7 @@ static int __init drm_sched_fence_slab_init(void)
>  {
>  	sched_fence_slab = kmem_cache_create(
>  		"drm_sched_fence", sizeof(struct drm_sched_fence), 0,
> -		SLAB_HWCACHE_ALIGN, NULL);
> +		SLAB_HWCACHE_ALIGN | SLAB_TYPESAFE_BY_RCU, NULL);
>  	if (!sched_fence_slab)
>  		return -ENOMEM;
>  

Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>

But let it simmer for 24 hours so Christian can see it too (CC-ed).
Christian König July 11, 2023, 7:46 a.m. UTC | #2
Am 10.07.23 um 23:15 schrieb Luben Tuikov:
> On 2023-07-10 16:56, Rob Clark wrote:
>> From: Rob Clark <robdclark@chromium.org>
>>
>> Fixes the KASAN splat:
>>
>>     ==================================================================
>>     BUG: KASAN: use-after-free in msm_ioctl_wait_fence+0x31c/0x7b0
>>     Read of size 4 at addr ffffff808cb7c2f8 by task syz-executor/12236
>>     CPU: 6 PID: 12236 Comm: syz-executor Tainted: G        W         5.15.119-lockdep-19932-g4a017c53fe63 #1 b15455e5b94c63032dd99eb0190c27e582b357ed
>>     Hardware name: Google Homestar (rev3) (DT)
>>     Call trace:
>>      dump_backtrace+0x0/0x4e8
>>      show_stack+0x34/0x50
>>      dump_stack_lvl+0xdc/0x11c
>>      print_address_description+0x30/0x2d8
>>      kasan_report+0x178/0x1e4
>>      kasan_check_range+0x1b0/0x1b8
>>      __kasan_check_read+0x44/0x54
>>      msm_ioctl_wait_fence+0x31c/0x7b0
>>      drm_ioctl_kernel+0x214/0x418
>>      drm_ioctl+0x524/0xbe8
>>      __arm64_sys_ioctl+0x154/0x1d0
>>      invoke_syscall+0x98/0x278
>>      el0_svc_common+0x214/0x274
>>      do_el0_svc+0x9c/0x19c
>>      el0_svc+0x5c/0xc0
>>      el0t_64_sync_handler+0x78/0x108
>>      el0t_64_sync+0x1a4/0x1a8
>>     Allocated by task 12224:
>>      kasan_save_stack+0x38/0x68
>>      __kasan_slab_alloc+0x6c/0x88
>>      kmem_cache_alloc+0x1b8/0x428
>>      drm_sched_fence_alloc+0x30/0x94
>>      drm_sched_job_init+0x7c/0x178
>>      msm_ioctl_gem_submit+0x2b8/0x5ac4
>>      drm_ioctl_kernel+0x214/0x418
>>      drm_ioctl+0x524/0xbe8
>>      __arm64_sys_ioctl+0x154/0x1d0
>>      invoke_syscall+0x98/0x278
>>      el0_svc_common+0x214/0x274
>>      do_el0_svc+0x9c/0x19c
>>      el0_svc+0x5c/0xc0
>>      el0t_64_sync_handler+0x78/0x108
>>      el0t_64_sync+0x1a4/0x1a8
>>     Freed by task 32:
>>      kasan_save_stack+0x38/0x68
>>      kasan_set_track+0x28/0x3c
>>      kasan_set_free_info+0x28/0x4c
>>      ____kasan_slab_free+0x110/0x164
>>      __kasan_slab_free+0x18/0x28
>>      kmem_cache_free+0x1e0/0x904
>>      drm_sched_fence_free_rcu+0x80/0x9c
>>      rcu_do_batch+0x318/0xcf0
>>      rcu_nocb_cb_kthread+0x1a0/0xc4c
>>      kthread+0x2e4/0x3a0
>>      ret_from_fork+0x10/0x20
>>     Last potentially related work creation:
>>      kasan_save_stack+0x38/0x68
>>      kasan_record_aux_stack+0xd4/0x114
>>      __call_rcu_common+0xd4/0x1478
>>      call_rcu+0x1c/0x28
>>      drm_sched_fence_release_scheduled+0x108/0x158
>>      dma_fence_release+0x178/0x564
>>      drm_sched_fence_release_finished+0xb4/0x124
>>      dma_fence_release+0x178/0x564
>>      __msm_gem_submit_destroy+0x150/0x488
>>      msm_job_free+0x9c/0xdc
>>      drm_sched_main+0xec/0x9ac
>>      kthread+0x2e4/0x3a0
>>      ret_from_fork+0x10/0x20
>>     Second to last potentially related work creation:
>>      kasan_save_stack+0x38/0x68
>>      kasan_record_aux_stack+0xd4/0x114
>>      __call_rcu_common+0xd4/0x1478
>>      call_rcu+0x1c/0x28
>>      drm_sched_fence_release_scheduled+0x108/0x158
>>      dma_fence_release+0x178/0x564
>>      drm_sched_fence_release_finished+0xb4/0x124
>>      dma_fence_release+0x178/0x564
>>      drm_sched_entity_fini+0x170/0x238
>>      drm_sched_entity_destroy+0x34/0x44
>>      __msm_file_private_destroy+0x60/0x118
>>      msm_submitqueue_destroy+0xd0/0x110
>>      __msm_gem_submit_destroy+0x384/0x488
>>      retire_submits+0x6a8/0xa14
>>      recover_worker+0x764/0xa50
>>      kthread_worker_fn+0x3f4/0x9ec
>>      kthread+0x2e4/0x3a0
>>      ret_from_fork+0x10/0x20
>>     The buggy address belongs to the object at ffffff808cb7c280
>>     The buggy address is located 120 bytes inside of
>>     The buggy address belongs to the page:
>>     page:000000008b01d27d refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10cb7c
>>     head:000000008b01d27d order:1 compound_mapcount:0
>>     flags: 0x8000000000010200(slab|head|zone=2)
>>     raw: 8000000000010200 fffffffe06844d80 0000000300000003 ffffff80860dca00
>>     raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000
>>     page dumped because: kasan: bad access detected
>>     Memory state around the buggy address:
>>      ffffff808cb7c180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>      ffffff808cb7c200: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
>>     >ffffff808cb7c280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>                                                                     ^
>>      ffffff808cb7c300: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
>>      ffffff808cb7c380: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
>>     ==================================================================
>>
>> Suggested-by: Alexander Potapenko <glider@google.com>
>> Signed-off-by: Rob Clark <robdclark@chromium.org>
>> ---
>>   drivers/gpu/drm/scheduler/sched_fence.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
>> index ef120475e7c6..b624711c6e03 100644
>> --- a/drivers/gpu/drm/scheduler/sched_fence.c
>> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
>> @@ -35,7 +35,7 @@ static int __init drm_sched_fence_slab_init(void)
>>   {
>>   	sched_fence_slab = kmem_cache_create(
>>   		"drm_sched_fence", sizeof(struct drm_sched_fence), 0,
>> -		SLAB_HWCACHE_ALIGN, NULL);
>> +		SLAB_HWCACHE_ALIGN | SLAB_TYPESAFE_BY_RCU, NULL);
>>   	if (!sched_fence_slab)
>>   		return -ENOMEM;
>>   
> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
>
> But let it simmer for 24 hours so Christian can see it too (CC-ed).

Well that won't work like this, using the the SLAB_TYPESAFE_BY_RCU flag 
was suggested before and is not allowed for dma_fence objects.

The flag SLAB_TYPESAFE_BY_RCU can only be used if the objects in the 
slab can't be reused within an RCU time period or if that reuse doesn't 
matter for the logic. And exactly that is not guaranteed for dma_fence 
objects.

It should also not be necessary because the scheduler fences are 
released using call_rcu:

static void drm_sched_fence_release_scheduled(struct dma_fence *f)
{
         struct drm_sched_fence *fence = to_drm_sched_fence(f);

         dma_fence_put(fence->parent);
         call_rcu(&fence->finished.rcu, drm_sched_fence_free_rcu);
}

That looks more like you have a reference count problem in MSM.

Regards,
Christian.
Rob Clark July 11, 2023, 2:49 p.m. UTC | #3
On Tue, Jul 11, 2023 at 12:46 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 10.07.23 um 23:15 schrieb Luben Tuikov:
> > On 2023-07-10 16:56, Rob Clark wrote:
> >> From: Rob Clark <robdclark@chromium.org>
> >>
> >> Fixes the KASAN splat:
> >>
> >>     ==================================================================
> >>     BUG: KASAN: use-after-free in msm_ioctl_wait_fence+0x31c/0x7b0
> >>     Read of size 4 at addr ffffff808cb7c2f8 by task syz-executor/12236
> >>     CPU: 6 PID: 12236 Comm: syz-executor Tainted: G        W         5.15.119-lockdep-19932-g4a017c53fe63 #1 b15455e5b94c63032dd99eb0190c27e582b357ed
> >>     Hardware name: Google Homestar (rev3) (DT)
> >>     Call trace:
> >>      dump_backtrace+0x0/0x4e8
> >>      show_stack+0x34/0x50
> >>      dump_stack_lvl+0xdc/0x11c
> >>      print_address_description+0x30/0x2d8
> >>      kasan_report+0x178/0x1e4
> >>      kasan_check_range+0x1b0/0x1b8
> >>      __kasan_check_read+0x44/0x54
> >>      msm_ioctl_wait_fence+0x31c/0x7b0
> >>      drm_ioctl_kernel+0x214/0x418
> >>      drm_ioctl+0x524/0xbe8
> >>      __arm64_sys_ioctl+0x154/0x1d0
> >>      invoke_syscall+0x98/0x278
> >>      el0_svc_common+0x214/0x274
> >>      do_el0_svc+0x9c/0x19c
> >>      el0_svc+0x5c/0xc0
> >>      el0t_64_sync_handler+0x78/0x108
> >>      el0t_64_sync+0x1a4/0x1a8
> >>     Allocated by task 12224:
> >>      kasan_save_stack+0x38/0x68
> >>      __kasan_slab_alloc+0x6c/0x88
> >>      kmem_cache_alloc+0x1b8/0x428
> >>      drm_sched_fence_alloc+0x30/0x94
> >>      drm_sched_job_init+0x7c/0x178
> >>      msm_ioctl_gem_submit+0x2b8/0x5ac4
> >>      drm_ioctl_kernel+0x214/0x418
> >>      drm_ioctl+0x524/0xbe8
> >>      __arm64_sys_ioctl+0x154/0x1d0
> >>      invoke_syscall+0x98/0x278
> >>      el0_svc_common+0x214/0x274
> >>      do_el0_svc+0x9c/0x19c
> >>      el0_svc+0x5c/0xc0
> >>      el0t_64_sync_handler+0x78/0x108
> >>      el0t_64_sync+0x1a4/0x1a8
> >>     Freed by task 32:
> >>      kasan_save_stack+0x38/0x68
> >>      kasan_set_track+0x28/0x3c
> >>      kasan_set_free_info+0x28/0x4c
> >>      ____kasan_slab_free+0x110/0x164
> >>      __kasan_slab_free+0x18/0x28
> >>      kmem_cache_free+0x1e0/0x904
> >>      drm_sched_fence_free_rcu+0x80/0x9c
> >>      rcu_do_batch+0x318/0xcf0
> >>      rcu_nocb_cb_kthread+0x1a0/0xc4c
> >>      kthread+0x2e4/0x3a0
> >>      ret_from_fork+0x10/0x20
> >>     Last potentially related work creation:
> >>      kasan_save_stack+0x38/0x68
> >>      kasan_record_aux_stack+0xd4/0x114
> >>      __call_rcu_common+0xd4/0x1478
> >>      call_rcu+0x1c/0x28
> >>      drm_sched_fence_release_scheduled+0x108/0x158
> >>      dma_fence_release+0x178/0x564
> >>      drm_sched_fence_release_finished+0xb4/0x124
> >>      dma_fence_release+0x178/0x564
> >>      __msm_gem_submit_destroy+0x150/0x488
> >>      msm_job_free+0x9c/0xdc
> >>      drm_sched_main+0xec/0x9ac
> >>      kthread+0x2e4/0x3a0
> >>      ret_from_fork+0x10/0x20
> >>     Second to last potentially related work creation:
> >>      kasan_save_stack+0x38/0x68
> >>      kasan_record_aux_stack+0xd4/0x114
> >>      __call_rcu_common+0xd4/0x1478
> >>      call_rcu+0x1c/0x28
> >>      drm_sched_fence_release_scheduled+0x108/0x158
> >>      dma_fence_release+0x178/0x564
> >>      drm_sched_fence_release_finished+0xb4/0x124
> >>      dma_fence_release+0x178/0x564
> >>      drm_sched_entity_fini+0x170/0x238
> >>      drm_sched_entity_destroy+0x34/0x44
> >>      __msm_file_private_destroy+0x60/0x118
> >>      msm_submitqueue_destroy+0xd0/0x110
> >>      __msm_gem_submit_destroy+0x384/0x488
> >>      retire_submits+0x6a8/0xa14
> >>      recover_worker+0x764/0xa50
> >>      kthread_worker_fn+0x3f4/0x9ec
> >>      kthread+0x2e4/0x3a0
> >>      ret_from_fork+0x10/0x20
> >>     The buggy address belongs to the object at ffffff808cb7c280
> >>     The buggy address is located 120 bytes inside of
> >>     The buggy address belongs to the page:
> >>     page:000000008b01d27d refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10cb7c
> >>     head:000000008b01d27d order:1 compound_mapcount:0
> >>     flags: 0x8000000000010200(slab|head|zone=2)
> >>     raw: 8000000000010200 fffffffe06844d80 0000000300000003 ffffff80860dca00
> >>     raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000
> >>     page dumped because: kasan: bad access detected
> >>     Memory state around the buggy address:
> >>      ffffff808cb7c180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>      ffffff808cb7c200: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
> >>     >ffffff808cb7c280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >>                                                                     ^
> >>      ffffff808cb7c300: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
> >>      ffffff808cb7c380: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
> >>     ==================================================================
> >>
> >> Suggested-by: Alexander Potapenko <glider@google.com>
> >> Signed-off-by: Rob Clark <robdclark@chromium.org>
> >> ---
> >>   drivers/gpu/drm/scheduler/sched_fence.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
> >> index ef120475e7c6..b624711c6e03 100644
> >> --- a/drivers/gpu/drm/scheduler/sched_fence.c
> >> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
> >> @@ -35,7 +35,7 @@ static int __init drm_sched_fence_slab_init(void)
> >>   {
> >>      sched_fence_slab = kmem_cache_create(
> >>              "drm_sched_fence", sizeof(struct drm_sched_fence), 0,
> >> -            SLAB_HWCACHE_ALIGN, NULL);
> >> +            SLAB_HWCACHE_ALIGN | SLAB_TYPESAFE_BY_RCU, NULL);
> >>      if (!sched_fence_slab)
> >>              return -ENOMEM;
> >>
> > Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
> >
> > But let it simmer for 24 hours so Christian can see it too (CC-ed).
>
> Well that won't work like this, using the the SLAB_TYPESAFE_BY_RCU flag
> was suggested before and is not allowed for dma_fence objects.
>
> The flag SLAB_TYPESAFE_BY_RCU can only be used if the objects in the
> slab can't be reused within an RCU time period or if that reuse doesn't
> matter for the logic. And exactly that is not guaranteed for dma_fence
> objects.

I think that is only true because of the drm_sched_fence_free() path?
But that could also use call_rcu().  (It looks like it is only an
error path.)

> It should also not be necessary because the scheduler fences are
> released using call_rcu:
>
> static void drm_sched_fence_release_scheduled(struct dma_fence *f)
> {
>          struct drm_sched_fence *fence = to_drm_sched_fence(f);
>
>          dma_fence_put(fence->parent);
>          call_rcu(&fence->finished.rcu, drm_sched_fence_free_rcu);
> }
>
> That looks more like you have a reference count problem in MSM.

Possibly I need to use rcu_read_lock()?  But I think the idr_lock
which protected dma_fence_get_rcu() (and is held until the fence is
removed from fence_idr, before it's reference is dropped in
__msm_gem_submit_destroy()) makes that unnecessary.

BR,
-R

> Regards,
> Christian.
Christian König July 12, 2023, 6:45 a.m. UTC | #4
Am 11.07.23 um 16:49 schrieb Rob Clark:
> On Tue, Jul 11, 2023 at 12:46 AM Christian König
> <christian.koenig@amd.com> wrote:
>> [SNIP]
>>>> ---
>>>>    drivers/gpu/drm/scheduler/sched_fence.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
>>>> index ef120475e7c6..b624711c6e03 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_fence.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
>>>> @@ -35,7 +35,7 @@ static int __init drm_sched_fence_slab_init(void)
>>>>    {
>>>>       sched_fence_slab = kmem_cache_create(
>>>>               "drm_sched_fence", sizeof(struct drm_sched_fence), 0,
>>>> -            SLAB_HWCACHE_ALIGN, NULL);
>>>> +            SLAB_HWCACHE_ALIGN | SLAB_TYPESAFE_BY_RCU, NULL);
>>>>       if (!sched_fence_slab)
>>>>               return -ENOMEM;
>>>>
>>> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
>>>
>>> But let it simmer for 24 hours so Christian can see it too (CC-ed).
>> Well that won't work like this, using the the SLAB_TYPESAFE_BY_RCU flag
>> was suggested before and is not allowed for dma_fence objects.
>>
>> The flag SLAB_TYPESAFE_BY_RCU can only be used if the objects in the
>> slab can't be reused within an RCU time period or if that reuse doesn't
>> matter for the logic. And exactly that is not guaranteed for dma_fence
>> objects.
> I think that is only true because of the drm_sched_fence_free() path?
> But that could also use call_rcu().  (It looks like it is only an
> error path.)

No, that's completely unrelated to that.

The SLAB_TYPESAFE_BY_RCU flag works only if you don't use 
kref_get_unless_zero() on the object.

The problem is basically that objects allocated with that flag can be 
re-used under RCU, but only for the same type of object.

This is ok as long as you only need information from the object to 
decide something and can still double check if you got the right object 
through different means.

But when the object can be re-used while in the critical section you can 
end up for example grabbing a reference to something completely 
unrelated to your code path.

The Intel guys had some very bad surprises with that and dma_fence as 
well as other objects.

>> It should also not be necessary because the scheduler fences are
>> released using call_rcu:
>>
>> static void drm_sched_fence_release_scheduled(struct dma_fence *f)
>> {
>>           struct drm_sched_fence *fence = to_drm_sched_fence(f);
>>
>>           dma_fence_put(fence->parent);
>>           call_rcu(&fence->finished.rcu, drm_sched_fence_free_rcu);
>> }
>>
>> That looks more like you have a reference count problem in MSM.
> Possibly I need to use rcu_read_lock()?  But I think the idr_lock
> which protected dma_fence_get_rcu() (and is held until the fence is
> removed from fence_idr, before it's reference is dropped in
> __msm_gem_submit_destroy()) makes that unnecessary.

Well you can either use a RCU protected pointer with dma_fence_get_rcu() 
inside a rcu_read_lock()/unlock() cirtical section.

Or you have some protection in the form of a lock, but then you should 
use dma_fence_get() instead.

Mixing those two doesn't make much sense.

Regards,
Christian.

>
> BR,
> -R
>
>> Regards,
>> Christian.
diff mbox series

Patch

diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
index ef120475e7c6..b624711c6e03 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -35,7 +35,7 @@  static int __init drm_sched_fence_slab_init(void)
 {
 	sched_fence_slab = kmem_cache_create(
 		"drm_sched_fence", sizeof(struct drm_sched_fence), 0,
-		SLAB_HWCACHE_ALIGN, NULL);
+		SLAB_HWCACHE_ALIGN | SLAB_TYPESAFE_BY_RCU, NULL);
 	if (!sched_fence_slab)
 		return -ENOMEM;