Message ID | 50D1C7E4.1060701@canonical.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 2012.12.19 at 14:57 +0100, Maarten Lankhorst wrote: > Op 18-12-12 17:12, Markus Trippelsdorf schreef: > > With your supposed debugging BUG_ONs added I still get: > > > > Dec 18 17:01:15 x4 kernel: ------------[ cut here ]------------ > > Dec 18 17:01:15 x4 kernel: WARNING: at include/linux/kref.h:42 radeon_fence_ref+0x2c/0x40() > > Dec 18 17:01:15 x4 kernel: Hardware name: System Product Name > > Dec 18 17:01:15 x4 kernel: Pid: 157, comm: X Not tainted 3.7.0-rc7-00520-g85b144f-dirty #174 > > Dec 18 17:01:15 x4 kernel: Call Trace: > > Dec 18 17:01:15 x4 kernel: [<ffffffff81058c84>] ? warn_slowpath_common+0x74/0xb0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff8129273c>] ? radeon_fence_ref+0x2c/0x40 > > Dec 18 17:01:15 x4 kernel: [<ffffffff8125e95c>] ? ttm_bo_cleanup_refs_and_unlock+0x18c/0x2d0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff8125f17c>] ? ttm_mem_evict_first+0x1dc/0x2a0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff81264452>] ? ttm_bo_man_get_node+0x62/0xb0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff8125f4ce>] ? ttm_bo_mem_space+0x28e/0x340 > > Dec 18 17:01:15 x4 kernel: [<ffffffff8125fb0c>] ? ttm_bo_move_buffer+0xfc/0x170 > > Dec 18 17:01:15 x4 kernel: [<ffffffff810de172>] ? kmem_cache_alloc+0xb2/0xc0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff8125fc15>] ? ttm_bo_validate+0x95/0x110 > > Dec 18 17:01:15 x4 kernel: [<ffffffff8125ff7c>] ? ttm_bo_init+0x2ec/0x3b0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff8129419a>] ? radeon_bo_create+0x18a/0x200 > > Dec 18 17:01:15 x4 kernel: [<ffffffff81293e80>] ? radeon_bo_clear_va+0x40/0x40 > > Dec 18 17:01:15 x4 kernel: [<ffffffff812a5342>] ? radeon_gem_object_create+0x92/0x160 > > Dec 18 17:01:15 x4 kernel: [<ffffffff812a575c>] ? radeon_gem_create_ioctl+0x6c/0x150 > > Dec 18 17:01:15 x4 kernel: [<ffffffff812a529f>] ? radeon_gem_object_free+0x2f/0x40 > > Dec 18 17:01:15 x4 kernel: [<ffffffff81246b60>] ? drm_ioctl+0x420/0x4f0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff812a56f0>] ? radeon_gem_pwrite_ioctl+0x20/0x20 > > Dec 18 17:01:15 x4 kernel: [<ffffffff810f53a4>] ? do_vfs_ioctl+0x2e4/0x4e0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff810e5588>] ? vfs_read+0x118/0x160 > > Dec 18 17:01:15 x4 kernel: [<ffffffff810f55ec>] ? sys_ioctl+0x4c/0xa0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff810e5851>] ? sys_read+0x51/0xa0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff814b0612>] ? system_call_fastpath+0x16/0x1b > so the kref to fence is null here. This should be impossible and > indicates a bug in refcounting somewhere, or possibly memory > corruption. > > Lets first look where things could go wrong.. > > sync_obj member requires fence_lock to be taken, but radeon code in > general doesn't do that, hm.. > > I think radeon_cs_sync_rings needs to take fence_lock during the > iteration, then taking on a refcount to the fence, and > radeon_crtc_page_flip and radeon_move_blit are lacking refcount on > fence_lock as well. > > But that would probably still not explain why it crashes in > radeon_vm_bo_invalidate shortly after, so it seems just as likely that > it's operating on freed memory there or something. > > But none of the code touches refcounting for that bo, and I really > don't see how I messed up anything there. > > I seem to be able to reproduce it if I add a hack though, can you test > if you get the exact same issues if you apply this patch? Your patch doesn't apply unfortunately: markus@x4 linux % patch -p1 --dry-run < ~/maarten.patch checking file drivers/gpu/drm/ttm/ttm_bo.c Hunk #1 succeeded at 512 with fuzz 1. Hunk #6 FAILED at 814. 1 out of 6 hunks FAILED markus@x4 linux % git describe v3.7-10833-g752451f markus@x4 linux %
Op 19-12-12 15:20, Markus Trippelsdorf schreef: > On 2012.12.19 at 14:57 +0100, Maarten Lankhorst wrote: >> Op 18-12-12 17:12, Markus Trippelsdorf schreef: >>> With your supposed debugging BUG_ONs added I still get: >>> >>> Dec 18 17:01:15 x4 kernel: ------------[ cut here ]------------ >>> Dec 18 17:01:15 x4 kernel: WARNING: at include/linux/kref.h:42 radeon_fence_ref+0x2c/0x40() >>> Dec 18 17:01:15 x4 kernel: Hardware name: System Product Name >>> Dec 18 17:01:15 x4 kernel: Pid: 157, comm: X Not tainted 3.7.0-rc7-00520-g85b144f-dirty #174 >>> Dec 18 17:01:15 x4 kernel: Call Trace: >>> Dec 18 17:01:15 x4 kernel: [<ffffffff81058c84>] ? warn_slowpath_common+0x74/0xb0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff8129273c>] ? radeon_fence_ref+0x2c/0x40 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125e95c>] ? ttm_bo_cleanup_refs_and_unlock+0x18c/0x2d0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125f17c>] ? ttm_mem_evict_first+0x1dc/0x2a0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff81264452>] ? ttm_bo_man_get_node+0x62/0xb0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125f4ce>] ? ttm_bo_mem_space+0x28e/0x340 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125fb0c>] ? ttm_bo_move_buffer+0xfc/0x170 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff810de172>] ? kmem_cache_alloc+0xb2/0xc0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125fc15>] ? ttm_bo_validate+0x95/0x110 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125ff7c>] ? ttm_bo_init+0x2ec/0x3b0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff8129419a>] ? radeon_bo_create+0x18a/0x200 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff81293e80>] ? radeon_bo_clear_va+0x40/0x40 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff812a5342>] ? radeon_gem_object_create+0x92/0x160 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff812a575c>] ? radeon_gem_create_ioctl+0x6c/0x150 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff812a529f>] ? radeon_gem_object_free+0x2f/0x40 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff81246b60>] ? drm_ioctl+0x420/0x4f0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff812a56f0>] ? radeon_gem_pwrite_ioctl+0x20/0x20 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff810f53a4>] ? do_vfs_ioctl+0x2e4/0x4e0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff810e5588>] ? vfs_read+0x118/0x160 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff810f55ec>] ? sys_ioctl+0x4c/0xa0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff810e5851>] ? sys_read+0x51/0xa0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff814b0612>] ? system_call_fastpath+0x16/0x1b >> so the kref to fence is null here. This should be impossible and >> indicates a bug in refcounting somewhere, or possibly memory >> corruption. >> >> Lets first look where things could go wrong.. >> >> sync_obj member requires fence_lock to be taken, but radeon code in >> general doesn't do that, hm.. >> >> I think radeon_cs_sync_rings needs to take fence_lock during the >> iteration, then taking on a refcount to the fence, and >> radeon_crtc_page_flip and radeon_move_blit are lacking refcount on >> fence_lock as well. >> >> But that would probably still not explain why it crashes in >> radeon_vm_bo_invalidate shortly after, so it seems just as likely that >> it's operating on freed memory there or something. >> >> But none of the code touches refcounting for that bo, and I really >> don't see how I messed up anything there. >> >> I seem to be able to reproduce it if I add a hack though, can you test >> if you get the exact same issues if you apply this patch? > Your patch doesn't apply unfortunately: > > markus@x4 linux % patch -p1 --dry-run < ~/maarten.patch > checking file drivers/gpu/drm/ttm/ttm_bo.c > Hunk #1 succeeded at 512 with fuzz 1. > Hunk #6 FAILED at 814. > 1 out of 6 hunks FAILED > markus@x4 linux % git describe > v3.7-10833-g752451f > markus@x4 linux % It applies on top of the regressed commit. It should probably not be too hard to make it apply manually on whatever you're using. But the real fix will be "drm/ttm: fix delayed ttm_bo_cleanup_refs_and_unlock delayed handling", which I cc'd you on. The patch I posted earlier in this thread will just aggressively stress test the codepath. ~Maarten
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 0bf66f9..9a8f0d8 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -512,6 +512,7 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo) spin_lock(&glob->lru_lock); ret = ttm_bo_reserve_locked(bo, false, true, false, 0); + goto skip; spin_lock(&bdev->fence_lock); (void) ttm_bo_wait(bo, false, false, true); if (!ret && !bo->sync_obj && 0) { @@ -529,6 +530,7 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo) sync_obj = driver->sync_obj_ref(bo->sync_obj); spin_unlock(&bdev->fence_lock); +skip: if (!ret) { atomic_set(&bo->reserved, 0); wake_up_all(&bo->event_queue); @@ -542,8 +544,7 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo) driver->sync_obj_flush(sync_obj); driver->sync_obj_unref(&sync_obj); } - schedule_delayed_work(&bdev->wq, - ((HZ / 100) < 1) ? 1 : HZ / 100); + schedule_delayed_work(&bdev->wq, HZ * 100); } /** @@ -699,8 +700,7 @@ static void ttm_bo_delayed_workqueue(struct work_struct *work) container_of(work, struct ttm_bo_device, wq.work); if (ttm_bo_delayed_delete(bdev, false)) { - schedule_delayed_work(&bdev->wq, - ((HZ / 100) < 1) ? 1 : HZ / 100); + schedule_delayed_work(&bdev->wq, HZ * 100); } } @@ -743,8 +743,7 @@ EXPORT_SYMBOL(ttm_bo_lock_delayed_workqueue); void ttm_bo_unlock_delayed_workqueue(struct ttm_bo_device *bdev, int resched) { if (resched) - schedule_delayed_work(&bdev->wq, - ((HZ / 100) < 1) ? 1 : HZ / 100); + schedule_delayed_work(&bdev->wq, HZ * 100); } EXPORT_SYMBOL(ttm_bo_unlock_delayed_workqueue); @@ -815,12 +814,15 @@ static int ttm_mem_evict_first(struct ttm_bo_device *bdev, retry: spin_lock(&glob->lru_lock); - if (list_empty(&man->lru)) { - spin_unlock(&glob->lru_lock); - return -EBUSY; - } + if (list_empty(&bdev->ddestroy)) { + if (list_empty(&man->lru)) { + spin_unlock(&glob->lru_lock); + return -EBUSY; + } + bo = list_entry(man->lru.prev, struct ttm_buffer_object, lru); + } else + bo = list_entry(bdev->ddestroy.prev, struct ttm_buffer_object, ddestroy); - bo = list_first_entry(&man->lru, struct ttm_buffer_object, lru); kref_get(&bo->list_kref); if (!list_empty(&bo->ddestroy)) {