diff mbox

ttm_bo_add_to_lru mystery

Message ID 56BB02C2.7090004@daenzer.net (mailing list archive)
State New, archived
Headers show

Commit Message

Michel Dänzer Feb. 10, 2016, 9:28 a.m. UTC
I recently came to realize that nothing seems to be ensuring that BOs
which are unpinned in the amdgpu/radeon drivers get added to the TTM
LRU list. I thought that should be easy enough to fix, but I ran into
an issue that has me scratching my head.

See the attached patch for the current debugging state, and the
corresponding dmesg output below. The BO's lru member is considered
empty (i.e. it's not hooked up to the LRU list) in radeon_bo_unpin,
as expected. However, calling ttm_bo_add_to_lru in that case, the lru
member is suddenly not considered empty (i.e. it seems to be hooked up
to the LRU list, which would normally trigger the BUG_ON). And indeed,
the prev/next pointers are different between the two functions. But I
have no idea how they could be modified between those two points, or
why else they would be seeing different values...

*Any* ideas for what might be going on here would be much appreciated.


[    5.043993] &bo->tbo.lru=ffff88007c0d1918, bo->tbo.lru.prev=ffff88007c0d1918, bo->tbo.lru.next=ffff88007c0d1918
[    5.043996] ------------[ cut here ]------------
[    5.044004] WARNING: CPU: 3 PID: 117 at drivers/gpu/drm//ttm/ttm_bo.c:173 ttm_bo_add_to_lru+0x145/0x150 [ttm]()
[    5.044005] Modules linked in: hid_generic usbhid hid cpufreq_stats binfmt_misc cpufreq_powersave cpufreq_userspace cpufreq_conservative kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel nls_utf8 nls_cp437 vfat fat radeon(O) sha256_ssse3 sha256_generic hmac drbg ansi_cprng eeepc_wmi asus_wmi sparse_keymap rfkill evdev efi_pstore aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd edac_mce_amd r8169 psmouse pcspkr serio_raw snd_hda_codec_realtek snd_hda_codec_generic edac_core mii amdkfd efivars k10temp fam15h_power snd_hda_codec_hdmi amdgpu(O) ohci_pci ttm(O) snd_hda_intel xhci_pci snd_hda_codec drm_kms_helper i2c_piix4 snd_hda_core xhci_hcd drm snd_hwdep ehci_pci ohci_hcd ehci_hcd snd_pcm i2c_algo_bit sg usbcore snd_timer snd usb_common soundcore shpchp 8250_fintek
[    5.044037]  wmi acpi_cpufreq video tpm_tis tpm button processor efivarfs autofs4 ext4 crc16 mbcache jbd2 dm_mod sd_mod ahci libahci libata scsi_mod
[    5.044047] CPU: 3 PID: 117 Comm: kworker/u8:7 Tainted: G        W  O    4.4.0+ #193
[    5.044048] Hardware name: System manufacturer System Product Name/A88X-PRO, BIOS 1602 12/04/2014
[    5.044066] Workqueue: radeon-crtc radeon_unpin_work_func [radeon]
[    5.044068]  0000000000000000 0000000031d3b9d6 ffff88022407bd80 ffffffff813030f4
[    5.044070]  0000000000000000 ffff88022407bdb8 ffffffff8107a332 ffff88007c0d1868
[    5.044071]  ffff88007c0d1918 ffff88007c0d1800 ffff88007c0d1868 ffff880224cf1320
[    5.044073] Call Trace:
[    5.044078]  [<ffffffff813030f4>] dump_stack+0x44/0x60
[    5.044080]  [<ffffffff8107a332>] warn_slowpath_common+0x82/0xc0
[    5.044082]  [<ffffffff8107a47a>] warn_slowpath_null+0x1a/0x20
[    5.044085]  [<ffffffffa03260d5>] ttm_bo_add_to_lru+0x145/0x150 [ttm]
[    5.044096]  [<ffffffffa073e87b>] radeon_unpin_work_func+0x8b/0x1a0 [radeon]
[    5.044098]  [<ffffffff81092443>] process_one_work+0x1a3/0x3d0
[    5.044099]  [<ffffffff810926be>] worker_thread+0x4e/0x460
[    5.044101]  [<ffffffff81092670>] ? process_one_work+0x3d0/0x3d0
[    5.044102]  [<ffffffff81098848>] kthread+0xd8/0xf0
[    5.044104]  [<ffffffff81098770>] ? kthread_create_on_node+0x190/0x190
[    5.044107]  [<ffffffff815b5a4f>] ret_from_fork+0x3f/0x70
[    5.044108]  [<ffffffff81098770>] ? kthread_create_on_node+0x190/0x190
[    5.044109] ---[ end trace 64028355474e8bbe ]---
[    5.044110] [TTM] &bo->lru=ffff88007c0d1918, bo->lru.prev=ffff88007dcb6d18, bo->lru.next=ffff880220a748e8

Comments

Michel Dänzer Feb. 10, 2016, 9:57 a.m. UTC | #1
On 10.02.2016 18:28, Michel Dänzer wrote:
> 
> I recently came to realize that nothing seems to be ensuring that BOs
> which are unpinned in the amdgpu/radeon drivers get added to the TTM
> LRU list. I thought that should be easy enough to fix, but I ran into
> an issue that has me scratching my head.
> 
> See the attached patch for the current debugging state, and the
> corresponding dmesg output below. The BO's lru member is considered
> empty (i.e. it's not hooked up to the LRU list) in radeon_bo_unpin,
> as expected. However, calling ttm_bo_add_to_lru in that case, the lru
> member is suddenly not considered empty (i.e. it seems to be hooked up
> to the LRU list, which would normally trigger the BUG_ON). And indeed,
> the prev/next pointers are different between the two functions. But I
> have no idea how they could be modified between those two points, or
> why else they would be seeing different values...
> 
> *Any* ideas for what might be going on here would be much appreciated.

Okay, so it was just me being stupid and missing that ttm_bo_unreserve
already calls ttm_bo_add_to_lru. Thanks to olesalscheider for
enlightening me on IRC.
diff mbox

Patch

diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c
index 84d4563..7f09ebf 100644
--- a/drivers/gpu/drm/radeon/radeon_object.c
+++ b/drivers/gpu/drm/radeon/radeon_object.c
@@ -396,6 +396,16 @@  int radeon_bo_unpin(struct radeon_bo *bo)
 	}
 	r = ttm_bo_validate(&bo->tbo, &bo->placement, false, false);
 	if (likely(r == 0)) {
+		struct ttm_bo_global *glob = bo->rdev->mman.bdev.glob;
+
+		spin_lock(&glob->lru_lock);
+		if (!WARN_ON_ONCE(!list_empty(&bo->tbo.lru))) {
+			pr_info_once("&bo->tbo.lru=%p, bo->tbo.lru.prev=%p, bo->tbo.lru.next=%p\n",
+				     &bo->tbo.lru, bo->tbo.lru.prev, bo->tbo.lru.next);
+			ttm_bo_add_to_lru(&bo->tbo);
+		}
+		spin_unlock(&glob->lru_lock);
+
 		if (bo->tbo.mem.mem_type == TTM_PL_VRAM)
 			bo->rdev->vram_pin_size -= radeon_bo_size(bo);
 		else
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 923079c..31353a9 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -170,7 +170,11 @@  void ttm_bo_add_to_lru(struct ttm_buffer_object *bo)
 
 	if (!(bo->mem.placement & TTM_PL_FLAG_NO_EVICT)) {
 
-		BUG_ON(!list_empty(&bo->lru));
+		if (WARN_ON_ONCE(!list_empty(&bo->lru))) {
+			pr_info_once("&bo->lru=%p, bo->lru.prev=%p, bo->lru.next=%p\n",
+				     &bo->lru, bo->lru.prev, bo->lru.next);
+			return;
+		}
 
 		man = &bdev->man[bo->mem.mem_type];
 		list_add_tail(&bo->lru, &man->lru);