diff mbox series

ammdgpu fix for gfx1103 queue evict/restore crash

Message ID 20241121195233.10679-1-lamikr@gmail.com (mailing list archive)
State New, archived
Headers show
Series ammdgpu fix for gfx1103 queue evict/restore crash | expand

Commit Message

Mika Laitio Nov. 21, 2024, 7:51 p.m. UTC
AMD gfx1103/M780 iGPU will crash eventually while performing
pytorch ML/AI operations on rocm sdk stack. Crash causes linux
desktop randomly either to recover after killing the app,
freeze the desktop or reset back to login screen.

Easy way to trigger the problem is to build the the
ML/AI support for gfx1103 M780 iGPU with the
rocm sdk builder and then running the test application in loop.

Additional trace messages helped to found out that error happens
always on same location when kernel ends up peridiocally
calling evict_process_queues_cpsch and restore_process_queues_cpsch
methods and calls MES to restore the queues in loop.

Crash requires small but random amount calls to these evict and restore
calls. (usually around 10-50) before the error happens on kernel. On
gfx1103 case, there seems to be 3 queues that are evicted and restored
and errors happens always when restoring the second one from the list
with the doorbell 0x1002.

Adding delays to either to test application between calls (1 second)
or to loop inside kernel which removes the queues one by one (mdelay(10))
does not help to avoid the crash.

I tested multiple other GPUs and similar error could not be triggered
gfx1010 (rx 5900), gfx1030 (rx 6800) and gfx1035 (M680 iGPU) or gfx1102
(RX 7700S). From these devices only the gfx1102 uses same codepath for
calling mes firmware.

I tested that the problem could not be avoidded either by adding delay
to user space pytorch app via delays between calls (1 sec) or by adding
delay to loop inside the kernel which removes/restores the queues.
(tested with mdelay(10))

Testing has mostly been done on 6.12rcs and 6.12 final kernels but same
problem can been triggered also at least on 6.08 and 6.11 kernels.
After the fix I have run the application on loop multiple times over
1000 loop without seeing the error to happen again.

Original bug and test case was made by jrl290 on rocm sdk builder bug issue 141.

[  948.324174] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1202, queue: 2, caller: restore_process_queues_cpsch
[  948.334344] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1002, queue: 1, caller: restore_process_queues_cpsch
[  948.344499] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1000, queue: 0, caller: restore_process_queues_cpsch
[  952.380614] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1202, queue: 2, caller: evict_process_queues_cpsch
[  952.391330] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1002, queue: 1, caller: evict_process_queues_cpsch
[  952.401634] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1000, queue: 0, caller: evict_process_queues_cpsch
[  952.414507] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1202, queue: 2, caller: restore_process_queues_cpsch
[  952.424618] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1002, queue: 1, caller: restore_process_queues_cpsch
[  952.434922] amdgpu 0000:c4:00.0: amdgpu: add_queue_mes added hardware queue to MES, doorbell=0x1000, queue: 0, caller: restore_process_queues_cpsch
[  952.446272] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes removed hardware queue from MES, doorbell=0x1202, queue: 2, caller: evict_process_queues_cpsch
[  954.460341] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  954.460356] amdgpu 0000:c4:00.0: amdgpu: remove_queue_mes failed to remove hardware queue from MES, doorbell=0x1002, queue: 1, caller: evict_process_queues_cpsch
[  954.460360] amdgpu 0000:c4:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[  954.460366] amdgpu 0000:c4:00.0: amdgpu: Failed to evict queue 1
[  954.460368] amdgpu 0000:c4:00.0: amdgpu: Failed to evict process queues
[  954.460439] amdgpu 0000:c4:00.0: amdgpu: GPU reset begin!
[  954.460464] amdgpu 0000:c4:00.0: amdgpu: remove_all_queues_mes: Failed to remove queue 0 for dev 5257
[  954.460515] amdgpu 0000:c4:00.0: amdgpu: Dumping IP State
[  954.462637] amdgpu 0000:c4:00.0: amdgpu: Dumping IP State Completed
[  955.865591] amdgpu: process_termination_cpsch started
[  955.866432] amdgpu: process_termination_cpsch started
[  955.866445] amdgpu 0000:c4:00.0: amdgpu: Failed to remove queue 0
[  956.503043] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  956.503059] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[  958.507491] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  958.507507] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[  960.512077] amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[  960.512093] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[  960.785816] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx

Signed-off-by: Mika Laitio <lamikr@gmail.com>
---
 .../drm/amd/amdkfd/kfd_device_queue_manager.c  | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

Comments

kernel test robot Nov. 23, 2024, 8:31 a.m. UTC | #1
Hi Mika,

kernel test robot noticed the following build warnings:

[auto build test WARNING on drm-misc/drm-misc-next]
[also build test WARNING on drm-tip/drm-tip v6.12 next-20241122]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Mika-Laitio/ammdgpu-fix-for-gfx1103-queue-evict-restore-crash/20241122-035602
base:   git://anongit.freedesktop.org/drm/drm-misc drm-misc-next
patch link:    https://lore.kernel.org/r/20241121195233.10679-1-lamikr%40gmail.com
patch subject: [PATCH] ammdgpu fix for gfx1103 queue evict/restore crash
config: arm64-allmodconfig (https://download.01.org/0day-ci/archive/20241123/202411231603.PMbyCkko-lkp@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 592c0fe55f6d9a811028b5f3507be91458ab2713)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241123/202411231603.PMbyCkko-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411231603.PMbyCkko-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c:32:
   In file included from drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_priv.h:37:
   In file included from include/linux/kfifo.h:40:
   In file included from include/linux/dma-mapping.h:11:
   In file included from include/linux/scatterlist.h:8:
   In file included from include/linux/mm.h:2213:
   include/linux/vmstat.h:504:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     504 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     505 |                            item];
         |                            ~~~~
   include/linux/vmstat.h:511:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     511 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     512 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
   include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     518 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
   include/linux/vmstat.h:524:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     524 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     525 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
>> drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c:1354:1: warning: unused label 'out_unlock' [-Wunused-label]
    1354 | out_unlock:
         | ^~~~~~~~~~~
   5 warnings generated.


vim +/out_unlock +1354 drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c

  1292	
  1293	static int restore_process_queues_cpsch(struct device_queue_manager *dqm,
  1294						struct qcm_process_device *qpd)
  1295	{
  1296		struct queue *q;
  1297		struct device *dev = dqm->dev->adev->dev;
  1298		struct kfd_process_device *pdd;
  1299		uint64_t eviction_duration;
  1300		int retval = 0;
  1301	
  1302		// gfx1103 APU fails to remove the queue usually after 10-50 attempts
  1303		if (dqm->dev->adev->flags & AMD_IS_APU)
  1304			goto out;
  1305		pdd = qpd_to_pdd(qpd);
  1306	
  1307		dqm_lock(dqm);
  1308		if (WARN_ON_ONCE(!qpd->evicted)) /* already restored, do nothing */
  1309			goto out;
  1310		if (qpd->evicted > 1) { /* ref count still > 0, decrement & quit */
  1311			qpd->evicted--;
  1312			goto out;
  1313		}
  1314	
  1315		/* The debugger creates processes that temporarily have not acquired
  1316		 * all VMs for all devices and has no VMs itself.
  1317		 * Skip queue restore on process restore.
  1318		 */
  1319		if (!pdd->drm_priv)
  1320			goto vm_not_acquired;
  1321	
  1322		pr_debug_ratelimited("Restoring PASID 0x%x queues\n",
  1323				    pdd->process->pasid);
  1324	
  1325		/* Update PD Base in QPD */
  1326		qpd->page_table_base = amdgpu_amdkfd_gpuvm_get_process_page_dir(pdd->drm_priv);
  1327		pr_debug("Updated PD address to 0x%llx\n", qpd->page_table_base);
  1328	
  1329		/* activate all active queues on the qpd */
  1330		list_for_each_entry(q, &qpd->queues_list, list) {
  1331			q->properties.is_evicted = false;
  1332			if (!QUEUE_IS_ACTIVE(q->properties))
  1333				continue;
  1334	
  1335			q->properties.is_active = true;
  1336			increment_queue_count(dqm, &pdd->qpd, q);
  1337	
  1338			if (dqm->dev->kfd->shared_resources.enable_mes) {
  1339				retval = add_queue_mes(dqm, q, qpd);
  1340				if (retval) {
  1341					dev_err(dev, "Failed to restore queue %d\n",
  1342						q->properties.queue_id);
  1343					goto out;
  1344				}
  1345			}
  1346		}
  1347		if (!dqm->dev->kfd->shared_resources.enable_mes)
  1348			retval = execute_queues_cpsch(dqm,
  1349						      KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0, USE_DEFAULT_GRACE_PERIOD);
  1350		eviction_duration = get_jiffies_64() - pdd->last_evict_timestamp;
  1351		atomic64_add(eviction_duration, &pdd->evict_duration_counter);
  1352	vm_not_acquired:
  1353		qpd->evicted = 0;
> 1354	out_unlock:
  1355		dqm_unlock(dqm);
  1356	out:
  1357		return retval;
  1358	}
  1359
kernel test robot Nov. 23, 2024, 10:05 a.m. UTC | #2
Hi Mika,

kernel test robot noticed the following build warnings:

[auto build test WARNING on drm-misc/drm-misc-next]
[also build test WARNING on drm-tip/drm-tip v6.12 next-20241122]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Mika-Laitio/ammdgpu-fix-for-gfx1103-queue-evict-restore-crash/20241122-035602
base:   git://anongit.freedesktop.org/drm/drm-misc drm-misc-next
patch link:    https://lore.kernel.org/r/20241121195233.10679-1-lamikr%40gmail.com
patch subject: [PATCH] ammdgpu fix for gfx1103 queue evict/restore crash
config: powerpc-allmodconfig (https://download.01.org/0day-ci/archive/20241123/202411231721.vBNjHNvr-lkp@intel.com/config)
compiler: powerpc64-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241123/202411231721.vBNjHNvr-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411231721.vBNjHNvr-lkp@intel.com/

All warnings (new ones prefixed by >>):

   drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c: In function 'restore_process_queues_cpsch':
>> drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c:1354:1: warning: label 'out_unlock' defined but not used [-Wunused-label]
    1354 | out_unlock:
         | ^~~~~~~~~~


vim +/out_unlock +1354 drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c

  1292	
  1293	static int restore_process_queues_cpsch(struct device_queue_manager *dqm,
  1294						struct qcm_process_device *qpd)
  1295	{
  1296		struct queue *q;
  1297		struct device *dev = dqm->dev->adev->dev;
  1298		struct kfd_process_device *pdd;
  1299		uint64_t eviction_duration;
  1300		int retval = 0;
  1301	
  1302		// gfx1103 APU fails to remove the queue usually after 10-50 attempts
  1303		if (dqm->dev->adev->flags & AMD_IS_APU)
  1304			goto out;
  1305		pdd = qpd_to_pdd(qpd);
  1306	
  1307		dqm_lock(dqm);
  1308		if (WARN_ON_ONCE(!qpd->evicted)) /* already restored, do nothing */
  1309			goto out;
  1310		if (qpd->evicted > 1) { /* ref count still > 0, decrement & quit */
  1311			qpd->evicted--;
  1312			goto out;
  1313		}
  1314	
  1315		/* The debugger creates processes that temporarily have not acquired
  1316		 * all VMs for all devices and has no VMs itself.
  1317		 * Skip queue restore on process restore.
  1318		 */
  1319		if (!pdd->drm_priv)
  1320			goto vm_not_acquired;
  1321	
  1322		pr_debug_ratelimited("Restoring PASID 0x%x queues\n",
  1323				    pdd->process->pasid);
  1324	
  1325		/* Update PD Base in QPD */
  1326		qpd->page_table_base = amdgpu_amdkfd_gpuvm_get_process_page_dir(pdd->drm_priv);
  1327		pr_debug("Updated PD address to 0x%llx\n", qpd->page_table_base);
  1328	
  1329		/* activate all active queues on the qpd */
  1330		list_for_each_entry(q, &qpd->queues_list, list) {
  1331			q->properties.is_evicted = false;
  1332			if (!QUEUE_IS_ACTIVE(q->properties))
  1333				continue;
  1334	
  1335			q->properties.is_active = true;
  1336			increment_queue_count(dqm, &pdd->qpd, q);
  1337	
  1338			if (dqm->dev->kfd->shared_resources.enable_mes) {
  1339				retval = add_queue_mes(dqm, q, qpd);
  1340				if (retval) {
  1341					dev_err(dev, "Failed to restore queue %d\n",
  1342						q->properties.queue_id);
  1343					goto out;
  1344				}
  1345			}
  1346		}
  1347		if (!dqm->dev->kfd->shared_resources.enable_mes)
  1348			retval = execute_queues_cpsch(dqm,
  1349						      KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0, USE_DEFAULT_GRACE_PERIOD);
  1350		eviction_duration = get_jiffies_64() - pdd->last_evict_timestamp;
  1351		atomic64_add(eviction_duration, &pdd->evict_duration_counter);
  1352	vm_not_acquired:
  1353		qpd->evicted = 0;
> 1354	out_unlock:
  1355		dqm_unlock(dqm);
  1356	out:
  1357		return retval;
  1358	}
  1359
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 648f40091aa3..027c8b4010c2 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1156,9 +1156,12 @@  static int evict_process_queues_cpsch(struct device_queue_manager *dqm,
 	struct kfd_process_device *pdd;
 	int retval = 0;
 
+	// gfx1103 APU fails to remove the queue usually after 10-50 attempts
+	if (dqm->dev->adev->flags & AMD_IS_APU)
+		goto out;
 	dqm_lock(dqm);
 	if (qpd->evicted++ > 0) /* already evicted, do nothing */
-		goto out;
+		goto out_unlock;
 
 	pdd = qpd_to_pdd(qpd);
 
@@ -1167,7 +1170,7 @@  static int evict_process_queues_cpsch(struct device_queue_manager *dqm,
 	 * Skip queue eviction on process eviction.
 	 */
 	if (!pdd->drm_priv)
-		goto out;
+		goto out_unlock;
 
 	pr_debug_ratelimited("Evicting PASID 0x%x queues\n",
 			    pdd->process->pasid);
@@ -1188,7 +1191,7 @@  static int evict_process_queues_cpsch(struct device_queue_manager *dqm,
 			if (retval) {
 				dev_err(dev, "Failed to evict queue %d\n",
 					q->properties.queue_id);
-				goto out;
+				goto out_unlock;
 			}
 		}
 	}
@@ -1200,8 +1203,9 @@  static int evict_process_queues_cpsch(struct device_queue_manager *dqm,
 					      KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0,
 					      USE_DEFAULT_GRACE_PERIOD);
 
-out:
+out_unlock:
 	dqm_unlock(dqm);
+out:
 	return retval;
 }
 
@@ -1295,6 +1299,9 @@  static int restore_process_queues_cpsch(struct device_queue_manager *dqm,
 	uint64_t eviction_duration;
 	int retval = 0;
 
+	// gfx1103 APU fails to remove the queue usually after 10-50 attempts
+	if (dqm->dev->adev->flags & AMD_IS_APU)
+		goto out;
 	pdd = qpd_to_pdd(qpd);
 
 	dqm_lock(dqm);
@@ -1344,8 +1351,9 @@  static int restore_process_queues_cpsch(struct device_queue_manager *dqm,
 	atomic64_add(eviction_duration, &pdd->evict_duration_counter);
 vm_not_acquired:
 	qpd->evicted = 0;
-out:
+out_unlock:
 	dqm_unlock(dqm);
+out:
 	return retval;
 }