From patchwork Wed Nov 9 09:50:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Christian_K=C3=B6nig?= X-Patchwork-Id: 13037298 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 431EBC4332F for ; Wed, 9 Nov 2022 09:50:20 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 6C55810E12B; Wed, 9 Nov 2022 09:50:18 +0000 (UTC) Received: from mail-ed1-x52b.google.com (mail-ed1-x52b.google.com [IPv6:2a00:1450:4864:20::52b]) by gabe.freedesktop.org (Postfix) with ESMTPS id E208910E12B for ; Wed, 9 Nov 2022 09:50:14 +0000 (UTC) Received: by mail-ed1-x52b.google.com with SMTP id l11so26382181edb.4 for ; Wed, 09 Nov 2022 01:50:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=ZDbzL71qJGKEEWxhHhjVdFL1vfQxk8fIaF9WfWTQzrc=; b=pGS3SYmk3psyYPHIkNGOOz0PyD8tZnh8P1rg9rKQB/Pl+1DPjjC9gQrT7znnxVYjAx 3ELIzNt2mrWpxM3BA0gdPuGNKXhPXi+a2Df2R7E4/QAySlImveUEq3oRfBW6b2eueU2Q aCgFhXMG6XkV9JId0b0dcZnatQ66G9ai3W122cR6WAWnVwupYQtUXYKmwj398xiI1LAW bYFPdgnVWAfsUWWeeqBIZPQAji8emBrkKafPbdBpCyxVtWvTVmFuHWFSGPPKlwWPhm7x xkZY5CmI298LYxKL0i5Pw5QpHxR00kVIKwX5J4zZbufkUxYoqlOT/VYucb+/mF2H/Xc+ OitQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=ZDbzL71qJGKEEWxhHhjVdFL1vfQxk8fIaF9WfWTQzrc=; b=QSgOLQYDd3L24ASFpw2cXk7BuDzzanUIhNu97f4IvyGH5XtZYLoT36WQ36qD5U/hQf 7/kfYnlp6e+zXJwNTI8CIJkggQaOYMEtdi7VrJ/HzKIRwczACJU9A+PKa9hKvKFr7h52 aSiCqygjUQYyHOUEipRbpS00gxIlhYOaWZUK7uI+KIRjuC8wFSXIEHsDohnr05vXebrz oeW8mwpGXMw00+mMVVjBiUFK1tJd4bkFUgAZiI5453nNhfz6uMrtoaQmOeZk+Xr+M+47 IDHIXpCoROcp165dAf2zddb6kzfeWPFCMt7tZT7gH9Wbb9zfI5biX8EQxevPfIW0IsMt 4VtA== X-Gm-Message-State: ACrzQf0rH7lOgJwcAr0jT9RQHLzViLwK91XvycVt29E7ZA3v+hbr7SL+ hDNRaBvw2ZYC1Kl4E3ZbTIgwwPTm7zY= X-Google-Smtp-Source: AMsMyM4OmcnJ8Fa9WX2WRakTnwvferlD8GphTEUaaWfbcbrxjzgERyDG2XH9lkyWDWFm5lW7+nmR5w== X-Received: by 2002:a05:6402:1767:b0:461:f1c6:1f22 with SMTP id da7-20020a056402176700b00461f1c61f22mr1126435edb.95.1667987412901; Wed, 09 Nov 2022 01:50:12 -0800 (PST) Received: from able.fritz.box (p5b0ea229.dip0.t-ipconnect.de. [91.14.162.41]) by smtp.gmail.com with ESMTPSA id s12-20020a1709062ecc00b00780f24b797dsm5604543eji.108.2022.11.09.01.50.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 09 Nov 2022 01:50:12 -0800 (PST) From: " =?utf-8?q?Christian_K=C3=B6nig?= " X-Google-Original-From: =?utf-8?q?Christian_K=C3=B6nig?= To: Alexander.Deucher@amd.com, daniel.vetter@ffwll.ch, dri-devel@lists.freedesktop.org, Shaoyun.Liu@amd.com Subject: [PATCH 1/5] drm/amd/amdgpu revert "implement tdr advanced mode" Date: Wed, 9 Nov 2022 10:50:06 +0100 Message-Id: <20221109095010.141189-1-christian.koenig@amd.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: =?utf-8?q?Christian_K=C3=B6nig?= Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" This reverts commit e6c6338f393b74ac0b303d567bb918b44ae7ad75. This feature basically re-submits one job after another to figure out which one was the one causing a hang. This is obviously incompatible with gang-submit which requires that multiple jobs run at the same time. It's also absolutely not helpful to crash the hardware multiple times if a clean recovery is desired. For testing and debugging environments we should rather disable recovery alltogether to be able to inspect the state with a hw debugger. Additional to that the sw implementation is clearly buggy and causes reference count issues for the hardware fence. Signed-off-by: Christian König Reviewed-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 103 --------------------- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +- drivers/gpu/drm/scheduler/sched_main.c | 58 ++---------- include/drm/gpu_scheduler.h | 3 - 4 files changed, 10 insertions(+), 156 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 5b9f992e4607..0da55fd97df8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5077,94 +5077,6 @@ static int amdgpu_device_suspend_display_audio(struct amdgpu_device *adev) return 0; } -static void amdgpu_device_recheck_guilty_jobs( - struct amdgpu_device *adev, struct list_head *device_list_handle, - struct amdgpu_reset_context *reset_context) -{ - int i, r = 0; - - for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { - struct amdgpu_ring *ring = adev->rings[i]; - int ret = 0; - struct drm_sched_job *s_job; - - if (!ring || !ring->sched.thread) - continue; - - s_job = list_first_entry_or_null(&ring->sched.pending_list, - struct drm_sched_job, list); - if (s_job == NULL) - continue; - - /* clear job's guilty and depend the folowing step to decide the real one */ - drm_sched_reset_karma(s_job); - drm_sched_resubmit_jobs_ext(&ring->sched, 1); - - if (!s_job->s_fence->parent) { - DRM_WARN("Failed to get a HW fence for job!"); - continue; - } - - ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout); - if (ret == 0) { /* timeout */ - DRM_ERROR("Found the real bad job! ring:%s, job_id:%llx\n", - ring->sched.name, s_job->id); - - - amdgpu_fence_driver_isr_toggle(adev, true); - - /* Clear this failed job from fence array */ - amdgpu_fence_driver_clear_job_fences(ring); - - amdgpu_fence_driver_isr_toggle(adev, false); - - /* Since the job won't signal and we go for - * another resubmit drop this parent pointer - */ - dma_fence_put(s_job->s_fence->parent); - s_job->s_fence->parent = NULL; - - /* set guilty */ - drm_sched_increase_karma(s_job); - amdgpu_reset_prepare_hwcontext(adev, reset_context); -retry: - /* do hw reset */ - if (amdgpu_sriov_vf(adev)) { - amdgpu_virt_fini_data_exchange(adev); - r = amdgpu_device_reset_sriov(adev, false); - if (r) - adev->asic_reset_res = r; - } else { - clear_bit(AMDGPU_SKIP_HW_RESET, - &reset_context->flags); - r = amdgpu_do_asic_reset(device_list_handle, - reset_context); - if (r && r == -EAGAIN) - goto retry; - } - - /* - * add reset counter so that the following - * resubmitted job could flush vmid - */ - atomic_inc(&adev->gpu_reset_counter); - continue; - } - - /* got the hw fence, signal finished fence */ - atomic_dec(ring->sched.score); - dma_fence_get(&s_job->s_fence->finished); - dma_fence_signal(&s_job->s_fence->finished); - dma_fence_put(&s_job->s_fence->finished); - - /* remove node from list and free the job */ - spin_lock(&ring->sched.job_list_lock); - list_del_init(&s_job->list); - spin_unlock(&ring->sched.job_list_lock); - ring->sched.ops->free_job(s_job); - } -} - static inline void amdgpu_device_stop_pending_resets(struct amdgpu_device *adev) { struct amdgpu_ras *con = amdgpu_ras_get_context(adev); @@ -5185,7 +5097,6 @@ static inline void amdgpu_device_stop_pending_resets(struct amdgpu_device *adev) } - /** * amdgpu_device_gpu_recover - reset the asic and recover scheduler * @@ -5208,7 +5119,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, int i, r = 0; bool need_emergency_restart = false; bool audio_suspended = false; - int tmp_vram_lost_counter; bool gpu_reset_for_dev_remove = false; gpu_reset_for_dev_remove = @@ -5354,7 +5264,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, amdgpu_device_stop_pending_resets(tmp_adev); } - tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter)); /* Actual ASIC resets if needed.*/ /* Host driver will handle XGMI hive reset for SRIOV */ if (amdgpu_sriov_vf(adev)) { @@ -5379,18 +5288,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, /* Post ASIC reset for all devs .*/ list_for_each_entry(tmp_adev, device_list_handle, reset_list) { - /* - * Sometimes a later bad compute job can block a good gfx job as gfx - * and compute ring share internal GC HW mutually. We add an additional - * guilty jobs recheck step to find the real guilty job, it synchronously - * submits and pends for the first job being signaled. If it gets timeout, - * we identify it as a real guilty job. - */ - if (amdgpu_gpu_recovery == 2 && - !(tmp_vram_lost_counter < atomic_read(&adev->vram_lost_counter))) - amdgpu_device_recheck_guilty_jobs( - tmp_adev, device_list_handle, reset_context); - for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { struct amdgpu_ring *ring = tmp_adev->rings[i]; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 8e97e95aca8c..a6820603214f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -519,7 +519,7 @@ module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444); * DOC: gpu_recovery (int) * Set to enable GPU recovery mechanism (1 = enable, 0 = disable). The default is -1 (auto, disabled except SRIOV). */ -MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (2 = advanced tdr mode, 1 = enable, 0 = disable, -1 = auto)"); +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)"); module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444); /** diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 68317d3a7a27..e77e1fd16732 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -355,27 +355,6 @@ static void drm_sched_job_timedout(struct work_struct *work) } } - /** - * drm_sched_increase_karma - Update sched_entity guilty flag - * - * @bad: The job guilty of time out - * - * Increment on every hang caused by the 'bad' job. If this exceeds the hang - * limit of the scheduler then the respective sched entity is marked guilty and - * jobs from it will not be scheduled further - */ -void drm_sched_increase_karma(struct drm_sched_job *bad) -{ - drm_sched_increase_karma_ext(bad, 1); -} -EXPORT_SYMBOL(drm_sched_increase_karma); - -void drm_sched_reset_karma(struct drm_sched_job *bad) -{ - drm_sched_increase_karma_ext(bad, 0); -} -EXPORT_SYMBOL(drm_sched_reset_karma); - /** * drm_sched_stop - stop the scheduler * @@ -516,32 +495,15 @@ EXPORT_SYMBOL(drm_sched_start); * */ void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched) -{ - drm_sched_resubmit_jobs_ext(sched, INT_MAX); -} -EXPORT_SYMBOL(drm_sched_resubmit_jobs); - -/** - * drm_sched_resubmit_jobs_ext - helper to relunch certain number of jobs from mirror ring list - * - * @sched: scheduler instance - * @max: job numbers to relaunch - * - */ -void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max) { struct drm_sched_job *s_job, *tmp; uint64_t guilty_context; bool found_guilty = false; struct dma_fence *fence; - int i = 0; list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) { struct drm_sched_fence *s_fence = s_job->s_fence; - if (i >= max) - break; - if (!found_guilty && atomic_read(&s_job->karma) > sched->hang_limit) { found_guilty = true; guilty_context = s_job->s_fence->scheduled.context; @@ -551,7 +513,6 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max) dma_fence_set_error(&s_fence->finished, -ECANCELED); fence = sched->ops->run_job(s_job); - i++; if (IS_ERR_OR_NULL(fence)) { if (IS_ERR(fence)) @@ -567,7 +528,7 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max) } } } -EXPORT_SYMBOL(drm_sched_resubmit_jobs_ext); +EXPORT_SYMBOL(drm_sched_resubmit_jobs); /** * drm_sched_job_init - init a scheduler job @@ -1082,13 +1043,15 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched) EXPORT_SYMBOL(drm_sched_fini); /** - * drm_sched_increase_karma_ext - Update sched_entity guilty flag + * drm_sched_increase_karma - Update sched_entity guilty flag * * @bad: The job guilty of time out - * @type: type for increase/reset karma * + * Increment on every hang caused by the 'bad' job. If this exceeds the hang + * limit of the scheduler then the respective sched entity is marked guilty and + * jobs from it will not be scheduled further */ -void drm_sched_increase_karma_ext(struct drm_sched_job *bad, int type) +void drm_sched_increase_karma(struct drm_sched_job *bad) { int i; struct drm_sched_entity *tmp; @@ -1100,10 +1063,7 @@ void drm_sched_increase_karma_ext(struct drm_sched_job *bad, int type) * corrupt but keep in mind that kernel jobs always considered good. */ if (bad->s_priority != DRM_SCHED_PRIORITY_KERNEL) { - if (type == 0) - atomic_set(&bad->karma, 0); - else if (type == 1) - atomic_inc(&bad->karma); + atomic_inc(&bad->karma); for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_KERNEL; i++) { @@ -1114,7 +1074,7 @@ void drm_sched_increase_karma_ext(struct drm_sched_job *bad, int type) if (bad->s_fence->scheduled.context == entity->fence_context) { if (entity->guilty) - atomic_set(entity->guilty, type); + atomic_set(entity->guilty, 1); break; } } @@ -1124,4 +1084,4 @@ void drm_sched_increase_karma_ext(struct drm_sched_job *bad, int type) } } } -EXPORT_SYMBOL(drm_sched_increase_karma_ext); +EXPORT_SYMBOL(drm_sched_increase_karma); diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 289a33e80639..156601fd7053 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -497,10 +497,7 @@ void drm_sched_wakeup(struct drm_gpu_scheduler *sched); void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad); void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery); void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched); -void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max); void drm_sched_increase_karma(struct drm_sched_job *bad); -void drm_sched_reset_karma(struct drm_sched_job *bad); -void drm_sched_increase_karma_ext(struct drm_sched_job *bad, int type); bool drm_sched_dependency_optimized(struct dma_fence* fence, struct drm_sched_entity *entity); void drm_sched_fault(struct drm_gpu_scheduler *sched); From patchwork Wed Nov 9 09:50:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Christian_K=C3=B6nig?= X-Patchwork-Id: 13037300 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id DAB9BC433FE for ; Wed, 9 Nov 2022 09:50:28 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 485CD10E2DC; Wed, 9 Nov 2022 09:50:27 +0000 (UTC) Received: from mail-ed1-x529.google.com (mail-ed1-x529.google.com [IPv6:2a00:1450:4864:20::529]) by gabe.freedesktop.org (Postfix) with ESMTPS id B5CFF10E12B for ; Wed, 9 Nov 2022 09:50:15 +0000 (UTC) Received: by mail-ed1-x529.google.com with SMTP id z18so26371023edb.9 for ; Wed, 09 Nov 2022 01:50:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=4PC8s8cF2OyJcBZV0kCPqZwAqYghNKvo1Uaxmwzw5J0=; b=XCkkRaruS2Vl7vvn4R7cj3IvkpILjmpZY6Y1sl10zxcK3cyD6U5TW2oAuqDYHIjsGu F98lLwRiwT/o6UT+Ip6ubGF2WogZsvsZ+e6BmGhrbf8hcgWJrPXbTrdBpIMsnEW/7IVg sQoLOW5CWy5ywhdhfwjT/dXd4hZt8XPagXAxRYfVANRVISp+dgrTO7Rio/4cPNZdsna9 FuqZCyFXMJL070JcAlO8uEuxrirshv7mkUT1lWAuILmEdiOx0HGZN/fggAEt8uqpBPXe m0nJlbHyuG4j4y7EdAeMEAypKBCkVCs2IRD3E5TrVi6UXLSnXYT03b9yGMTrvlTPJgqg kJFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4PC8s8cF2OyJcBZV0kCPqZwAqYghNKvo1Uaxmwzw5J0=; b=vOSDSF4upUUlwNo0SKWqWDONPiwkb8MYGx4QKfW5B1GbozWqSDfCBT3ijMec792sbL VGWkN+mfL5H+lA/PBoxJsrFn1CcejHyaOW2A1UTZHG7tcn9YJcpEQbOBdxNiFYWdIvQ4 c0+QlTdRtLH+BpK5iG1onj6NjBvbwl1o0BWTV9ca9nN5iCp87KZe1XLbk408nQGgq0TQ KpfFmvNvBeeAUKYv8WU2tJ7ylCsZyJVUcrw7hzCvXW/i0iXlSh32Bp+lzXvLkFZCXYCB vXDOIbde77BlxSW4KkGVdibgrmlYNc9iBTS842jcN6OQ4OohTQ975eI1t8xu0EkHhj/X C6Qw== X-Gm-Message-State: ACrzQf2K9VwqOE9b/lqcFTmjfeBThdYIjp6yReM9q0ObtUoDKhCgdd8u GqiOl5pDYDMZVMXJ3ZtlBO0= X-Google-Smtp-Source: AMsMyM5XYEjuXKph2BcWKOcfHOA5o7u6F6xUGkXQVwTjIHSa3CGyJ0wIgzreFlD4a7H4seqKPhW7Lg== X-Received: by 2002:a50:d6d1:0:b0:45f:9526:e35a with SMTP id l17-20020a50d6d1000000b0045f9526e35amr1124386edj.256.1667987414093; Wed, 09 Nov 2022 01:50:14 -0800 (PST) Received: from able.fritz.box (p5b0ea229.dip0.t-ipconnect.de. [91.14.162.41]) by smtp.gmail.com with ESMTPSA id s12-20020a1709062ecc00b00780f24b797dsm5604543eji.108.2022.11.09.01.50.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 09 Nov 2022 01:50:13 -0800 (PST) From: " =?utf-8?q?Christian_K=C3=B6nig?= " X-Google-Original-From: =?utf-8?q?Christian_K=C3=B6nig?= To: Alexander.Deucher@amd.com, daniel.vetter@ffwll.ch, dri-devel@lists.freedesktop.org, Shaoyun.Liu@amd.com Subject: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for GPU reset v2 Date: Wed, 9 Nov 2022 10:50:07 +0100 Message-Id: <20221109095010.141189-2-christian.koenig@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221109095010.141189-1-christian.koenig@amd.com> References: <20221109095010.141189-1-christian.koenig@amd.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: =?utf-8?q?Christian_K=C3=B6nig?= Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Re-submitting IBs by the kernel has many problems because pre- requisite state is not automatically re-created as well. In other words neither binary semaphores nor things like ring buffer pointers are in the state they should be when the hardware starts to work on the IBs again. Additional to that even after more than 5 years of developing this feature it is still not stable and we have massively problems getting the reference counts right. As discussed with user space developers this behavior is not helpful in the first place. For graphics and multimedia workloads it makes much more sense to either completely re-create the context or at least re-submitting the IBs from userspace. For compute use cases re-submitting is also not very helpful since userspace must rely on the accuracy of the result. Because of this we stop this practice and instead just properly note that the fence submission was canceled. The only use case we keep the re-submission for now is SRIOV and function level resets. v2: as suggested by Sshaoyun stop resubmitting jobs even for SRIOV Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 0da55fd97df8..3a51c4c61833 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5294,11 +5294,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, if (!ring || !ring->sched.thread) continue; - /* No point to resubmit jobs if we didn't HW reset*/ - if (!tmp_adev->asic_reset_res && !job_signaled) - drm_sched_resubmit_jobs(&ring->sched); - - drm_sched_start(&ring->sched, !tmp_adev->asic_reset_res); + drm_sched_start(&ring->sched, true); } if (adev->enable_mes && adev->ip_versions[GC_HWIP][0] != IP_VERSION(11, 0, 3)) From patchwork Wed Nov 9 09:50:08 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Christian_K=C3=B6nig?= X-Patchwork-Id: 13037302 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C015AC4332F for ; Wed, 9 Nov 2022 09:50:42 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 8D37210E51E; Wed, 9 Nov 2022 09:50:41 +0000 (UTC) Received: from mail-ej1-x630.google.com (mail-ej1-x630.google.com [IPv6:2a00:1450:4864:20::630]) by gabe.freedesktop.org (Postfix) with ESMTPS id C340D10E12B for ; Wed, 9 Nov 2022 09:50:16 +0000 (UTC) Received: by mail-ej1-x630.google.com with SMTP id bj12so45100677ejb.13 for ; Wed, 09 Nov 2022 01:50:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=BDAM9OYZ6MLhEgMgixpdhoeMFz5qwIi9sbQ0Mdb6c/w=; b=kIFKkV2CPVIB5WWdPmk4P7335+YLMf/QdwPqC2kOgyCG+dczRftHMy/lOrfOY6dKV3 n80gzIHOufPqgEwUVBKXKG40BplrMnEj/jz+6/0VgAKRT8iovn+HViMAmWPCxes9aJo1 ppvVjRX6ItXGWT0q0jS0InjwKpY50qBlrH/m7uQlNt0KFjVfO2zVNBEgbjElHSQJg/nP 2KZtBNpYi62hx08Kg5/cEDNXusiQ0U6Td9Q/xiW7Yw7IMWuDKjg23IQYWu77BRglATE+ a6Vj57cn+INzw2/Fb1WDaWPaKUlcJM71Go1yW9u3KGENnFi6GxfdSsgzjJvyKG6MeO14 6T8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=BDAM9OYZ6MLhEgMgixpdhoeMFz5qwIi9sbQ0Mdb6c/w=; b=YBcTzmQcAwisJJyeERGzCDEEoB2gW8UqY9a3XUYNT0F1gkIxHyhtLpS1GdQL8yOGbp 9f4BvWd2bfHmKdV5ejASZvIOxahBt7vZjFVTH0oj30BQq+pkg547eIGwc6Xs8NMo13+O g/ipBsBC6o05m8AqVvKXOT6EAeBoFZxKKFnxDiQWiZpctSdxkGD+J46fWRp7Xb5h415u uO76nBYG3WCELsr0lGIg8bwEkZxn2i65Y1DZG31lSob/Jswep+hvJsJDSG8z/GUTAfkd rGL1LTx7heqDvslwl6fkkHFoLJaNOsAY+r49nT2UoMpkKAuwdtO2T9AtT/ZfX6AAn24p 9a6Q== X-Gm-Message-State: ANoB5pnLjS84B8gjjkc8s98lXR8nOwciAwZOFkZS2UxbhcFPLwv9w0jV J0p7TxXcYJMn5RzSpytoOkc= X-Google-Smtp-Source: AA0mqf7M+fK1btSP5oIt562DYqL159DKWAFqMCo0IZjgYzZACdzfLFudYyhQcaeEBAHkgWx7sxdDTA== X-Received: by 2002:a17:907:3f28:b0:7ad:88f8:7644 with SMTP id hq40-20020a1709073f2800b007ad88f87644mr6566507ejc.738.1667987415171; Wed, 09 Nov 2022 01:50:15 -0800 (PST) Received: from able.fritz.box (p5b0ea229.dip0.t-ipconnect.de. [91.14.162.41]) by smtp.gmail.com with ESMTPSA id s12-20020a1709062ecc00b00780f24b797dsm5604543eji.108.2022.11.09.01.50.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 09 Nov 2022 01:50:14 -0800 (PST) From: " =?utf-8?q?Christian_K=C3=B6nig?= " X-Google-Original-From: =?utf-8?q?Christian_K=C3=B6nig?= To: Alexander.Deucher@amd.com, daniel.vetter@ffwll.ch, dri-devel@lists.freedesktop.org, Shaoyun.Liu@amd.com Subject: [PATCH 3/5] drm/amdgpu: stop resubmittting jobs in amdgpu_pci_resume Date: Wed, 9 Nov 2022 10:50:08 +0100 Message-Id: <20221109095010.141189-3-christian.koenig@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221109095010.141189-1-christian.koenig@amd.com> References: <20221109095010.141189-1-christian.koenig@amd.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: =?utf-8?q?Christian_K=C3=B6nig?= Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" As far as I can see this is not really recoverable since a PCI reset clears VRAM. Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 3a51c4c61833..8564d4a8e3e8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5747,8 +5747,6 @@ void amdgpu_pci_resume(struct pci_dev *pdev) if (!ring || !ring->sched.thread) continue; - - drm_sched_resubmit_jobs(&ring->sched); drm_sched_start(&ring->sched, true); } From patchwork Wed Nov 9 09:50:09 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Christian_K=C3=B6nig?= X-Patchwork-Id: 13037299 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id EA101C4332F for ; Wed, 9 Nov 2022 09:50:25 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 2597A10E13A; Wed, 9 Nov 2022 09:50:24 +0000 (UTC) Received: from mail-ej1-x635.google.com (mail-ej1-x635.google.com [IPv6:2a00:1450:4864:20::635]) by gabe.freedesktop.org (Postfix) with ESMTPS id C153D10E12B for ; Wed, 9 Nov 2022 09:50:17 +0000 (UTC) Received: by mail-ej1-x635.google.com with SMTP id kt23so45146417ejc.7 for ; Wed, 09 Nov 2022 01:50:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=2YlHN1nH3duYo4rVIJwXoiPtGWnH1xvxqUpFQTi0sQE=; b=KerWE3PGAYde6ZWWKbXGu/arezDh+40mnKhFfwQQNeeJrewBdlMOt5bTJil3zY0uMZ 2edS4r9NKASWtFdgkf2lEDx5/qcRElnk68yOSgPoiIuMfdPbAG2Wy85zclPGHnYr0Gdk mH+dRltXh1TPKRI6WciiR0oVWH17NW3N7alS83n5ihPpWsR14g8C1vku+dQTeMlf4IYL WKX+g2luXqoDnzM+Vx2aeeDcRoW+ppIp0bZYyrnkgKEZZlOwm8mO4Dx5C5NLMr81SpNq vUJic9rfSdP5w/1l3qqG0OYqM4lLISt5DFduI0QxJZNuvGzUreinMttJKDwDu8egXmSp cXKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=2YlHN1nH3duYo4rVIJwXoiPtGWnH1xvxqUpFQTi0sQE=; b=GwBR/u86goHKoe/+tjrBrz0scJNqTbGPfCz0v87VJ/pdlHesUV0rsQgyEpJ2QTlK5O A9Jb71bbbq8zxy0p8zgSsZ+WaN9KKqFpuV9jSRP9ejZ9bmF4wwTYFlVyhaeCHOXwZKSw KovMeSL70d+uP99m5IcIfqZTv76T7Uw28WDPcRo3OYMRlhVizazX/B+akKCNqxNDmsJ3 sBGqqI738araFbAliqMWX5LHAmkmiplSoiOBkdoRY1iJm1AnIEC71BRgzXum8768IQBK NOG4ZhM0lFHp57vmi3XZ8Zw6iP1rQDMc8FoMinrBxsh6wMgNJV8IsIwE/ReSvjzMAc+f /0sw== X-Gm-Message-State: ACrzQf1cn3iQ5TqNKYI4BVIr2sdWhwXTtOxGIzG6TGZ0Kod9jOQBXurQ UbkcEZr0DYN06j1PpLupCDA= X-Google-Smtp-Source: AMsMyM6hue9dk2unfbic3n5A31AycSl25HLMqFYCsUzDMcha+w9+Kh0GIqXIy7S12AV5QIFo38eYUA== X-Received: by 2002:a17:907:6e9e:b0:78c:5533:4158 with SMTP id sh30-20020a1709076e9e00b0078c55334158mr53190964ejc.417.1667987416194; Wed, 09 Nov 2022 01:50:16 -0800 (PST) Received: from able.fritz.box (p5b0ea229.dip0.t-ipconnect.de. [91.14.162.41]) by smtp.gmail.com with ESMTPSA id s12-20020a1709062ecc00b00780f24b797dsm5604543eji.108.2022.11.09.01.50.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 09 Nov 2022 01:50:15 -0800 (PST) From: " =?utf-8?q?Christian_K=C3=B6nig?= " X-Google-Original-From: =?utf-8?q?Christian_K=C3=B6nig?= To: Alexander.Deucher@amd.com, daniel.vetter@ffwll.ch, dri-devel@lists.freedesktop.org, Shaoyun.Liu@amd.com Subject: [PATCH 4/5] drm/scheduler: cleanup define Date: Wed, 9 Nov 2022 10:50:09 +0100 Message-Id: <20221109095010.141189-4-christian.koenig@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221109095010.141189-1-christian.koenig@amd.com> References: <20221109095010.141189-1-christian.koenig@amd.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: =?utf-8?q?Christian_K=C3=B6nig?= Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Remove some not implemented function define Signed-off-by: Christian König --- include/drm/gpu_scheduler.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 156601fd7053..73a2327d6b00 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -501,7 +501,6 @@ void drm_sched_increase_karma(struct drm_sched_job *bad); bool drm_sched_dependency_optimized(struct dma_fence* fence, struct drm_sched_entity *entity); void drm_sched_fault(struct drm_gpu_scheduler *sched); -void drm_sched_job_kickout(struct drm_sched_job *s_job); void drm_sched_rq_add_entity(struct drm_sched_rq *rq, struct drm_sched_entity *entity); From patchwork Wed Nov 9 09:50:10 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Christian_K=C3=B6nig?= X-Patchwork-Id: 13037301 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 0E974C4332F for ; Wed, 9 Nov 2022 09:50:39 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id E8A7F10E51B; Wed, 9 Nov 2022 09:50:37 +0000 (UTC) Received: from mail-ej1-x62c.google.com (mail-ej1-x62c.google.com [IPv6:2a00:1450:4864:20::62c]) by gabe.freedesktop.org (Postfix) with ESMTPS id 1B42810E13A for ; Wed, 9 Nov 2022 09:50:19 +0000 (UTC) Received: by mail-ej1-x62c.google.com with SMTP id f27so45220189eje.1 for ; Wed, 09 Nov 2022 01:50:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=lJBFTRcNwwbuIBVh7uk8D1dXIHh8Bq+X8ggGiiL72aQ=; b=MeCdhokzMXMvFhVqJDSGh4c9x7xSboG9rtPcva94GfBrWYDdKNP/EWAsD4WAH0fsc4 lpjm7lx75MmeHsQLplXu1UOEfTu6rM8oz51h8v5pNqlkaxBL6wX6lONcNKFKa+vkxD3W EJeKSK8eDz86pBOqLiDLnVvU5tcYHeHpWJ5Zz2OuRAqiD5QGjk6KzM4PPlLiibYVLDGq VPJ5j6TZcU0eA9+XWFEny2j42AhzirDYeU/KI9WjAPGec+QW+jVFDnO73XjyioT/iT6t O4fhNcwrDPXs4Z+8j/U0a+rmAKmo89OrtpeKns0Gmq2IKuuebS4JE3yEk+3cBHEWNS2a +tWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=lJBFTRcNwwbuIBVh7uk8D1dXIHh8Bq+X8ggGiiL72aQ=; b=awMk4THLepQCwWWtyxYWRp7H9oV/cNY61HcnsqOtaMr0NxEryGyTEE1tWpAGdpVdTy PSdGXSuMv79tH9/AYNKcHTMkK68LXJtqSg5gOY/F7xfL4QwliRUy2cImlC05RBhicBGB pzs7fSSnb/w6hafVGeFOB4JbB0Ri2LKLXrTs2a3n3QpvSIRJVpM8vkEdrqH/mcAbhZqU rLGZ1Tq0F7u8vfY9uYovdGY3Uf+uPbqdNCRNFWiqMvX1UDdEaN0ww6Z0aEzM5MUbOVIm J21k3aIDw5vf8hRB/ybyt3CyGtbo+napfs+T1/BIuW3oxCMAAS+s87kx9n99PAkFH6Za UbmQ== X-Gm-Message-State: ACrzQf2x0/pq+ol8RDUYy9Ool/P4tvUe9bfSfaHrRB4PvaSNQfJI0CWA CUVnzv9UEAVSUouKkupdUx4= X-Google-Smtp-Source: AMsMyM6bn+JDxdsvdz/Pguutl1+cyXvFPG0VoIq4nTtQ5F0GF7+rZKKfcQuCsNccmMmVzgis47TkEA== X-Received: by 2002:a17:907:77d5:b0:73f:40a9:62ff with SMTP id kz21-20020a17090777d500b0073f40a962ffmr1110246ejc.678.1667987417433; Wed, 09 Nov 2022 01:50:17 -0800 (PST) Received: from able.fritz.box (p5b0ea229.dip0.t-ipconnect.de. [91.14.162.41]) by smtp.gmail.com with ESMTPSA id s12-20020a1709062ecc00b00780f24b797dsm5604543eji.108.2022.11.09.01.50.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 09 Nov 2022 01:50:16 -0800 (PST) From: " =?utf-8?q?Christian_K=C3=B6nig?= " X-Google-Original-From: =?utf-8?q?Christian_K=C3=B6nig?= To: Alexander.Deucher@amd.com, daniel.vetter@ffwll.ch, dri-devel@lists.freedesktop.org, Shaoyun.Liu@amd.com Subject: [PATCH 5/5] drm/scheduler: deprecate drm_sched_resubmit_jobs Date: Wed, 9 Nov 2022 10:50:10 +0100 Message-Id: <20221109095010.141189-5-christian.koenig@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221109095010.141189-1-christian.koenig@amd.com> References: <20221109095010.141189-1-christian.koenig@amd.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: =?utf-8?q?Christian_K=C3=B6nig?= Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" This interface is not working as it should. Signed-off-by: Christian König --- drivers/gpu/drm/scheduler/sched_main.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index e77e1fd16732..9eacce8aae3f 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -489,10 +489,21 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) EXPORT_SYMBOL(drm_sched_start); /** - * drm_sched_resubmit_jobs - helper to relaunch jobs from the pending list + * drm_sched_resubmit_jobs - Deprecated, don't use in new code! * * @sched: scheduler instance * + * Re-submitting jobs was a concept AMD came up as cheap way to implement + * recovery after a job timeout. + * + * This turned out to be not working very well. First of all there are many + * problem with the dma_fence implementation and requirements. Either the + * implementation is risking deadlocks with core memory management or violating + * documented implementation details of the dma_fence object. + * + * Drivers can still save and restore their state for recovery operations, but + * we shouldn't make this a general scheduler feature around the dma_fence + * interface. */ void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched) {