From patchwork Wed Oct 26 15:35:53 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Christian_K=C3=B6nig?= X-Patchwork-Id: 13020817 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id ADD9FC38A2D for ; Wed, 26 Oct 2022 15:36:59 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 1CD9410E51C; Wed, 26 Oct 2022 15:36:53 +0000 (UTC) Received: from mail-ed1-x52d.google.com (mail-ed1-x52d.google.com [IPv6:2a00:1450:4864:20::52d]) by gabe.freedesktop.org (Postfix) with ESMTPS id E8BBC10E33F; Wed, 26 Oct 2022 15:36:00 +0000 (UTC) Received: by mail-ed1-x52d.google.com with SMTP id y12so21874517edc.9; Wed, 26 Oct 2022 08:36:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=c3647PSQ+0vwPlvlNxC9ZlJ7mFZIN+KyoVoHuNp8XOg=; b=BGgrYVNA6coWr0IKlzXepSLnk18pQh0FdKE0cHQvEEVjd4O74y3rtNnZNx6DDzLJZa e7wSBQW9ZTCjnnkl0KLDlP9KwhIeneLVBAlrFg7RMhwnso/Xq/KIle2Gq6iXxUZgfEVV irZjsDDkGNWbMKOw/f/PT2NLmVoGFjxZjwgoecSL2B/K9V2NO6ohGk27mdmYws1oZaLJ WPkxeBSE+OQtdLZqwYCHtHfR+KdaWkNd49QJElORuFXE9ZrXxth0wOP/Fv+iBNldJlr+ xLc4YL6XfLZPItmzAxXJiLPG/5DN56DhJ2rqo3sBgQtBtkJrQqdUzTxX31PtB8QMdbpA NN+g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=c3647PSQ+0vwPlvlNxC9ZlJ7mFZIN+KyoVoHuNp8XOg=; b=rXOBc5SvaXTwWdpvUzpv4gT8wMfYdvRoH9s7cXWN7DmAeoRQIguDon4//HDNMXE9QU 6Lqt4kIEljYk5+CeTYOZf5BwwAbjGuCyLCdDkt9beRgfASnsIF1gFdXtubaiBEV0mPms Nb1QWg/xBv3RKvv397eTL6YOvF3y9S6tJOlJ9ydy9I4p7yNBG3IIsjM8OlDv9N146xfZ 8fboD+fr0Lu729jRWR1hUBeexztwOKo9/SC4pZVXL4/ayxh/2mqY+rCQ2FxIF2dL3Jui Z82ssuzY7WIVYrgbB+4Dh0cp6FGvAclGMfJoYB6co18wmIUPDh/murD81J6P5HYAiLXI 5Azw== X-Gm-Message-State: ACrzQf0n970V9KRMRKLDVv5u/Y4fKeZjVzBMNJpHjiB/UT4kSVX6RWEu p/Hha7sOwoqIC1TbCOQE9kI= X-Google-Smtp-Source: AMsMyM52wHXFChyQUVxiCpzsXZQduYcsYgpf42c08n61jwsEJOFl/2s2mAShLlXJTmC9/e3BXbqs/A== X-Received: by 2002:a05:6402:4006:b0:461:f4e2:82e0 with SMTP id d6-20020a056402400600b00461f4e282e0mr12073319eda.68.1666798559293; Wed, 26 Oct 2022 08:35:59 -0700 (PDT) Received: from able.fritz.box (p5b0ea229.dip0.t-ipconnect.de. [91.14.162.41]) by smtp.gmail.com with ESMTPSA id la3-20020a170907780300b007abafe43c3bsm3066715ejc.86.2022.10.26.08.35.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Oct 2022 08:35:58 -0700 (PDT) From: " =?utf-8?q?Christian_K=C3=B6nig?= " X-Google-Original-From: =?utf-8?q?Christian_K=C3=B6nig?= To: luben.tuikov@amd.com, vprosyak@amd.com, Alexander.Deucher@amd.com, daniel.vetter@ffwll.ch, amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Subject: [PATCH 1/5] drm/amd/amdgpu revert "implement tdr advanced mode" Date: Wed, 26 Oct 2022 17:35:53 +0200 Message-Id: <20221026153557.63541-1-christian.koenig@amd.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: =?utf-8?q?Christian_K=C3=B6nig?= Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" This reverts commit e6c6338f393b74ac0b303d567bb918b44ae7ad75. This feature basically re-submits one job after another to figure out which one was the one causing a hang. This is obviously incompatible with gang-submit which requires that multiple jobs run at the same time. It's also absolutely not helpful to crash the hardware multiple times if a clean recovery is desired. For testing and debugging environments we should rather disable recovery alltogether to be able to inspect the state with a hw debugger. Additional to that the sw implementation is clearly buggy and causes reference count issues for the hardware fence. Signed-off-by: Christian König Acked-by: Luben Tuikov --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 103 --------------------- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +- drivers/gpu/drm/scheduler/sched_main.c | 58 ++---------- include/drm/gpu_scheduler.h | 3 - 4 files changed, 10 insertions(+), 156 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 6f958603c8cc..d4584e577b51 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5070,94 +5070,6 @@ static int amdgpu_device_suspend_display_audio(struct amdgpu_device *adev) return 0; } -static void amdgpu_device_recheck_guilty_jobs( - struct amdgpu_device *adev, struct list_head *device_list_handle, - struct amdgpu_reset_context *reset_context) -{ - int i, r = 0; - - for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { - struct amdgpu_ring *ring = adev->rings[i]; - int ret = 0; - struct drm_sched_job *s_job; - - if (!ring || !ring->sched.thread) - continue; - - s_job = list_first_entry_or_null(&ring->sched.pending_list, - struct drm_sched_job, list); - if (s_job == NULL) - continue; - - /* clear job's guilty and depend the folowing step to decide the real one */ - drm_sched_reset_karma(s_job); - drm_sched_resubmit_jobs_ext(&ring->sched, 1); - - if (!s_job->s_fence->parent) { - DRM_WARN("Failed to get a HW fence for job!"); - continue; - } - - ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout); - if (ret == 0) { /* timeout */ - DRM_ERROR("Found the real bad job! ring:%s, job_id:%llx\n", - ring->sched.name, s_job->id); - - - amdgpu_fence_driver_isr_toggle(adev, true); - - /* Clear this failed job from fence array */ - amdgpu_fence_driver_clear_job_fences(ring); - - amdgpu_fence_driver_isr_toggle(adev, false); - - /* Since the job won't signal and we go for - * another resubmit drop this parent pointer - */ - dma_fence_put(s_job->s_fence->parent); - s_job->s_fence->parent = NULL; - - /* set guilty */ - drm_sched_increase_karma(s_job); - amdgpu_reset_prepare_hwcontext(adev, reset_context); -retry: - /* do hw reset */ - if (amdgpu_sriov_vf(adev)) { - amdgpu_virt_fini_data_exchange(adev); - r = amdgpu_device_reset_sriov(adev, false); - if (r) - adev->asic_reset_res = r; - } else { - clear_bit(AMDGPU_SKIP_HW_RESET, - &reset_context->flags); - r = amdgpu_do_asic_reset(device_list_handle, - reset_context); - if (r && r == -EAGAIN) - goto retry; - } - - /* - * add reset counter so that the following - * resubmitted job could flush vmid - */ - atomic_inc(&adev->gpu_reset_counter); - continue; - } - - /* got the hw fence, signal finished fence */ - atomic_dec(ring->sched.score); - dma_fence_get(&s_job->s_fence->finished); - dma_fence_signal(&s_job->s_fence->finished); - dma_fence_put(&s_job->s_fence->finished); - - /* remove node from list and free the job */ - spin_lock(&ring->sched.job_list_lock); - list_del_init(&s_job->list); - spin_unlock(&ring->sched.job_list_lock); - ring->sched.ops->free_job(s_job); - } -} - static inline void amdgpu_device_stop_pending_resets(struct amdgpu_device *adev) { struct amdgpu_ras *con = amdgpu_ras_get_context(adev); @@ -5178,7 +5090,6 @@ static inline void amdgpu_device_stop_pending_resets(struct amdgpu_device *adev) } - /** * amdgpu_device_gpu_recover - reset the asic and recover scheduler * @@ -5201,7 +5112,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, int i, r = 0; bool need_emergency_restart = false; bool audio_suspended = false; - int tmp_vram_lost_counter; bool gpu_reset_for_dev_remove = false; gpu_reset_for_dev_remove = @@ -5347,7 +5257,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, amdgpu_device_stop_pending_resets(tmp_adev); } - tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter)); /* Actual ASIC resets if needed.*/ /* Host driver will handle XGMI hive reset for SRIOV */ if (amdgpu_sriov_vf(adev)) { @@ -5372,18 +5281,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, /* Post ASIC reset for all devs .*/ list_for_each_entry(tmp_adev, device_list_handle, reset_list) { - /* - * Sometimes a later bad compute job can block a good gfx job as gfx - * and compute ring share internal GC HW mutually. We add an additional - * guilty jobs recheck step to find the real guilty job, it synchronously - * submits and pends for the first job being signaled. If it gets timeout, - * we identify it as a real guilty job. - */ - if (amdgpu_gpu_recovery == 2 && - !(tmp_vram_lost_counter < atomic_read(&adev->vram_lost_counter))) - amdgpu_device_recheck_guilty_jobs( - tmp_adev, device_list_handle, reset_context); - for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { struct amdgpu_ring *ring = tmp_adev->rings[i]; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 16f6a313335e..f177d8e5b665 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -519,7 +519,7 @@ module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444); * DOC: gpu_recovery (int) * Set to enable GPU recovery mechanism (1 = enable, 0 = disable). The default is -1 (auto, disabled except SRIOV). */ -MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (2 = advanced tdr mode, 1 = enable, 0 = disable, -1 = auto)"); +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)"); module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444); /** diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index e0ab14e0fb6b..bb28f31bff6f 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -355,27 +355,6 @@ static void drm_sched_job_timedout(struct work_struct *work) } } - /** - * drm_sched_increase_karma - Update sched_entity guilty flag - * - * @bad: The job guilty of time out - * - * Increment on every hang caused by the 'bad' job. If this exceeds the hang - * limit of the scheduler then the respective sched entity is marked guilty and - * jobs from it will not be scheduled further - */ -void drm_sched_increase_karma(struct drm_sched_job *bad) -{ - drm_sched_increase_karma_ext(bad, 1); -} -EXPORT_SYMBOL(drm_sched_increase_karma); - -void drm_sched_reset_karma(struct drm_sched_job *bad) -{ - drm_sched_increase_karma_ext(bad, 0); -} -EXPORT_SYMBOL(drm_sched_reset_karma); - /** * drm_sched_stop - stop the scheduler * @@ -516,32 +495,15 @@ EXPORT_SYMBOL(drm_sched_start); * */ void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched) -{ - drm_sched_resubmit_jobs_ext(sched, INT_MAX); -} -EXPORT_SYMBOL(drm_sched_resubmit_jobs); - -/** - * drm_sched_resubmit_jobs_ext - helper to relunch certain number of jobs from mirror ring list - * - * @sched: scheduler instance - * @max: job numbers to relaunch - * - */ -void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max) { struct drm_sched_job *s_job, *tmp; uint64_t guilty_context; bool found_guilty = false; struct dma_fence *fence; - int i = 0; list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) { struct drm_sched_fence *s_fence = s_job->s_fence; - if (i >= max) - break; - if (!found_guilty && atomic_read(&s_job->karma) > sched->hang_limit) { found_guilty = true; guilty_context = s_job->s_fence->scheduled.context; @@ -551,7 +513,6 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max) dma_fence_set_error(&s_fence->finished, -ECANCELED); fence = sched->ops->run_job(s_job); - i++; if (IS_ERR_OR_NULL(fence)) { if (IS_ERR(fence)) @@ -567,7 +528,7 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max) } } } -EXPORT_SYMBOL(drm_sched_resubmit_jobs_ext); +EXPORT_SYMBOL(drm_sched_resubmit_jobs); /** * drm_sched_job_init - init a scheduler job @@ -1081,13 +1042,15 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched) EXPORT_SYMBOL(drm_sched_fini); /** - * drm_sched_increase_karma_ext - Update sched_entity guilty flag + * drm_sched_increase_karma - Update sched_entity guilty flag * * @bad: The job guilty of time out - * @type: type for increase/reset karma * + * Increment on every hang caused by the 'bad' job. If this exceeds the hang + * limit of the scheduler then the respective sched entity is marked guilty and + * jobs from it will not be scheduled further */ -void drm_sched_increase_karma_ext(struct drm_sched_job *bad, int type) +void drm_sched_increase_karma(struct drm_sched_job *bad) { int i; struct drm_sched_entity *tmp; @@ -1099,10 +1062,7 @@ void drm_sched_increase_karma_ext(struct drm_sched_job *bad, int type) * corrupt but keep in mind that kernel jobs always considered good. */ if (bad->s_priority != DRM_SCHED_PRIORITY_KERNEL) { - if (type == 0) - atomic_set(&bad->karma, 0); - else if (type == 1) - atomic_inc(&bad->karma); + atomic_inc(&bad->karma); for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_KERNEL; i++) { @@ -1113,7 +1073,7 @@ void drm_sched_increase_karma_ext(struct drm_sched_job *bad, int type) if (bad->s_fence->scheduled.context == entity->fence_context) { if (entity->guilty) - atomic_set(entity->guilty, type); + atomic_set(entity->guilty, 1); break; } } @@ -1123,4 +1083,4 @@ void drm_sched_increase_karma_ext(struct drm_sched_job *bad, int type) } } } -EXPORT_SYMBOL(drm_sched_increase_karma_ext); +EXPORT_SYMBOL(drm_sched_increase_karma); diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 0fca8f38bee4..c564be29c5ae 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -488,10 +488,7 @@ void drm_sched_wakeup(struct drm_gpu_scheduler *sched); void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad); void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery); void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched); -void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max); void drm_sched_increase_karma(struct drm_sched_job *bad); -void drm_sched_reset_karma(struct drm_sched_job *bad); -void drm_sched_increase_karma_ext(struct drm_sched_job *bad, int type); bool drm_sched_dependency_optimized(struct dma_fence* fence, struct drm_sched_entity *entity); void drm_sched_fault(struct drm_gpu_scheduler *sched); From patchwork Wed Oct 26 15:35:54 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Christian_K=C3=B6nig?= X-Patchwork-Id: 13020818 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 04C15C433FE for ; Wed, 26 Oct 2022 15:37:03 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 1823B10E521; Wed, 26 Oct 2022 15:36:54 +0000 (UTC) Received: from mail-ej1-x633.google.com (mail-ej1-x633.google.com [IPv6:2a00:1450:4864:20::633]) by gabe.freedesktop.org (Postfix) with ESMTPS id F071310E524; Wed, 26 Oct 2022 15:36:01 +0000 (UTC) Received: by mail-ej1-x633.google.com with SMTP id sc25so22278384ejc.12; Wed, 26 Oct 2022 08:36:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=OwUGcgzRYL20XpSh0K0nSFSgs1f9ZtaAcW2vydQz0LU=; b=pnHZ8lRXU6QOXaVzky78fYw1t1RPQO4+mnU+MymQVQiMShZ3KDQVkigOmMNuhxdU+r 3M68z92K7vvjGR2FR2mvqov2Z1Sv6BQsUubvJjwFzHi2AozDJX/cIjM6lvW8ANSZWhJ3 25YsD3gvUcGTRg4YEtR9ruRIU33XDuF9GAKuG2sr60Ud4GiQWoK8yYBZvwnajZDJfEeO cZNkXSPZDPW96lyjE+6QAFWZXb0dSIHMee78+jyDdqXSnFeFHWFUisG4bhfVyztDSYfQ XvnJv1YCrroijKTi5iQXnFUyRFVvySlrCL6XSfr+igCvzaj98s592kroUx8laYFThLpQ KVXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=OwUGcgzRYL20XpSh0K0nSFSgs1f9ZtaAcW2vydQz0LU=; b=TCDg3/L2DKtjTpEeIL2UmAayioYnBSQ8gvJpvZ1AyyZMcM1exa96HXl0b2wo3yJX56 06XEnDnkx8JecuwsVRhfPJPj9ohpskWocRYI6gAsD2YQXBH/8DPEZRqW0IIAYdQdhdBY 0xpGasSSDvfwVU7YMChIjkoi1NZJNX+oPKRDQRF/M1mCAv268eW6MREE2c8sj+Z0LzHy enyQb1kSnOkTD0ayVztWChLqTHLaXM0zAs4R6xMbUN5oB4Hzu6ngIqPPj4ovZhoTJbmB BBnyvzKGMkgTOxqtRi/u5+rnvHJQv/MniXsefcYAanc2IK9h6N9ZHAS9OOLIZ3NIMLIk fJPg== X-Gm-Message-State: ACrzQf1k2HIFTuPgV0lWp8CdVapKTOzY39fSFhPSQHsHAdBG2+FsLA2M /OgKHAs2SIazhQevXW6ejDE= X-Google-Smtp-Source: AMsMyM4RonCJTgGO2vi93Fobxgvuj73s/9PEcnDNzuQKv50KxY6qrBcpKp91kPn5r4bkVqkqxkxfNA== X-Received: by 2002:a17:907:a48:b0:7a7:3714:1629 with SMTP id be8-20020a1709070a4800b007a737141629mr15067907ejc.569.1666798560440; Wed, 26 Oct 2022 08:36:00 -0700 (PDT) Received: from able.fritz.box (p5b0ea229.dip0.t-ipconnect.de. [91.14.162.41]) by smtp.gmail.com with ESMTPSA id la3-20020a170907780300b007abafe43c3bsm3066715ejc.86.2022.10.26.08.35.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Oct 2022 08:36:00 -0700 (PDT) From: " =?utf-8?q?Christian_K=C3=B6nig?= " X-Google-Original-From: =?utf-8?q?Christian_K=C3=B6nig?= To: luben.tuikov@amd.com, vprosyak@amd.com, Alexander.Deucher@amd.com, daniel.vetter@ffwll.ch, amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Subject: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset Date: Wed, 26 Oct 2022 17:35:54 +0200 Message-Id: <20221026153557.63541-2-christian.koenig@amd.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20221026153557.63541-1-christian.koenig@amd.com> References: <20221026153557.63541-1-christian.koenig@amd.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: =?utf-8?q?Christian_K=C3=B6nig?= Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Re-submitting IBs by the kernel has many problems because pre- requisite state is not automatically re-created as well. In other words neither binary semaphores nor things like ring buffer pointers are in the state they should be when the hardware starts to work on the IBs again. Additional to that even after more than 5 years of developing this feature it is still not stable and we have massively problems getting the reference counts right. As discussed with user space developers this behavior is not helpful in the first place. For graphics and multimedia workloads it makes much more sense to either completely re-create the context or at least re-submitting the IBs from userspace. For compute use cases re-submitting is also not very helpful since userspace must rely on the accuracy of the result. Because of this we stop this practice and instead just properly note that the fence submission was canceled. The only use case we keep the re-submission for now is SRIOV and function level resets. Signed-off-by: Christian König Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index d4584e577b51..39e94feba1ac 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5288,7 +5288,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, continue; /* No point to resubmit jobs if we didn't HW reset*/ - if (!tmp_adev->asic_reset_res && !job_signaled) + if (!tmp_adev->asic_reset_res && !job_signaled && + amdgpu_sriov_vf(tmp_adev)) drm_sched_resubmit_jobs(&ring->sched); drm_sched_start(&ring->sched, !tmp_adev->asic_reset_res); From patchwork Wed Oct 26 15:35:55 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Christian_K=C3=B6nig?= X-Patchwork-Id: 13020822 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 62F11C38A2D for ; Wed, 26 Oct 2022 15:37:11 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 5CCFB10E589; Wed, 26 Oct 2022 15:36:58 +0000 (UTC) Received: from mail-lj1-x229.google.com (mail-lj1-x229.google.com [IPv6:2a00:1450:4864:20::229]) by gabe.freedesktop.org (Postfix) with ESMTPS id 502FC10E56E; Wed, 26 Oct 2022 15:36:13 +0000 (UTC) Received: by mail-lj1-x229.google.com with SMTP id r22so22579340ljn.10; Wed, 26 Oct 2022 08:36:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=0CG3FwSUAOl1HMDb3efp7unHbFctLIwlHQeHnLl6gJo=; b=Az0il5YsiGJvyp6LViCh/r+njcwXdimfd7Ej0f/gkhZJcxuae+o8XN6PG5m9prIEJ7 NU5QXeNtxk63wOKPvTtYvvPClT9oSgK2Hur4yHjilaM4sXKDCnph/cyHamfFW7I3GfX3 WwC9lW1rsZ3MXj2EZ+JKuMIwRXKv3zIKlE94P3N6hh3J4IRPyOiQUOAXKrBAMpfM7PID Ok+lSXlsmhCTDKcu03MGFdSnGEGYf4vdh26s7SyjAqcyGTdvO/pYFiao/GrkMQ0Tcr0B Qlvwowy+wA/3IdAkdTDL2Qbtelv0qPtxS8b96E1G8sku36phf3PQps9do2uG37ypsnTm t6og== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=0CG3FwSUAOl1HMDb3efp7unHbFctLIwlHQeHnLl6gJo=; b=o73E8EMwyf4na+ypr08uRaMlaavRdz5WzLJPpolrAEl+GLzlAhuCHDdxRgDkbQW/h1 pVZ1NXUr9uMkiNXtMCGMemEAn/51S1dOLDO5OWWrNryM8SmM8ZGNqQYPVBgjCPO0N28H nA5CnAYLaIbB3gOYVnlLkaZmd6/kTlSEmkMrYXy5F0Q8/byQlSsaIBrxtmN16Zmup3Kl UjSc3Egs7tMp38y+B6Cfvk4Z4wQkI+01Ahuq5dnIHSnmgSbX2Igd58G2Za9wuat09+PY dfKSSgNd4qYPgYwpI6sGCT9adUSqB0mYwERqpzc19UsBWWuikbKqVhymVASvdjfsfXlY kHPw== X-Gm-Message-State: ACrzQf2s9fBsuNOyGKNUAAxtSbSw87T4b/9WUA39Fin6SDruMQNzFeEu LI74i44nD3b3+qIDkh3aOzRjGF5mHeg= X-Google-Smtp-Source: AMsMyM57bsSME4LPkvnmYeFsLvKhOqVlP5vcV+cfatpU8ZXfjvX2eC5EaB6AXqXzer9cx9CSKtMWDA== X-Received: by 2002:a17:907:2c5b:b0:78d:3f8a:19d0 with SMTP id hf27-20020a1709072c5b00b0078d3f8a19d0mr37281305ejc.369.1666798561626; Wed, 26 Oct 2022 08:36:01 -0700 (PDT) Received: from able.fritz.box (p5b0ea229.dip0.t-ipconnect.de. [91.14.162.41]) by smtp.gmail.com with ESMTPSA id la3-20020a170907780300b007abafe43c3bsm3066715ejc.86.2022.10.26.08.36.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Oct 2022 08:36:01 -0700 (PDT) From: " =?utf-8?q?Christian_K=C3=B6nig?= " X-Google-Original-From: =?utf-8?q?Christian_K=C3=B6nig?= To: luben.tuikov@amd.com, vprosyak@amd.com, Alexander.Deucher@amd.com, daniel.vetter@ffwll.ch, amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Subject: [PATCH 3/5] drm/amdgpu: stop resubmittting jobs in amdgpu_pci_resume Date: Wed, 26 Oct 2022 17:35:55 +0200 Message-Id: <20221026153557.63541-3-christian.koenig@amd.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20221026153557.63541-1-christian.koenig@amd.com> References: <20221026153557.63541-1-christian.koenig@amd.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: =?utf-8?q?Christian_K=C3=B6nig?= Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" As far as I can see this is not really recoverable since a PCI reset clears VRAM. Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 39e94feba1ac..b1827e804363 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5745,8 +5745,6 @@ void amdgpu_pci_resume(struct pci_dev *pdev) if (!ring || !ring->sched.thread) continue; - - drm_sched_resubmit_jobs(&ring->sched); drm_sched_start(&ring->sched, true); } From patchwork Wed Oct 26 15:35:56 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Christian_K=C3=B6nig?= X-Patchwork-Id: 13020815 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id DB14EC38A2D for ; Wed, 26 Oct 2022 15:36:52 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 4BF4F10E524; Wed, 26 Oct 2022 15:36:50 +0000 (UTC) Received: from mail-ej1-x632.google.com (mail-ej1-x632.google.com [IPv6:2a00:1450:4864:20::632]) by gabe.freedesktop.org (Postfix) with ESMTPS id 5D39310E524; Wed, 26 Oct 2022 15:36:04 +0000 (UTC) Received: by mail-ej1-x632.google.com with SMTP id kt23so16786311ejc.7; Wed, 26 Oct 2022 08:36:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=V9/7qDsNBRa5cdoZiqx3KoYVsYmhpnIHqLbLNf8ZYFA=; b=BklkyqsluUW/4b9UQMOhTwvvR3IeT47he/LAH7AXLy3vfWp671N85eyVKH3o2TDEHH lpW8sd4Wm1Pirzt1+Ll/jJ6cw0933XG3vDIzCQbF09VCZY2HpYNPo9dHhoKnrDO/eQz0 dDLsLO+UVdEjE9UdRslEWQhKjxXGp+Bql+lh5u/iEfQmYb0hd2rvYX7a0xcK4OTuVeRV 3G1F9b2G7wfo/vPufEv2kfbslhtu9CLXSx589lzOyBAHNrUoXPYZIOFcg9PU2aUri4h2 xuChJ9Mp7LKclf/O9Wk4cvMBNHjP041u/+4lh4iCsa8Ggmxme2ei+lrbx3eI9QznYQ83 yrQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=V9/7qDsNBRa5cdoZiqx3KoYVsYmhpnIHqLbLNf8ZYFA=; b=D3zq5fLKyTXnCnpHqcLgqfAr92g1NfJDEBMa6psUU/4wpX1GRyChodXY24rE0R/aq9 vV6v1bMnOXoADHO+INWkww2j2vYgHuL/59SaBfDWFujM0tUQwwvypTTahNkDVknlxPko ww/Oeat7yAXgK8a1Z9WJ5MA/esIm2Bp9MEwNShxNdFFTg6gw9ciC+rovd9Q/X6l409dU d7UPdo+zoXo11QeODrl1qQ4rFL0wgELFu0KOs3yD1yCzuuK72GsSMnxjCMSuEgrB9IxT rsLN/cpn2HR5aNWCXIt1nS8a013tp+WSgskF8vP8JREZzt+FBNbDSUZWzcbjGznOMWuB IXrA== X-Gm-Message-State: ACrzQf2bfLKLMFZJR4zxTDkRKGyw9ZLuRJo65bH5Mu3XKyA9Bzc8wpOs 44RffiblKToqjjSecKK8QAfUuhjfgps= X-Google-Smtp-Source: AMsMyM6kiWKvIKTdcaUFwKmkFB7iNItwcaMDr+SCh5XADEtznbAtPa/K5lt6m3MKhq0BibzwHaETnw== X-Received: by 2002:a17:906:ef8c:b0:78d:96b9:a0ad with SMTP id ze12-20020a170906ef8c00b0078d96b9a0admr38672106ejb.529.1666798562756; Wed, 26 Oct 2022 08:36:02 -0700 (PDT) Received: from able.fritz.box (p5b0ea229.dip0.t-ipconnect.de. [91.14.162.41]) by smtp.gmail.com with ESMTPSA id la3-20020a170907780300b007abafe43c3bsm3066715ejc.86.2022.10.26.08.36.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Oct 2022 08:36:02 -0700 (PDT) From: " =?utf-8?q?Christian_K=C3=B6nig?= " X-Google-Original-From: =?utf-8?q?Christian_K=C3=B6nig?= To: luben.tuikov@amd.com, vprosyak@amd.com, Alexander.Deucher@amd.com, daniel.vetter@ffwll.ch, amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Subject: [PATCH 4/5] drm/scheduler: cleanup define Date: Wed, 26 Oct 2022 17:35:56 +0200 Message-Id: <20221026153557.63541-4-christian.koenig@amd.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20221026153557.63541-1-christian.koenig@amd.com> References: <20221026153557.63541-1-christian.koenig@amd.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: =?utf-8?q?Christian_K=C3=B6nig?= Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Remove some not implemented function define Signed-off-by: Christian König --- include/drm/gpu_scheduler.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index c564be29c5ae..d646ff2fd557 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -492,7 +492,6 @@ void drm_sched_increase_karma(struct drm_sched_job *bad); bool drm_sched_dependency_optimized(struct dma_fence* fence, struct drm_sched_entity *entity); void drm_sched_fault(struct drm_gpu_scheduler *sched); -void drm_sched_job_kickout(struct drm_sched_job *s_job); void drm_sched_rq_add_entity(struct drm_sched_rq *rq, struct drm_sched_entity *entity); From patchwork Wed Oct 26 15:35:57 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Christian_K=C3=B6nig?= X-Patchwork-Id: 13020819 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 8DABFC38A2D for ; Wed, 26 Oct 2022 15:37:04 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 0DF0110E57F; Wed, 26 Oct 2022 15:36:56 +0000 (UTC) Received: from mail-ej1-x62e.google.com (mail-ej1-x62e.google.com [IPv6:2a00:1450:4864:20::62e]) by gabe.freedesktop.org (Postfix) with ESMTPS id 16F8F10E576; Wed, 26 Oct 2022 15:36:05 +0000 (UTC) Received: by mail-ej1-x62e.google.com with SMTP id bj12so22250184ejb.13; Wed, 26 Oct 2022 08:36:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=JoMWQpOsm4pNAdd277GYa33vLcAp9QeaHSjtvnxhucU=; b=ca17gAjP3fSdJAHUU8tMza3zNqXHfORa18aw4dw0lxXOAV1lGEYUsgugaDx/4hTJTN kbtMCn7ziKKcEApabUAwJxHgF5Y3y57KKTwTv2Whqip1klnaDm/VcaDzkuVzAdh2H/gJ UMHBAD8E3+JHqeDVOiwfi1ZQN7qoHrjRa0PANZ6HrFXfkFSq3F96Q+8H3ckB5kbaTzA3 demPeVKKbOg0BZiI0hB8C68UNezVgHMcQEArWNCTjHETQhyC3Pczkz7dsuOfzz2MnL3l ZQ62bf8kdgtWwVVGGCr6yjxcMwhl3Vv2i6q3k4EiMLT5zIU7M0q0OWmiznQo5AplTPnc gFAw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JoMWQpOsm4pNAdd277GYa33vLcAp9QeaHSjtvnxhucU=; b=zfLHCMo1HyErAtV1iJY1FU9/MSPBN0AmreFHts0E1RCoVxN6VA2C926eziiDj0gEWa LxpuZhH/OxB1ITy30AEBZDg/BUehVq6H6Jh3wEvGKjT3fB13f78bugJWNvEiI33QqsDv 8TBnBxIR0wO5c+Y7+SFL11nJGqr72nzzkr7q/FcdqEUKhKgITqCVnajhD/UFW/By5/XE 7y/hLjA8TLQ6bkBjkWsq302CsF3KPCGIOqtupp8o+087g8CGrkhuZ1LKFz6KHiRaZMUs wUcnqmWQcuG8XOWwbU79rx0zZyhFLEbGRAyFIairr1XRffCGaJBB1QHyTXNxxl2bszqc 5d5g== X-Gm-Message-State: ACrzQf3q5aGJuZCZkt5VU1+SK6HDrioqGSZKu66BNc9i52ndnCUGnKf2 ipyYrWpfocbpFsuJeYiFTZkV/FYg4VI= X-Google-Smtp-Source: AMsMyM5NTpvpUw/S0Eo3inEJKVAMsXZbEWqmHn01qoFJAn5Si/voQ8g2JHsjwbUfEJi1N5nvh9t+Kg== X-Received: by 2002:a17:907:a055:b0:7a4:48e1:65c3 with SMTP id gz21-20020a170907a05500b007a448e165c3mr17559336ejc.764.1666798563831; Wed, 26 Oct 2022 08:36:03 -0700 (PDT) Received: from able.fritz.box (p5b0ea229.dip0.t-ipconnect.de. [91.14.162.41]) by smtp.gmail.com with ESMTPSA id la3-20020a170907780300b007abafe43c3bsm3066715ejc.86.2022.10.26.08.36.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Oct 2022 08:36:03 -0700 (PDT) From: " =?utf-8?q?Christian_K=C3=B6nig?= " X-Google-Original-From: =?utf-8?q?Christian_K=C3=B6nig?= To: luben.tuikov@amd.com, vprosyak@amd.com, Alexander.Deucher@amd.com, daniel.vetter@ffwll.ch, amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Subject: [PATCH 5/5] drm/scheduler: deprecate drm_sched_resubmit_jobs Date: Wed, 26 Oct 2022 17:35:57 +0200 Message-Id: <20221026153557.63541-5-christian.koenig@amd.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20221026153557.63541-1-christian.koenig@amd.com> References: <20221026153557.63541-1-christian.koenig@amd.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: =?utf-8?q?Christian_K=C3=B6nig?= Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" This interface is not working as it should. Signed-off-by: Christian König --- drivers/gpu/drm/scheduler/sched_main.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index bb28f31bff6f..ecd4afab4adb 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -489,10 +489,21 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) EXPORT_SYMBOL(drm_sched_start); /** - * drm_sched_resubmit_jobs - helper to relaunch jobs from the pending list + * drm_sched_resubmit_jobs - Deprecated, don't use in new code! * * @sched: scheduler instance * + * Re-submitting jobs was a concept AMD came up as cheap way to implement + * recovery after a job timeout. + * + * This turned out to be not working very well. First of all there are many + * problem with the dma_fence implementation and requirements. Either the + * implementation is risking deadlocks with core memory management or violating + * documented implementation details of the dma_fence object. + * + * Drivers can still save and restore their state for recovery operations, but + * we shouldn't make this a general scheduler feature around the dma_fence + * interface. */ void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched) {