[1/2] drm/sched: document drm_sched_fini requirements v2

Message ID	20240927142755.103076-2-christian.koenig@amd.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: " =?utf-8?q?Christian_K=C3=B6nig?= " <ckoenig.leichtzumerken@gmail.com> To: pstanner@redhat.com, dakr@kernel.org, ltuikov89@gmail.com, simona.vetter@ffwll.ch, dri-devel@lists.freedesktop.org Subject: [PATCH 1/2] drm/sched: document drm_sched_fini requirements v2 Date: Fri, 27 Sep 2024 16:27:54 +0200 Message-Id: <20240927142755.103076-2-christian.koenig@amd.com> In-Reply-To: <20240927142755.103076-1-christian.koenig@amd.com> References: <20240927142755.103076-1-christian.koenig@amd.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	[1/2] drm/sched: document drm_sched_fini requirements v2 \| expand [1/2] drm/sched: document drm_sched_fini requirements v2 [2/2] drm/sched: clarify the documentation on drm_sched_entity_error

Message ID

20240927142755.103076-2-christian.koenig@amd.com (mailing list archive)

State

New, archived

Headers

From: " =?utf-8?q?Christian_K=C3=B6nig?= " <ckoenig.leichtzumerken@gmail.com>
To: pstanner@redhat.com, dakr@kernel.org, ltuikov89@gmail.com,
 simona.vetter@ffwll.ch, dri-devel@lists.freedesktop.org
Subject: [PATCH 1/2] drm/sched: document drm_sched_fini requirements v2
Date: Fri, 27 Sep 2024 16:27:54 +0200
Message-Id: <20240927142755.103076-2-christian.koenig@amd.com>
In-Reply-To: <20240927142755.103076-1-christian.koenig@amd.com>
References: <20240927142755.103076-1-christian.koenig@amd.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: list
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Series

[1/2] drm/sched: document drm_sched_fini requirements v2 | expand

Commit Message

Christian König Sept. 27, 2024, 2:27 p.m. UTC

Document the necessary steps which needs to be done before calling
drm_sched_fini().

Tearing down the scheduler with jobs still on the pending list can
lead to use after free issues. Add a warning if drivers try to
destroy a scheduler which still has work pushed to the HW.

When there are still entities with jobs the situation is even worse
since the dma_fences for those jobs can never signal we can just
choose between potentially locking up core memory management and
random memory corruption. When drivers really mess it up that well
let them run into a BUG() to prevent further damage.

v2: Use drm_WARN_ON() as suggested by Jani, add documentation what
drivers need to do before calling drm_sched_fini.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 31 ++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index f093616fe53c..5510f04788d1 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1324,7 +1324,16 @@  EXPORT_SYMBOL(drm_sched_init);
  *
  * @sched: scheduler instance
  *
- * Tears down and cleans up the scheduler.
+ * Before calling this function all entities which potentially used this
+ * scheduler instance should be forced idle using drm_sched_entity_flush() and
+ * detached from their scheduler using drm_entity_fini().
+ *
+ * Special care must be taken if drivers allocate scheduler instances
+ * dynamically. Since the dma_fence signaling doesn't guarantee any processing
+ * order of callbacks it is possible that the scheduler is still cleaning up
+ * when fences have already signaled. The easiest way to avoid that is to keep a
+ * reference from the job to the scheduler and tear down the scheduler from a
+ * work item after the last job was cleaned up.
  */
 void drm_sched_fini(struct drm_gpu_scheduler *sched)
 {
@@ -1333,17 +1342,35 @@  void drm_sched_fini(struct drm_gpu_scheduler *sched)
 
 	drm_sched_wqueue_stop(sched);
 
+	/*
+	 * Tearing down the scheduler wile there are still unprocessed jobs can
+	 * lead to use after free issues in the scheduler fence.
+	 */
+	drm_WARN_ON(sched, !list_empty(&sched->pending_list));
+
 	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
 		struct drm_sched_rq *rq = sched->sched_rq[i];
 
 		spin_lock(&rq->lock);
-		list_for_each_entry(s_entity, &rq->entities, list)
+		drm_WARN_ON(sched, !list_empty(&rq->entities));
+		list_for_each_entry(s_entity, &rq->entities, list) {
+			/*
+			 * The justification for this BUG_ON() is that tearing
+			 * down the scheduler while jobs are pending leaves
+			 * dma_fences unsignaled. Since we have dependencies
+			 * from the core memory management to eventually signal
+			 * dma_fences this can lead to a system wide stop
+			 * because of a locked up memory management.
+			 */
+			BUG_ON(spsc_queue_count(&s_entity->job_queue));
+
 			/*
 			 * Prevents reinsertion and marks job_queue as idle,
 			 * it will removed from rq in drm_sched_entity_fini
 			 * eventually
 			 */
 			s_entity->stopped = true;
+		}
 		spin_unlock(&rq->lock);
 		kfree(sched->sched_rq[i]);
 	}

[1/2] drm/sched: document drm_sched_fini requirements v2

Commit Message

Patch