[1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

Message ID	20240918133956.26557-1-christian.koenig@amd.com (mailing list archive)
State	New
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: " =?utf-8?q?Christian_K=C3=B6nig?= " <ckoenig.leichtzumerken@gmail.com> To: dakr@kernel.org, pstanner@redhat.com, dri-devel@lists.freedesktop.org, ltuikov89@gmail.com Subject: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini Date: Wed, 18 Sep 2024 15:39:55 +0200 Message-Id: <20240918133956.26557-1-christian.koenig@amd.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	[1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini \| expand [1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini [2/2] drm/sched: clarify the documentation on drm_sched_entity_error

Message ID

20240918133956.26557-1-christian.koenig@amd.com (mailing list archive)

State

New

Headers

From: " =?utf-8?q?Christian_K=C3=B6nig?= " <ckoenig.leichtzumerken@gmail.com>
To: dakr@kernel.org, pstanner@redhat.com, dri-devel@lists.freedesktop.org,
 ltuikov89@gmail.com
Subject: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini
Date: Wed, 18 Sep 2024 15:39:55 +0200
Message-Id: <20240918133956.26557-1-christian.koenig@amd.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: list
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Series

[1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini | expand

Commit Message

Christian König Sept. 18, 2024, 1:39 p.m. UTC

Tearing down the scheduler with jobs still on the pending list can
lead to use after free issues. Add a warning if drivers try to
destroy a scheduler which still has work pushed to the HW.

When there are still entities with jobs the situation is even worse
since the dma_fences for those jobs can never signal we can just
choose between potentially locking up core memory management and
random memory corruption. When drivers really mess it up that well
let them run into a BUG_ON().

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

Comments

Jani Nikula Sept. 18, 2024, 2:41 p.m. UTC | #1

On Wed, 18 Sep 2024, "Christian König" <ckoenig.leichtzumerken@gmail.com> wrote:
> Tearing down the scheduler with jobs still on the pending list can
> lead to use after free issues. Add a warning if drivers try to
> destroy a scheduler which still has work pushed to the HW.
>
> When there are still entities with jobs the situation is even worse
> since the dma_fences for those jobs can never signal we can just
> choose between potentially locking up core memory management and
> random memory corruption. When drivers really mess it up that well
> let them run into a BUG_ON().
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index f093616fe53c..8a46fab5cdc8 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>  
>  	drm_sched_wqueue_stop(sched);
>  
> +	/*
> +	 * Tearing down the scheduler wile there are still unprocessed jobs can
> +	 * lead to use after free issues in the scheduler fence.
> +	 */
> +	WARN_ON(!list_empty(&sched->pending_list));

drm_WARN_ON(sched->dev, ...) would identify the device, which I presume
would be helpful in multi-GPU systems.

BR,
Jani.

> +
>  	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>  		struct drm_sched_rq *rq = sched->sched_rq[i];
>  
>  		spin_lock(&rq->lock);
> -		list_for_each_entry(s_entity, &rq->entities, list)
> +		list_for_each_entry(s_entity, &rq->entities, list) {
> +			/*
> +			 * The justification for this BUG_ON() is that tearing
> +			 * down the scheduler while jobs are pending leaves
> +			 * dma_fences unsignaled. Since we have dependencies
> +			 * from the core memory management to eventually signal
> +			 * dma_fences this can trivially lead to a system wide
> +			 * stop because of a locked up memory management.
> +			 */
> +			BUG_ON(spsc_queue_count(&s_entity->job_queue));
> +
>  			/*
>  			 * Prevents reinsertion and marks job_queue as idle,
>  			 * it will removed from rq in drm_sched_entity_fini
>  			 * eventually
>  			 */
>  			s_entity->stopped = true;
> +		}
>  		spin_unlock(&rq->lock);
>  		kfree(sched->sched_rq[i]);
>  	}

Christian König Sept. 18, 2024, 2:56 p.m. UTC | #2

Am 18.09.24 um 16:41 schrieb Jani Nikula:
> On Wed, 18 Sep 2024, "Christian König" <ckoenig.leichtzumerken@gmail.com> wrote:
>> Tearing down the scheduler with jobs still on the pending list can
>> lead to use after free issues. Add a warning if drivers try to
>> destroy a scheduler which still has work pushed to the HW.
>>
>> When there are still entities with jobs the situation is even worse
>> since the dma_fences for those jobs can never signal we can just
>> choose between potentially locking up core memory management and
>> random memory corruption. When drivers really mess it up that well
>> let them run into a BUG_ON().
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
>>   1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index f093616fe53c..8a46fab5cdc8 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>   
>>   	drm_sched_wqueue_stop(sched);
>>   
>> +	/*
>> +	 * Tearing down the scheduler wile there are still unprocessed jobs can
>> +	 * lead to use after free issues in the scheduler fence.
>> +	 */
>> +	WARN_ON(!list_empty(&sched->pending_list));
> drm_WARN_ON(sched->dev, ...) would identify the device, which I presume
> would be helpful in multi-GPU systems.

Good point, going to fix that.

Regards,
Christian.

>
> BR,
> Jani.
>
>> +
>>   	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>>   		struct drm_sched_rq *rq = sched->sched_rq[i];
>>   
>>   		spin_lock(&rq->lock);
>> -		list_for_each_entry(s_entity, &rq->entities, list)
>> +		list_for_each_entry(s_entity, &rq->entities, list) {
>> +			/*
>> +			 * The justification for this BUG_ON() is that tearing
>> +			 * down the scheduler while jobs are pending leaves
>> +			 * dma_fences unsignaled. Since we have dependencies
>> +			 * from the core memory management to eventually signal
>> +			 * dma_fences this can trivially lead to a system wide
>> +			 * stop because of a locked up memory management.
>> +			 */
>> +			BUG_ON(spsc_queue_count(&s_entity->job_queue));
>> +
>>   			/*
>>   			 * Prevents reinsertion and marks job_queue as idle,
>>   			 * it will removed from rq in drm_sched_entity_fini
>>   			 * eventually
>>   			 */
>>   			s_entity->stopped = true;
>> +		}
>>   		spin_unlock(&rq->lock);
>>   		kfree(sched->sched_rq[i]);
>>   	}

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index f093616fe53c..8a46fab5cdc8 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1333,17 +1333,34 @@  void drm_sched_fini(struct drm_gpu_scheduler *sched)
 
 	drm_sched_wqueue_stop(sched);
 
+	/*
+	 * Tearing down the scheduler wile there are still unprocessed jobs can
+	 * lead to use after free issues in the scheduler fence.
+	 */
+	WARN_ON(!list_empty(&sched->pending_list));
+
 	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
 		struct drm_sched_rq *rq = sched->sched_rq[i];
 
 		spin_lock(&rq->lock);
-		list_for_each_entry(s_entity, &rq->entities, list)
+		list_for_each_entry(s_entity, &rq->entities, list) {
+			/*
+			 * The justification for this BUG_ON() is that tearing
+			 * down the scheduler while jobs are pending leaves
+			 * dma_fences unsignaled. Since we have dependencies
+			 * from the core memory management to eventually signal
+			 * dma_fences this can trivially lead to a system wide
+			 * stop because of a locked up memory management.
+			 */
+			BUG_ON(spsc_queue_count(&s_entity->job_queue));
+
 			/*
 			 * Prevents reinsertion and marks job_queue as idle,
 			 * it will removed from rq in drm_sched_entity_fini
 			 * eventually
 			 */
 			s_entity->stopped = true;
+		}
 		spin_unlock(&rq->lock);
 		kfree(sched->sched_rq[i]);
 	}

[1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

Commit Message

Comments

Patch