diff mbox series

drm/sched: Add error code parameter to drm_sched_start

Message ID 20240724184340.183606-1-vitaly.prosyak@amd.com (mailing list archive)
State New, archived
Headers show
Series drm/sched: Add error code parameter to drm_sched_start | expand

Commit Message

Vitaly Prosyak July 24, 2024, 6:43 p.m. UTC
From: Vitaly Prosyak <vitaly.prosyak@amd.com>

The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when
the parent/hw fence is NULL. This results in drm_sched_job_done being called with -ECANCELED for
each job with a NULL parent in the pending list, making it difficult to distinguish between recovery
methods, whether a queue reset or a full GPU reset was used. To improve this, we first try a soft
recovery for timeout jobs and use the error code -ENODATA. If soft recovery fails, we proceed with
a queue reset, where the error code would still be -ENODATA for the job. Finally, for a full GPU
reset, we use error codes -ECANCELED or -ETIME. This patch adds an error code parameter to
drm_sched_start, allowing us to differentiate between queue reset and GPU reset failures.
This enables user mode and test applications to validate the expected correctness of the requested
operation. After a successful queue reset, the only way to continue normal operation is to call
drm_sched_job_done with the specific error code -ENODATA.

v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain and
    amdgpu_device_unlock_reset_domain to allow user mode to track the queue reset status
    and distinguish between queue reset and GPU reset.
v2: Christian suggested using the error codes -ENODATA for queue reset and -ECANCELED or
    -ETIME for GPU reset, returned to amdgpu_cs_wait_ioctl.
v3: To meet the requirements, we introduce a new function drm_sched_start_ex with an additional
    parameter to set dma_fence_set_error, allowing us to handle the specific error codes
    appropriately and dispose of bad jobs with the selected error code depending on whether
    it was a queue reset or GPU reset.

Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++---
 include/drm/gpu_scheduler.h            |  1 +
 2 files changed, 17 insertions(+), 3 deletions(-)

Comments

Alex Deucher July 24, 2024, 6:55 p.m. UTC | #1
On Wed, Jul 24, 2024 at 2:43 PM <vitaly.prosyak@amd.com> wrote:
>
> From: Vitaly Prosyak <vitaly.prosyak@amd.com>
>
> The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when
> the parent/hw fence is NULL. This results in drm_sched_job_done being called with -ECANCELED for
> each job with a NULL parent in the pending list, making it difficult to distinguish between recovery
> methods, whether a queue reset or a full GPU reset was used. To improve this, we first try a soft
> recovery for timeout jobs and use the error code -ENODATA. If soft recovery fails, we proceed with
> a queue reset, where the error code would still be -ENODATA for the job. Finally, for a full GPU
> reset, we use error codes -ECANCELED or -ETIME. This patch adds an error code parameter to
> drm_sched_start, allowing us to differentiate between queue reset and GPU reset failures.
> This enables user mode and test applications to validate the expected correctness of the requested
> operation. After a successful queue reset, the only way to continue normal operation is to call
> drm_sched_job_done with the specific error code -ENODATA.
>
> v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain and
>     amdgpu_device_unlock_reset_domain to allow user mode to track the queue reset status
>     and distinguish between queue reset and GPU reset.
> v2: Christian suggested using the error codes -ENODATA for queue reset and -ECANCELED or
>     -ETIME for GPU reset, returned to amdgpu_cs_wait_ioctl.
> v3: To meet the requirements, we introduce a new function drm_sched_start_ex with an additional
>     parameter to set dma_fence_set_error, allowing us to handle the specific error codes
>     appropriately and dispose of bad jobs with the selected error code depending on whether
>     it was a queue reset or GPU reset.
>
> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++---
>  include/drm/gpu_scheduler.h            |  1 +
>  2 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 7e90c9f95611..5a534772335a 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -671,13 +671,14 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>  EXPORT_SYMBOL(drm_sched_stop);
>
>  /**
> - * drm_sched_start - recover jobs after a reset
> + * drm_sched_start_ex - recover jobs after a reset
>   *
>   * @sched: scheduler instance
>   * @full_recovery: proceed with complete sched restart
> + * @error : err code for set dma_fence_set_error
>   *
>   */
> -void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
> +void drm_sched_start_ex(struct drm_gpu_scheduler *sched, bool full_recovery, int error)

How about calling this drm_sched_start_with_recovery_error() or
similar?  drm_sched_start_ex() is not super clear.

Also, add some information to the function documentation above to
describe when and why you'd want to specify separate error messages.
Something like in your patch description.

Alex

>  {
>         struct drm_sched_job *s_job, *tmp;
>         int r;
> @@ -704,7 +705,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>                                 DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n",
>                                           r);
>                 } else
> -                       drm_sched_job_done(s_job, -ECANCELED);
> +                       drm_sched_job_done(s_job, error);
>         }
>
>         if (full_recovery)
> @@ -712,6 +713,18 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>
>         drm_sched_wqueue_start(sched);
>  }
> +EXPORT_SYMBOL(drm_sched_start_ex);
> +/**
> + * drm_sched_start - recover jobs after a reset
> + *
> + * @sched: scheduler instance
> + * @full_recovery: proceed with complete sched restart
> + *
> + */
> +void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
> +{
> +       drm_sched_start_ex(sched, full_recovery, -ECANCELED);
> +}
>  EXPORT_SYMBOL(drm_sched_start);
>
>  /**
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 5acc64954a88..444fa6761590 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -580,6 +580,7 @@ void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
>  void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
> +void drm_sched_start_ex(struct drm_gpu_scheduler *sched, bool full_recovery, int error);
>  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>  void drm_sched_increase_karma(struct drm_sched_job *bad);
>  void drm_sched_reset_karma(struct drm_sched_job *bad);
> --
> 2.25.1
>
Christian König July 25, 2024, 7:37 a.m. UTC | #2
Am 24.07.24 um 20:43 schrieb vitaly.prosyak@amd.com:
> From: Vitaly Prosyak <vitaly.prosyak@amd.com>
>
> The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when
> the parent/hw fence is NULL. This results in drm_sched_job_done being called with -ECANCELED for
> each job with a NULL parent in the pending list, making it difficult to distinguish between recovery
> methods, whether a queue reset or a full GPU reset was used. To improve this, we first try a soft
> recovery for timeout jobs and use the error code -ENODATA. If soft recovery fails, we proceed with
> a queue reset, where the error code would still be -ENODATA for the job. Finally, for a full GPU
> reset, we use error codes -ECANCELED or -ETIME. This patch adds an error code parameter to
> drm_sched_start, allowing us to differentiate between queue reset and GPU reset failures.
> This enables user mode and test applications to validate the expected correctness of the requested
> operation. After a successful queue reset, the only way to continue normal operation is to call
> drm_sched_job_done with the specific error code -ENODATA.
>
> v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain and
>      amdgpu_device_unlock_reset_domain to allow user mode to track the queue reset status
>      and distinguish between queue reset and GPU reset.
> v2: Christian suggested using the error codes -ENODATA for queue reset and -ECANCELED or
>      -ETIME for GPU reset, returned to amdgpu_cs_wait_ioctl.
> v3: To meet the requirements, we introduce a new function drm_sched_start_ex with an additional
>      parameter to set dma_fence_set_error, allowing us to handle the specific error codes
>      appropriately and dispose of bad jobs with the selected error code depending on whether
>      it was a queue reset or GPU reset.

I've already send out a patch to remove the full reset parameter from 
the function and have another one in the pipeline to add the errno as 
optional parameter.

I'm currently just waiting for the feedback on those patches.

Regards,
Christian.

>
> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++---
>   include/drm/gpu_scheduler.h            |  1 +
>   2 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 7e90c9f95611..5a534772335a 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -671,13 +671,14 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>   EXPORT_SYMBOL(drm_sched_stop);
>   
>   /**
> - * drm_sched_start - recover jobs after a reset
> + * drm_sched_start_ex - recover jobs after a reset
>    *
>    * @sched: scheduler instance
>    * @full_recovery: proceed with complete sched restart
> + * @error : err code for set dma_fence_set_error
>    *
>    */
> -void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
> +void drm_sched_start_ex(struct drm_gpu_scheduler *sched, bool full_recovery, int error)
>   {
>   	struct drm_sched_job *s_job, *tmp;
>   	int r;
> @@ -704,7 +705,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>   				DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n",
>   					  r);
>   		} else
> -			drm_sched_job_done(s_job, -ECANCELED);
> +			drm_sched_job_done(s_job, error);
>   	}
>   
>   	if (full_recovery)
> @@ -712,6 +713,18 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>   
>   	drm_sched_wqueue_start(sched);
>   }
> +EXPORT_SYMBOL(drm_sched_start_ex);
> +/**
> + * drm_sched_start - recover jobs after a reset
> + *
> + * @sched: scheduler instance
> + * @full_recovery: proceed with complete sched restart
> + *
> + */
> +void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
> +{
> +	drm_sched_start_ex(sched, full_recovery, -ECANCELED);
> +}
>   EXPORT_SYMBOL(drm_sched_start);
>   
>   /**
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 5acc64954a88..444fa6761590 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -580,6 +580,7 @@ void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
>   void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
>   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>   void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
> +void drm_sched_start_ex(struct drm_gpu_scheduler *sched, bool full_recovery, int error);
>   void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>   void drm_sched_increase_karma(struct drm_sched_job *bad);
>   void drm_sched_reset_karma(struct drm_sched_job *bad);
vitaly prosyak July 25, 2024, 9:57 a.m. UTC | #3
On 2024-07-25 03:37, Christian König wrote:
> Am 24.07.24 um 20:43 schrieb vitaly.prosyak@amd.com:
>> From: Vitaly Prosyak <vitaly.prosyak@amd.com>
>>
>> The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when
>> the parent/hw fence is NULL. This results in drm_sched_job_done being called with -ECANCELED for
>> each job with a NULL parent in the pending list, making it difficult to distinguish between recovery
>> methods, whether a queue reset or a full GPU reset was used. To improve this, we first try a soft
>> recovery for timeout jobs and use the error code -ENODATA. If soft recovery fails, we proceed with
>> a queue reset, where the error code would still be -ENODATA for the job. Finally, for a full GPU
>> reset, we use error codes -ECANCELED or -ETIME. This patch adds an error code parameter to
>> drm_sched_start, allowing us to differentiate between queue reset and GPU reset failures.
>> This enables user mode and test applications to validate the expected correctness of the requested
>> operation. After a successful queue reset, the only way to continue normal operation is to call
>> drm_sched_job_done with the specific error code -ENODATA.
>>
>> v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain and
>>      amdgpu_device_unlock_reset_domain to allow user mode to track the queue reset status
>>      and distinguish between queue reset and GPU reset.
>> v2: Christian suggested using the error codes -ENODATA for queue reset and -ECANCELED or
>>      -ETIME for GPU reset, returned to amdgpu_cs_wait_ioctl.
>> v3: To meet the requirements, we introduce a new function drm_sched_start_ex with an additional
>>      parameter to set dma_fence_set_error, allowing us to handle the specific error codes
>>      appropriately and dispose of bad jobs with the selected error code depending on whether
>>      it was a queue reset or GPU reset.
> I've already send out a patch to remove the full reset parameter from 
> the function and have another one in the pipeline to add the errno as 
> optional parameter.
>
> I'm currently just waiting for the feedback on those patches.
Thank you for informing me. Given this, I will not proceed with this change further and will await your response on your patch.
>
> Regards,
> Christian.
Thanks, Vitaly
>
>> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
>> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++---
>>   include/drm/gpu_scheduler.h            |  1 +
>>   2 files changed, 17 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 7e90c9f95611..5a534772335a 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -671,13 +671,14 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>   EXPORT_SYMBOL(drm_sched_stop);
>>   
>>   /**
>> - * drm_sched_start - recover jobs after a reset
>> + * drm_sched_start_ex - recover jobs after a reset
>>    *
>>    * @sched: scheduler instance
>>    * @full_recovery: proceed with complete sched restart
>> + * @error : err code for set dma_fence_set_error
>>    *
>>    */
>> -void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>> +void drm_sched_start_ex(struct drm_gpu_scheduler *sched, bool full_recovery, int error)
>>   {
>>   	struct drm_sched_job *s_job, *tmp;
>>   	int r;
>> @@ -704,7 +705,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>>   				DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n",
>>   					  r);
>>   		} else
>> -			drm_sched_job_done(s_job, -ECANCELED);
>> +			drm_sched_job_done(s_job, error);
>>   	}
>>   
>>   	if (full_recovery)
>> @@ -712,6 +713,18 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>>   
>>   	drm_sched_wqueue_start(sched);
>>   }
>> +EXPORT_SYMBOL(drm_sched_start_ex);
>> +/**
>> + * drm_sched_start - recover jobs after a reset
>> + *
>> + * @sched: scheduler instance
>> + * @full_recovery: proceed with complete sched restart
>> + *
>> + */
>> +void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>> +{
>> +	drm_sched_start_ex(sched, full_recovery, -ECANCELED);
>> +}
>>   EXPORT_SYMBOL(drm_sched_start);
>>   
>>   /**
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index 5acc64954a88..444fa6761590 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -580,6 +580,7 @@ void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
>>   void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
>>   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>>   void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
>> +void drm_sched_start_ex(struct drm_gpu_scheduler *sched, bool full_recovery, int error);
>>   void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>>   void drm_sched_increase_karma(struct drm_sched_job *bad);
>>   void drm_sched_reset_karma(struct drm_sched_job *bad);
diff mbox series

Patch

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 7e90c9f95611..5a534772335a 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -671,13 +671,14 @@  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 EXPORT_SYMBOL(drm_sched_stop);
 
 /**
- * drm_sched_start - recover jobs after a reset
+ * drm_sched_start_ex - recover jobs after a reset
  *
  * @sched: scheduler instance
  * @full_recovery: proceed with complete sched restart
+ * @error : err code for set dma_fence_set_error
  *
  */
-void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
+void drm_sched_start_ex(struct drm_gpu_scheduler *sched, bool full_recovery, int error)
 {
 	struct drm_sched_job *s_job, *tmp;
 	int r;
@@ -704,7 +705,7 @@  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 				DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n",
 					  r);
 		} else
-			drm_sched_job_done(s_job, -ECANCELED);
+			drm_sched_job_done(s_job, error);
 	}
 
 	if (full_recovery)
@@ -712,6 +713,18 @@  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 
 	drm_sched_wqueue_start(sched);
 }
+EXPORT_SYMBOL(drm_sched_start_ex);
+/**
+ * drm_sched_start - recover jobs after a reset
+ *
+ * @sched: scheduler instance
+ * @full_recovery: proceed with complete sched restart
+ *
+ */
+void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
+{
+	drm_sched_start_ex(sched, full_recovery, -ECANCELED);
+}
 EXPORT_SYMBOL(drm_sched_start);
 
 /**
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 5acc64954a88..444fa6761590 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -580,6 +580,7 @@  void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
 void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
 void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
 void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
+void drm_sched_start_ex(struct drm_gpu_scheduler *sched, bool full_recovery, int error);
 void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
 void drm_sched_increase_karma(struct drm_sched_job *bad);
 void drm_sched_reset_karma(struct drm_sched_job *bad);