diff mbox series

[RFC,07/10] drm/sched: Add helper to set TDR timeout

Message ID 20230404002211.3611376-8-matthew.brost@intel.com (mailing list archive)
State New, archived
Headers show
Series Xe DRM scheduler and long running workload plans | expand

Commit Message

Matthew Brost April 4, 2023, 12:22 a.m. UTC
Add helper to set TDR timeout and restart the TDR with new timeout
value. This will be used in XE, new Intel GPU driver, to trigger the TDR
to cleanup drm_sched_entity that encounter errors.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
 include/drm/gpu_scheduler.h            |  1 +
 2 files changed, 19 insertions(+)

Comments

Luben Tuikov May 4, 2023, 5:28 a.m. UTC | #1
On 2023-04-03 20:22, Matthew Brost wrote:
> Add helper to set TDR timeout and restart the TDR with new timeout
> value. This will be used in XE, new Intel GPU driver, to trigger the TDR
> to cleanup drm_sched_entity that encounter errors.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
>  include/drm/gpu_scheduler.h            |  1 +
>  2 files changed, 19 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 4eac02d212c1..d61880315d8d 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -370,6 +370,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>  		queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout);
>  }
>  
> +/**
> + * drm_sched_set_timeout - set timeout for reset worker
> + *
> + * @sched: scheduler instance to set and (re)-start the worker for
> + * @timeout: timeout period
> + *
> + * Set and (re)-start the timeout for the given scheduler.
> + */
> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
> +{
> +	spin_lock(&sched->job_list_lock);
> +	sched->timeout = timeout;
> +	cancel_delayed_work(&sched->work_tdr);

I see that the comment says "(re-)start"(sic). Is the rest of the logic
stable in that we don't need to use _sync() version, and/or at least
inspect the return value of the one currently used?

Regards,
Luben

> +	drm_sched_start_timeout(sched);
> +	spin_unlock(&sched->job_list_lock);
> +}
> +EXPORT_SYMBOL(drm_sched_set_timeout);
> +
>  /**
>   * drm_sched_fault - immediately start timeout handler
>   *
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 18172ae63ab7..6258e324bd7c 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -593,6 +593,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>  				    struct drm_gpu_scheduler **sched_list,
>                                     unsigned int num_sched_list);
>  
> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
>  void drm_sched_job_cleanup(struct drm_sched_job *job);
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>  void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
Matthew Brost July 31, 2023, 1:09 a.m. UTC | #2
On Thu, May 04, 2023 at 01:28:12AM -0400, Luben Tuikov wrote:
> On 2023-04-03 20:22, Matthew Brost wrote:
> > Add helper to set TDR timeout and restart the TDR with new timeout
> > value. This will be used in XE, new Intel GPU driver, to trigger the TDR
> > to cleanup drm_sched_entity that encounter errors.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
> >  include/drm/gpu_scheduler.h            |  1 +
> >  2 files changed, 19 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 4eac02d212c1..d61880315d8d 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -370,6 +370,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
> >  		queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout);
> >  }
> >  
> > +/**
> > + * drm_sched_set_timeout - set timeout for reset worker
> > + *
> > + * @sched: scheduler instance to set and (re)-start the worker for
> > + * @timeout: timeout period
> > + *
> > + * Set and (re)-start the timeout for the given scheduler.
> > + */
> > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
> > +{
> > +	spin_lock(&sched->job_list_lock);
> > +	sched->timeout = timeout;
> > +	cancel_delayed_work(&sched->work_tdr);
> 
> I see that the comment says "(re-)start"(sic). Is the rest of the logic
> stable in that we don't need to use _sync() version, and/or at least
> inspect the return value of the one currently used?
> 

Sorry for the delayed response, just reviving this series now and seeing
this comment.

We don't care if the TDR is currently executing (at least in Xe which
makes use of this function), that is totally fine we only care to change
the future timeout values. I believe we actually call this from the TDR
in Xe to set the timeout value to zero so using a sync version would
deadlock. We do this as a mechanism to kill the drm_gpu_scheduler and
immediately timeout all remaining jobs. We also call this in a few other
places too with a value of zero for the same reason (kill the
drm_gpu_scheduler).

Matt

> Regards,
> Luben
> 
> > +	drm_sched_start_timeout(sched);
> > +	spin_unlock(&sched->job_list_lock);
> > +}
> > +EXPORT_SYMBOL(drm_sched_set_timeout);
> > +
> >  /**
> >   * drm_sched_fault - immediately start timeout handler
> >   *
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 18172ae63ab7..6258e324bd7c 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -593,6 +593,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
> >  				    struct drm_gpu_scheduler **sched_list,
> >                                     unsigned int num_sched_list);
> >  
> > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
> >  void drm_sched_job_cleanup(struct drm_sched_job *job);
> >  void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> >  void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
>
Luben Tuikov Aug. 31, 2023, 7:52 p.m. UTC | #3
On 2023-07-30 21:09, Matthew Brost wrote:
> On Thu, May 04, 2023 at 01:28:12AM -0400, Luben Tuikov wrote:
>> On 2023-04-03 20:22, Matthew Brost wrote:
>>> Add helper to set TDR timeout and restart the TDR with new timeout
>>> value. This will be used in XE, new Intel GPU driver, to trigger the TDR
>>> to cleanup drm_sched_entity that encounter errors.
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>  drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
>>>  include/drm/gpu_scheduler.h            |  1 +
>>>  2 files changed, 19 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 4eac02d212c1..d61880315d8d 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -370,6 +370,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>>>  		queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout);
>>>  }
>>>  
>>> +/**
>>> + * drm_sched_set_timeout - set timeout for reset worker
>>> + *
>>> + * @sched: scheduler instance to set and (re)-start the worker for
>>> + * @timeout: timeout period
>>> + *
>>> + * Set and (re)-start the timeout for the given scheduler.
>>> + */
>>> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
>>> +{
>>> +	spin_lock(&sched->job_list_lock);
>>> +	sched->timeout = timeout;
>>> +	cancel_delayed_work(&sched->work_tdr);
>>
>> I see that the comment says "(re-)start"(sic). Is the rest of the logic
>> stable in that we don't need to use _sync() version, and/or at least
>> inspect the return value of the one currently used?
>>
> 
> Sorry for the delayed response, just reviving this series now and seeing
> this comment.
> 
> We don't care if the TDR is currently executing (at least in Xe which
> makes use of this function), that is totally fine we only care to change
> the future timeout values. I believe we actually call this from the TDR
> in Xe to set the timeout value to zero so using a sync version would
> deadlock. We do this as a mechanism to kill the drm_gpu_scheduler and
> immediately timeout all remaining jobs. We also call this in a few other
> places too with a value of zero for the same reason (kill the
> drm_gpu_scheduler).

(Catching up chronologically after vacation...)

Okay, that's fine, but this shows a need for an interface/logic to simply
kill the DRM gpu scheduler. So perhaps we need to provide that kind
of functionality, as opposed to gaming the scheduler--setting the timeout to 0
to kill the scheduler. Perhaps that would be simpler...?
diff mbox series

Patch

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 4eac02d212c1..d61880315d8d 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -370,6 +370,24 @@  static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
 		queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout);
 }
 
+/**
+ * drm_sched_set_timeout - set timeout for reset worker
+ *
+ * @sched: scheduler instance to set and (re)-start the worker for
+ * @timeout: timeout period
+ *
+ * Set and (re)-start the timeout for the given scheduler.
+ */
+void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
+{
+	spin_lock(&sched->job_list_lock);
+	sched->timeout = timeout;
+	cancel_delayed_work(&sched->work_tdr);
+	drm_sched_start_timeout(sched);
+	spin_unlock(&sched->job_list_lock);
+}
+EXPORT_SYMBOL(drm_sched_set_timeout);
+
 /**
  * drm_sched_fault - immediately start timeout handler
  *
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 18172ae63ab7..6258e324bd7c 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -593,6 +593,7 @@  void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 				    struct drm_gpu_scheduler **sched_list,
                                    unsigned int num_sched_list);
 
+void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
 void drm_sched_job_cleanup(struct drm_sched_job *job);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
 void drm_sched_add_msg(struct drm_gpu_scheduler *sched,