[RFC,07/10] drm/sched: Add helper to set TDR timeout

Message ID	20230404002211.3611376-8-matthew.brost@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Matthew Brost <matthew.brost@intel.com> To: dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org Subject: [RFC PATCH 07/10] drm/sched: Add helper to set TDR timeout Date: Mon, 3 Apr 2023 17:22:08 -0700 Message-Id: <20230404002211.3611376-8-matthew.brost@intel.com> In-Reply-To: <20230404002211.3611376-1-matthew.brost@intel.com> References: <20230404002211.3611376-1-matthew.brost@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Cc: robdclark@chromium.org, thomas.hellstrom@linux.intel.com, airlied@linux.ie, lina@asahilina.net, boris.brezillon@collabora.com, Matthew Brost <matthew.brost@intel.com>, christian.koenig@amd.com, faith.ekstrand@collabora.com Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	Xe DRM scheduler and long running workload plans \| expand [RFC,00/10] Xe DRM scheduler and long running workload plans [RFC,01/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread [RFC,02/10] drm/sched: Move schedule policy to scheduler / entity [RFC,03/10] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy [RFC,04/10] drm/sched: Add generic scheduler message interface [RFC,05/10] drm/sched: Start run wq before TDR in drm_sched_start [RFC,06/10] drm/sched: Submit job before starting TDR [RFC,07/10] drm/sched: Add helper to set TDR timeout [RFC,08/10] dma-buf/dma-fence: Introduce long-running completion fences [RFC,09/10] drm/sched: Support long-running sched entities [RFC,10/10] drm/syncobj: Warn on long running dma-fences

Message ID

20230404002211.3611376-8-matthew.brost@intel.com (mailing list archive)

State

New, archived

Headers

From: Matthew Brost <matthew.brost@intel.com>
To: dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org
Subject: [RFC PATCH 07/10] drm/sched: Add helper to set TDR timeout
Date: Mon,  3 Apr 2023 17:22:08 -0700
Message-Id: <20230404002211.3611376-8-matthew.brost@intel.com>
In-Reply-To: <20230404002211.3611376-1-matthew.brost@intel.com>
References: <20230404002211.3611376-1-matthew.brost@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
Cc: robdclark@chromium.org, thomas.hellstrom@linux.intel.com,
 airlied@linux.ie,
 lina@asahilina.net, boris.brezillon@collabora.com,
 Matthew Brost <matthew.brost@intel.com>, christian.koenig@amd.com,
 faith.ekstrand@collabora.com
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Series

Xe DRM scheduler and long running workload plans | expand

Commit Message

Matthew Brost April 4, 2023, 12:22 a.m. UTC

Add helper to set TDR timeout and restart the TDR with new timeout
value. This will be used in XE, new Intel GPU driver, to trigger the TDR
to cleanup drm_sched_entity that encounter errors.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
 include/drm/gpu_scheduler.h            |  1 +
 2 files changed, 19 insertions(+)

Comments

Luben Tuikov May 4, 2023, 5:28 a.m. UTC | #1

On 2023-04-03 20:22, Matthew Brost wrote:
> Add helper to set TDR timeout and restart the TDR with new timeout
> value. This will be used in XE, new Intel GPU driver, to trigger the TDR
> to cleanup drm_sched_entity that encounter errors.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
>  include/drm/gpu_scheduler.h            |  1 +
>  2 files changed, 19 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 4eac02d212c1..d61880315d8d 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -370,6 +370,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>  		queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout);
>  }
>  
> +/**
> + * drm_sched_set_timeout - set timeout for reset worker
> + *
> + * @sched: scheduler instance to set and (re)-start the worker for
> + * @timeout: timeout period
> + *
> + * Set and (re)-start the timeout for the given scheduler.
> + */
> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
> +{
> +	spin_lock(&sched->job_list_lock);
> +	sched->timeout = timeout;
> +	cancel_delayed_work(&sched->work_tdr);

I see that the comment says "(re-)start"(sic). Is the rest of the logic
stable in that we don't need to use _sync() version, and/or at least
inspect the return value of the one currently used?

Regards,
Luben

> +	drm_sched_start_timeout(sched);
> +	spin_unlock(&sched->job_list_lock);
> +}
> +EXPORT_SYMBOL(drm_sched_set_timeout);
> +
>  /**
>   * drm_sched_fault - immediately start timeout handler
>   *
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 18172ae63ab7..6258e324bd7c 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -593,6 +593,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>  				    struct drm_gpu_scheduler **sched_list,
>                                     unsigned int num_sched_list);
>  
> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
>  void drm_sched_job_cleanup(struct drm_sched_job *job);
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>  void drm_sched_add_msg(struct drm_gpu_scheduler *sched,

Matthew Brost July 31, 2023, 1:09 a.m. UTC | #2

On Thu, May 04, 2023 at 01:28:12AM -0400, Luben Tuikov wrote:
> On 2023-04-03 20:22, Matthew Brost wrote:
> > Add helper to set TDR timeout and restart the TDR with new timeout
> > value. This will be used in XE, new Intel GPU driver, to trigger the TDR
> > to cleanup drm_sched_entity that encounter errors.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
> >  include/drm/gpu_scheduler.h            |  1 +
> >  2 files changed, 19 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 4eac02d212c1..d61880315d8d 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -370,6 +370,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
> >  		queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout);
> >  }
> >  
> > +/**
> > + * drm_sched_set_timeout - set timeout for reset worker
> > + *
> > + * @sched: scheduler instance to set and (re)-start the worker for
> > + * @timeout: timeout period
> > + *
> > + * Set and (re)-start the timeout for the given scheduler.
> > + */
> > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
> > +{
> > +	spin_lock(&sched->job_list_lock);
> > +	sched->timeout = timeout;
> > +	cancel_delayed_work(&sched->work_tdr);
> 
> I see that the comment says "(re-)start"(sic). Is the rest of the logic
> stable in that we don't need to use _sync() version, and/or at least
> inspect the return value of the one currently used?
> 

Sorry for the delayed response, just reviving this series now and seeing
this comment.

We don't care if the TDR is currently executing (at least in Xe which
makes use of this function), that is totally fine we only care to change
the future timeout values. I believe we actually call this from the TDR
in Xe to set the timeout value to zero so using a sync version would
deadlock. We do this as a mechanism to kill the drm_gpu_scheduler and
immediately timeout all remaining jobs. We also call this in a few other
places too with a value of zero for the same reason (kill the
drm_gpu_scheduler).

Matt

> Regards,
> Luben
> 
> > +	drm_sched_start_timeout(sched);
> > +	spin_unlock(&sched->job_list_lock);
> > +}
> > +EXPORT_SYMBOL(drm_sched_set_timeout);
> > +
> >  /**
> >   * drm_sched_fault - immediately start timeout handler
> >   *
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 18172ae63ab7..6258e324bd7c 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -593,6 +593,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
> >  				    struct drm_gpu_scheduler **sched_list,
> >                                     unsigned int num_sched_list);
> >  
> > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
> >  void drm_sched_job_cleanup(struct drm_sched_job *job);
> >  void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> >  void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
>

Luben Tuikov Aug. 31, 2023, 7:52 p.m. UTC | #3

On 2023-07-30 21:09, Matthew Brost wrote:
> On Thu, May 04, 2023 at 01:28:12AM -0400, Luben Tuikov wrote:
>> On 2023-04-03 20:22, Matthew Brost wrote:
>>> Add helper to set TDR timeout and restart the TDR with new timeout
>>> value. This will be used in XE, new Intel GPU driver, to trigger the TDR
>>> to cleanup drm_sched_entity that encounter errors.
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>  drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
>>>  include/drm/gpu_scheduler.h            |  1 +
>>>  2 files changed, 19 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 4eac02d212c1..d61880315d8d 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -370,6 +370,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>>>  		queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout);
>>>  }
>>>  
>>> +/**
>>> + * drm_sched_set_timeout - set timeout for reset worker
>>> + *
>>> + * @sched: scheduler instance to set and (re)-start the worker for
>>> + * @timeout: timeout period
>>> + *
>>> + * Set and (re)-start the timeout for the given scheduler.
>>> + */
>>> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
>>> +{
>>> +	spin_lock(&sched->job_list_lock);
>>> +	sched->timeout = timeout;
>>> +	cancel_delayed_work(&sched->work_tdr);
>>
>> I see that the comment says "(re-)start"(sic). Is the rest of the logic
>> stable in that we don't need to use _sync() version, and/or at least
>> inspect the return value of the one currently used?
>>
> 
> Sorry for the delayed response, just reviving this series now and seeing
> this comment.
> 
> We don't care if the TDR is currently executing (at least in Xe which
> makes use of this function), that is totally fine we only care to change
> the future timeout values. I believe we actually call this from the TDR
> in Xe to set the timeout value to zero so using a sync version would
> deadlock. We do this as a mechanism to kill the drm_gpu_scheduler and
> immediately timeout all remaining jobs. We also call this in a few other
> places too with a value of zero for the same reason (kill the
> drm_gpu_scheduler).

(Catching up chronologically after vacation...)

Okay, that's fine, but this shows a need for an interface/logic to simply
kill the DRM gpu scheduler. So perhaps we need to provide that kind
of functionality, as opposed to gaming the scheduler--setting the timeout to 0
to kill the scheduler. Perhaps that would be simpler...?

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 4eac02d212c1..d61880315d8d 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -370,6 +370,24 @@  static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
 		queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout);
 }
 
+/**
+ * drm_sched_set_timeout - set timeout for reset worker
+ *
+ * @sched: scheduler instance to set and (re)-start the worker for
+ * @timeout: timeout period
+ *
+ * Set and (re)-start the timeout for the given scheduler.
+ */
+void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
+{
+	spin_lock(&sched->job_list_lock);
+	sched->timeout = timeout;
+	cancel_delayed_work(&sched->work_tdr);
+	drm_sched_start_timeout(sched);
+	spin_unlock(&sched->job_list_lock);
+}
+EXPORT_SYMBOL(drm_sched_set_timeout);
+
 /**
  * drm_sched_fault - immediately start timeout handler
  *
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 18172ae63ab7..6258e324bd7c 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -593,6 +593,7 @@  void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 				    struct drm_gpu_scheduler **sched_list,
                                    unsigned int num_sched_list);
 
+void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
 void drm_sched_job_cleanup(struct drm_sched_job *job);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
 void drm_sched_add_msg(struct drm_gpu_scheduler *sched,

[RFC,07/10] drm/sched: Add helper to set TDR timeout

Commit Message

Comments

Patch