[v3,10/13] drm/sched: Add helper to set TDR timeout

Message ID	20230912021615.2086698-11-matthew.brost@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Matthew Brost <matthew.brost@intel.com> To: dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org Subject: [PATCH v3 10/13] drm/sched: Add helper to set TDR timeout Date: Mon, 11 Sep 2023 19:16:12 -0700 Message-Id: <20230912021615.2086698-11-matthew.brost@intel.com> In-Reply-To: <20230912021615.2086698-1-matthew.brost@intel.com> References: <20230912021615.2086698-1-matthew.brost@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Cc: robdclark@chromium.org, thomas.hellstrom@linux.intel.com, Matthew Brost <matthew.brost@intel.com>, sarah.walker@imgtec.com, ketil.johnsen@arm.com, mcanal@igalia.com, Liviu.Dudau@arm.com, luben.tuikov@amd.com, lina@asahilina.net, donald.robson@imgtec.com, boris.brezillon@collabora.com, christian.koenig@amd.com, faith.ekstrand@collabora.com Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	DRM scheduler changes for Xe \| expand [v3,00/13] DRM scheduler changes for Xe [v3,01/13] drm/sched: Add drm_sched_submit_* helpers [v3,02/13] drm/sched: Convert drm scheduler to use a work queue rather than kthread [v3,03/13] drm/sched: Move schedule policy to scheduler / entity [v3,04/13] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy [v3,05/13] drm/sched: Split free_job into own work item [v3,06/13] drm/sched: Add generic scheduler message interface [v3,07/13] drm/sched: Add drm_sched_start_timeout_unlocked helper [v3,08/13] drm/sched: Start run wq before TDR in drm_sched_start [v3,09/13] drm/sched: Submit job before starting TDR [v3,10/13] drm/sched: Add helper to set TDR timeout [v3,11/13] drm/sched: Waiting for pending jobs to complete in scheduler kill [v3,12/13] drm/sched/doc: Add Entity teardown documentaion [v3,13/13] drm/sched: Update maintainers of GPU scheduler

Message ID

20230912021615.2086698-11-matthew.brost@intel.com (mailing list archive)

State

New, archived

Headers

From: Matthew Brost <matthew.brost@intel.com>
To: dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org
Subject: [PATCH v3 10/13] drm/sched: Add helper to set TDR timeout
Date: Mon, 11 Sep 2023 19:16:12 -0700
Message-Id: <20230912021615.2086698-11-matthew.brost@intel.com>
In-Reply-To: <20230912021615.2086698-1-matthew.brost@intel.com>
References: <20230912021615.2086698-1-matthew.brost@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
Cc: robdclark@chromium.org, thomas.hellstrom@linux.intel.com,
 Matthew Brost <matthew.brost@intel.com>, sarah.walker@imgtec.com,
 ketil.johnsen@arm.com, mcanal@igalia.com, Liviu.Dudau@arm.com,
 luben.tuikov@amd.com, lina@asahilina.net, donald.robson@imgtec.com,
 boris.brezillon@collabora.com, christian.koenig@amd.com,
 faith.ekstrand@collabora.com
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Series

DRM scheduler changes for Xe | expand

Commit Message

Matthew Brost Sept. 12, 2023, 2:16 a.m. UTC

Add helper to set TDR timeout and restart the TDR with new timeout
value. This will be used in XE, new Intel GPU driver, to trigger the TDR
to cleanup drm_sched_entity that encounter errors.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
 include/drm/gpu_scheduler.h            |  1 +
 2 files changed, 19 insertions(+)

Comments

Luben Tuikov Sept. 14, 2023, 2:38 a.m. UTC | #1

On 2023-09-11 22:16, Matthew Brost wrote:
> Add helper to set TDR timeout and restart the TDR with new timeout
> value. This will be used in XE, new Intel GPU driver, to trigger the TDR
> to cleanup drm_sched_entity that encounter errors.

Do you just want to trigger the cleanup or do you really want to
modify the timeout and requeue TDR delayed work (to be triggered
later at a different time)?

If the former, then might as well just add a function to queue that
right away. If the latter, then this would do, albeit with a few
notes as mentioned below.

> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
>  include/drm/gpu_scheduler.h            |  1 +
>  2 files changed, 19 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 9dbfab7be2c6..689fb6686e01 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -426,6 +426,24 @@ static void drm_sched_start_timeout_unlocked(struct drm_gpu_scheduler *sched)
>  	spin_unlock(&sched->job_list_lock);
>  }
>  
> +/**
> + * drm_sched_set_timeout - set timeout for reset worker
> + *
> + * @sched: scheduler instance to set and (re)-start the worker for
> + * @timeout: timeout period
> + *
> + * Set and (re)-start the timeout for the given scheduler.
> + */
> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
> +{

Well, I'd perhaps call this "drm_sched_set_tdr_timeout()", or something
to that effect, as "drm_sched_set_timeout()" isn't clear that it is indeed
a cleanup timeout. However, it's totally up to you. :-)

It appears that "long timeout" is the new job timeout, so it is possible
that a stuck job might be given old timeout + new timeout recovery time,
after this function is called.

> +	spin_lock(&sched->job_list_lock);
> +	sched->timeout = timeout;
> +	cancel_delayed_work(&sched->work_tdr);
> +	drm_sched_start_timeout(sched);
> +	spin_unlock(&sched->job_list_lock);
> +}
> +EXPORT_SYMBOL(drm_sched_set_timeout);

Well, drm_sched_start_timeout() (which also has a name lacking description, perhaps
it should be "drm_sched_start_tdr_timeout()" or "...start_cleanup_timeout()"), anyway,
so that function compares to MAX_SCHEDULE_TIMEOUT and pending list not being empty
before it requeues delayed TDR work item. So, while a remote possibility, this new
function may have the unintended consequence of canceling TDR, and never restarting it.
I see it grabs the lock, however. Maybe wrap it in "if (sched->timeout != MAX_SCHEDULE_TIMEOUT)"?
How about using mod_delayed_work()?

Matthew Brost Sept. 14, 2023, 5:36 p.m. UTC | #2

On Wed, Sep 13, 2023 at 10:38:24PM -0400, Luben Tuikov wrote:
> On 2023-09-11 22:16, Matthew Brost wrote:
> > Add helper to set TDR timeout and restart the TDR with new timeout
> > value. This will be used in XE, new Intel GPU driver, to trigger the TDR
> > to cleanup drm_sched_entity that encounter errors.
> 
> Do you just want to trigger the cleanup or do you really want to
> modify the timeout and requeue TDR delayed work (to be triggered
> later at a different time)?
> 
> If the former, then might as well just add a function to queue that
> right away. If the latter, then this would do, albeit with a few

Let me change the function to queue it immediately as that is what is
needed.

Matt

> notes as mentioned below.
> 
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
> >  include/drm/gpu_scheduler.h            |  1 +
> >  2 files changed, 19 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 9dbfab7be2c6..689fb6686e01 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -426,6 +426,24 @@ static void drm_sched_start_timeout_unlocked(struct drm_gpu_scheduler *sched)
> >  	spin_unlock(&sched->job_list_lock);
> >  }
> >  
> > +/**
> > + * drm_sched_set_timeout - set timeout for reset worker
> > + *
> > + * @sched: scheduler instance to set and (re)-start the worker for
> > + * @timeout: timeout period
> > + *
> > + * Set and (re)-start the timeout for the given scheduler.
> > + */
> > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
> > +{
> 
> Well, I'd perhaps call this "drm_sched_set_tdr_timeout()", or something
> to that effect, as "drm_sched_set_timeout()" isn't clear that it is indeed
> a cleanup timeout. However, it's totally up to you. :-)
> 
> It appears that "long timeout" is the new job timeout, so it is possible
> that a stuck job might be given old timeout + new timeout recovery time,
> after this function is called.
> 
> > +	spin_lock(&sched->job_list_lock);
> > +	sched->timeout = timeout;
> > +	cancel_delayed_work(&sched->work_tdr);
> > +	drm_sched_start_timeout(sched);
> > +	spin_unlock(&sched->job_list_lock);
> > +}
> > +EXPORT_SYMBOL(drm_sched_set_timeout);
> 
> Well, drm_sched_start_timeout() (which also has a name lacking description, perhaps
> it should be "drm_sched_start_tdr_timeout()" or "...start_cleanup_timeout()"), anyway,
> so that function compares to MAX_SCHEDULE_TIMEOUT and pending list not being empty
> before it requeues delayed TDR work item. So, while a remote possibility, this new
> function may have the unintended consequence of canceling TDR, and never restarting it.
> I see it grabs the lock, however. Maybe wrap it in "if (sched->timeout != MAX_SCHEDULE_TIMEOUT)"?
> How about using mod_delayed_work()?
> -- 
> Regards,
> Luben
> 
> > +
> >  /**
> >   * drm_sched_fault - immediately start timeout handler
> >   *
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 5d753ecb5d71..b7b818cd81b6 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -596,6 +596,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
> >  				    struct drm_gpu_scheduler **sched_list,
> >                                     unsigned int num_sched_list);
> >  
> > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
> >  void drm_sched_job_cleanup(struct drm_sched_job *job);
> >  void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
> >  void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
>

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 9dbfab7be2c6..689fb6686e01 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -426,6 +426,24 @@  static void drm_sched_start_timeout_unlocked(struct drm_gpu_scheduler *sched)
 	spin_unlock(&sched->job_list_lock);
 }
 
+/**
+ * drm_sched_set_timeout - set timeout for reset worker
+ *
+ * @sched: scheduler instance to set and (re)-start the worker for
+ * @timeout: timeout period
+ *
+ * Set and (re)-start the timeout for the given scheduler.
+ */
+void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
+{
+	spin_lock(&sched->job_list_lock);
+	sched->timeout = timeout;
+	cancel_delayed_work(&sched->work_tdr);
+	drm_sched_start_timeout(sched);
+	spin_unlock(&sched->job_list_lock);
+}
+EXPORT_SYMBOL(drm_sched_set_timeout);
+
 /**
  * drm_sched_fault - immediately start timeout handler
  *
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 5d753ecb5d71..b7b818cd81b6 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -596,6 +596,7 @@  void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 				    struct drm_gpu_scheduler **sched_list,
                                    unsigned int num_sched_list);
 
+void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
 void drm_sched_job_cleanup(struct drm_sched_job *job);
 void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
 void drm_sched_add_msg(struct drm_gpu_scheduler *sched,

[v3,10/13] drm/sched: Add helper to set TDR timeout

Commit Message

Comments

Patch