diff mbox series

drm/sched: Only start TDR in drm_sched_job_begin on first job

Message ID 20240724234417.1912357-1-matthew.brost@intel.com (mailing list archive)
State New, archived
Headers show
Series drm/sched: Only start TDR in drm_sched_job_begin on first job | expand

Commit Message

Matthew Brost July 24, 2024, 11:44 p.m. UTC
Only start in drm_sched_job_begin on first job being added to the
pending list as if pending list non-empty the TDR has already been
started. It is problematic to restart the TDR as it will extend TDR
period for an already running job, potentially leading to dma-fence
signaling for a very long period of with continous stream of jobs.

Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Christian König July 25, 2024, 7:42 a.m. UTC | #1
Am 25.07.24 um 01:44 schrieb Matthew Brost:
> Only start in drm_sched_job_begin on first job being added to the
> pending list as if pending list non-empty the TDR has already been
> started. It is problematic to restart the TDR as it will extend TDR
> period for an already running job, potentially leading to dma-fence
> signaling for a very long period of with continous stream of jobs.

Mhm, that should be unnecessary. drm_sched_start_timeout() should only 
start the timeout, but never re-start it.

Could be that this isn't working properly.

Regards,
Christian.

>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 7e90c9f95611..feeeb9dbeb86 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -540,7 +540,8 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
>   
>   	spin_lock(&sched->job_list_lock);
>   	list_add_tail(&s_job->list, &sched->pending_list);
> -	drm_sched_start_timeout(sched);
> +	if (list_is_singular(&sched->pending_list))
> +		drm_sched_start_timeout(sched);
>   	spin_unlock(&sched->job_list_lock);
>   }
>
Matthew Brost July 25, 2024, 2:50 p.m. UTC | #2
On Thu, Jul 25, 2024 at 09:42:08AM +0200, Christian König wrote:
> Am 25.07.24 um 01:44 schrieb Matthew Brost:
> > Only start in drm_sched_job_begin on first job being added to the
> > pending list as if pending list non-empty the TDR has already been
> > started. It is problematic to restart the TDR as it will extend TDR
> > period for an already running job, potentially leading to dma-fence
> > signaling for a very long period of with continous stream of jobs.
> 
> Mhm, that should be unnecessary. drm_sched_start_timeout() should only start
> the timeout, but never re-start it.
> 

That function checks the pending list for not empty, so it indeed starts
it. Which is the correct behavior for some of the callers, e.g.
drm_sched_tdr_queue_imm, drm_sched_get_finished_job

IMO best to fix this here.

Also FWIW on Xe I wrote a test which submitted a new ending spinner,
then submitted a job every second on the same queue in a loop and
observed the spinner not get canceled for a long time. After this patch,
the spinner correctly timed out after 5 second (our default TDR period).

Matt

> Could be that this isn't working properly.
> 
> Regards,
> Christian.
> 
> > 
> > Cc: Christian König <christian.koenig@amd.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/scheduler/sched_main.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 7e90c9f95611..feeeb9dbeb86 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -540,7 +540,8 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
> >   	spin_lock(&sched->job_list_lock);
> >   	list_add_tail(&s_job->list, &sched->pending_list);
> > -	drm_sched_start_timeout(sched);
> > +	if (list_is_singular(&sched->pending_list))
> > +		drm_sched_start_timeout(sched);
> >   	spin_unlock(&sched->job_list_lock);
> >   }
>
Matthew Brost Aug. 14, 2024, 5:44 a.m. UTC | #3
On Thu, Jul 25, 2024 at 02:50:54PM +0000, Matthew Brost wrote:
> On Thu, Jul 25, 2024 at 09:42:08AM +0200, Christian König wrote:
> > Am 25.07.24 um 01:44 schrieb Matthew Brost:
> > > Only start in drm_sched_job_begin on first job being added to the
> > > pending list as if pending list non-empty the TDR has already been
> > > started. It is problematic to restart the TDR as it will extend TDR
> > > period for an already running job, potentially leading to dma-fence
> > > signaling for a very long period of with continous stream of jobs.
> > 
> > Mhm, that should be unnecessary. drm_sched_start_timeout() should only start
> > the timeout, but never re-start it.
> > 
> 
> That function checks the pending list for not empty, so it indeed starts
> it. Which is the correct behavior for some of the callers, e.g.
> drm_sched_tdr_queue_imm, drm_sched_get_finished_job
> 
> IMO best to fix this here.
> 
> Also FWIW on Xe I wrote a test which submitted a new ending spinner,
> then submitted a job every second on the same queue in a loop and
> observed the spinner not get canceled for a long time. After this patch,
> the spinner correctly timed out after 5 second (our default TDR period).
> 
> Matt

Ping Christian. Any response to above?

Pretty clear problem, would like to resolve.

Matt

> 
> > Could be that this isn't working properly.
> > 
> > Regards,
> > Christian.
> > 
> > > 
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >   drivers/gpu/drm/scheduler/sched_main.c | 3 ++-
> > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > index 7e90c9f95611..feeeb9dbeb86 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -540,7 +540,8 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
> > >   	spin_lock(&sched->job_list_lock);
> > >   	list_add_tail(&s_job->list, &sched->pending_list);
> > > -	drm_sched_start_timeout(sched);
> > > +	if (list_is_singular(&sched->pending_list))
> > > +		drm_sched_start_timeout(sched);
> > >   	spin_unlock(&sched->job_list_lock);
> > >   }
> >
diff mbox series

Patch

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 7e90c9f95611..feeeb9dbeb86 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -540,7 +540,8 @@  static void drm_sched_job_begin(struct drm_sched_job *s_job)
 
 	spin_lock(&sched->job_list_lock);
 	list_add_tail(&s_job->list, &sched->pending_list);
-	drm_sched_start_timeout(sched);
+	if (list_is_singular(&sched->pending_list))
+		drm_sched_start_timeout(sched);
 	spin_unlock(&sched->job_list_lock);
 }