Message ID | 20240724234417.1912357-1-matthew.brost@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/sched: Only start TDR in drm_sched_job_begin on first job | expand |
Am 25.07.24 um 01:44 schrieb Matthew Brost: > Only start in drm_sched_job_begin on first job being added to the > pending list as if pending list non-empty the TDR has already been > started. It is problematic to restart the TDR as it will extend TDR > period for an already running job, potentially leading to dma-fence > signaling for a very long period of with continous stream of jobs. Mhm, that should be unnecessary. drm_sched_start_timeout() should only start the timeout, but never re-start it. Could be that this isn't working properly. Regards, Christian. > > Cc: Christian König <christian.koenig@amd.com> > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > --- > drivers/gpu/drm/scheduler/sched_main.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 7e90c9f95611..feeeb9dbeb86 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -540,7 +540,8 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job) > > spin_lock(&sched->job_list_lock); > list_add_tail(&s_job->list, &sched->pending_list); > - drm_sched_start_timeout(sched); > + if (list_is_singular(&sched->pending_list)) > + drm_sched_start_timeout(sched); > spin_unlock(&sched->job_list_lock); > } >
On Thu, Jul 25, 2024 at 09:42:08AM +0200, Christian König wrote: > Am 25.07.24 um 01:44 schrieb Matthew Brost: > > Only start in drm_sched_job_begin on first job being added to the > > pending list as if pending list non-empty the TDR has already been > > started. It is problematic to restart the TDR as it will extend TDR > > period for an already running job, potentially leading to dma-fence > > signaling for a very long period of with continous stream of jobs. > > Mhm, that should be unnecessary. drm_sched_start_timeout() should only start > the timeout, but never re-start it. > That function checks the pending list for not empty, so it indeed starts it. Which is the correct behavior for some of the callers, e.g. drm_sched_tdr_queue_imm, drm_sched_get_finished_job IMO best to fix this here. Also FWIW on Xe I wrote a test which submitted a new ending spinner, then submitted a job every second on the same queue in a loop and observed the spinner not get canceled for a long time. After this patch, the spinner correctly timed out after 5 second (our default TDR period). Matt > Could be that this isn't working properly. > > Regards, > Christian. > > > > > Cc: Christian König <christian.koenig@amd.com> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > --- > > drivers/gpu/drm/scheduler/sched_main.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > index 7e90c9f95611..feeeb9dbeb86 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -540,7 +540,8 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job) > > spin_lock(&sched->job_list_lock); > > list_add_tail(&s_job->list, &sched->pending_list); > > - drm_sched_start_timeout(sched); > > + if (list_is_singular(&sched->pending_list)) > > + drm_sched_start_timeout(sched); > > spin_unlock(&sched->job_list_lock); > > } >
On Thu, Jul 25, 2024 at 02:50:54PM +0000, Matthew Brost wrote: > On Thu, Jul 25, 2024 at 09:42:08AM +0200, Christian König wrote: > > Am 25.07.24 um 01:44 schrieb Matthew Brost: > > > Only start in drm_sched_job_begin on first job being added to the > > > pending list as if pending list non-empty the TDR has already been > > > started. It is problematic to restart the TDR as it will extend TDR > > > period for an already running job, potentially leading to dma-fence > > > signaling for a very long period of with continous stream of jobs. > > > > Mhm, that should be unnecessary. drm_sched_start_timeout() should only start > > the timeout, but never re-start it. > > > > That function checks the pending list for not empty, so it indeed starts > it. Which is the correct behavior for some of the callers, e.g. > drm_sched_tdr_queue_imm, drm_sched_get_finished_job > > IMO best to fix this here. > > Also FWIW on Xe I wrote a test which submitted a new ending spinner, > then submitted a job every second on the same queue in a loop and > observed the spinner not get canceled for a long time. After this patch, > the spinner correctly timed out after 5 second (our default TDR period). > > Matt Ping Christian. Any response to above? Pretty clear problem, would like to resolve. Matt > > > Could be that this isn't working properly. > > > > Regards, > > Christian. > > > > > > > > Cc: Christian König <christian.koenig@amd.com> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > > --- > > > drivers/gpu/drm/scheduler/sched_main.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > > index 7e90c9f95611..feeeb9dbeb86 100644 > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > @@ -540,7 +540,8 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job) > > > spin_lock(&sched->job_list_lock); > > > list_add_tail(&s_job->list, &sched->pending_list); > > > - drm_sched_start_timeout(sched); > > > + if (list_is_singular(&sched->pending_list)) > > > + drm_sched_start_timeout(sched); > > > spin_unlock(&sched->job_list_lock); > > > } > >
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 7e90c9f95611..feeeb9dbeb86 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -540,7 +540,8 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job) spin_lock(&sched->job_list_lock); list_add_tail(&s_job->list, &sched->pending_list); - drm_sched_start_timeout(sched); + if (list_is_singular(&sched->pending_list)) + drm_sched_start_timeout(sched); spin_unlock(&sched->job_list_lock); }
Only start in drm_sched_job_begin on first job being added to the pending list as if pending list non-empty the TDR has already been started. It is problematic to restart the TDR as it will extend TDR period for an already running job, potentially leading to dma-fence signaling for a very long period of with continous stream of jobs. Cc: Christian König <christian.koenig@amd.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> --- drivers/gpu/drm/scheduler/sched_main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)