Message ID | 20230919050155.2647172-8-matthew.brost@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | DRM scheduler changes for Xe | expand |
Hi, On 2023-09-19 01:01, Matthew Brost wrote: > If the TDR is set to a very small value it can fire before the > submission is started in the function drm_sched_start. The submission is > expected to running when the TDR fires, fix this ordering so this > expectation is always met. > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > --- > drivers/gpu/drm/scheduler/sched_main.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 09ef07b9e9d5..a5cc9b6c2faa 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -684,10 +684,10 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) > drm_sched_job_done(s_job, -ECANCELED); > } > > + drm_sched_submit_start(sched); > + > if (full_recovery) > drm_sched_start_timeout_unlocked(sched); > - > - drm_sched_submit_start(sched); > } > EXPORT_SYMBOL(drm_sched_start); No. A timeout timer should be started before we submit anything down to the hardware. See Message-ID: <ed3aca10-8a9f-4698-92f4-21558fa6cfe3@amd.com>, and Message-ID: <8e5eab14-9e55-42c9-b6ea-02fcc591266d@amd.com>. You shouldn't start TDR at an arbitrarily late time after job submission to the hardware. To close this, the timer is started before jobs are submitted to the hardware. One possibility is to increase the timeout timer value.
On 2023-09-29 17:53, Luben Tuikov wrote: > Hi, > > On 2023-09-19 01:01, Matthew Brost wrote: >> If the TDR is set to a very small value it can fire before the >> submission is started in the function drm_sched_start. The submission is >> expected to running when the TDR fires, fix this ordering so this >> expectation is always met. >> >> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >> --- >> drivers/gpu/drm/scheduler/sched_main.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >> index 09ef07b9e9d5..a5cc9b6c2faa 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -684,10 +684,10 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) >> drm_sched_job_done(s_job, -ECANCELED); >> } >> >> + drm_sched_submit_start(sched); >> + >> if (full_recovery) >> drm_sched_start_timeout_unlocked(sched); >> - >> - drm_sched_submit_start(sched); >> } >> EXPORT_SYMBOL(drm_sched_start); > > No. > > A timeout timer should be started before we submit anything down to the hardware. > See Message-ID: <ed3aca10-8a9f-4698-92f4-21558fa6cfe3@amd.com>, > and Message-ID: <8e5eab14-9e55-42c9-b6ea-02fcc591266d@amd.com>. > > You shouldn't start TDR at an arbitrarily late time after job > submission to the hardware. To close this, the timer is started > before jobs are submitted to the hardware. > > One possibility is to increase the timeout timer value. If we went with this general change as we see here and in the subsequent patch--starting the TDR _after_ submitting jobs for execution to the hardware--this is what generally happens, 1. submit one or many jobs for execution; 2. one or many jobs may execute, complete, hang, etc.; 3. at some arbitrary time in the future, start TDR. Which means that the timeout doesn't necessarily track the time allotted for a job to finish executing in the hardware. It ends up larger than intended.
On Sat, Sep 30, 2023 at 03:48:07PM -0400, Luben Tuikov wrote: > On 2023-09-29 17:53, Luben Tuikov wrote: > > Hi, > > > > On 2023-09-19 01:01, Matthew Brost wrote: > >> If the TDR is set to a very small value it can fire before the > >> submission is started in the function drm_sched_start. The submission is > >> expected to running when the TDR fires, fix this ordering so this > >> expectation is always met. > >> > >> Signed-off-by: Matthew Brost <matthew.brost@intel.com> > >> --- > >> drivers/gpu/drm/scheduler/sched_main.c | 4 ++-- > >> 1 file changed, 2 insertions(+), 2 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > >> index 09ef07b9e9d5..a5cc9b6c2faa 100644 > >> --- a/drivers/gpu/drm/scheduler/sched_main.c > >> +++ b/drivers/gpu/drm/scheduler/sched_main.c > >> @@ -684,10 +684,10 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) > >> drm_sched_job_done(s_job, -ECANCELED); > >> } > >> > >> + drm_sched_submit_start(sched); > >> + > >> if (full_recovery) > >> drm_sched_start_timeout_unlocked(sched); > >> - > >> - drm_sched_submit_start(sched); > >> } > >> EXPORT_SYMBOL(drm_sched_start); > > > > No. > > I don't think we will ever agree on this but I pulled out this patch and the next in Xe. It seems to work without these changes, I believe understand why and think it should actually work without this change. If for some reason it didn't work, I know how I can work around this in the Xe submission backend. With this, I will drop these in the next rev. But more on why I disagree below... > > A timeout timer should be started before we submit anything down to the hardware. > > See Message-ID: <ed3aca10-8a9f-4698-92f4-21558fa6cfe3@amd.com>, > > and Message-ID: <8e5eab14-9e55-42c9-b6ea-02fcc591266d@amd.com>. > > > > You shouldn't start TDR at an arbitrarily late time after job > > submission to the hardware. To close this, the timer is started > > before jobs are submitted to the hardware. > > > > One possibility is to increase the timeout timer value. No matter what the timeout value is there will always be a race of TDR firing before run_job() is called. > > If we went with this general change as we see here and in the subsequent patch--starting > the TDR _after_ submitting jobs for execution to the hardware--this is what generally happens, > 1. submit one or many jobs for execution; > 2. one or many jobs may execute, complete, hang, etc.; > 3. at some arbitrary time in the future, start TDR. > Which means that the timeout doesn't necessarily track the time allotted for a job to finish > executing in the hardware. It ends up larger than intended. Yes, conversely it can be smaller the way it is coded now. Kinda just a matter of opinion on which one to prefer. Matt > -- > Regards, > Luben >
On 2023-10-04 23:11, Matthew Brost wrote: > On Sat, Sep 30, 2023 at 03:48:07PM -0400, Luben Tuikov wrote: >> On 2023-09-29 17:53, Luben Tuikov wrote: >>> Hi, >>> >>> On 2023-09-19 01:01, Matthew Brost wrote: >>>> If the TDR is set to a very small value it can fire before the >>>> submission is started in the function drm_sched_start. The submission is >>>> expected to running when the TDR fires, fix this ordering so this >>>> expectation is always met. >>>> >>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >>>> --- >>>> drivers/gpu/drm/scheduler/sched_main.c | 4 ++-- >>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>>> index 09ef07b9e9d5..a5cc9b6c2faa 100644 >>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>> @@ -684,10 +684,10 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) >>>> drm_sched_job_done(s_job, -ECANCELED); >>>> } >>>> >>>> + drm_sched_submit_start(sched); >>>> + >>>> if (full_recovery) >>>> drm_sched_start_timeout_unlocked(sched); >>>> - >>>> - drm_sched_submit_start(sched); >>>> } >>>> EXPORT_SYMBOL(drm_sched_start); >>> >>> No. >>> > > I don't think we will ever agree on this but I pulled out this patch and > the next in Xe. It seems to work without these changes, I believe > understand why and think it should actually work without this change. If > for some reason it didn't work, I know how I can work around this in the > Xe submission backend. > > With this, I will drop these in the next rev. > > But more on why I disagree below... > >>> A timeout timer should be started before we submit anything down to the hardware. >>> See Message-ID: <ed3aca10-8a9f-4698-92f4-21558fa6cfe3@amd.com>, >>> and Message-ID: <8e5eab14-9e55-42c9-b6ea-02fcc591266d@amd.com>. >>> >>> You shouldn't start TDR at an arbitrarily late time after job >>> submission to the hardware. To close this, the timer is started >>> before jobs are submitted to the hardware. >>> >>> One possibility is to increase the timeout timer value. > > No matter what the timeout value is there will always be a race of TDR > firing before run_job() is called. It's not a "race". In all software and firmware I've seen, a timeout timer is started _before_ a command is submitted to firmware or hardware, respectively. > >> >> If we went with this general change as we see here and in the subsequent patch--starting >> the TDR _after_ submitting jobs for execution to the hardware--this is what generally happens, >> 1. submit one or many jobs for execution; >> 2. one or many jobs may execute, complete, hang, etc.; >> 3. at some arbitrary time in the future, start TDR. >> Which means that the timeout doesn't necessarily track the time allotted for a job to finish >> executing in the hardware. It ends up larger than intended. > > Yes, conversely it can be smaller the way it is coded now. Kinda just a > matter of opinion on which one to prefer. It should be large enough to contain the command/task/job making it to the hardware. We want to make sure there's no runaway job, _for_ the amount of time allotted to each job.
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 09ef07b9e9d5..a5cc9b6c2faa 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -684,10 +684,10 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) drm_sched_job_done(s_job, -ECANCELED); } + drm_sched_submit_start(sched); + if (full_recovery) drm_sched_start_timeout_unlocked(sched); - - drm_sched_submit_start(sched); } EXPORT_SYMBOL(drm_sched_start);
If the TDR is set to a very small value it can fire before the submission is started in the function drm_sched_start. The submission is expected to running when the TDR fires, fix this ordering so this expectation is always met. Signed-off-by: Matthew Brost <matthew.brost@intel.com> --- drivers/gpu/drm/scheduler/sched_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)