diff mbox series

[2/2] drm/sched: fix timeout handling

Message ID 20181008113617.1295-2-christian.koenig@amd.com (mailing list archive)
State New, archived
Headers show
Series [1/2] drm/sched: add drm_sched_start_timeout helper | expand

Commit Message

Christian König Oct. 8, 2018, 11:36 a.m. UTC
We need to make sure that we don't race between job completion and
timeout.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

Comments

Nayan Deshmukh Oct. 8, 2018, 5:40 p.m. UTC | #1
Hi Christian,

I am a bit confused with this patch. It will be better if you can
explain what these 3 functions are supposed to do. From my
understanding this is what they are supposed to do:

1. drm_sched_job_timedout: This function is called when a job has been
executing for more than "timeout" seconds. This function needs to
identify the wrong job and call the timedout_job function which will
notify the driver of that job and the gpu should be recovered to its
original state.

2. drm_sched_hw_job_reset: stop the scheduler if it contains the bad
job. ATM this function first removes all the callback and goes through
the ring_mirror_list afterward to find the bad job. I think it should
do it in the opposite order. Or my understanding of this function is
not fully correct.

3. drm_sched_job_recovery: recover jobs after a reset. It resubmits
the job to the hardware queue avoiding the guilty job.

Some other questions and suggestions:

1. We can avoid the race condition altogether if we shift the
canceling and rescheduling of the timedout work item to the
drm_sched_process_job().

2. How does removing and adding the callback help with the race
condition? Moreover, the hardware might execute some jobs while we are
in this function leading to more races.

3. What is the order in which the above 3 functions should be executed
by the hardware? I think the answer to this question might clear a lot
of my doubts.

Regards,
Nayan

On Mon, Oct 8, 2018 at 8:36 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> We need to make sure that we don't race between job completion and
> timeout.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 28 +++++++++++++++++++++++++++-
>  1 file changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index bd7d11c47202..ad3c57c9fd21 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -248,14 +248,40 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
>  static void drm_sched_job_timedout(struct work_struct *work)
>  {
>         struct drm_gpu_scheduler *sched;
> +       struct drm_sched_fence *fence;
>         struct drm_sched_job *job;
> +       int r;
>
>         sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> +
> +       spin_lock(&sched->job_list_lock);
> +       list_for_each_entry_reverse(job, &sched->ring_mirror_list, node) {
> +               fence = job->s_fence;
> +               if (!dma_fence_remove_callback(fence->parent, &fence->cb))
> +                       goto already_signaled;
> +       }
> +
>         job = list_first_entry_or_null(&sched->ring_mirror_list,
>                                        struct drm_sched_job, node);
> +       spin_unlock(&sched->job_list_lock);
>
>         if (job)
> -               job->sched->ops->timedout_job(job);
> +               sched->ops->timedout_job(job);
> +
> +       spin_lock(&sched->job_list_lock);
> +       list_for_each_entry(job, &sched->ring_mirror_list, node) {
> +               fence = job->s_fence;
> +               if (!fence->parent || !list_empty(&fence->cb.node))
> +                       continue;
> +
> +               r = dma_fence_add_callback(fence->parent, &fence->cb,
> +                                          drm_sched_process_job);
> +               if (r)
> +already_signaled:
> +                       drm_sched_process_job(fence->parent, &fence->cb);
> +
> +       }
> +       spin_unlock(&sched->job_list_lock);
>  }
>
>  /**
> --
> 2.14.1
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Christian König Oct. 8, 2018, 5:55 p.m. UTC | #2
Am 08.10.2018 um 19:40 schrieb Nayan Deshmukh:
> Hi Christian,
>
> I am a bit confused with this patch. It will be better if you can
> explain what these 3 functions are supposed to do. From my
> understanding this is what they are supposed to do:
>
> 1. drm_sched_job_timedout: This function is called when a job has been
> executing for more than "timeout" seconds. This function needs to
> identify the wrong job and call the timedout_job function which will
> notify the driver of that job and the gpu should be recovered to its
> original state.
>
> 2. drm_sched_hw_job_reset: stop the scheduler if it contains the bad
> job. ATM this function first removes all the callback and goes through
> the ring_mirror_list afterward to find the bad job. I think it should
> do it in the opposite order. Or my understanding of this function is
> not fully correct.

That function actually has a couple of bugs itself, but it is irrelevant 
for the current patch.

> 3. drm_sched_job_recovery: recover jobs after a reset. It resubmits
> the job to the hardware queue avoiding the guilty job.

That function is completely unrelated and only called after recovering 
from a hard reset.

>
> Some other questions and suggestions:
>
> 1. We can avoid the race condition altogether if we shift the
> canceling and rescheduling of the timedout work item to the
> drm_sched_process_job().

That won't work. drm_sched_process_job() is called from interrupt 
context and can't sync with a timeout worker.

> 2. How does removing and adding the callback help with the race
> condition? Moreover, the hardware might execute some jobs while we are
> in this function leading to more races.

We need to make sure that the underlying hardware fence doesn't signal 
and triggers new processing while we are about to call the drivers 
timeout function to reset the hardware.

This is done by removing the callbacks in reverse order (e.g. newest to 
oldest).

If we find that we can't remove the callback because the hardware has 
actually continued and the fence has already signaled we add back the 
callback again in normal order (e.g. oldest to newest) starting from the 
job which was already  signaled.

>
> 3. What is the order in which the above 3 functions should be executed
> by the hardware? I think the answer to this question might clear a lot
> of my doubts.

If we can quickly recover from a problem only the 
drm_sched_job_timedout() should be execute.

The other two are for hard resets where we need to stop multiple 
scheduler instances and get them running again.

That the karma handling is mixed into that is rather unfortunate and 
actually quite buggy as well.

I should probably also clean that up.

Regards,
Christian.

>
> Regards,
> Nayan
>
> On Mon, Oct 8, 2018 at 8:36 PM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> We need to make sure that we don't race between job completion and
>> timeout.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 28 +++++++++++++++++++++++++++-
>>   1 file changed, 27 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index bd7d11c47202..ad3c57c9fd21 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -248,14 +248,40 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
>>   static void drm_sched_job_timedout(struct work_struct *work)
>>   {
>>          struct drm_gpu_scheduler *sched;
>> +       struct drm_sched_fence *fence;
>>          struct drm_sched_job *job;
>> +       int r;
>>
>>          sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>> +
>> +       spin_lock(&sched->job_list_lock);
>> +       list_for_each_entry_reverse(job, &sched->ring_mirror_list, node) {
>> +               fence = job->s_fence;
>> +               if (!dma_fence_remove_callback(fence->parent, &fence->cb))
>> +                       goto already_signaled;
>> +       }
>> +
>>          job = list_first_entry_or_null(&sched->ring_mirror_list,
>>                                         struct drm_sched_job, node);
>> +       spin_unlock(&sched->job_list_lock);
>>
>>          if (job)
>> -               job->sched->ops->timedout_job(job);
>> +               sched->ops->timedout_job(job);
>> +
>> +       spin_lock(&sched->job_list_lock);
>> +       list_for_each_entry(job, &sched->ring_mirror_list, node) {
>> +               fence = job->s_fence;
>> +               if (!fence->parent || !list_empty(&fence->cb.node))
>> +                       continue;
>> +
>> +               r = dma_fence_add_callback(fence->parent, &fence->cb,
>> +                                          drm_sched_process_job);
>> +               if (r)
>> +already_signaled:
>> +                       drm_sched_process_job(fence->parent, &fence->cb);
>> +
>> +       }
>> +       spin_unlock(&sched->job_list_lock);
>>   }
>>
>>   /**
>> --
>> 2.14.1
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Nayan Deshmukh Oct. 8, 2018, 6:01 p.m. UTC | #3
Thanks for all the explanations. Looks like this part of scheduler has
a lot of bugs.

On Tue, Oct 9, 2018 at 2:55 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 08.10.2018 um 19:40 schrieb Nayan Deshmukh:
> > Hi Christian,
> >
> > I am a bit confused with this patch. It will be better if you can
> > explain what these 3 functions are supposed to do. From my
> > understanding this is what they are supposed to do:
> >
> > 1. drm_sched_job_timedout: This function is called when a job has been
> > executing for more than "timeout" seconds. This function needs to
> > identify the wrong job and call the timedout_job function which will
> > notify the driver of that job and the gpu should be recovered to its
> > original state.
> >
> > 2. drm_sched_hw_job_reset: stop the scheduler if it contains the bad
> > job. ATM this function first removes all the callback and goes through
> > the ring_mirror_list afterward to find the bad job. I think it should
> > do it in the opposite order. Or my understanding of this function is
> > not fully correct.
>
> That function actually has a couple of bugs itself, but it is irrelevant
> for the current patch.
>
> > 3. drm_sched_job_recovery: recover jobs after a reset. It resubmits
> > the job to the hardware queue avoiding the guilty job.
>
> That function is completely unrelated and only called after recovering
> from a hard reset.
>
> >
> > Some other questions and suggestions:
> >
> > 1. We can avoid the race condition altogether if we shift the
> > canceling and rescheduling of the timedout work item to the
> > drm_sched_process_job().
>
> That won't work. drm_sched_process_job() is called from interrupt
> context and can't sync with a timeout worker.
>
> > 2. How does removing and adding the callback help with the race
> > condition? Moreover, the hardware might execute some jobs while we are
> > in this function leading to more races.
>
> We need to make sure that the underlying hardware fence doesn't signal
> and triggers new processing while we are about to call the drivers
> timeout function to reset the hardware.
>
> This is done by removing the callbacks in reverse order (e.g. newest to
> oldest).
>
> If we find that we can't remove the callback because the hardware has
> actually continued and the fence has already signaled we add back the
> callback again in normal order (e.g. oldest to newest) starting from the
> job which was already  signaled.
Why do we need to call drm_sched_process_job() again for the last signaled job?

Regards,
Nayan
>
> >
> > 3. What is the order in which the above 3 functions should be executed
> > by the hardware? I think the answer to this question might clear a lot
> > of my doubts.
>
> If we can quickly recover from a problem only the
> drm_sched_job_timedout() should be execute.
>
> The other two are for hard resets where we need to stop multiple
> scheduler instances and get them running again.
>
> That the karma handling is mixed into that is rather unfortunate and
> actually quite buggy as well.
>
> I should probably also clean that up.
>
> Regards,
> Christian.
>
> >
> > Regards,
> > Nayan
> >
> > On Mon, Oct 8, 2018 at 8:36 PM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >> We need to make sure that we don't race between job completion and
> >> timeout.
> >>
> >> Signed-off-by: Christian König <christian.koenig@amd.com>
> >> ---
> >>   drivers/gpu/drm/scheduler/sched_main.c | 28 +++++++++++++++++++++++++++-
> >>   1 file changed, 27 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> >> index bd7d11c47202..ad3c57c9fd21 100644
> >> --- a/drivers/gpu/drm/scheduler/sched_main.c
> >> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> >> @@ -248,14 +248,40 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
> >>   static void drm_sched_job_timedout(struct work_struct *work)
> >>   {
> >>          struct drm_gpu_scheduler *sched;
> >> +       struct drm_sched_fence *fence;
> >>          struct drm_sched_job *job;
> >> +       int r;
> >>
> >>          sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> >> +
> >> +       spin_lock(&sched->job_list_lock);
> >> +       list_for_each_entry_reverse(job, &sched->ring_mirror_list, node) {
> >> +               fence = job->s_fence;
> >> +               if (!dma_fence_remove_callback(fence->parent, &fence->cb))
> >> +                       goto already_signaled;
> >> +       }
> >> +
> >>          job = list_first_entry_or_null(&sched->ring_mirror_list,
> >>                                         struct drm_sched_job, node);
> >> +       spin_unlock(&sched->job_list_lock);
> >>
> >>          if (job)
> >> -               job->sched->ops->timedout_job(job);
> >> +               sched->ops->timedout_job(job);
> >> +
> >> +       spin_lock(&sched->job_list_lock);
> >> +       list_for_each_entry(job, &sched->ring_mirror_list, node) {
> >> +               fence = job->s_fence;
> >> +               if (!fence->parent || !list_empty(&fence->cb.node))
> >> +                       continue;
> >> +
> >> +               r = dma_fence_add_callback(fence->parent, &fence->cb,
> >> +                                          drm_sched_process_job);
> >> +               if (r)
> >> +already_signaled:
> >> +                       drm_sched_process_job(fence->parent, &fence->cb);
> >> +
> >> +       }
> >> +       spin_unlock(&sched->job_list_lock);
> >>   }
> >>
> >>   /**
> >> --
> >> 2.14.1
> >>
> >> _______________________________________________
> >> dri-devel mailing list
> >> dri-devel@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
Christian König Oct. 8, 2018, 6:04 p.m. UTC | #4
Am 08.10.2018 um 20:01 schrieb Nayan Deshmukh:
> Thanks for all the explanations. Looks like this part of scheduler has
> a lot of bugs.
>
> On Tue, Oct 9, 2018 at 2:55 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 08.10.2018 um 19:40 schrieb Nayan Deshmukh:
>>> Hi Christian,
>>>
>>> I am a bit confused with this patch. It will be better if you can
>>> explain what these 3 functions are supposed to do. From my
>>> understanding this is what they are supposed to do:
>>>
>>> 1. drm_sched_job_timedout: This function is called when a job has been
>>> executing for more than "timeout" seconds. This function needs to
>>> identify the wrong job and call the timedout_job function which will
>>> notify the driver of that job and the gpu should be recovered to its
>>> original state.
>>>
>>> 2. drm_sched_hw_job_reset: stop the scheduler if it contains the bad
>>> job. ATM this function first removes all the callback and goes through
>>> the ring_mirror_list afterward to find the bad job. I think it should
>>> do it in the opposite order. Or my understanding of this function is
>>> not fully correct.
>> That function actually has a couple of bugs itself, but it is irrelevant
>> for the current patch.
>>
>>> 3. drm_sched_job_recovery: recover jobs after a reset. It resubmits
>>> the job to the hardware queue avoiding the guilty job.
>> That function is completely unrelated and only called after recovering
>> from a hard reset.
>>
>>> Some other questions and suggestions:
>>>
>>> 1. We can avoid the race condition altogether if we shift the
>>> canceling and rescheduling of the timedout work item to the
>>> drm_sched_process_job().
>> That won't work. drm_sched_process_job() is called from interrupt
>> context and can't sync with a timeout worker.
>>
>>> 2. How does removing and adding the callback help with the race
>>> condition? Moreover, the hardware might execute some jobs while we are
>>> in this function leading to more races.
>> We need to make sure that the underlying hardware fence doesn't signal
>> and triggers new processing while we are about to call the drivers
>> timeout function to reset the hardware.
>>
>> This is done by removing the callbacks in reverse order (e.g. newest to
>> oldest).
>>
>> If we find that we can't remove the callback because the hardware has
>> actually continued and the fence has already signaled we add back the
>> callback again in normal order (e.g. oldest to newest) starting from the
>> job which was already  signaled.
> Why do we need to call drm_sched_process_job() again for the last signaled job?

Oh, good point, that is indeed incorrect.

The label should be after we calling drm_sched_process_job() because we 
couldn't remove the callback for the current one.

Going to fix that,
Christian,

>
> Regards,
> Nayan
>>> 3. What is the order in which the above 3 functions should be executed
>>> by the hardware? I think the answer to this question might clear a lot
>>> of my doubts.
>> If we can quickly recover from a problem only the
>> drm_sched_job_timedout() should be execute.
>>
>> The other two are for hard resets where we need to stop multiple
>> scheduler instances and get them running again.
>>
>> That the karma handling is mixed into that is rather unfortunate and
>> actually quite buggy as well.
>>
>> I should probably also clean that up.
>>
>> Regards,
>> Christian.
>>
>>> Regards,
>>> Nayan
>>>
>>> On Mon, Oct 8, 2018 at 8:36 PM Christian König
>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>> We need to make sure that we don't race between job completion and
>>>> timeout.
>>>>
>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/scheduler/sched_main.c | 28 +++++++++++++++++++++++++++-
>>>>    1 file changed, 27 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index bd7d11c47202..ad3c57c9fd21 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -248,14 +248,40 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
>>>>    static void drm_sched_job_timedout(struct work_struct *work)
>>>>    {
>>>>           struct drm_gpu_scheduler *sched;
>>>> +       struct drm_sched_fence *fence;
>>>>           struct drm_sched_job *job;
>>>> +       int r;
>>>>
>>>>           sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>>> +
>>>> +       spin_lock(&sched->job_list_lock);
>>>> +       list_for_each_entry_reverse(job, &sched->ring_mirror_list, node) {
>>>> +               fence = job->s_fence;
>>>> +               if (!dma_fence_remove_callback(fence->parent, &fence->cb))
>>>> +                       goto already_signaled;
>>>> +       }
>>>> +
>>>>           job = list_first_entry_or_null(&sched->ring_mirror_list,
>>>>                                          struct drm_sched_job, node);
>>>> +       spin_unlock(&sched->job_list_lock);
>>>>
>>>>           if (job)
>>>> -               job->sched->ops->timedout_job(job);
>>>> +               sched->ops->timedout_job(job);
>>>> +
>>>> +       spin_lock(&sched->job_list_lock);
>>>> +       list_for_each_entry(job, &sched->ring_mirror_list, node) {
>>>> +               fence = job->s_fence;
>>>> +               if (!fence->parent || !list_empty(&fence->cb.node))
>>>> +                       continue;
>>>> +
>>>> +               r = dma_fence_add_callback(fence->parent, &fence->cb,
>>>> +                                          drm_sched_process_job);
>>>> +               if (r)
>>>> +already_signaled:
>>>> +                       drm_sched_process_job(fence->parent, &fence->cb);
>>>> +
>>>> +       }
>>>> +       spin_unlock(&sched->job_list_lock);
>>>>    }
>>>>
>>>>    /**
>>>> --
>>>> 2.14.1
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
diff mbox series

Patch

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index bd7d11c47202..ad3c57c9fd21 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -248,14 +248,40 @@  static void drm_sched_job_begin(struct drm_sched_job *s_job)
 static void drm_sched_job_timedout(struct work_struct *work)
 {
 	struct drm_gpu_scheduler *sched;
+	struct drm_sched_fence *fence;
 	struct drm_sched_job *job;
+	int r;
 
 	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
+
+	spin_lock(&sched->job_list_lock);
+	list_for_each_entry_reverse(job, &sched->ring_mirror_list, node) {
+		fence = job->s_fence;
+		if (!dma_fence_remove_callback(fence->parent, &fence->cb))
+			goto already_signaled;
+	}
+
 	job = list_first_entry_or_null(&sched->ring_mirror_list,
 				       struct drm_sched_job, node);
+	spin_unlock(&sched->job_list_lock);
 
 	if (job)
-		job->sched->ops->timedout_job(job);
+		sched->ops->timedout_job(job);
+
+	spin_lock(&sched->job_list_lock);
+	list_for_each_entry(job, &sched->ring_mirror_list, node) {
+		fence = job->s_fence;
+		if (!fence->parent || !list_empty(&fence->cb.node))
+			continue;
+
+		r = dma_fence_add_callback(fence->parent, &fence->cb,
+					   drm_sched_process_job);
+		if (r)
+already_signaled:
+			drm_sched_process_job(fence->parent, &fence->cb);
+
+	}
+	spin_unlock(&sched->job_list_lock);
 }
 
 /**