Message ID | 1571947050-26276-1-git-send-email-andrey.grodzovsky@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] drm/sched: Set error to s_fence if HW job submission failed. | expand |
Am 24.10.19 um 21:57 schrieb Andrey Grodzovsky: > Problem: > When run_job fails and HW fence returned is NULL we still signal > the s_fence to avoid hangs but the user has no way of knowing if > the actual HW job was ran and finished. > > Fix: > Allow .run_job implementations to return ERR_PTR in the fence pointer > returned and then set this error for s_fence->finished fence so whoever > wait on this fence can inspect the signaled fence for an error. > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> > --- > drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++--- > 1 file changed, 16 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 9a0ee74..f39b97e 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -479,6 +479,7 @@ void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched) > struct drm_sched_job *s_job, *tmp; > uint64_t guilty_context; > bool found_guilty = false; > + struct dma_fence *fence; > > list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list, node) { > struct drm_sched_fence *s_fence = s_job->s_fence; > @@ -492,7 +493,16 @@ void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched) > dma_fence_set_error(&s_fence->finished, -ECANCELED); > > dma_fence_put(s_job->s_fence->parent); > - s_job->s_fence->parent = sched->ops->run_job(s_job); > + fence = sched->ops->run_job(s_job); > + > + if (IS_ERR_OR_NULL(fence)) { > + s_job->s_fence->parent = NULL; > + dma_fence_set_error(&s_fence->finished, PTR_ERR(fence)); > + } else { > + s_job->s_fence->parent = fence; > + } > + > + Maybe time for a drm_sched_run_job() function which does that handling? And why don't we need to install the callback here? Apart from that looks good to me, Christian. > } > } > EXPORT_SYMBOL(drm_sched_resubmit_jobs); > @@ -720,7 +730,7 @@ static int drm_sched_main(void *param) > fence = sched->ops->run_job(sched_job); > drm_sched_fence_scheduled(s_fence); > > - if (fence) { > + if (!IS_ERR_OR_NULL(fence)) { > s_fence->parent = dma_fence_get(fence); > r = dma_fence_add_callback(fence, &sched_job->cb, > drm_sched_process_job); > @@ -730,8 +740,11 @@ static int drm_sched_main(void *param) > DRM_ERROR("fence add callback failed (%d)\n", > r); > dma_fence_put(fence); > - } else > + } else { > + > + dma_fence_set_error(&s_fence->finished, PTR_ERR(fence)); > drm_sched_process_job(NULL, &sched_job->cb); > + } > > wake_up(&sched->job_scheduled); > }
On 10/25/19 4:44 AM, Christian König wrote: > Am 24.10.19 um 21:57 schrieb Andrey Grodzovsky: >> Problem: >> When run_job fails and HW fence returned is NULL we still signal >> the s_fence to avoid hangs but the user has no way of knowing if >> the actual HW job was ran and finished. >> >> Fix: >> Allow .run_job implementations to return ERR_PTR in the fence pointer >> returned and then set this error for s_fence->finished fence so whoever >> wait on this fence can inspect the signaled fence for an error. >> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> >> --- >> drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++--- >> 1 file changed, 16 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >> b/drivers/gpu/drm/scheduler/sched_main.c >> index 9a0ee74..f39b97e 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -479,6 +479,7 @@ void drm_sched_resubmit_jobs(struct >> drm_gpu_scheduler *sched) >> struct drm_sched_job *s_job, *tmp; >> uint64_t guilty_context; >> bool found_guilty = false; >> + struct dma_fence *fence; >> list_for_each_entry_safe(s_job, tmp, >> &sched->ring_mirror_list, node) { >> struct drm_sched_fence *s_fence = s_job->s_fence; >> @@ -492,7 +493,16 @@ void drm_sched_resubmit_jobs(struct >> drm_gpu_scheduler *sched) >> dma_fence_set_error(&s_fence->finished, -ECANCELED); >> dma_fence_put(s_job->s_fence->parent); >> - s_job->s_fence->parent = sched->ops->run_job(s_job); >> + fence = sched->ops->run_job(s_job); >> + >> + if (IS_ERR_OR_NULL(fence)) { >> + s_job->s_fence->parent = NULL; >> + dma_fence_set_error(&s_fence->finished, PTR_ERR(fence)); >> + } else { >> + s_job->s_fence->parent = fence; >> + } >> + >> + > > Maybe time for a drm_sched_run_job() function which does that > handling? And why don't we need to install the callback here? What code do you want to put in drm_sched_run_job ? We reinstall the callback later in drm_sched_start, drm_sched_resubmit_jobs is conditional on whether the guilty fence did signal by this time or not and so the split of the logic into drm_sched_start and drm_sched_resubmit_jobs. Andrey > > Apart from that looks good to me, > Christian. > >> } >> } >> EXPORT_SYMBOL(drm_sched_resubmit_jobs); >> @@ -720,7 +730,7 @@ static int drm_sched_main(void *param) >> fence = sched->ops->run_job(sched_job); >> drm_sched_fence_scheduled(s_fence); >> - if (fence) { >> + if (!IS_ERR_OR_NULL(fence)) { >> s_fence->parent = dma_fence_get(fence); >> r = dma_fence_add_callback(fence, &sched_job->cb, >> drm_sched_process_job); >> @@ -730,8 +740,11 @@ static int drm_sched_main(void *param) >> DRM_ERROR("fence add callback failed (%d)\n", >> r); >> dma_fence_put(fence); >> - } else >> + } else { >> + >> + dma_fence_set_error(&s_fence->finished, PTR_ERR(fence)); >> drm_sched_process_job(NULL, &sched_job->cb); >> + } >> wake_up(&sched->job_scheduled); >> } >
Am 25.10.19 um 16:57 schrieb Grodzovsky, Andrey: > On 10/25/19 4:44 AM, Christian König wrote: >> Am 24.10.19 um 21:57 schrieb Andrey Grodzovsky: >>> Problem: >>> When run_job fails and HW fence returned is NULL we still signal >>> the s_fence to avoid hangs but the user has no way of knowing if >>> the actual HW job was ran and finished. >>> >>> Fix: >>> Allow .run_job implementations to return ERR_PTR in the fence pointer >>> returned and then set this error for s_fence->finished fence so whoever >>> wait on this fence can inspect the signaled fence for an error. >>> >>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> >>> --- >>> drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++--- >>> 1 file changed, 16 insertions(+), 3 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>> b/drivers/gpu/drm/scheduler/sched_main.c >>> index 9a0ee74..f39b97e 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -479,6 +479,7 @@ void drm_sched_resubmit_jobs(struct >>> drm_gpu_scheduler *sched) >>> struct drm_sched_job *s_job, *tmp; >>> uint64_t guilty_context; >>> bool found_guilty = false; >>> + struct dma_fence *fence; >>> list_for_each_entry_safe(s_job, tmp, >>> &sched->ring_mirror_list, node) { >>> struct drm_sched_fence *s_fence = s_job->s_fence; >>> @@ -492,7 +493,16 @@ void drm_sched_resubmit_jobs(struct >>> drm_gpu_scheduler *sched) >>> dma_fence_set_error(&s_fence->finished, -ECANCELED); >>> dma_fence_put(s_job->s_fence->parent); >>> - s_job->s_fence->parent = sched->ops->run_job(s_job); >>> + fence = sched->ops->run_job(s_job); >>> + >>> + if (IS_ERR_OR_NULL(fence)) { >>> + s_job->s_fence->parent = NULL; >>> + dma_fence_set_error(&s_fence->finished, PTR_ERR(fence)); >>> + } else { >>> + s_job->s_fence->parent = fence; >>> + } >>> + >>> + >> Maybe time for a drm_sched_run_job() function which does that >> handling? And why don't we need to install the callback here? > > What code do you want to put in drm_sched_run_job ? > > We reinstall the callback later in drm_sched_start, > drm_sched_resubmit_jobs is conditional on whether the guilty fence did > signal by this time or not and so the split of the logic into > drm_sched_start and drm_sched_resubmit_jobs. Ah, yes of course. In this case the patch is Reviewed-by: Christian König <christian.koenig@amd.com>. Regards, Christian. > > Andrey > > >> Apart from that looks good to me, >> Christian. >> >>> } >>> } >>> EXPORT_SYMBOL(drm_sched_resubmit_jobs); >>> @@ -720,7 +730,7 @@ static int drm_sched_main(void *param) >>> fence = sched->ops->run_job(sched_job); >>> drm_sched_fence_scheduled(s_fence); >>> - if (fence) { >>> + if (!IS_ERR_OR_NULL(fence)) { >>> s_fence->parent = dma_fence_get(fence); >>> r = dma_fence_add_callback(fence, &sched_job->cb, >>> drm_sched_process_job); >>> @@ -730,8 +740,11 @@ static int drm_sched_main(void *param) >>> DRM_ERROR("fence add callback failed (%d)\n", >>> r); >>> dma_fence_put(fence); >>> - } else >>> + } else { >>> + >>> + dma_fence_set_error(&s_fence->finished, PTR_ERR(fence)); >>> drm_sched_process_job(NULL, &sched_job->cb); >>> + } >>> wake_up(&sched->job_scheduled); >>> }
On 10/25/19 11:55 AM, Koenig, Christian wrote: > Am 25.10.19 um 16:57 schrieb Grodzovsky, Andrey: >> On 10/25/19 4:44 AM, Christian König wrote: >>> Am 24.10.19 um 21:57 schrieb Andrey Grodzovsky: >>>> Problem: >>>> When run_job fails and HW fence returned is NULL we still signal >>>> the s_fence to avoid hangs but the user has no way of knowing if >>>> the actual HW job was ran and finished. >>>> >>>> Fix: >>>> Allow .run_job implementations to return ERR_PTR in the fence pointer >>>> returned and then set this error for s_fence->finished fence so whoever >>>> wait on this fence can inspect the signaled fence for an error. >>>> >>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> >>>> --- >>>> drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++--- >>>> 1 file changed, 16 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>> index 9a0ee74..f39b97e 100644 >>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>> @@ -479,6 +479,7 @@ void drm_sched_resubmit_jobs(struct >>>> drm_gpu_scheduler *sched) >>>> struct drm_sched_job *s_job, *tmp; >>>> uint64_t guilty_context; >>>> bool found_guilty = false; >>>> + struct dma_fence *fence; >>>> list_for_each_entry_safe(s_job, tmp, >>>> &sched->ring_mirror_list, node) { >>>> struct drm_sched_fence *s_fence = s_job->s_fence; >>>> @@ -492,7 +493,16 @@ void drm_sched_resubmit_jobs(struct >>>> drm_gpu_scheduler *sched) >>>> dma_fence_set_error(&s_fence->finished, -ECANCELED); >>>> dma_fence_put(s_job->s_fence->parent); >>>> - s_job->s_fence->parent = sched->ops->run_job(s_job); >>>> + fence = sched->ops->run_job(s_job); >>>> + >>>> + if (IS_ERR_OR_NULL(fence)) { >>>> + s_job->s_fence->parent = NULL; >>>> + dma_fence_set_error(&s_fence->finished, PTR_ERR(fence)); >>>> + } else { >>>> + s_job->s_fence->parent = fence; >>>> + } >>>> + >>>> + >>> Maybe time for a drm_sched_run_job() function which does that >>> handling? And why don't we need to install the callback here? >> What code do you want to put in drm_sched_run_job ? >> >> We reinstall the callback later in drm_sched_start, >> drm_sched_resubmit_jobs is conditional on whether the guilty fence did >> signal by this time or not and so the split of the logic into >> drm_sched_start and drm_sched_resubmit_jobs. > Ah, yes of course. In this case the patch is Reviewed-by: Christian > König <christian.koenig@amd.com>. > > Regards, > Christian. Thanks, there is also 2/2 patch for amdgpu, please take a look. Andrey > >> Andrey >> >> >>> Apart from that looks good to me, >>> Christian. >>> >>>> } >>>> } >>>> EXPORT_SYMBOL(drm_sched_resubmit_jobs); >>>> @@ -720,7 +730,7 @@ static int drm_sched_main(void *param) >>>> fence = sched->ops->run_job(sched_job); >>>> drm_sched_fence_scheduled(s_fence); >>>> - if (fence) { >>>> + if (!IS_ERR_OR_NULL(fence)) { >>>> s_fence->parent = dma_fence_get(fence); >>>> r = dma_fence_add_callback(fence, &sched_job->cb, >>>> drm_sched_process_job); >>>> @@ -730,8 +740,11 @@ static int drm_sched_main(void *param) >>>> DRM_ERROR("fence add callback failed (%d)\n", >>>> r); >>>> dma_fence_put(fence); >>>> - } else >>>> + } else { >>>> + >>>> + dma_fence_set_error(&s_fence->finished, PTR_ERR(fence)); >>>> drm_sched_process_job(NULL, &sched_job->cb); >>>> + } >>>> wake_up(&sched->job_scheduled); >>>> }
Am 25.10.19 um 17:56 schrieb Grodzovsky, Andrey: > On 10/25/19 11:55 AM, Koenig, Christian wrote: >> Am 25.10.19 um 16:57 schrieb Grodzovsky, Andrey: >>> On 10/25/19 4:44 AM, Christian König wrote: >>>> Am 24.10.19 um 21:57 schrieb Andrey Grodzovsky: >>>>> Problem: >>>>> When run_job fails and HW fence returned is NULL we still signal >>>>> the s_fence to avoid hangs but the user has no way of knowing if >>>>> the actual HW job was ran and finished. >>>>> >>>>> Fix: >>>>> Allow .run_job implementations to return ERR_PTR in the fence pointer >>>>> returned and then set this error for s_fence->finished fence so whoever >>>>> wait on this fence can inspect the signaled fence for an error. >>>>> >>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> >>>>> --- >>>>> drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++--- >>>>> 1 file changed, 16 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>>> index 9a0ee74..f39b97e 100644 >>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>>> @@ -479,6 +479,7 @@ void drm_sched_resubmit_jobs(struct >>>>> drm_gpu_scheduler *sched) >>>>> struct drm_sched_job *s_job, *tmp; >>>>> uint64_t guilty_context; >>>>> bool found_guilty = false; >>>>> + struct dma_fence *fence; >>>>> list_for_each_entry_safe(s_job, tmp, >>>>> &sched->ring_mirror_list, node) { >>>>> struct drm_sched_fence *s_fence = s_job->s_fence; >>>>> @@ -492,7 +493,16 @@ void drm_sched_resubmit_jobs(struct >>>>> drm_gpu_scheduler *sched) >>>>> dma_fence_set_error(&s_fence->finished, -ECANCELED); >>>>> dma_fence_put(s_job->s_fence->parent); >>>>> - s_job->s_fence->parent = sched->ops->run_job(s_job); >>>>> + fence = sched->ops->run_job(s_job); >>>>> + >>>>> + if (IS_ERR_OR_NULL(fence)) { >>>>> + s_job->s_fence->parent = NULL; >>>>> + dma_fence_set_error(&s_fence->finished, PTR_ERR(fence)); >>>>> + } else { >>>>> + s_job->s_fence->parent = fence; >>>>> + } >>>>> + >>>>> + >>>> Maybe time for a drm_sched_run_job() function which does that >>>> handling? And why don't we need to install the callback here? >>> What code do you want to put in drm_sched_run_job ? >>> >>> We reinstall the callback later in drm_sched_start, >>> drm_sched_resubmit_jobs is conditional on whether the guilty fence did >>> signal by this time or not and so the split of the logic into >>> drm_sched_start and drm_sched_resubmit_jobs. >> Ah, yes of course. In this case the patch is Reviewed-by: Christian >> König <christian.koenig@amd.com>. >> >> Regards, >> Christian. > > Thanks, there is also 2/2 patch for amdgpu, please take a look. Seen that, feel free to add my rb as well. Christian. > > Andrey > > >>> Andrey >>> >>> >>>> Apart from that looks good to me, >>>> Christian. >>>> >>>>> } >>>>> } >>>>> EXPORT_SYMBOL(drm_sched_resubmit_jobs); >>>>> @@ -720,7 +730,7 @@ static int drm_sched_main(void *param) >>>>> fence = sched->ops->run_job(sched_job); >>>>> drm_sched_fence_scheduled(s_fence); >>>>> - if (fence) { >>>>> + if (!IS_ERR_OR_NULL(fence)) { >>>>> s_fence->parent = dma_fence_get(fence); >>>>> r = dma_fence_add_callback(fence, &sched_job->cb, >>>>> drm_sched_process_job); >>>>> @@ -730,8 +740,11 @@ static int drm_sched_main(void *param) >>>>> DRM_ERROR("fence add callback failed (%d)\n", >>>>> r); >>>>> dma_fence_put(fence); >>>>> - } else >>>>> + } else { >>>>> + >>>>> + dma_fence_set_error(&s_fence->finished, PTR_ERR(fence)); >>>>> drm_sched_process_job(NULL, &sched_job->cb); >>>>> + } >>>>> wake_up(&sched->job_scheduled); >>>>> }
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 9a0ee74..f39b97e 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -479,6 +479,7 @@ void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched) struct drm_sched_job *s_job, *tmp; uint64_t guilty_context; bool found_guilty = false; + struct dma_fence *fence; list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list, node) { struct drm_sched_fence *s_fence = s_job->s_fence; @@ -492,7 +493,16 @@ void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched) dma_fence_set_error(&s_fence->finished, -ECANCELED); dma_fence_put(s_job->s_fence->parent); - s_job->s_fence->parent = sched->ops->run_job(s_job); + fence = sched->ops->run_job(s_job); + + if (IS_ERR_OR_NULL(fence)) { + s_job->s_fence->parent = NULL; + dma_fence_set_error(&s_fence->finished, PTR_ERR(fence)); + } else { + s_job->s_fence->parent = fence; + } + + } } EXPORT_SYMBOL(drm_sched_resubmit_jobs); @@ -720,7 +730,7 @@ static int drm_sched_main(void *param) fence = sched->ops->run_job(sched_job); drm_sched_fence_scheduled(s_fence); - if (fence) { + if (!IS_ERR_OR_NULL(fence)) { s_fence->parent = dma_fence_get(fence); r = dma_fence_add_callback(fence, &sched_job->cb, drm_sched_process_job); @@ -730,8 +740,11 @@ static int drm_sched_main(void *param) DRM_ERROR("fence add callback failed (%d)\n", r); dma_fence_put(fence); - } else + } else { + + dma_fence_set_error(&s_fence->finished, PTR_ERR(fence)); drm_sched_process_job(NULL, &sched_job->cb); + } wake_up(&sched->job_scheduled); }
Problem: When run_job fails and HW fence returned is NULL we still signal the s_fence to avoid hangs but the user has no way of knowing if the actual HW job was ran and finished. Fix: Allow .run_job implementations to return ERR_PTR in the fence pointer returned and then set this error for s_fence->finished fence so whoever wait on this fence can inspect the signaled fence for an error. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> --- drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-)