diff mbox series

[drm-misc-next] drm/nouveau: sched: avoid job races between entities

Message ID 20230811010632.2473-1-dakr@redhat.com (mailing list archive)
State New, archived
Headers show
Series [drm-misc-next] drm/nouveau: sched: avoid job races between entities | expand

Commit Message

Danilo Krummrich Aug. 11, 2023, 1:06 a.m. UTC
If a sched job depends on a dma-fence from a job from the same GPU
scheduler instance, but a different scheduler entity, the GPU scheduler
does only wait for the particular job to be scheduled, rather than for
the job to fully complete. This is due to the GPU scheduler assuming
that there is a scheduler instance per ring. However, the current
implementation, in order to avoid arbitrary amounts of kthreads, has a
single scheduler instance while scheduler entities represent rings.

As a workaround, set the DRM_SCHED_FENCE_DONT_PIPELINE for all
out-fences in order to force the scheduler to wait for full job
completion for dependent jobs from different entities and same scheduler
instance.

There is some work in progress [1] to address the issues of firmware
schedulers; once it is in-tree the scheduler topology in Nouveau should
be re-worked accordingly.

[1] https://lore.kernel.org/dri-devel/20230801205103.627779-1-matthew.brost@intel.com/

Signed-off-by: Danilo Krummrich <dakr@redhat.com>
---
 drivers/gpu/drm/nouveau/nouveau_sched.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)


base-commit: 68132cc6d1bcbc78ade524c6c6c226de42139f0e

Comments

Faith Ekstrand Aug. 11, 2023, 5:53 p.m. UTC | #1
On Thu, Aug 10, 2023 at 8:06 PM Danilo Krummrich <dakr@redhat.com> wrote:

> If a sched job depends on a dma-fence from a job from the same GPU
> scheduler instance, but a different scheduler entity, the GPU scheduler
> does only wait for the particular job to be scheduled, rather than for
> the job to fully complete. This is due to the GPU scheduler assuming
> that there is a scheduler instance per ring. However, the current
> implementation, in order to avoid arbitrary amounts of kthreads, has a
> single scheduler instance while scheduler entities represent rings.
>
> As a workaround, set the DRM_SCHED_FENCE_DONT_PIPELINE for all
> out-fences in order to force the scheduler to wait for full job
> completion for dependent jobs from different entities and same scheduler
> instance.
>
> There is some work in progress [1] to address the issues of firmware
> schedulers; once it is in-tree the scheduler topology in Nouveau should
> be re-worked accordingly.
>
> [1]
> https://lore.kernel.org/dri-devel/20230801205103.627779-1-matthew.brost@intel.com/
>
> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
> ---
>  drivers/gpu/drm/nouveau/nouveau_sched.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c
> b/drivers/gpu/drm/nouveau/nouveau_sched.c
> index 3424a1bf6af3..88217185e0f3 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
> @@ -292,6 +292,28 @@ nouveau_job_submit(struct nouveau_job *job)
>         if (job->sync)
>                 done_fence = dma_fence_get(job->done_fence);
>
> +       /* If a sched job depends on a dma-fence from a job from the same
> GPU
> +        * scheduler instance, but a different scheduler entity, the GPU
> +        * scheduler does only wait for the particular job to be scheduled,
>

s/does only wait/only waits/

Reviewed-by: Faith Ekstrand <faith.ekstrand@collaboralcom>

+        * rather than for the job to fully complete. This is due to the GPU
> +        * scheduler assuming that there is a scheduler instance per ring.
> +        * However, the current implementation, in order to avoid arbitrary
> +        * amounts of kthreads, has a single scheduler instance while
> scheduler
> +        * entities represent rings.
> +        *
> +        * As a workaround, set the DRM_SCHED_FENCE_DONT_PIPELINE for all
> +        * out-fences in order to force the scheduler to wait for full job
> +        * completion for dependent jobs from different entities and same
> +        * scheduler instance.
> +        *
> +        * There is some work in progress [1] to address the issues of
> firmware
> +        * schedulers; once it is in-tree the scheduler topology in Nouveau
> +        * should be re-worked accordingly.
> +        *
> +        * [1]
> https://lore.kernel.org/dri-devel/20230801205103.627779-1-matthew.brost@intel.com/
> +        */
> +       set_bit(DRM_SCHED_FENCE_DONT_PIPELINE, &job->done_fence->flags);
> +
>         if (job->ops->armed_submit)
>                 job->ops->armed_submit(job);
>
>
> base-commit: 68132cc6d1bcbc78ade524c6c6c226de42139f0e
> --
> 2.41.0
>
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index 3424a1bf6af3..88217185e0f3 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -292,6 +292,28 @@  nouveau_job_submit(struct nouveau_job *job)
 	if (job->sync)
 		done_fence = dma_fence_get(job->done_fence);
 
+	/* If a sched job depends on a dma-fence from a job from the same GPU
+	 * scheduler instance, but a different scheduler entity, the GPU
+	 * scheduler does only wait for the particular job to be scheduled,
+	 * rather than for the job to fully complete. This is due to the GPU
+	 * scheduler assuming that there is a scheduler instance per ring.
+	 * However, the current implementation, in order to avoid arbitrary
+	 * amounts of kthreads, has a single scheduler instance while scheduler
+	 * entities represent rings.
+	 *
+	 * As a workaround, set the DRM_SCHED_FENCE_DONT_PIPELINE for all
+	 * out-fences in order to force the scheduler to wait for full job
+	 * completion for dependent jobs from different entities and same
+	 * scheduler instance.
+	 *
+	 * There is some work in progress [1] to address the issues of firmware
+	 * schedulers; once it is in-tree the scheduler topology in Nouveau
+	 * should be re-worked accordingly.
+	 *
+	 * [1] https://lore.kernel.org/dri-devel/20230801205103.627779-1-matthew.brost@intel.com/
+	 */
+	set_bit(DRM_SCHED_FENCE_DONT_PIPELINE, &job->done_fence->flags);
+
 	if (job->ops->armed_submit)
 		job->ops->armed_submit(job);