diff mbox series

drm/sched: Add native dependency support to drm_sched

Message ID 7ced7c0a101cb2467c34b69d2b686c429f64d8c2.camel@imgtec.com (mailing list archive)
State New, archived
Headers show
Series drm/sched: Add native dependency support to drm_sched | expand

Commit Message

Donald Robson June 8, 2023, 1:23 p.m. UTC
This patch adds support for 'native' dependencies to DRM scheduler.  In
drivers that use a firmware based scheduler there are performance gains
to be had by allowing waits to happen in the firmware, as this reduces
the latency between signalling and job submission.  Dependencies that
can be awaited by the firmware scheduler are termed 'native
dependencies'.  In the new PowerVR driver we delegate the waits to the
firmware, but it is still necessary to expose these fences within DRM
scheduler.  This is because when a job is cancelled
drm_sched_entity_kill() registers callback to all the dependencies in
order to ensure the job finished fence is not signalled before all the
job dependencies are met, and if they were not exposed the core wouldn't
be able to guarantee that anymore, and it might signal the fence too
early leading to potential invalid accesses if other things depend on
the job finished fence.

All dependencies are handled in the same way up to the point that
dependencies for a job are being checked.  At this stage, DRM scheduler
will now allow  job submission to proceed once it encounters the first
native dependency in the list - dependencies having been sorted
beforehand in drm_sched_job_arm() so that native ones appear last.  The
list is sorted during drm_sched_job_arm() because the scheduler isn't
known until this point, and determining whether a dependency is native
is via a new drm_gpu_scheduler backend operation.

Native fences are just simple counters that get incremented every time
some specific execution point is reached, like when a GPU job is done.
The firmware is in charge of waiting but also updating fences, so it can
easily unblock any waiters it has internally. The CPU also has access to
these counters, so it can also check for progress.

TODO:

When operating normally the CPU is not supposed to update the counters
itself, but there is one specific situation where this is needed - when
a GPU hang happened and some context were declared faulty because they
had unfinished or blocked jobs pending. In that situation, when we reset
the GPU we will evict faulty contexts so they can't submit jobs anymore
and we will cancel the jobs that were in-flight at the time of reset,
but that's not enough because some jobs on other non-faulty contexts
might have native dependencies on jobs that never completed on this
faulty context.

If we were to ask the firmware to wait on those native fences, it would
block indefinitely, because no one would ever update the counter. So, in
that case, and that case only, we want the CPU to force-update the
counter and set it to the last issued sequence number.  We do not
currently have a helper for this and we welcome any suggestions for how
best to implement it.

Signed-off-by: Donald Robson <donald.robson@imgtec.com>
Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Boris Brezillon <boris.brezillon@collabora.com>
Cc: Frank Binns <frank.binns@imgtec.com>
Cc: Sarah Walker <sarah.walker@imgtec.com>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 60 +++++++++++++--
 drivers/gpu/drm/scheduler/sched_main.c   | 96 ++++++++++++++++++++++++
 include/drm/gpu_scheduler.h              | 11 +++
 3 files changed, 161 insertions(+), 6 deletions(-)

Comments

Donald Robson June 9, 2023, 9:29 a.m. UTC | #1
For context, the native dependency support is used in these as yet
unsubmitted files:
https://gitlab.freedesktop.org/sarah-walker-imgtec/powervr/-/blob/dev/v3/drivers/gpu/drm/imagination/pvr_job.c
https://gitlab.freedesktop.org/sarah-walker-imgtec/powervr/-/blob/dev/v3/drivers/gpu/drm/imagination/pvr_queue.c

Thanks,
Donald

On Thu, 2023-06-08 at 14:23 +0100, Donald Robson wrote:
> This patch adds support for 'native' dependencies to DRM scheduler.  In
> drivers that use a firmware based scheduler there are performance gains
> to be had by allowing waits to happen in the firmware, as this reduces
> the latency between signalling and job submission.  Dependencies that
> can be awaited by the firmware scheduler are termed 'native
> dependencies'.  In the new PowerVR driver we delegate the waits to the
> firmware, but it is still necessary to expose these fences within DRM
> scheduler.  This is because when a job is cancelled
> drm_sched_entity_kill() registers callback to all the dependencies in
> order to ensure the job finished fence is not signalled before all the
> job dependencies are met, and if they were not exposed the core wouldn't
> be able to guarantee that anymore, and it might signal the fence too
> early leading to potential invalid accesses if other things depend on
> the job finished fence.
> 
> All dependencies are handled in the same way up to the point that
> dependencies for a job are being checked.  At this stage, DRM scheduler
> will now allow  job submission to proceed once it encounters the first
> native dependency in the list - dependencies having been sorted
> beforehand in drm_sched_job_arm() so that native ones appear last.  The
> list is sorted during drm_sched_job_arm() because the scheduler isn't
> known until this point, and determining whether a dependency is native
> is via a new drm_gpu_scheduler backend operation.
> 
> Native fences are just simple counters that get incremented every time
> some specific execution point is reached, like when a GPU job is done.
> The firmware is in charge of waiting but also updating fences, so it can
> easily unblock any waiters it has internally. The CPU also has access to
> these counters, so it can also check for progress.
> 
> TODO:
> 
> When operating normally the CPU is not supposed to update the counters
> itself, but there is one specific situation where this is needed - when
> a GPU hang happened and some context were declared faulty because they
> had unfinished or blocked jobs pending. In that situation, when we reset
> the GPU we will evict faulty contexts so they can't submit jobs anymore
> and we will cancel the jobs that were in-flight at the time of reset,
> but that's not enough because some jobs on other non-faulty contexts
> might have native dependencies on jobs that never completed on this
> faulty context.
> 
> If we were to ask the firmware to wait on those native fences, it would
> block indefinitely, because no one would ever update the counter. So, in
> that case, and that case only, we want the CPU to force-update the
> counter and set it to the last issued sequence number.  We do not
> currently have a helper for this and we welcome any suggestions for how
> best to implement it.
> 
> Signed-off-by: Donald Robson <donald.robson@imgtec.com>
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: Boris Brezillon <boris.brezillon@collabora.com>
> Cc: Frank Binns <frank.binns@imgtec.com>
> Cc: Sarah Walker <sarah.walker@imgtec.com>
> ---
>  drivers/gpu/drm/scheduler/sched_entity.c | 60 +++++++++++++--
>  drivers/gpu/drm/scheduler/sched_main.c   | 96 ++++++++++++++++++++++++
>  include/drm/gpu_scheduler.h              | 11 +++
>  3 files changed, 161 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 15d04a0ec623..2685805a5e05 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -155,13 +155,14 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
>  {
>  	struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
>  						 finish_cb);
> +	unsigned long idx;
>  	int r;
>  
>  	dma_fence_put(f);
>  
>  	/* Wait for all dependencies to avoid data corruptions */
> -	while (!xa_empty(&job->dependencies)) {
> -		f = xa_erase(&job->dependencies, job->last_dependency++);
> +	xa_for_each(&job->dependencies, idx, f) {
> +		xa_erase(&job->dependencies, idx);
>  		r = dma_fence_add_callback(f, &job->finish_cb,
>  					   drm_sched_entity_kill_jobs_cb);
>  		if (!r)
> @@ -390,12 +391,59 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>  	return false;
>  }
>  
> +/**
> + * dep_is_native - indicates that native dependencies are supported and that the
> + * dependency at @index is marked.
> + * @job: Scheduler job.
> + * @index: Index into the @job->dependencies xarray.
> + *
> + * Must only be used after calling drm_sched_job_arm().
> + *
> + * Returns true if both these conditions are met.
> + */
> +static bool dep_is_native(struct drm_sched_job *job, unsigned long index)
> +{
> +	return job->sched->ops->dependency_is_native &&
> +	       xa_get_mark(&job->dependencies, job->last_dependency, XA_MARK_0);
> +}
> +
>  static struct dma_fence *
> -drm_sched_job_dependency(struct drm_sched_job *job,
> -			 struct drm_sched_entity *entity)
> +drm_sched_job_dependency(struct drm_sched_job *job, struct drm_sched_entity *entity)
>  {
> -	if (!xa_empty(&job->dependencies))
> -		return xa_erase(&job->dependencies, job->last_dependency++);
> +	struct dma_fence *fence;
> +	unsigned long dep_index;
> +
> +	if (!dep_is_native(job, job->last_dependency)) {
> +		fence = xa_erase(&job->dependencies, job->last_dependency++);
> +		if (fence)
> +			return fence;
> +	}
> +
> +	xa_for_each_start(&job->dependencies, dep_index, fence,
> +			  job->last_dependency) {
> +		/*
> +		 * Encountered first native dependency. Since these were
> +		 * previously sorted to the end of the array in
> +		 * drm_sched_sort_native_deps(), all remaining entries
> +		 * will be native too, so we can just iterate through
> +		 * them.
> +		 *
> +		 * Native entries cannot be erased, as they need to be
> +		 * accessed by the driver's native scheduler.
> +		 *
> +		 * If the native fence is a drm_sched_fence object, we
> +		 * ensure the job has been submitted so drm_sched_fence
> +		 * ::parent points to a valid dma_fence object.
> +		 */
> +		struct drm_sched_fence *s_fence = to_drm_sched_fence(fence);
> +		struct dma_fence *scheduled_fence =
> +			s_fence ? dma_fence_get_rcu(&s_fence->scheduled) : NULL;
> +
> +		job->last_dependency = dep_index + 1;
> +
> +		if (scheduled_fence)
> +			return scheduled_fence;
> +	}
>  
>  	if (job->sched->ops->prepare_job)
>  		return job->sched->ops->prepare_job(job, entity);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 214364fccb71..08dcc33ec690 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -643,6 +643,92 @@ int drm_sched_job_init(struct drm_sched_job *job,
>  }
>  EXPORT_SYMBOL(drm_sched_job_init);
>  
> +/**
> + * drm_sched_sort_native_deps - relocates all native dependencies to the
> + * tail end of @job->dependencies.
> + * @job: target scheduler job.
> + *
> + * Starts by marking all of the native dependencies, then, in a quick-sort
> + * like manner it swaps entries using a head and tail index counter. Only
> + * a single partition is required, as there are only two values.
> + */
> +static void drm_sched_sort_native_deps(struct drm_sched_job *job)
> +{
> +	struct dma_fence *entry, *head = NULL, *tail = NULL;
> +	unsigned long h = 0, t = 0, native_dep_count = 0;
> +	XA_STATE(xas_head, &job->dependencies, 0);
> +	XA_STATE(xas_tail, &job->dependencies, 0);
> +	bool already_sorted = true;
> +
> +	if (!job->sched->ops->dependency_is_native)
> +		/* Driver doesn't support native deps. */
> +		return;
> +
> +	/* Mark all the native dependencies as we walk xas_tail to the end. */
> +	xa_lock(&job->dependencies);
> +	xas_for_each(&xas_tail, entry, ULONG_MAX) {
> +		/* Keep track of the index. */
> +		t++;
> +
> +		if (job->sched->ops->dependency_is_native(entry)) {
> +			xas_set_mark(&xas_tail, XA_MARK_0);
> +			native_dep_count++;
> +		} else if (native_dep_count) {
> +			/*
> +			 * As a native dep has been encountered before, we can
> +			 * infer the array is not already sorted.
> +			 */
> +			already_sorted = false;
> +		}
> +	}
> +	xa_unlock(&job->dependencies);
> +
> +	if (already_sorted)
> +		return;
> +
> +	/* xas_tail and t are now at the end of the array. */
> +	xa_lock(&job->dependencies);
> +	while (h < t) {
> +		if (!head) {
> +			/* Find a marked entry. */
> +			if (xas_get_mark(&xas_head, XA_MARK_0)) {
> +				head = xas_load(&xas_head);
> +			} else {
> +				xas_next(&xas_head);
> +				h++;
> +			}
> +		}
> +		if (!tail) {
> +			/* Find an unmarked entry. */
> +			if (xas_get_mark(&xas_tail, XA_MARK_0)) {
> +				xas_prev(&xas_tail);
> +				t--;
> +			} else {
> +				tail = xas_load(&xas_tail);
> +			}
> +		}
> +		if (head && tail) {
> +			/*
> +			 * Swap!
> +			 * These stores should never allocate, since they both
> +			 * already exist, hence they never fail.
> +			 */
> +			xas_store(&xas_head, tail);
> +			xas_store(&xas_tail, head);
> +
> +			/* Also swap the mark. */
> +			xas_clear_mark(&xas_head, XA_MARK_0);
> +			xas_set_mark(&xas_tail, XA_MARK_0);
> +
> +			head = NULL;
> +			tail = NULL;
> +			h++;
> +			t--;
> +		}
> +	}
> +	xa_unlock(&job->dependencies);
> +}
> +
>  /**
>   * drm_sched_job_arm - arm a scheduler job for execution
>   * @job: scheduler job to arm
> @@ -669,6 +755,7 @@ void drm_sched_job_arm(struct drm_sched_job *job)
>  	job->s_priority = entity->rq - sched->sched_rq;
>  	job->id = atomic64_inc_return(&sched->job_id_count);
>  
> +	drm_sched_sort_native_deps(job);
>  	drm_sched_fence_init(job->s_fence, job->entity);
>  }
>  EXPORT_SYMBOL(drm_sched_job_arm);
> @@ -1045,6 +1132,15 @@ static int drm_sched_main(void *param)
>  		trace_drm_run_job(sched_job, entity);
>  		fence = sched->ops->run_job(sched_job);
>  		complete_all(&entity->entity_idle);
> +
> +		/* We need to set the parent before signaling the scheduled
> +		 * fence if we want native dependency to work properly. If we
> +		 * don't, the driver might try to access the parent before
> +		 * it's set.
> +		 */
> +		if (!IS_ERR_OR_NULL(fence))
> +			drm_sched_fence_set_parent(s_fence, fence);
> +
>  		drm_sched_fence_scheduled(s_fence);
>  
>  		if (!IS_ERR_OR_NULL(fence)) {
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 898608f87b96..dca6be35e517 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -455,6 +455,17 @@ struct drm_sched_backend_ops {
>           * and it's time to clean it up.
>  	 */
>  	void (*free_job)(struct drm_sched_job *sched_job);
> +
> +	/**
> +	 * @dependency_is_native: When arming a job for this scheduler, this
> +	 * function will be called to determine whether to treat it as a
> +	 * native dependency. A native dependency is awaited and cleaned up
> +	 * when the job is cancelled, but responsibility is otherwise delegated
> +	 * to a native scheduler in the calling driver code.
> +	 *
> +	 * Optional - implies support for native dependencies.
> +	 */
> +	bool (*dependency_is_native)(struct dma_fence *fence);
>  };
>  
>  /**
Boris Brezillon June 12, 2023, 1:05 p.m. UTC | #2
Hi Donald,

On Thu, 8 Jun 2023 13:23:26 +0000
Donald Robson <Donald.Robson@imgtec.com> wrote:

>  /**
>   * drm_sched_job_arm - arm a scheduler job for execution
>   * @job: scheduler job to arm
> @@ -669,6 +755,7 @@ void drm_sched_job_arm(struct drm_sched_job *job)
>  	job->s_priority = entity->rq - sched->sched_rq;
>  	job->id = atomic64_inc_return(&sched->job_id_count);
>  
> +	drm_sched_sort_native_deps(job);

If we get [1] accepted, we no longer need to sort the array. We can
just skip native dependencies as we iterate over the array in
drm_sched_job_dependency() with something like:

       f = xa_load(&job->dependencies, job->last_dependency);
       while (f) {
               struct drm_sched_fence *s_fence;
               struct dma_fence *scheduled_fence;

               job->last_dependency++;

               /* Not a native dependency, return the fence directly. */
               if (!job->sched->ops->dependency_is_native ||
                   !job->sched->ops->dependency_is_native(f))
                       return dma_fence_get(f);

               /*
                * If the native fence is a drm_sched_fence object, we
                * ensure the job has been submitted so drm_sched_fence
                * ::parent points to a valid dma_fence object.
                */
               s_fence = to_drm_sched_fence(f);
               scheduled_fence = s_fence ?
				 dma_fence_get_rcu(&s_fence->scheduled) :
                                 NULL;

               if (scheduled_fence)
                       return scheduled_fence;

               /* Otherwise we skip the native fence and check the next fence. */
               f = xa_load(&job->dependencies, job->last_dependency);
        }

And, in the driver, when you get to submit the job, you can gather
the native deps with a simple xa_for_each() loop:

	xa_for_each(&job->dependencies, index, f) {
		/* If the fence is not signaled, it must be a native fence,
		 * because drm_sched_entity waited for all non-native ones.
		 */
		if (!dma_fence_is_signaled(f))
			// DO SOMETHING
	}

>  	drm_sched_fence_init(job->s_fence, job->entity);
>  }

Regards,

Boris

[1]https://patchwork.freedesktop.org/patch/541956/
Christian König June 12, 2023, 1:16 p.m. UTC | #3
Am 08.06.23 um 15:23 schrieb Donald Robson:
> This patch adds support for 'native' dependencies to DRM scheduler.  In
> drivers that use a firmware based scheduler there are performance gains
> to be had by allowing waits to happen in the firmware, as this reduces
> the latency between signalling and job submission.

Well, if I'm not completely mistaken this patch here is superfluous 
since we already use that functionality.

This strongly sounds like the HW dependencies we have in amdgpu. See 
AMDGPU_CHUNK_ID_SCHEDULED_DEPENDENCIES.

Basically userspace can instead of giving a hard dependency to finish 
something before the current submission starts also give a soft 
dependency and only let the other submission be scheduled.

This way you can then wait for the firmware for certain operations of 
the previous submission to complete by comparing memory or registers.

You don't necessarily need to give control over this to userspace, if 
your kernel driver can determine a fw assisted wait by itself that 
should also work fine.

Regards,
Christian.

>    Dependencies that
> can be awaited by the firmware scheduler are termed 'native
> dependencies'.  In the new PowerVR driver we delegate the waits to the
> firmware, but it is still necessary to expose these fences within DRM
> scheduler.  This is because when a job is cancelled
> drm_sched_entity_kill() registers callback to all the dependencies in
> order to ensure the job finished fence is not signalled before all the
> job dependencies are met, and if they were not exposed the core wouldn't
> be able to guarantee that anymore, and it might signal the fence too
> early leading to potential invalid accesses if other things depend on
> the job finished fence.
>
> All dependencies are handled in the same way up to the point that
> dependencies for a job are being checked.  At this stage, DRM scheduler
> will now allow  job submission to proceed once it encounters the first
> native dependency in the list - dependencies having been sorted
> beforehand in drm_sched_job_arm() so that native ones appear last.  The
> list is sorted during drm_sched_job_arm() because the scheduler isn't
> known until this point, and determining whether a dependency is native
> is via a new drm_gpu_scheduler backend operation.
>
> Native fences are just simple counters that get incremented every time
> some specific execution point is reached, like when a GPU job is done.
> The firmware is in charge of waiting but also updating fences, so it can
> easily unblock any waiters it has internally. The CPU also has access to
> these counters, so it can also check for progress.
>
> TODO:
>
> When operating normally the CPU is not supposed to update the counters
> itself, but there is one specific situation where this is needed - when
> a GPU hang happened and some context were declared faulty because they
> had unfinished or blocked jobs pending. In that situation, when we reset
> the GPU we will evict faulty contexts so they can't submit jobs anymore
> and we will cancel the jobs that were in-flight at the time of reset,
> but that's not enough because some jobs on other non-faulty contexts
> might have native dependencies on jobs that never completed on this
> faulty context.
>
> If we were to ask the firmware to wait on those native fences, it would
> block indefinitely, because no one would ever update the counter. So, in
> that case, and that case only, we want the CPU to force-update the
> counter and set it to the last issued sequence number.  We do not
> currently have a helper for this and we welcome any suggestions for how
> best to implement it.
>
> Signed-off-by: Donald Robson <donald.robson@imgtec.com>
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: Boris Brezillon <boris.brezillon@collabora.com>
> Cc: Frank Binns <frank.binns@imgtec.com>
> Cc: Sarah Walker <sarah.walker@imgtec.com>
> ---
>   drivers/gpu/drm/scheduler/sched_entity.c | 60 +++++++++++++--
>   drivers/gpu/drm/scheduler/sched_main.c   | 96 ++++++++++++++++++++++++
>   include/drm/gpu_scheduler.h              | 11 +++
>   3 files changed, 161 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 15d04a0ec623..2685805a5e05 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -155,13 +155,14 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
>   {
>   	struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
>   						 finish_cb);
> +	unsigned long idx;
>   	int r;
>   
>   	dma_fence_put(f);
>   
>   	/* Wait for all dependencies to avoid data corruptions */
> -	while (!xa_empty(&job->dependencies)) {
> -		f = xa_erase(&job->dependencies, job->last_dependency++);
> +	xa_for_each(&job->dependencies, idx, f) {
> +		xa_erase(&job->dependencies, idx);
>   		r = dma_fence_add_callback(f, &job->finish_cb,
>   					   drm_sched_entity_kill_jobs_cb);
>   		if (!r)
> @@ -390,12 +391,59 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>   	return false;
>   }
>   
> +/**
> + * dep_is_native - indicates that native dependencies are supported and that the
> + * dependency at @index is marked.
> + * @job: Scheduler job.
> + * @index: Index into the @job->dependencies xarray.
> + *
> + * Must only be used after calling drm_sched_job_arm().
> + *
> + * Returns true if both these conditions are met.
> + */
> +static bool dep_is_native(struct drm_sched_job *job, unsigned long index)
> +{
> +	return job->sched->ops->dependency_is_native &&
> +	       xa_get_mark(&job->dependencies, job->last_dependency, XA_MARK_0);
> +}
> +
>   static struct dma_fence *
> -drm_sched_job_dependency(struct drm_sched_job *job,
> -			 struct drm_sched_entity *entity)
> +drm_sched_job_dependency(struct drm_sched_job *job, struct drm_sched_entity *entity)
>   {
> -	if (!xa_empty(&job->dependencies))
> -		return xa_erase(&job->dependencies, job->last_dependency++);
> +	struct dma_fence *fence;
> +	unsigned long dep_index;
> +
> +	if (!dep_is_native(job, job->last_dependency)) {
> +		fence = xa_erase(&job->dependencies, job->last_dependency++);
> +		if (fence)
> +			return fence;
> +	}
> +
> +	xa_for_each_start(&job->dependencies, dep_index, fence,
> +			  job->last_dependency) {
> +		/*
> +		 * Encountered first native dependency. Since these were
> +		 * previously sorted to the end of the array in
> +		 * drm_sched_sort_native_deps(), all remaining entries
> +		 * will be native too, so we can just iterate through
> +		 * them.
> +		 *
> +		 * Native entries cannot be erased, as they need to be
> +		 * accessed by the driver's native scheduler.
> +		 *
> +		 * If the native fence is a drm_sched_fence object, we
> +		 * ensure the job has been submitted so drm_sched_fence
> +		 * ::parent points to a valid dma_fence object.
> +		 */
> +		struct drm_sched_fence *s_fence = to_drm_sched_fence(fence);
> +		struct dma_fence *scheduled_fence =
> +			s_fence ? dma_fence_get_rcu(&s_fence->scheduled) : NULL;
> +
> +		job->last_dependency = dep_index + 1;
> +
> +		if (scheduled_fence)
> +			return scheduled_fence;
> +	}
>   
>   	if (job->sched->ops->prepare_job)
>   		return job->sched->ops->prepare_job(job, entity);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 214364fccb71..08dcc33ec690 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -643,6 +643,92 @@ int drm_sched_job_init(struct drm_sched_job *job,
>   }
>   EXPORT_SYMBOL(drm_sched_job_init);
>   
> +/**
> + * drm_sched_sort_native_deps - relocates all native dependencies to the
> + * tail end of @job->dependencies.
> + * @job: target scheduler job.
> + *
> + * Starts by marking all of the native dependencies, then, in a quick-sort
> + * like manner it swaps entries using a head and tail index counter. Only
> + * a single partition is required, as there are only two values.
> + */
> +static void drm_sched_sort_native_deps(struct drm_sched_job *job)
> +{
> +	struct dma_fence *entry, *head = NULL, *tail = NULL;
> +	unsigned long h = 0, t = 0, native_dep_count = 0;
> +	XA_STATE(xas_head, &job->dependencies, 0);
> +	XA_STATE(xas_tail, &job->dependencies, 0);
> +	bool already_sorted = true;
> +
> +	if (!job->sched->ops->dependency_is_native)
> +		/* Driver doesn't support native deps. */
> +		return;
> +
> +	/* Mark all the native dependencies as we walk xas_tail to the end. */
> +	xa_lock(&job->dependencies);
> +	xas_for_each(&xas_tail, entry, ULONG_MAX) {
> +		/* Keep track of the index. */
> +		t++;
> +
> +		if (job->sched->ops->dependency_is_native(entry)) {
> +			xas_set_mark(&xas_tail, XA_MARK_0);
> +			native_dep_count++;
> +		} else if (native_dep_count) {
> +			/*
> +			 * As a native dep has been encountered before, we can
> +			 * infer the array is not already sorted.
> +			 */
> +			already_sorted = false;
> +		}
> +	}
> +	xa_unlock(&job->dependencies);
> +
> +	if (already_sorted)
> +		return;
> +
> +	/* xas_tail and t are now at the end of the array. */
> +	xa_lock(&job->dependencies);
> +	while (h < t) {
> +		if (!head) {
> +			/* Find a marked entry. */
> +			if (xas_get_mark(&xas_head, XA_MARK_0)) {
> +				head = xas_load(&xas_head);
> +			} else {
> +				xas_next(&xas_head);
> +				h++;
> +			}
> +		}
> +		if (!tail) {
> +			/* Find an unmarked entry. */
> +			if (xas_get_mark(&xas_tail, XA_MARK_0)) {
> +				xas_prev(&xas_tail);
> +				t--;
> +			} else {
> +				tail = xas_load(&xas_tail);
> +			}
> +		}
> +		if (head && tail) {
> +			/*
> +			 * Swap!
> +			 * These stores should never allocate, since they both
> +			 * already exist, hence they never fail.
> +			 */
> +			xas_store(&xas_head, tail);
> +			xas_store(&xas_tail, head);
> +
> +			/* Also swap the mark. */
> +			xas_clear_mark(&xas_head, XA_MARK_0);
> +			xas_set_mark(&xas_tail, XA_MARK_0);
> +
> +			head = NULL;
> +			tail = NULL;
> +			h++;
> +			t--;
> +		}
> +	}
> +	xa_unlock(&job->dependencies);
> +}
> +
>   /**
>    * drm_sched_job_arm - arm a scheduler job for execution
>    * @job: scheduler job to arm
> @@ -669,6 +755,7 @@ void drm_sched_job_arm(struct drm_sched_job *job)
>   	job->s_priority = entity->rq - sched->sched_rq;
>   	job->id = atomic64_inc_return(&sched->job_id_count);
>   
> +	drm_sched_sort_native_deps(job);
>   	drm_sched_fence_init(job->s_fence, job->entity);
>   }
>   EXPORT_SYMBOL(drm_sched_job_arm);
> @@ -1045,6 +1132,15 @@ static int drm_sched_main(void *param)
>   		trace_drm_run_job(sched_job, entity);
>   		fence = sched->ops->run_job(sched_job);
>   		complete_all(&entity->entity_idle);
> +
> +		/* We need to set the parent before signaling the scheduled
> +		 * fence if we want native dependency to work properly. If we
> +		 * don't, the driver might try to access the parent before
> +		 * it's set.
> +		 */
> +		if (!IS_ERR_OR_NULL(fence))
> +			drm_sched_fence_set_parent(s_fence, fence);
> +
>   		drm_sched_fence_scheduled(s_fence);
>   
>   		if (!IS_ERR_OR_NULL(fence)) {
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 898608f87b96..dca6be35e517 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -455,6 +455,17 @@ struct drm_sched_backend_ops {
>            * and it's time to clean it up.
>   	 */
>   	void (*free_job)(struct drm_sched_job *sched_job);
> +
> +	/**
> +	 * @dependency_is_native: When arming a job for this scheduler, this
> +	 * function will be called to determine whether to treat it as a
> +	 * native dependency. A native dependency is awaited and cleaned up
> +	 * when the job is cancelled, but responsibility is otherwise delegated
> +	 * to a native scheduler in the calling driver code.
> +	 *
> +	 * Optional - implies support for native dependencies.
> +	 */
> +	bool (*dependency_is_native)(struct dma_fence *fence);
>   };
>   
>   /**
Boris Brezillon June 12, 2023, 2:59 p.m. UTC | #4
Hi Christian,

On Mon, 12 Jun 2023 15:16:03 +0200
Christian König <christian.koenig@amd.com> wrote:

> Am 08.06.23 um 15:23 schrieb Donald Robson:
> > This patch adds support for 'native' dependencies to DRM scheduler.  In
> > drivers that use a firmware based scheduler there are performance gains
> > to be had by allowing waits to happen in the firmware, as this reduces
> > the latency between signalling and job submission.  
> 
> Well, if I'm not completely mistaken this patch here is superfluous 
> since we already use that functionality.
> 
> This strongly sounds like the HW dependencies we have in amdgpu. See 
> AMDGPU_CHUNK_ID_SCHEDULED_DEPENDENCIES.

I'll look at it in more details. Thanks for the pointer.

> 
> Basically userspace can instead of giving a hard dependency to finish 
> something before the current submission starts also give a soft 
> dependency and only let the other submission be scheduled.
> 
> This way you can then wait for the firmware for certain operations of 
> the previous submission to complete by comparing memory or registers.
> 
> You don't necessarily need to give control over this to userspace, if 
> your kernel driver can determine a fw assisted wait by itself that 
> should also work fine.

That's what we did initially. We had a separate 'native_deps' xarray in
pvr_job that we were managing ourselves, and that worked fine, except
for the kill_entity() stuff. If you don't wait for those
'native-fences', you're potentially signaling the job finished fence
earlier than it should.

Just to make sure we're on the same page, the native fences we
have here are really dma_fences that can be waited upon FW side:
they're not exposed to userspace, the GPU can't access the memory
region containing the counter (it's visible to the FW VM only, and
a kernel side CPU mapping), and we do make sure they signal in finite
time thanks to the job timeout. Feels a bit different compared to
USER_FENCEs most GPUs have nowadays, on which you don't have this
isolation guarantee, and which, AFAIU, are currently used to do some
advanced userspace driven scheduling between queues belonging to the
same context. My understanding, after discussing it with Daniel a few
weeks back, was that exposing USER_FENCEs as dma_fences was risky,
especially if they're used to do inter-context synchronization,
but the FW-visible-only ones were okay to expose as dma_fences. Maybe I
misunderstood what he suggested.

I'm done with this digression, now back to the original topic: we can of
course wait for all those native fences before calling
drm_sched_enity_destroy(), but that's a bit weird to do some partial
wait in the driver while the entity is still active (pretty sure that's
racy anyway), and then delegate the rest to the core.

If we decide we don't care about waiting for native fences when
killing jobs in the kill_entity() path, because we assume drm_resv is
covering us, that's fine, but then I don't really see why
drm_sched_kill_entity() should wait at all, because this 'should wait,
but maybe not for all your deps' behavior is quite confusing.

Regards,

Boris
Boris Brezillon June 12, 2023, 4:25 p.m. UTC | #5
On Mon, 12 Jun 2023 16:59:02 +0200
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> Hi Christian,
> 
> On Mon, 12 Jun 2023 15:16:03 +0200
> Christian König <christian.koenig@amd.com> wrote:
> 
> > Am 08.06.23 um 15:23 schrieb Donald Robson:  
> > > This patch adds support for 'native' dependencies to DRM scheduler.  In
> > > drivers that use a firmware based scheduler there are performance gains
> > > to be had by allowing waits to happen in the firmware, as this reduces
> > > the latency between signalling and job submission.    
> > 
> > Well, if I'm not completely mistaken this patch here is superfluous 
> > since we already use that functionality.
> > 
> > This strongly sounds like the HW dependencies we have in amdgpu. See 
> > AMDGPU_CHUNK_ID_SCHEDULED_DEPENDENCIES.  
> 
> I'll look at it in more details. Thanks for the pointer.

I had a quick look, and it looks pretty similar, indeed.

> 
> > 
> > Basically userspace can instead of giving a hard dependency to finish 
> > something before the current submission starts also give a soft 
> > dependency and only let the other submission be scheduled.
> > 
> > This way you can then wait for the firmware for certain operations of 
> > the previous submission to complete by comparing memory or registers.
> > 
> > You don't necessarily need to give control over this to userspace, if 
> > your kernel driver can determine a fw assisted wait by itself that 
> > should also work fine.  
> 
> That's what we did initially. We had a separate 'native_deps' xarray in
> pvr_job that we were managing ourselves, and that worked fine, except
> for the kill_entity() stuff. If you don't wait for those
> 'native-fences', you're potentially signaling the job finished fence
> earlier than it should.

Hm, I think we could get drm_sched_entity_kill_jobs_cb() to do the
right thing here without teaching drm_sched about native deps. If we
turn back scheduled fences into finished fences in
drm_sched_entity_kill_jobs_cb(), this should work:

static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
                                          struct dma_fence_cb *cb)
{
        struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
                                                 finish_cb);
        unsigned long index;
        int r;

        dma_fence_put(f);

        /* Wait for all dependencies to avoid data corruptions */
        xa_for_each(&job->dependencies, index, f) {
                struct drm_sched_fence *s_fence = to_drm_sched_fence(f);

                /* Make sure we wait for the finished fence here, so we can
                 * guarantee that any job we depend on that is still accessing
                 * resources is done before we signal this job finished fence
                 * and unblock further accesses on these resources.
                 */
                if (s_fence && f == &s_fence->scheduled)
                        f = &s_fence->finished;

                xa_erase(&job->dependencies, index);
                r = dma_fence_add_callback(f, &job->finish_cb,
                                           drm_sched_entity_kill_jobs_cb);
                if (!r)
                        return;

                dma_fence_put(f);
        }

        INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);
        schedule_work(&job->work);
}

Then, for native fences, we just have to add the scheduled fence to the deps
array, as you do (and as we did in our first version), and we should be good.
diff mbox series

Patch

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 15d04a0ec623..2685805a5e05 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -155,13 +155,14 @@  static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
 {
 	struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
 						 finish_cb);
+	unsigned long idx;
 	int r;
 
 	dma_fence_put(f);
 
 	/* Wait for all dependencies to avoid data corruptions */
-	while (!xa_empty(&job->dependencies)) {
-		f = xa_erase(&job->dependencies, job->last_dependency++);
+	xa_for_each(&job->dependencies, idx, f) {
+		xa_erase(&job->dependencies, idx);
 		r = dma_fence_add_callback(f, &job->finish_cb,
 					   drm_sched_entity_kill_jobs_cb);
 		if (!r)
@@ -390,12 +391,59 @@  static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
 	return false;
 }
 
+/**
+ * dep_is_native - indicates that native dependencies are supported and that the
+ * dependency at @index is marked.
+ * @job: Scheduler job.
+ * @index: Index into the @job->dependencies xarray.
+ *
+ * Must only be used after calling drm_sched_job_arm().
+ *
+ * Returns true if both these conditions are met.
+ */
+static bool dep_is_native(struct drm_sched_job *job, unsigned long index)
+{
+	return job->sched->ops->dependency_is_native &&
+	       xa_get_mark(&job->dependencies, job->last_dependency, XA_MARK_0);
+}
+
 static struct dma_fence *
-drm_sched_job_dependency(struct drm_sched_job *job,
-			 struct drm_sched_entity *entity)
+drm_sched_job_dependency(struct drm_sched_job *job, struct drm_sched_entity *entity)
 {
-	if (!xa_empty(&job->dependencies))
-		return xa_erase(&job->dependencies, job->last_dependency++);
+	struct dma_fence *fence;
+	unsigned long dep_index;
+
+	if (!dep_is_native(job, job->last_dependency)) {
+		fence = xa_erase(&job->dependencies, job->last_dependency++);
+		if (fence)
+			return fence;
+	}
+
+	xa_for_each_start(&job->dependencies, dep_index, fence,
+			  job->last_dependency) {
+		/*
+		 * Encountered first native dependency. Since these were
+		 * previously sorted to the end of the array in
+		 * drm_sched_sort_native_deps(), all remaining entries
+		 * will be native too, so we can just iterate through
+		 * them.
+		 *
+		 * Native entries cannot be erased, as they need to be
+		 * accessed by the driver's native scheduler.
+		 *
+		 * If the native fence is a drm_sched_fence object, we
+		 * ensure the job has been submitted so drm_sched_fence
+		 * ::parent points to a valid dma_fence object.
+		 */
+		struct drm_sched_fence *s_fence = to_drm_sched_fence(fence);
+		struct dma_fence *scheduled_fence =
+			s_fence ? dma_fence_get_rcu(&s_fence->scheduled) : NULL;
+
+		job->last_dependency = dep_index + 1;
+
+		if (scheduled_fence)
+			return scheduled_fence;
+	}
 
 	if (job->sched->ops->prepare_job)
 		return job->sched->ops->prepare_job(job, entity);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 214364fccb71..08dcc33ec690 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -643,6 +643,92 @@  int drm_sched_job_init(struct drm_sched_job *job,
 }
 EXPORT_SYMBOL(drm_sched_job_init);
 
+/**
+ * drm_sched_sort_native_deps - relocates all native dependencies to the
+ * tail end of @job->dependencies.
+ * @job: target scheduler job.
+ *
+ * Starts by marking all of the native dependencies, then, in a quick-sort
+ * like manner it swaps entries using a head and tail index counter. Only
+ * a single partition is required, as there are only two values.
+ */
+static void drm_sched_sort_native_deps(struct drm_sched_job *job)
+{
+	struct dma_fence *entry, *head = NULL, *tail = NULL;
+	unsigned long h = 0, t = 0, native_dep_count = 0;
+	XA_STATE(xas_head, &job->dependencies, 0);
+	XA_STATE(xas_tail, &job->dependencies, 0);
+	bool already_sorted = true;
+
+	if (!job->sched->ops->dependency_is_native)
+		/* Driver doesn't support native deps. */
+		return;
+
+	/* Mark all the native dependencies as we walk xas_tail to the end. */
+	xa_lock(&job->dependencies);
+	xas_for_each(&xas_tail, entry, ULONG_MAX) {
+		/* Keep track of the index. */
+		t++;
+
+		if (job->sched->ops->dependency_is_native(entry)) {
+			xas_set_mark(&xas_tail, XA_MARK_0);
+			native_dep_count++;
+		} else if (native_dep_count) {
+			/*
+			 * As a native dep has been encountered before, we can
+			 * infer the array is not already sorted.
+			 */
+			already_sorted = false;
+		}
+	}
+	xa_unlock(&job->dependencies);
+
+	if (already_sorted)
+		return;
+
+	/* xas_tail and t are now at the end of the array. */
+	xa_lock(&job->dependencies);
+	while (h < t) {
+		if (!head) {
+			/* Find a marked entry. */
+			if (xas_get_mark(&xas_head, XA_MARK_0)) {
+				head = xas_load(&xas_head);
+			} else {
+				xas_next(&xas_head);
+				h++;
+			}
+		}
+		if (!tail) {
+			/* Find an unmarked entry. */
+			if (xas_get_mark(&xas_tail, XA_MARK_0)) {
+				xas_prev(&xas_tail);
+				t--;
+			} else {
+				tail = xas_load(&xas_tail);
+			}
+		}
+		if (head && tail) {
+			/*
+			 * Swap!
+			 * These stores should never allocate, since they both
+			 * already exist, hence they never fail.
+			 */
+			xas_store(&xas_head, tail);
+			xas_store(&xas_tail, head);
+
+			/* Also swap the mark. */
+			xas_clear_mark(&xas_head, XA_MARK_0);
+			xas_set_mark(&xas_tail, XA_MARK_0);
+
+			head = NULL;
+			tail = NULL;
+			h++;
+			t--;
+		}
+	}
+	xa_unlock(&job->dependencies);
+}
+
 /**
  * drm_sched_job_arm - arm a scheduler job for execution
  * @job: scheduler job to arm
@@ -669,6 +755,7 @@  void drm_sched_job_arm(struct drm_sched_job *job)
 	job->s_priority = entity->rq - sched->sched_rq;
 	job->id = atomic64_inc_return(&sched->job_id_count);
 
+	drm_sched_sort_native_deps(job);
 	drm_sched_fence_init(job->s_fence, job->entity);
 }
 EXPORT_SYMBOL(drm_sched_job_arm);
@@ -1045,6 +1132,15 @@  static int drm_sched_main(void *param)
 		trace_drm_run_job(sched_job, entity);
 		fence = sched->ops->run_job(sched_job);
 		complete_all(&entity->entity_idle);
+
+		/* We need to set the parent before signaling the scheduled
+		 * fence if we want native dependency to work properly. If we
+		 * don't, the driver might try to access the parent before
+		 * it's set.
+		 */
+		if (!IS_ERR_OR_NULL(fence))
+			drm_sched_fence_set_parent(s_fence, fence);
+
 		drm_sched_fence_scheduled(s_fence);
 
 		if (!IS_ERR_OR_NULL(fence)) {
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 898608f87b96..dca6be35e517 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -455,6 +455,17 @@  struct drm_sched_backend_ops {
          * and it's time to clean it up.
 	 */
 	void (*free_job)(struct drm_sched_job *sched_job);
+
+	/**
+	 * @dependency_is_native: When arming a job for this scheduler, this
+	 * function will be called to determine whether to treat it as a
+	 * native dependency. A native dependency is awaited and cleaned up
+	 * when the job is cancelled, but responsibility is otherwise delegated
+	 * to a native scheduler in the calling driver code.
+	 *
+	 * Optional - implies support for native dependencies.
+	 */
+	bool (*dependency_is_native)(struct dma_fence *fence);
 };
 
 /**