[05/41] drm/i915: Restructure priority inheritance

Message ID	20210125140136.10494-5-chris@chris-wilson.co.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=x+Hj=G4=lists.freedesktop.org=intel-gfx-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 516EE230FD From: Chris Wilson <chris@chris-wilson.co.uk> To: intel-gfx@lists.freedesktop.org Date: Mon, 25 Jan 2021 14:01:00 +0000 Message-Id: <20210125140136.10494-5-chris@chris-wilson.co.uk> In-Reply-To: <20210125140136.10494-1-chris@chris-wilson.co.uk> References: <20210125140136.10494-1-chris@chris-wilson.co.uk> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 05/41] drm/i915: Restructure priority inheritance Precedence: list Cc: thomas.hellstrom@intel.com, Chris Wilson <chris@chris-wilson.co.uk> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	[01/41] drm/i915/selftests: Check for engine-reset errors in the middle of workarounds \| expand [01/41] drm/i915/selftests: Check for engine-reset errors in the middle of workarounds [02/41] drm/i915/gt: Move the defer_request waiter active assertion [03/41] drm/i915: Replace engine->schedule() with a known request operation [04/41] drm/i915: Teach the i915_dependency to use a double-lock [05/41] drm/i915: Restructure priority inheritance [06/41] drm/i915/selftests: Measure set-priority duration [07/41] drm/i915/selftests: Exercise priority inheritance around an engine loop [08/41] drm/i915: Improve DFS for priority inheritance [09/41] drm/i915/selftests: Exercise relative mmio paths to non-privileged registers [10/41] drm/i915/selftests: Exercise cross-process context isolation [11/41] drm/i915: Extract request submission from execlists [12/41] drm/i915: Extract request rewinding from execlists [13/41] drm/i915: Extract request suspension from the execlists [14/41] drm/i915: Extract the ability to defer and rerun a request later [15/41] drm/i915: Fix the iterative dfs for defering requests [16/41] drm/i915: Move common active lists from engine to i915_scheduler [17/41] drm/i915: Move scheduler queue [18/41] drm/i915: Move tasklet from execlists to sched [19/41] drm/i915/gt: Show scheduler queues when dumping state [20/41] drm/i915: Replace priolist rbtree with a skiplist [21/41] drm/i915: Wrap cmpxchg64 with try_cmpxchg64() helper [22/41] drm/i915: Fair low-latency scheduling [23/41] drm/i915/gt: Specify a deadline for the heartbeat [24/41] drm/i915: Extend the priority boosting for the display with a deadline [25/41] drm/i915/gt: Support virtual engine queues [26/41] drm/i915: Move saturated workload detection back to the context [27/41] drm/i915: Bump default timeslicing quantum to 5ms [28/41] drm/i915/gt: Wrap intel_timeline.has_initial_breadcrumb [29/41] drm/i915/gt: Track timeline GGTT offset separately from subpage offset [30/41] drm/i915/gt: Add timeline "mode" [31/41] drm/i915/gt: Use indices for writing into relative timelines [32/41] drm/i915/selftests: Exercise relative timeline modes [33/41] drm/i915/gt: Use ppHWSP for unshared non-semaphore related timelines [34/41] Restore "drm/i915: drop engine_pin/unpin_breadcrumbs_irq" [35/41] drm/i915/gt: Couple tasklet scheduling for all CS interrupts [36/41] drm/i915/gt: Support creation of 'internal' rings [37/41] drm/i915/gt: Use client timeline address for seqno writes [38/41] drm/i915/gt: Infrastructure for ring scheduling [39/41] drm/i915/gt: Implement ring scheduler for gen4-7 [40/41] drm/i915/gt: Enable ring scheduling for gen5-7 [41/41] drm/i915: Support secure dispatch on gen6/gen7

Chris Wilson Jan. 25, 2021, 2:01 p.m. UTC

In anticipation of wanting to be able to call pi from underneath an
engine's active.lock, rework the priority inheritance to primarily work
along an engine's priority queue, delegating any other engine that the
chain may traverse to a worker. This reduces the global spinlock from
governing the multi-entire priority inheritance depth-first search, to a
smaller lock on each engine around a single list on that engine.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c    |   2 +
 drivers/gpu/drm/i915/gt/intel_engine_types.h |   3 +
 drivers/gpu/drm/i915/i915_scheduler.c        | 346 ++++++++++++-------
 drivers/gpu/drm/i915/i915_scheduler.h        |   2 +
 drivers/gpu/drm/i915/i915_scheduler_types.h  |  19 +-
 5 files changed, 234 insertions(+), 138 deletions(-)

Tvrtko Ursulin Jan. 26, 2021, 11:12 a.m. UTC | #1

On 25/01/2021 14:01, Chris Wilson wrote:
> In anticipation of wanting to be able to call pi from underneath an
> engine's active.lock, rework the priority inheritance to primarily work
> along an engine's priority queue, delegating any other engine that the
> chain may traverse to a worker. This reduces the global spinlock from
> governing the multi-entire priority inheritance depth-first search, to a
> smaller lock on each engine around a single list on that engine.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c    |   2 +
>   drivers/gpu/drm/i915/gt/intel_engine_types.h |   3 +
>   drivers/gpu/drm/i915/i915_scheduler.c        | 346 ++++++++++++-------
>   drivers/gpu/drm/i915/i915_scheduler.h        |   2 +
>   drivers/gpu/drm/i915/i915_scheduler_types.h  |  19 +-
>   5 files changed, 234 insertions(+), 138 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 7e580d3ac58f..3bfd3853c0e9 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -576,6 +576,8 @@ void intel_engine_init_execlists(struct intel_engine_cs *engine)
>   
>   	execlists->queue_priority_hint = INT_MIN;
>   	execlists->queue = RB_ROOT_CACHED;
> +
> +	i915_sched_init_ipi(&execlists->ipi);
>   }
>   
>   static void cleanup_status_page(struct intel_engine_cs *engine)
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index 27cb3dc0233b..9105b7769635 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -20,6 +20,7 @@
>   #include "i915_gem.h"
>   #include "i915_pmu.h"
>   #include "i915_priolist_types.h"
> +#include "i915_scheduler_types.h"
>   #include "i915_selftest.h"
>   #include "intel_breadcrumbs_types.h"
>   #include "intel_sseu.h"
> @@ -257,6 +258,8 @@ struct intel_engine_execlists {
>   	struct rb_root_cached queue;
>   	struct rb_root_cached virtual;
>   
> +	struct i915_sched_ipi ipi;
> +
>   	/**
>   	 * @csb_write: control register for Context Switch buffer
>   	 *
> diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
> index 96fe1e22dad7..0ecf71a6afd4 100644
> --- a/drivers/gpu/drm/i915/i915_scheduler.c
> +++ b/drivers/gpu/drm/i915/i915_scheduler.c
> @@ -17,8 +17,6 @@ static struct i915_global_scheduler {
>   	struct kmem_cache *slab_priorities;
>   } global;
>   
> -static DEFINE_SPINLOCK(schedule_lock);
> -
>   static struct i915_sched_node *node_get(struct i915_sched_node *node)
>   {
>   	i915_request_get(container_of(node, struct i915_request, sched));
> @@ -30,17 +28,116 @@ static void node_put(struct i915_sched_node *node)
>   	i915_request_put(container_of(node, struct i915_request, sched));
>   }
>   
> +static inline int rq_prio(const struct i915_request *rq)
> +{
> +	return READ_ONCE(rq->sched.attr.priority);
> +}
> +
> +static int ipi_get_prio(struct i915_request *rq)
> +{
> +	if (READ_ONCE(rq->sched.ipi_priority) == I915_PRIORITY_INVALID)
> +		return I915_PRIORITY_INVALID;
> +
> +	return xchg(&rq->sched.ipi_priority, I915_PRIORITY_INVALID);
> +}
> +
> +static void ipi_schedule(struct work_struct *wrk)
> +{
> +	struct i915_sched_ipi *ipi = container_of(wrk, typeof(*ipi), work);
> +	struct i915_request *rq = xchg(&ipi->list, NULL);
> +
> +	do {
> +		struct i915_request *rn = xchg(&rq->sched.ipi_link, NULL);
> +		int prio;
> +
> +		prio = ipi_get_prio(rq);
> +
> +		/*
> +		 * For cross-engine scheduling to work we rely on one of two
> +		 * things:
> +		 *
> +		 * a) The requests are using dma-fence fences and so will not
> +		 * be scheduled until the previous engine is completed, and
> +		 * so we cannot cross back onto the original engine and end up
> +		 * queuing an earlier request after the first (due to the
> +		 * interrupted DFS).
> +		 *
> +		 * b) The requests are using semaphores and so may be already
> +		 * be in flight, in which case if we cross back onto the same
> +		 * engine, we will already have put the interrupted DFS into
> +		 * the priolist, and the continuation will now be queued
> +		 * afterwards [out-of-order]. However, since we are using
> +		 * semaphores in this case, we also perform yield on semaphore
> +		 * waits and so will reorder the requests back into the correct
> +		 * sequence. This occurrence (of promoting a request chain
> +		 * that crosses the engines using semaphores back unto itself)
> +		 * should be unlikely enough that it probably does not matter...
> +		 */
> +		local_bh_disable();
> +		i915_request_set_priority(rq, prio);
> +		local_bh_enable();

Is it that important and wouldn't the priority order restore eventually 
due timeslicing?

> +
> +		i915_request_put(rq);
> +		rq = ptr_mask_bits(rn, 1);
> +	} while (rq);
> +}
> +
> +void i915_sched_init_ipi(struct i915_sched_ipi *ipi)
> +{
> +	INIT_WORK(&ipi->work, ipi_schedule);
> +	ipi->list = NULL;
> +}
> +
> +static void __ipi_add(struct i915_request *rq)
> +{
> +#define STUB ((struct i915_request *)1)
> +	struct intel_engine_cs *engine = READ_ONCE(rq->engine);
> +	struct i915_request *first;
> +
> +	if (!i915_request_get_rcu(rq))
> +		return;
> +
> +	if (__i915_request_is_complete(rq) ||
> +	    cmpxchg(&rq->sched.ipi_link, NULL, STUB)) { /* already queued */
> +		i915_request_put(rq);
> +		return;
> +	}
> +
> +	first = READ_ONCE(engine->execlists.ipi.list);
> +	do
> +		rq->sched.ipi_link = ptr_pack_bits(first, 1, 1);
> +	while (!try_cmpxchg(&engine->execlists.ipi.list, &first, rq));
> +
> +	if (!first)
> +		queue_work(system_unbound_wq, &engine->execlists.ipi.work);
> +}
> +
> +/*
> + * Virtual engines complicate acquiring the engine timeline lock,
> + * as their rq->engine pointer is not stable until under that
> + * engine lock. The simple ploy we use is to take the lock then
> + * check that the rq still belongs to the newly locked engine.
> + */
> +#define lock_engine_irqsave(rq, flags) ({ \
> +	struct i915_request * const rq__ = (rq); \
> +	struct intel_engine_cs *engine__ = READ_ONCE(rq__->engine); \
> +\
> +	spin_lock_irqsave(&engine__->active.lock, (flags)); \
> +	while (engine__ != READ_ONCE((rq__)->engine)) { \
> +		spin_unlock(&engine__->active.lock); \
> +		engine__ = READ_ONCE(rq__->engine); \
> +		spin_lock(&engine__->active.lock); \
> +	} \
> +\
> +	engine__; \
> +})
> +
>   static const struct i915_request *
>   node_to_request(const struct i915_sched_node *node)
>   {
>   	return container_of(node, const struct i915_request, sched);
>   }
>   
> -static inline bool node_started(const struct i915_sched_node *node)
> -{
> -	return i915_request_started(node_to_request(node));
> -}
> -
>   static inline bool node_signaled(const struct i915_sched_node *node)
>   {
>   	return i915_request_completed(node_to_request(node));
> @@ -137,42 +234,6 @@ void __i915_priolist_free(struct i915_priolist *p)
>   	kmem_cache_free(global.slab_priorities, p);
>   }
>   
> -struct sched_cache {
> -	struct list_head *priolist;
> -};
> -
> -static struct intel_engine_cs *
> -sched_lock_engine(const struct i915_sched_node *node,
> -		  struct intel_engine_cs *locked,
> -		  struct sched_cache *cache)
> -{
> -	const struct i915_request *rq = node_to_request(node);
> -	struct intel_engine_cs *engine;
> -
> -	GEM_BUG_ON(!locked);
> -
> -	/*
> -	 * Virtual engines complicate acquiring the engine timeline lock,
> -	 * as their rq->engine pointer is not stable until under that
> -	 * engine lock. The simple ploy we use is to take the lock then
> -	 * check that the rq still belongs to the newly locked engine.
> -	 */
> -	while (locked != (engine = READ_ONCE(rq->engine))) {
> -		spin_unlock(&locked->active.lock);
> -		memset(cache, 0, sizeof(*cache));
> -		spin_lock(&engine->active.lock);
> -		locked = engine;
> -	}
> -
> -	GEM_BUG_ON(locked != engine);
> -	return locked;
> -}
> -
> -static inline int rq_prio(const struct i915_request *rq)
> -{
> -	return rq->sched.attr.priority;
> -}
> -
>   static inline bool need_preempt(int prio, int active)
>   {
>   	/*
> @@ -198,19 +259,17 @@ static void kick_submission(struct intel_engine_cs *engine,
>   	if (prio <= engine->execlists.queue_priority_hint)
>   		return;
>   
> -	rcu_read_lock();
> -
>   	/* Nothing currently active? We're overdue for a submission! */
>   	inflight = execlists_active(&engine->execlists);
>   	if (!inflight)
> -		goto unlock;
> +		return;
>   
>   	/*
>   	 * If we are already the currently executing context, don't
>   	 * bother evaluating if we should preempt ourselves.
>   	 */
>   	if (inflight->context == rq->context)
> -		goto unlock;
> +		return;
>   
>   	ENGINE_TRACE(engine,
>   		     "bumping queue-priority-hint:%d for rq:%llx:%lld, inflight:%llx:%lld prio %d\n",
> @@ -222,30 +281,28 @@ static void kick_submission(struct intel_engine_cs *engine,
>   	engine->execlists.queue_priority_hint = prio;
>   	if (need_preempt(prio, rq_prio(inflight)))
>   		tasklet_hi_schedule(&engine->execlists.tasklet);
> -
> -unlock:
> -	rcu_read_unlock();
>   }
>   
> -static void __i915_schedule(struct i915_sched_node *node, int prio)
> +static void ipi_priority(struct i915_request *rq, int prio)
>   {
> -	struct intel_engine_cs *engine;
> -	struct i915_dependency *dep, *p;
> -	struct i915_dependency stack;
> -	struct sched_cache cache;
> +	int old = READ_ONCE(rq->sched.ipi_priority);
> +
> +	do {
> +		if (prio <= old)
> +			return;
> +	} while (!try_cmpxchg(&rq->sched.ipi_priority, &old, prio));
> +
> +	__ipi_add(rq);
> +}
> +
> +static void __i915_request_set_priority(struct i915_request *rq, int prio)
> +{
> +	struct intel_engine_cs *engine = rq->engine;
> +	struct i915_request *rn;
> +	struct list_head *plist;
>   	LIST_HEAD(dfs);
>   
> -	/* Needed in order to use the temporary link inside i915_dependency */
> -	lockdep_assert_held(&schedule_lock);
> -	GEM_BUG_ON(prio == I915_PRIORITY_INVALID);
> -
> -	if (node_signaled(node))
> -		return;
> -
> -	prio = max(prio, node->attr.priority);
> -
> -	stack.signaler = node;
> -	list_add(&stack.dfs_link, &dfs);
> +	list_add(&rq->sched.dfs, &dfs);
>   
>   	/*
>   	 * Recursively bump all dependent priorities to match the new request.
> @@ -265,66 +322,41 @@ static void __i915_schedule(struct i915_sched_node *node, int prio)
>   	 * end result is a topological list of requests in reverse order, the
>   	 * last element in the list is the request we must execute first.
>   	 */
> -	list_for_each_entry(dep, &dfs, dfs_link) {
> -		struct i915_sched_node *node = dep->signaler;
> +	list_for_each_entry(rq, &dfs, sched.dfs) {
> +		struct i915_dependency *p;
>   
> -		/* If we are already flying, we know we have no signalers */
> -		if (node_started(node))
> -			continue;
> +		/* Also release any children on this engine that are ready */
> +		GEM_BUG_ON(rq->engine != engine);
>   
> -		/*
> -		 * Within an engine, there can be no cycle, but we may
> -		 * refer to the same dependency chain multiple times
> -		 * (redundant dependencies are not eliminated) and across
> -		 * engines.
> -		 */
> -		list_for_each_entry(p, &node->signalers_list, signal_link) {
> -			GEM_BUG_ON(p == dep); /* no cycles! */
> +		for_each_signaler(p, rq) {
> +			struct i915_request *s =
> +				container_of(p->signaler, typeof(*s), sched);
>   
> -			if (node_signaled(p->signaler))
> +			GEM_BUG_ON(s == rq);
> +
> +			if (rq_prio(s) >= prio)
>   				continue;
>   
> -			if (prio > READ_ONCE(p->signaler->attr.priority))
> -				list_move_tail(&p->dfs_link, &dfs);
> +			if (__i915_request_is_complete(s))
> +				continue;
> +
> +			if (s->engine != rq->engine) {
> +				ipi_priority(s, prio);
> +				continue;
> +			}
> +
> +			list_move_tail(&s->sched.dfs, &dfs);
>   		}
>   	}
>   
> -	/*
> -	 * If we didn't need to bump any existing priorities, and we haven't
> -	 * yet submitted this request (i.e. there is no potential race with
> -	 * execlists_submit_request()), we can set our own priority and skip
> -	 * acquiring the engine locks.
> -	 */
> -	if (node->attr.priority == I915_PRIORITY_INVALID) {
> -		GEM_BUG_ON(!list_empty(&node->link));
> -		node->attr.priority = prio;
> +	plist = i915_sched_lookup_priolist(engine, prio);
>   
> -		if (stack.dfs_link.next == stack.dfs_link.prev)
> -			return;
> +	/* Fifo and depth-first replacement ensure our deps execute first */
> +	list_for_each_entry_safe_reverse(rq, rn, &dfs, sched.dfs) {
> +		GEM_BUG_ON(rq->engine != engine);
>   
> -		__list_del_entry(&stack.dfs_link);
> -	}
> -
> -	memset(&cache, 0, sizeof(cache));
> -	engine = node_to_request(node)->engine;
> -	spin_lock(&engine->active.lock);
> -
> -	/* Fifo and depth-first replacement ensure our deps execute before us */
> -	engine = sched_lock_engine(node, engine, &cache);
> -	list_for_each_entry_safe_reverse(dep, p, &dfs, dfs_link) {
> -		INIT_LIST_HEAD(&dep->dfs_link);
> -
> -		node = dep->signaler;
> -		engine = sched_lock_engine(node, engine, &cache);
> -		lockdep_assert_held(&engine->active.lock);
> -
> -		/* Recheck after acquiring the engine->timeline.lock */
> -		if (prio <= node->attr.priority || node_signaled(node))
> -			continue;
> -
> -		GEM_BUG_ON(node_to_request(node)->engine != engine);
> -
> -		WRITE_ONCE(node->attr.priority, prio);
> +		INIT_LIST_HEAD(&rq->sched.dfs);
> +		WRITE_ONCE(rq->sched.attr.priority, prio);
>   
>   		/*
>   		 * Once the request is ready, it will be placed into the
> @@ -334,32 +366,75 @@ static void __i915_schedule(struct i915_sched_node *node, int prio)
>   		 * any preemption required, be dealt with upon submission.
>   		 * See engine->submit_request()
>   		 */
> -		if (list_empty(&node->link))
> +		if (!i915_request_is_ready(rq))
>   			continue;
>   
> -		if (i915_request_in_priority_queue(node_to_request(node))) {
> -			if (!cache.priolist)
> -				cache.priolist =
> -					i915_sched_lookup_priolist(engine,
> -								   prio);
> -			list_move_tail(&node->link, cache.priolist);
> -		}
> +		if (i915_request_in_priority_queue(rq))
> +			list_move_tail(&rq->sched.link, plist);
>   
> -		/* Defer (tasklet) submission until after all of our updates. */
> -		kick_submission(engine, node_to_request(node), prio);
> +		/* Defer (tasklet) submission until after all updates. */
> +		kick_submission(engine, rq, prio);
>   	}
> -
> -	spin_unlock(&engine->active.lock);
>   }
>   
>   void i915_request_set_priority(struct i915_request *rq, int prio)
>   {
> -	if (!intel_engine_has_scheduler(rq->engine))
> +	struct intel_engine_cs *engine;
> +	unsigned long flags;
> +
> +	if (prio <= rq_prio(rq))
>   		return;
>   
> -	spin_lock_irq(&schedule_lock);
> -	__i915_schedule(&rq->sched, prio);
> -	spin_unlock_irq(&schedule_lock);
> +	/*
> +	 * If we are setting the priority before being submitted, see if we
> +	 * can quickly adjust our own priority in-situ and avoid taking
> +	 * the contended engine->active.lock. If we need priority inheritance,
> +	 * take the slow route.
> +	 */
> +	if (rq_prio(rq) == I915_PRIORITY_INVALID) {
> +		struct i915_dependency *p;
> +
> +		rcu_read_lock();
> +		for_each_signaler(p, rq) {
> +			struct i915_request *s =
> +				container_of(p->signaler, typeof(*s), sched);
> +
> +			if (rq_prio(s) >= prio)
> +				continue;
> +
> +			if (__i915_request_is_complete(s))
> +				continue;
> +
> +			break;
> +		}
> +		rcu_read_unlock();

Exit this loop with a first lower priority incomplete signaler. What 
does the block below then do? Feels like it needs a comment.

> +
> +		if (&p->signal_link == &rq->sched.signalers_list &&
> +		    cmpxchg(&rq->sched.attr.priority,
> +			    I915_PRIORITY_INVALID,
> +			    prio) == I915_PRIORITY_INVALID)
> +			return;
> +	}
> +
> +	engine = lock_engine_irqsave(rq, flags);
> +	if (prio <= rq_prio(rq))
> +		goto unlock;
> +
> +	if (__i915_request_is_complete(rq))
> +		goto unlock;
> +
> +	if (!intel_engine_has_scheduler(engine)) {
> +		rq->sched.attr.priority = prio;
> +		goto unlock;
> +	}
> +
> +	rcu_read_lock();
> +	__i915_request_set_priority(rq, prio);
> +	rcu_read_unlock();
> +	GEM_BUG_ON(rq_prio(rq) != prio);
> +
> +unlock:
> +	spin_unlock_irqrestore(&engine->active.lock, flags);
>   }
>   
>   void i915_sched_node_init(struct i915_sched_node *node)
> @@ -369,6 +444,9 @@ void i915_sched_node_init(struct i915_sched_node *node)
>   	INIT_LIST_HEAD(&node->signalers_list);
>   	INIT_LIST_HEAD(&node->waiters_list);
>   	INIT_LIST_HEAD(&node->link);
> +	INIT_LIST_HEAD(&node->dfs);
> +
> +	node->ipi_link = NULL;
>   
>   	i915_sched_node_reinit(node);
>   }
> @@ -379,6 +457,9 @@ void i915_sched_node_reinit(struct i915_sched_node *node)
>   	node->semaphores = 0;
>   	node->flags = 0;
>   
> +	GEM_BUG_ON(node->ipi_link);
> +	node->ipi_priority = I915_PRIORITY_INVALID;
> +
>   	GEM_BUG_ON(!list_empty(&node->signalers_list));
>   	GEM_BUG_ON(!list_empty(&node->waiters_list));
>   	GEM_BUG_ON(!list_empty(&node->link));
> @@ -414,7 +495,6 @@ bool __i915_sched_node_add_dependency(struct i915_sched_node *node,
>   	spin_lock(&signal->lock);
>   
>   	if (!node_signaled(signal)) {
> -		INIT_LIST_HEAD(&dep->dfs_link);
>   		dep->signaler = signal;
>   		dep->waiter = node_get(node);
>   		dep->flags = flags;
> diff --git a/drivers/gpu/drm/i915/i915_scheduler.h b/drivers/gpu/drm/i915/i915_scheduler.h
> index a045be784c67..5be7f90e7896 100644
> --- a/drivers/gpu/drm/i915/i915_scheduler.h
> +++ b/drivers/gpu/drm/i915/i915_scheduler.h
> @@ -35,6 +35,8 @@ int i915_sched_node_add_dependency(struct i915_sched_node *node,
>   
>   void i915_sched_node_retire(struct i915_sched_node *node);
>   
> +void i915_sched_init_ipi(struct i915_sched_ipi *ipi);
> +
>   void i915_request_set_priority(struct i915_request *request, int prio);
>   
>   struct list_head *
> diff --git a/drivers/gpu/drm/i915/i915_scheduler_types.h b/drivers/gpu/drm/i915/i915_scheduler_types.h
> index 623bf41fcf35..5a84d59134ee 100644
> --- a/drivers/gpu/drm/i915/i915_scheduler_types.h
> +++ b/drivers/gpu/drm/i915/i915_scheduler_types.h
> @@ -8,8 +8,8 @@
>   #define _I915_SCHEDULER_TYPES_H_
>   
>   #include <linux/list.h>
> +#include <linux/workqueue.h>
>   
> -#include "gt/intel_engine_types.h"
>   #include "i915_priolist_types.h"
>   
>   struct drm_i915_private;
> @@ -61,13 +61,23 @@ struct i915_sched_attr {
>    */
>   struct i915_sched_node {
>   	spinlock_t lock; /* protect the lists */
> +
>   	struct list_head signalers_list; /* those before us, we depend upon */
>   	struct list_head waiters_list; /* those after us, they depend upon us */
> -	struct list_head link;
> +	struct list_head link; /* guarded by engine->active.lock */
> +	struct list_head dfs; /* guarded by engine->active.lock */
>   	struct i915_sched_attr attr;
> -	unsigned int flags;
> +	unsigned long flags;
>   #define I915_SCHED_HAS_EXTERNAL_CHAIN	BIT(0)
> -	intel_engine_mask_t semaphores;
> +	unsigned long semaphores;
> +
> +	struct i915_request *ipi_link;
> +	int ipi_priority;
> +};
> +
> +struct i915_sched_ipi {
> +	struct i915_request *list;
> +	struct work_struct work;
>   };
>   
>   struct i915_dependency {
> @@ -75,7 +85,6 @@ struct i915_dependency {
>   	struct i915_sched_node *waiter;
>   	struct list_head signal_link;
>   	struct list_head wait_link;
> -	struct list_head dfs_link;
>   	struct rcu_head rcu;
>   	unsigned long flags;
>   #define I915_DEPENDENCY_ALLOC		BIT(0)
> 

Regards,

Tvrtko

Chris Wilson Jan. 26, 2021, 11:30 a.m. UTC | #2

Quoting Tvrtko Ursulin (2021-01-26 11:12:53)
> 
> 
> On 25/01/2021 14:01, Chris Wilson wrote:
> > +static void ipi_schedule(struct work_struct *wrk)
> > +{
> > +     struct i915_sched_ipi *ipi = container_of(wrk, typeof(*ipi), work);
> > +     struct i915_request *rq = xchg(&ipi->list, NULL);
> > +
> > +     do {
> > +             struct i915_request *rn = xchg(&rq->sched.ipi_link, NULL);
> > +             int prio;
> > +
> > +             prio = ipi_get_prio(rq);
> > +
> > +             /*
> > +              * For cross-engine scheduling to work we rely on one of two
> > +              * things:
> > +              *
> > +              * a) The requests are using dma-fence fences and so will not
> > +              * be scheduled until the previous engine is completed, and
> > +              * so we cannot cross back onto the original engine and end up
> > +              * queuing an earlier request after the first (due to the
> > +              * interrupted DFS).
> > +              *
> > +              * b) The requests are using semaphores and so may be already
> > +              * be in flight, in which case if we cross back onto the same
> > +              * engine, we will already have put the interrupted DFS into
> > +              * the priolist, and the continuation will now be queued
> > +              * afterwards [out-of-order]. However, since we are using
> > +              * semaphores in this case, we also perform yield on semaphore
> > +              * waits and so will reorder the requests back into the correct
> > +              * sequence. This occurrence (of promoting a request chain
> > +              * that crosses the engines using semaphores back unto itself)
> > +              * should be unlikely enough that it probably does not matter...
> > +              */
> > +             local_bh_disable();
> > +             i915_request_set_priority(rq, prio);
> > +             local_bh_enable();
> 
> Is it that important and wouldn't the priority order restore eventually 
> due timeslicing?

There would be a window in which we executed userspace code
out-of-order. That's enough to scare me! However, for our PI dependency
chains it should not matter as the only time we do submit out-of-order,
we are stuck on _our_ semaphore that cannot be resolved until the
requests are back in-order.

I've tried to trick this into causing problems with the
i915_selftest/igt_schedule_cycle and gem_exec_schedule/noreorder.
Fortunately for my sanity, neither test have caught any problems.

This is the handwaving part of removing the global lock.

> > +     /*
> > +      * If we are setting the priority before being submitted, see if we
> > +      * can quickly adjust our own priority in-situ and avoid taking
> > +      * the contended engine->active.lock. If we need priority inheritance,
> > +      * take the slow route.
> > +      */
> > +     if (rq_prio(rq) == I915_PRIORITY_INVALID) {
> > +             struct i915_dependency *p;
> > +
> > +             rcu_read_lock();
> > +             for_each_signaler(p, rq) {
> > +                     struct i915_request *s =
> > +                             container_of(p->signaler, typeof(*s), sched);
> > +
> > +                     if (rq_prio(s) >= prio)
> > +                             continue;
> > +
> > +                     if (__i915_request_is_complete(s))
> > +                             continue;
> > +
> > +                     break;
> > +             }
> > +             rcu_read_unlock();
> 
> Exit this loop with a first lower priority incomplete signaler. What 
> does the block below then do? Feels like it needs a comment.

I thought I had sufficiently explained that in the comment above.

/* Update priority in place if no PI required */
> > +             if (&p->signal_link == &rq->sched.signalers_list &&
> > +                 cmpxchg(&rq->sched.attr.priority,
> > +                         I915_PRIORITY_INVALID,
> > +                         prio) == I915_PRIORITY_INVALID)
> > +                     return;

It could do a few more tricks to change the priority in-place a second
time, but I did not think that would be frequent enough to matter.
Whereas we always adjust the priority from INVALID once before
submission, and avoiding taking the lock then does make a difference to
the profiles.
-Chris

Tvrtko Ursulin Jan. 26, 2021, 11:40 a.m. UTC | #3

On 26/01/2021 11:30, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2021-01-26 11:12:53)
>>
>>
>> On 25/01/2021 14:01, Chris Wilson wrote:
>>> +static void ipi_schedule(struct work_struct *wrk)
>>> +{
>>> +     struct i915_sched_ipi *ipi = container_of(wrk, typeof(*ipi), work);
>>> +     struct i915_request *rq = xchg(&ipi->list, NULL);
>>> +
>>> +     do {
>>> +             struct i915_request *rn = xchg(&rq->sched.ipi_link, NULL);
>>> +             int prio;
>>> +
>>> +             prio = ipi_get_prio(rq);
>>> +
>>> +             /*
>>> +              * For cross-engine scheduling to work we rely on one of two
>>> +              * things:
>>> +              *
>>> +              * a) The requests are using dma-fence fences and so will not
>>> +              * be scheduled until the previous engine is completed, and
>>> +              * so we cannot cross back onto the original engine and end up
>>> +              * queuing an earlier request after the first (due to the
>>> +              * interrupted DFS).
>>> +              *
>>> +              * b) The requests are using semaphores and so may be already
>>> +              * be in flight, in which case if we cross back onto the same
>>> +              * engine, we will already have put the interrupted DFS into
>>> +              * the priolist, and the continuation will now be queued
>>> +              * afterwards [out-of-order]. However, since we are using
>>> +              * semaphores in this case, we also perform yield on semaphore
>>> +              * waits and so will reorder the requests back into the correct
>>> +              * sequence. This occurrence (of promoting a request chain
>>> +              * that crosses the engines using semaphores back unto itself)
>>> +              * should be unlikely enough that it probably does not matter...
>>> +              */
>>> +             local_bh_disable();
>>> +             i915_request_set_priority(rq, prio);
>>> +             local_bh_enable();
>>
>> Is it that important and wouldn't the priority order restore eventually
>> due timeslicing?
> 
> There would be a window in which we executed userspace code
> out-of-order. That's enough to scare me! However, for our PI dependency
> chains it should not matter as the only time we do submit out-of-order,
> we are stuck on _our_ semaphore that cannot be resolved until the
> requests are back in-order.

Out of order how? Within a single timeline?! I though only with 
incomplete view of priority inheritance, which in my mind could only 
cause deadlocks (if no timeslicing). But really really out of order?

> I've tried to trick this into causing problems with the
> i915_selftest/igt_schedule_cycle and gem_exec_schedule/noreorder.
> Fortunately for my sanity, neither test have caught any problems.
> 
> This is the handwaving part of removing the global lock.
> 
>>> +     /*
>>> +      * If we are setting the priority before being submitted, see if we
>>> +      * can quickly adjust our own priority in-situ and avoid taking
>>> +      * the contended engine->active.lock. If we need priority inheritance,
>>> +      * take the slow route.
>>> +      */
>>> +     if (rq_prio(rq) == I915_PRIORITY_INVALID) {
>>> +             struct i915_dependency *p;
>>> +
>>> +             rcu_read_lock();
>>> +             for_each_signaler(p, rq) {
>>> +                     struct i915_request *s =
>>> +                             container_of(p->signaler, typeof(*s), sched);
>>> +
>>> +                     if (rq_prio(s) >= prio)
>>> +                             continue;
>>> +
>>> +                     if (__i915_request_is_complete(s))
>>> +                             continue;
>>> +
>>> +                     break;
>>> +             }
>>> +             rcu_read_unlock();
>>
>> Exit this loop with a first lower priority incomplete signaler. What
>> does the block below then do? Feels like it needs a comment.
> 
> I thought I had sufficiently explained that in the comment above.
> 
> /* Update priority in place if no PI required */
>>> +             if (&p->signal_link == &rq->sched.signalers_list &&
>>> +                 cmpxchg(&rq->sched.attr.priority,
>>> +                         I915_PRIORITY_INVALID,
>>> +                         prio) == I915_PRIORITY_INVALID)
>>> +                     return;
> 
> It could do a few more tricks to change the priority in-place a second
> time, but I did not think that would be frequent enough to matter.
> Whereas we always adjust the priority from INVALID once before
> submission, and avoiding taking the lock then does make a difference to
> the profiles.

To start with, if p is NULL or un-initialized (can be, no?) then 
relationship of &p->signal_link to &rq->sched.signalers_list escapes me.

Regards,

Tvrtko

Chris Wilson Jan. 26, 2021, 11:55 a.m. UTC | #4

Quoting Tvrtko Ursulin (2021-01-26 11:40:24)
> 
> On 26/01/2021 11:30, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2021-01-26 11:12:53)
> >>
> >>
> >> On 25/01/2021 14:01, Chris Wilson wrote:
> >>> +static void ipi_schedule(struct work_struct *wrk)
> >>> +{
> >>> +     struct i915_sched_ipi *ipi = container_of(wrk, typeof(*ipi), work);
> >>> +     struct i915_request *rq = xchg(&ipi->list, NULL);
> >>> +
> >>> +     do {
> >>> +             struct i915_request *rn = xchg(&rq->sched.ipi_link, NULL);
> >>> +             int prio;
> >>> +
> >>> +             prio = ipi_get_prio(rq);
> >>> +
> >>> +             /*
> >>> +              * For cross-engine scheduling to work we rely on one of two
> >>> +              * things:
> >>> +              *
> >>> +              * a) The requests are using dma-fence fences and so will not
> >>> +              * be scheduled until the previous engine is completed, and
> >>> +              * so we cannot cross back onto the original engine and end up
> >>> +              * queuing an earlier request after the first (due to the
> >>> +              * interrupted DFS).
> >>> +              *
> >>> +              * b) The requests are using semaphores and so may be already
> >>> +              * be in flight, in which case if we cross back onto the same
> >>> +              * engine, we will already have put the interrupted DFS into
> >>> +              * the priolist, and the continuation will now be queued
> >>> +              * afterwards [out-of-order]. However, since we are using
> >>> +              * semaphores in this case, we also perform yield on semaphore
> >>> +              * waits and so will reorder the requests back into the correct
> >>> +              * sequence. This occurrence (of promoting a request chain
> >>> +              * that crosses the engines using semaphores back unto itself)
> >>> +              * should be unlikely enough that it probably does not matter...
> >>> +              */
> >>> +             local_bh_disable();
> >>> +             i915_request_set_priority(rq, prio);
> >>> +             local_bh_enable();
> >>
> >> Is it that important and wouldn't the priority order restore eventually
> >> due timeslicing?
> > 
> > There would be a window in which we executed userspace code
> > out-of-order. That's enough to scare me! However, for our PI dependency
> > chains it should not matter as the only time we do submit out-of-order,
> > we are stuck on _our_ semaphore that cannot be resolved until the
> > requests are back in-order.
> 
> Out of order how? Within a single timeline?! I though only with 
> incomplete view of priority inheritance, which in my mind could only 
> cause deadlocks (if no timeslicing). But really really out of order?

Fences between timelines. Let's say we have 3 requests, A,B,C all with
sequential fencing (C depends on B depends on A), but B is on a
different engine to (A, C) and we are using semaphores to submit early.
If we bump the priority of C, we see it crosses the engine to B, and send
an ipi_priority, but set C to be higher priority than A. So we now
schedule C before A!

However, since C depends on B which depends on A, C is stuck on its
semaphore from B, and B is waiting for A. As soon as A is set to the
same priority as C (after a couple of ipi_priority()), we rerun the
scheduler see that C has a semaphore-yield (or eventually timeslice
expired) and so run A before C, and order is restored.

> > I've tried to trick this into causing problems with the
> > i915_selftest/igt_schedule_cycle and gem_exec_schedule/noreorder.
> > Fortunately for my sanity, neither test have caught any problems.
> > 
> > This is the handwaving part of removing the global lock.
> > 
> >>> +     /*
> >>> +      * If we are setting the priority before being submitted, see if we
> >>> +      * can quickly adjust our own priority in-situ and avoid taking
> >>> +      * the contended engine->active.lock. If we need priority inheritance,
> >>> +      * take the slow route.
> >>> +      */
> >>> +     if (rq_prio(rq) == I915_PRIORITY_INVALID) {
> >>> +             struct i915_dependency *p;
> >>> +
> >>> +             rcu_read_lock();
> >>> +             for_each_signaler(p, rq) {
> >>> +                     struct i915_request *s =
> >>> +                             container_of(p->signaler, typeof(*s), sched);
> >>> +
> >>> +                     if (rq_prio(s) >= prio)
> >>> +                             continue;
> >>> +
> >>> +                     if (__i915_request_is_complete(s))
> >>> +                             continue;
> >>> +
> >>> +                     break;
> >>> +             }
> >>> +             rcu_read_unlock();
> >>
> >> Exit this loop with a first lower priority incomplete signaler. What
> >> does the block below then do? Feels like it needs a comment.
> > 
> > I thought I had sufficiently explained that in the comment above.
> > 
> > /* Update priority in place if no PI required */
> >>> +             if (&p->signal_link == &rq->sched.signalers_list &&
> >>> +                 cmpxchg(&rq->sched.attr.priority,
> >>> +                         I915_PRIORITY_INVALID,
> >>> +                         prio) == I915_PRIORITY_INVALID)
> >>> +                     return;
> > 
> > It could do a few more tricks to change the priority in-place a second
> > time, but I did not think that would be frequent enough to matter.
> > Whereas we always adjust the priority from INVALID once before
> > submission, and avoiding taking the lock then does make a difference to
> > the profiles.
> 
> To start with, if p is NULL or un-initialized (can be, no?) then 
> relationship of &p->signal_link to &rq->sched.signalers_list escapes me.

p is constrained to be a member of the signalers_list or its head.
-Chris

Tvrtko Ursulin Jan. 26, 2021, 1:15 p.m. UTC | #5

On 26/01/2021 11:55, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2021-01-26 11:40:24)
>>
>> On 26/01/2021 11:30, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2021-01-26 11:12:53)
>>>>
>>>>
>>>> On 25/01/2021 14:01, Chris Wilson wrote:
>>>>> +static void ipi_schedule(struct work_struct *wrk)
>>>>> +{
>>>>> +     struct i915_sched_ipi *ipi = container_of(wrk, typeof(*ipi), work);
>>>>> +     struct i915_request *rq = xchg(&ipi->list, NULL);
>>>>> +
>>>>> +     do {
>>>>> +             struct i915_request *rn = xchg(&rq->sched.ipi_link, NULL);
>>>>> +             int prio;
>>>>> +
>>>>> +             prio = ipi_get_prio(rq);
>>>>> +
>>>>> +             /*
>>>>> +              * For cross-engine scheduling to work we rely on one of two
>>>>> +              * things:
>>>>> +              *
>>>>> +              * a) The requests are using dma-fence fences and so will not
>>>>> +              * be scheduled until the previous engine is completed, and
>>>>> +              * so we cannot cross back onto the original engine and end up
>>>>> +              * queuing an earlier request after the first (due to the
>>>>> +              * interrupted DFS).
>>>>> +              *
>>>>> +              * b) The requests are using semaphores and so may be already
>>>>> +              * be in flight, in which case if we cross back onto the same
>>>>> +              * engine, we will already have put the interrupted DFS into
>>>>> +              * the priolist, and the continuation will now be queued
>>>>> +              * afterwards [out-of-order]. However, since we are using
>>>>> +              * semaphores in this case, we also perform yield on semaphore
>>>>> +              * waits and so will reorder the requests back into the correct
>>>>> +              * sequence. This occurrence (of promoting a request chain
>>>>> +              * that crosses the engines using semaphores back unto itself)
>>>>> +              * should be unlikely enough that it probably does not matter...
>>>>> +              */
>>>>> +             local_bh_disable();
>>>>> +             i915_request_set_priority(rq, prio);
>>>>> +             local_bh_enable();
>>>>
>>>> Is it that important and wouldn't the priority order restore eventually
>>>> due timeslicing?
>>>
>>> There would be a window in which we executed userspace code
>>> out-of-order. That's enough to scare me! However, for our PI dependency
>>> chains it should not matter as the only time we do submit out-of-order,
>>> we are stuck on _our_ semaphore that cannot be resolved until the
>>> requests are back in-order.
>>
>> Out of order how? Within a single timeline?! I though only with
>> incomplete view of priority inheritance, which in my mind could only
>> cause deadlocks (if no timeslicing). But really really out of order?
> 
> Fences between timelines. Let's say we have 3 requests, A,B,C all with
> sequential fencing (C depends on B depends on A), but B is on a
> different engine to (A, C) and we are using semaphores to submit early.
> If we bump the priority of C, we see it crosses the engine to B, and send
> an ipi_priority, but set C to be higher priority than A. So we now
> schedule C before A!

Yeah so different timelines, I think that's not a huge problem to start 
with. Only if things were non-preemptable.

> However, since C depends on B which depends on A, C is stuck on its
> semaphore from B, and B is waiting for A. As soon as A is set to the
> same priority as C (after a couple of ipi_priority()), we rerun the
> scheduler see that C has a semaphore-yield (or eventually timeslice
> expired) and so run A before C, and order is restored.
> 
>>> I've tried to trick this into causing problems with the
>>> i915_selftest/igt_schedule_cycle and gem_exec_schedule/noreorder.
>>> Fortunately for my sanity, neither test have caught any problems.
>>>
>>> This is the handwaving part of removing the global lock.
>>>
>>>>> +     /*
>>>>> +      * If we are setting the priority before being submitted, see if we
>>>>> +      * can quickly adjust our own priority in-situ and avoid taking
>>>>> +      * the contended engine->active.lock. If we need priority inheritance,
>>>>> +      * take the slow route.
>>>>> +      */
>>>>> +     if (rq_prio(rq) == I915_PRIORITY_INVALID) {
>>>>> +             struct i915_dependency *p;
>>>>> +
>>>>> +             rcu_read_lock();
>>>>> +             for_each_signaler(p, rq) {
>>>>> +                     struct i915_request *s =
>>>>> +                             container_of(p->signaler, typeof(*s), sched);
>>>>> +
>>>>> +                     if (rq_prio(s) >= prio)
>>>>> +                             continue;
>>>>> +
>>>>> +                     if (__i915_request_is_complete(s))
>>>>> +                             continue;
>>>>> +
>>>>> +                     break;
>>>>> +             }
>>>>> +             rcu_read_unlock();
>>>>
>>>> Exit this loop with a first lower priority incomplete signaler. What
>>>> does the block below then do? Feels like it needs a comment.
>>>
>>> I thought I had sufficiently explained that in the comment above.
>>>
>>> /* Update priority in place if no PI required */
>>>>> +             if (&p->signal_link == &rq->sched.signalers_list &&
>>>>> +                 cmpxchg(&rq->sched.attr.priority,
>>>>> +                         I915_PRIORITY_INVALID,
>>>>> +                         prio) == I915_PRIORITY_INVALID)
>>>>> +                     return;
>>>
>>> It could do a few more tricks to change the priority in-place a second
>>> time, but I did not think that would be frequent enough to matter.
>>> Whereas we always adjust the priority from INVALID once before
>>> submission, and avoiding taking the lock then does make a difference to
>>> the profiles.
>>
>> To start with, if p is NULL or un-initialized (can be, no?) then
>> relationship of &p->signal_link to &rq->sched.signalers_list escapes me.
> 
> p is constrained to be a member of the signalers_list or its head.

Is it defined list_for_each_entry exits with pos set? It is in 
implementation but I don't know why it would have to be. Could you 
change this to some form of list_empty or a descriptively named helper 
for clarity?

Regards,

Tvrtko

Chris Wilson Jan. 26, 2021, 1:24 p.m. UTC | #6

Quoting Tvrtko Ursulin (2021-01-26 13:15:29)
> 
> On 26/01/2021 11:55, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2021-01-26 11:40:24)
> >>
> >> On 26/01/2021 11:30, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2021-01-26 11:12:53)
> >>>>
> >>>>
> >>>> On 25/01/2021 14:01, Chris Wilson wrote:
> >>>>> +static void ipi_schedule(struct work_struct *wrk)
> >>>>> +{
> >>>>> +     struct i915_sched_ipi *ipi = container_of(wrk, typeof(*ipi), work);
> >>>>> +     struct i915_request *rq = xchg(&ipi->list, NULL);
> >>>>> +
> >>>>> +     do {
> >>>>> +             struct i915_request *rn = xchg(&rq->sched.ipi_link, NULL);
> >>>>> +             int prio;
> >>>>> +
> >>>>> +             prio = ipi_get_prio(rq);
> >>>>> +
> >>>>> +             /*
> >>>>> +              * For cross-engine scheduling to work we rely on one of two
> >>>>> +              * things:
> >>>>> +              *
> >>>>> +              * a) The requests are using dma-fence fences and so will not
> >>>>> +              * be scheduled until the previous engine is completed, and
> >>>>> +              * so we cannot cross back onto the original engine and end up
> >>>>> +              * queuing an earlier request after the first (due to the
> >>>>> +              * interrupted DFS).
> >>>>> +              *
> >>>>> +              * b) The requests are using semaphores and so may be already
> >>>>> +              * be in flight, in which case if we cross back onto the same
> >>>>> +              * engine, we will already have put the interrupted DFS into
> >>>>> +              * the priolist, and the continuation will now be queued
> >>>>> +              * afterwards [out-of-order]. However, since we are using
> >>>>> +              * semaphores in this case, we also perform yield on semaphore
> >>>>> +              * waits and so will reorder the requests back into the correct
> >>>>> +              * sequence. This occurrence (of promoting a request chain
> >>>>> +              * that crosses the engines using semaphores back unto itself)
> >>>>> +              * should be unlikely enough that it probably does not matter...
> >>>>> +              */
> >>>>> +             local_bh_disable();
> >>>>> +             i915_request_set_priority(rq, prio);
> >>>>> +             local_bh_enable();
> >>>>
> >>>> Is it that important and wouldn't the priority order restore eventually
> >>>> due timeslicing?
> >>>
> >>> There would be a window in which we executed userspace code
> >>> out-of-order. That's enough to scare me! However, for our PI dependency
> >>> chains it should not matter as the only time we do submit out-of-order,
> >>> we are stuck on _our_ semaphore that cannot be resolved until the
> >>> requests are back in-order.
> >>
> >> Out of order how? Within a single timeline?! I though only with
> >> incomplete view of priority inheritance, which in my mind could only
> >> cause deadlocks (if no timeslicing). But really really out of order?
> > 
> > Fences between timelines. Let's say we have 3 requests, A,B,C all with
> > sequential fencing (C depends on B depends on A), but B is on a
> > different engine to (A, C) and we are using semaphores to submit early.
> > If we bump the priority of C, we see it crosses the engine to B, and send
> > an ipi_priority, but set C to be higher priority than A. So we now
> > schedule C before A!
> 
> Yeah so different timelines, I think that's not a huge problem to start 
> with. Only if things were non-preemptable.

And for the special case where it may occur, it's inside an preemptible
section (under our control).

> > However, since C depends on B which depends on A, C is stuck on its
> > semaphore from B, and B is waiting for A. As soon as A is set to the
> > same priority as C (after a couple of ipi_priority()), we rerun the
> > scheduler see that C has a semaphore-yield (or eventually timeslice
> > expired) and so run A before C, and order is restored.
> > 
> >>> I've tried to trick this into causing problems with the
> >>> i915_selftest/igt_schedule_cycle and gem_exec_schedule/noreorder.
> >>> Fortunately for my sanity, neither test have caught any problems.
> >>>
> >>> This is the handwaving part of removing the global lock.
> >>>
> >>>>> +     /*
> >>>>> +      * If we are setting the priority before being submitted, see if we
> >>>>> +      * can quickly adjust our own priority in-situ and avoid taking
> >>>>> +      * the contended engine->active.lock. If we need priority inheritance,
> >>>>> +      * take the slow route.
> >>>>> +      */
> >>>>> +     if (rq_prio(rq) == I915_PRIORITY_INVALID) {
> >>>>> +             struct i915_dependency *p;
> >>>>> +
> >>>>> +             rcu_read_lock();
> >>>>> +             for_each_signaler(p, rq) {
> >>>>> +                     struct i915_request *s =
> >>>>> +                             container_of(p->signaler, typeof(*s), sched);
> >>>>> +
> >>>>> +                     if (rq_prio(s) >= prio)
> >>>>> +                             continue;
> >>>>> +
> >>>>> +                     if (__i915_request_is_complete(s))
> >>>>> +                             continue;
> >>>>> +
> >>>>> +                     break;
> >>>>> +             }
> >>>>> +             rcu_read_unlock();
> >>>>
> >>>> Exit this loop with a first lower priority incomplete signaler. What
> >>>> does the block below then do? Feels like it needs a comment.
> >>>
> >>> I thought I had sufficiently explained that in the comment above.
> >>>
> >>> /* Update priority in place if no PI required */
> >>>>> +             if (&p->signal_link == &rq->sched.signalers_list &&
> >>>>> +                 cmpxchg(&rq->sched.attr.priority,
> >>>>> +                         I915_PRIORITY_INVALID,
> >>>>> +                         prio) == I915_PRIORITY_INVALID)
> >>>>> +                     return;
> >>>
> >>> It could do a few more tricks to change the priority in-place a second
> >>> time, but I did not think that would be frequent enough to matter.
> >>> Whereas we always adjust the priority from INVALID once before
> >>> submission, and avoiding taking the lock then does make a difference to
> >>> the profiles.
> >>
> >> To start with, if p is NULL or un-initialized (can be, no?) then
> >> relationship of &p->signal_link to &rq->sched.signalers_list escapes me.
> > 
> > p is constrained to be a member of the signalers_list or its head.
> 
> Is it defined list_for_each_entry exits with pos set? It is in 
> implementation but I don't know why it would have to be. Could you 
> change this to some form of list_empty or a descriptively named helper 
> for clarity?

It as defined as the macro gets.

There's a list_entry_is_head(). That sounds new.

commit e130816164e244b692921de49771eeb28205152d
Author: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Date:   Thu Oct 15 20:11:31 2020 -0700

    include/linux/list.h: add a macro to test if entry is pointing to the head

-Chris

Chris Wilson Jan. 26, 2021, 1:45 p.m. UTC | #7

Quoting Chris Wilson (2021-01-26 13:24:07)
> Quoting Tvrtko Ursulin (2021-01-26 13:15:29)
> > 
> > On 26/01/2021 11:55, Chris Wilson wrote:
> > > Quoting Tvrtko Ursulin (2021-01-26 11:40:24)
> > >>
> > >> On 26/01/2021 11:30, Chris Wilson wrote:
> > >>> Quoting Tvrtko Ursulin (2021-01-26 11:12:53)
> > >>>>
> > >>>>
> > >>>> On 25/01/2021 14:01, Chris Wilson wrote:
> > >>>>> +static void ipi_schedule(struct work_struct *wrk)
> > >>>>> +{
> > >>>>> +     struct i915_sched_ipi *ipi = container_of(wrk, typeof(*ipi), work);
> > >>>>> +     struct i915_request *rq = xchg(&ipi->list, NULL);
> > >>>>> +
> > >>>>> +     do {
> > >>>>> +             struct i915_request *rn = xchg(&rq->sched.ipi_link, NULL);
> > >>>>> +             int prio;
> > >>>>> +
> > >>>>> +             prio = ipi_get_prio(rq);
> > >>>>> +
> > >>>>> +             /*
> > >>>>> +              * For cross-engine scheduling to work we rely on one of two
> > >>>>> +              * things:
> > >>>>> +              *
> > >>>>> +              * a) The requests are using dma-fence fences and so will not
> > >>>>> +              * be scheduled until the previous engine is completed, and
> > >>>>> +              * so we cannot cross back onto the original engine and end up
> > >>>>> +              * queuing an earlier request after the first (due to the
> > >>>>> +              * interrupted DFS).
> > >>>>> +              *
> > >>>>> +              * b) The requests are using semaphores and so may be already
> > >>>>> +              * be in flight, in which case if we cross back onto the same
> > >>>>> +              * engine, we will already have put the interrupted DFS into
> > >>>>> +              * the priolist, and the continuation will now be queued
> > >>>>> +              * afterwards [out-of-order]. However, since we are using
> > >>>>> +              * semaphores in this case, we also perform yield on semaphore
> > >>>>> +              * waits and so will reorder the requests back into the correct
> > >>>>> +              * sequence. This occurrence (of promoting a request chain
> > >>>>> +              * that crosses the engines using semaphores back unto itself)
> > >>>>> +              * should be unlikely enough that it probably does not matter...
> > >>>>> +              */
> > >>>>> +             local_bh_disable();
> > >>>>> +             i915_request_set_priority(rq, prio);
> > >>>>> +             local_bh_enable();
> > >>>>
> > >>>> Is it that important and wouldn't the priority order restore eventually
> > >>>> due timeslicing?
> > >>>
> > >>> There would be a window in which we executed userspace code
> > >>> out-of-order. That's enough to scare me! However, for our PI dependency
> > >>> chains it should not matter as the only time we do submit out-of-order,
> > >>> we are stuck on _our_ semaphore that cannot be resolved until the
> > >>> requests are back in-order.
> > >>
> > >> Out of order how? Within a single timeline?! I though only with
> > >> incomplete view of priority inheritance, which in my mind could only
> > >> cause deadlocks (if no timeslicing). But really really out of order?
> > > 
> > > Fences between timelines. Let's say we have 3 requests, A,B,C all with
> > > sequential fencing (C depends on B depends on A), but B is on a
> > > different engine to (A, C) and we are using semaphores to submit early.
> > > If we bump the priority of C, we see it crosses the engine to B, and send
> > > an ipi_priority, but set C to be higher priority than A. So we now
> > > schedule C before A!
> > 
> > Yeah so different timelines, I think that's not a huge problem to start 
> > with. Only if things were non-preemptable.
> 
> And for the special case where it may occur, it's inside an preemptible
> section (under our control).
> 
> > > However, since C depends on B which depends on A, C is stuck on its
> > > semaphore from B, and B is waiting for A. As soon as A is set to the
> > > same priority as C (after a couple of ipi_priority()), we rerun the
> > > scheduler see that C has a semaphore-yield (or eventually timeslice
> > > expired) and so run A before C, and order is restored.
> > > 
> > >>> I've tried to trick this into causing problems with the
> > >>> i915_selftest/igt_schedule_cycle and gem_exec_schedule/noreorder.
> > >>> Fortunately for my sanity, neither test have caught any problems.
> > >>>
> > >>> This is the handwaving part of removing the global lock.
> > >>>
> > >>>>> +     /*
> > >>>>> +      * If we are setting the priority before being submitted, see if we
> > >>>>> +      * can quickly adjust our own priority in-situ and avoid taking
> > >>>>> +      * the contended engine->active.lock. If we need priority inheritance,
> > >>>>> +      * take the slow route.
> > >>>>> +      */
> > >>>>> +     if (rq_prio(rq) == I915_PRIORITY_INVALID) {
> > >>>>> +             struct i915_dependency *p;
> > >>>>> +
> > >>>>> +             rcu_read_lock();
> > >>>>> +             for_each_signaler(p, rq) {
> > >>>>> +                     struct i915_request *s =
> > >>>>> +                             container_of(p->signaler, typeof(*s), sched);
> > >>>>> +
> > >>>>> +                     if (rq_prio(s) >= prio)
> > >>>>> +                             continue;
> > >>>>> +
> > >>>>> +                     if (__i915_request_is_complete(s))
> > >>>>> +                             continue;
> > >>>>> +
> > >>>>> +                     break;
> > >>>>> +             }
> > >>>>> +             rcu_read_unlock();
> > >>>>
> > >>>> Exit this loop with a first lower priority incomplete signaler. What
> > >>>> does the block below then do? Feels like it needs a comment.
> > >>>
> > >>> I thought I had sufficiently explained that in the comment above.
> > >>>
> > >>> /* Update priority in place if no PI required */
> > >>>>> +             if (&p->signal_link == &rq->sched.signalers_list &&
> > >>>>> +                 cmpxchg(&rq->sched.attr.priority,
> > >>>>> +                         I915_PRIORITY_INVALID,
> > >>>>> +                         prio) == I915_PRIORITY_INVALID)
> > >>>>> +                     return;
> > >>>
> > >>> It could do a few more tricks to change the priority in-place a second
> > >>> time, but I did not think that would be frequent enough to matter.
> > >>> Whereas we always adjust the priority from INVALID once before
> > >>> submission, and avoiding taking the lock then does make a difference to
> > >>> the profiles.
> > >>
> > >> To start with, if p is NULL or un-initialized (can be, no?) then
> > >> relationship of &p->signal_link to &rq->sched.signalers_list escapes me.
> > > 
> > > p is constrained to be a member of the signalers_list or its head.
> > 
> > Is it defined list_for_each_entry exits with pos set? It is in 
> > implementation but I don't know why it would have to be. Could you 
> > change this to some form of list_empty or a descriptively named helper 
> > for clarity?
> 
> It as defined as the macro gets.
> 
> There's a list_entry_is_head(). That sounds new.
> 
> commit e130816164e244b692921de49771eeb28205152d
> Author: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Date:   Thu Oct 15 20:11:31 2020 -0700
> 
>     include/linux/list.h: add a macro to test if entry is pointing to the head


#define all_dependencies_checked(p, rq) \
        list_entry_is_head(p, &(rq)->sched.signalers_list, signal_link)

/* Update priority in place if no PI required */
if (all_dependencies_checked(p, rq) &&
    cmpxchg(&rq->sched.attr.priority,
	    I915_PRIORITY_INVALID,
	    prio) == I915_PRIORITY_INVALID)
	return;

-Chris

[05/41] drm/i915: Restructure priority inheritance

Commit Message

Comments

Patch