diff mbox series

[RFC,1/6] drm/i915: Individual request cancellation

Message ID 20210312154622.1767865-2-tvrtko.ursulin@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Default request/fence expiry + watchdog | expand

Commit Message

Tvrtko Ursulin March 12, 2021, 3:46 p.m. UTC
From: Chris Wilson <chris@chris-wilson.co.uk>

Currently, we cancel outstanding requests within a context when the
context is closed. We may also want to cancel individual requests using
the same graceful preemption mechanism.

v2 (Tvrtko):
 * Cancel waiters carefully considering no timeline lock and RCU.
 * Fixed selftests.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
 .../drm/i915/gt/intel_execlists_submission.c  |   9 +-
 drivers/gpu/drm/i915/i915_request.c           |  77 ++++-
 drivers/gpu/drm/i915/i915_request.h           |   4 +-
 drivers/gpu/drm/i915/selftests/i915_request.c | 275 ++++++++++++++++++
 5 files changed, 360 insertions(+), 6 deletions(-)

Comments

Tvrtko Ursulin March 15, 2021, 5:37 p.m. UTC | #1
On 12/03/2021 15:46, Tvrtko Ursulin wrote:
> From: Chris Wilson <chris@chris-wilson.co.uk>
> 
> Currently, we cancel outstanding requests within a context when the
> context is closed. We may also want to cancel individual requests using
> the same graceful preemption mechanism.
> 
> v2 (Tvrtko):
>   * Cancel waiters carefully considering no timeline lock and RCU.
>   * Fixed selftests.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

[snip]

> +void i915_request_cancel(struct i915_request *rq, int error)
> +{
> +	if (!i915_request_set_error_once(rq, error))
> +		return;
> +
> +	set_bit(I915_FENCE_FLAG_SENTINEL, &rq->fence.flags);
> +
> +	if (i915_sw_fence_signaled(&rq->submit)) {
> +		struct i915_dependency *p;
> +
> +restart:
> +		rcu_read_lock();
> +		for_each_waiter(p, rq) {
> +			struct i915_request *w =
> +				container_of(p->waiter, typeof(*w), sched);
> +
> +			if (__i915_request_is_complete(w) ||
> +			    fatal_error(w->fence.error))
> +				continue;
> +
> +			w = i915_request_get(w);
> +			rcu_read_unlock();
> +			/* Recursion bound by the number of engines */
> +			i915_request_cancel(w, error);
> +			i915_request_put(w);
> +
> +			/* Restart after having to drop rcu lock. */
> +			goto restart;
> +		}

So I need to fix this error propagation to waiters in order to avoid 
potential stack overflow caught in shards (gem_ctx_ringsize).

Or alternatively we decide not to propagate fence errors. Not sure that 
consequences either way are particularly better or worse. Things will 
break anyway since what userspace inspects for unexpected fence errors?!

So rendering corruption more or less. Can it cause a further stream of 
GPU hangs I am not sure. Only if there is a inter-engine data dependency 
involving data more complex than images/textures.

Regards,

Tvrtko

> +		rcu_read_unlock();
> +	}
> +
> +	__cancel_request(rq);
> +}
Daniel Vetter March 16, 2021, 10:02 a.m. UTC | #2
On Mon, Mar 15, 2021 at 05:37:27PM +0000, Tvrtko Ursulin wrote:
> 
> On 12/03/2021 15:46, Tvrtko Ursulin wrote:
> > From: Chris Wilson <chris@chris-wilson.co.uk>
> > 
> > Currently, we cancel outstanding requests within a context when the
> > context is closed. We may also want to cancel individual requests using
> > the same graceful preemption mechanism.
> > 
> > v2 (Tvrtko):
> >   * Cancel waiters carefully considering no timeline lock and RCU.
> >   * Fixed selftests.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> [snip]
> 
> > +void i915_request_cancel(struct i915_request *rq, int error)
> > +{
> > +	if (!i915_request_set_error_once(rq, error))
> > +		return;
> > +
> > +	set_bit(I915_FENCE_FLAG_SENTINEL, &rq->fence.flags);
> > +
> > +	if (i915_sw_fence_signaled(&rq->submit)) {
> > +		struct i915_dependency *p;
> > +
> > +restart:
> > +		rcu_read_lock();
> > +		for_each_waiter(p, rq) {
> > +			struct i915_request *w =
> > +				container_of(p->waiter, typeof(*w), sched);
> > +
> > +			if (__i915_request_is_complete(w) ||
> > +			    fatal_error(w->fence.error))
> > +				continue;
> > +
> > +			w = i915_request_get(w);
> > +			rcu_read_unlock();
> > +			/* Recursion bound by the number of engines */
> > +			i915_request_cancel(w, error);
> > +			i915_request_put(w);
> > +
> > +			/* Restart after having to drop rcu lock. */
> > +			goto restart;
> > +		}
> 
> So I need to fix this error propagation to waiters in order to avoid
> potential stack overflow caught in shards (gem_ctx_ringsize).
> 
> Or alternatively we decide not to propagate fence errors. Not sure that
> consequences either way are particularly better or worse. Things will break
> anyway since what userspace inspects for unexpected fence errors?!

fence error propagation is one of these "sounds like a good idea" things
that turned into a can of worms. See the recent revert Jason submitted, I
replied with a  more in-depth discussion.

So I'd say if we don't need this internally somehow for scheduler state,
remove it. Maybe even the entire scaffolding we have for the forwarding.

Maybe best if you sync with Jason here, we need to stuff Jason's patch
into -fixes since there's a pretty bad regression going on. I think Jason
also said there's a pile of igts to remove once we give up on fence error
propagation.

> So rendering corruption more or less. Can it cause a further stream of GPU
> hangs I am not sure. Only if there is a inter-engine data dependency
> involving data more complex than images/textures.

Yup. Also at least on modern-ish hw our userspace goes with
non-recoverable contexts anyway, because everything needs to be
reconstructed. vk is even more brutal, it just hands you back a
vk_device_lost and everything is gone (textures, data, all api objects,
really everything afaiui). Trying to continue is something only old
userspace is doing, because the fully emit the entire ctx state at the
start of each batch anyway.
-Daniel
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 0b062fad1837..e2fb3ae2aaf3 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -314,6 +314,7 @@  int intel_engine_pulse(struct intel_engine_cs *engine)
 		mutex_unlock(&ce->timeline->mutex);
 	}
 
+	intel_engine_flush_scheduler(engine);
 	intel_engine_pm_put(engine);
 	return err;
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 85ff5fe861b4..4c2acb5a6c0a 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -421,6 +421,11 @@  static void reset_active(struct i915_request *rq,
 	ce->lrc.lrca = lrc_update_regs(ce, engine, head);
 }
 
+static bool bad_request(const struct i915_request *rq)
+{
+	return rq->fence.error && i915_request_started(rq);
+}
+
 static struct intel_engine_cs *
 __execlists_schedule_in(struct i915_request *rq)
 {
@@ -433,7 +438,7 @@  __execlists_schedule_in(struct i915_request *rq)
 		     !intel_engine_has_heartbeat(engine)))
 		intel_context_set_banned(ce);
 
-	if (unlikely(intel_context_is_banned(ce)))
+	if (unlikely(intel_context_is_banned(ce) || bad_request(rq)))
 		reset_active(rq, engine);
 
 	if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
@@ -1112,7 +1117,7 @@  static unsigned long active_preempt_timeout(struct intel_engine_cs *engine,
 		return 0;
 
 	/* Force a fast reset for terminated contexts (ignoring sysfs!) */
-	if (unlikely(intel_context_is_banned(rq->context)))
+	if (unlikely(intel_context_is_banned(rq->context) || bad_request(rq)))
 		return 1;
 
 	return READ_ONCE(engine->props.preempt_timeout_ms);
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index e7b4c4bc41a6..fb9c5bb1fe41 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -33,7 +33,10 @@ 
 #include "gem/i915_gem_context.h"
 #include "gt/intel_breadcrumbs.h"
 #include "gt/intel_context.h"
+#include "gt/intel_engine.h"
+#include "gt/intel_engine_heartbeat.h"
 #include "gt/intel_gpu_commands.h"
+#include "gt/intel_reset.h"
 #include "gt/intel_ring.h"
 #include "gt/intel_rps.h"
 
@@ -429,20 +432,22 @@  void __i915_request_skip(struct i915_request *rq)
 	rq->infix = rq->postfix;
 }
 
-void i915_request_set_error_once(struct i915_request *rq, int error)
+bool i915_request_set_error_once(struct i915_request *rq, int error)
 {
 	int old;
 
 	GEM_BUG_ON(!IS_ERR_VALUE((long)error));
 
 	if (i915_request_signaled(rq))
-		return;
+		return false;
 
 	old = READ_ONCE(rq->fence.error);
 	do {
 		if (fatal_error(old))
-			return;
+			return false;
 	} while (!try_cmpxchg(&rq->fence.error, &old, error));
+
+	return true;
 }
 
 struct i915_request *i915_request_mark_eio(struct i915_request *rq)
@@ -609,6 +614,72 @@  void i915_request_unsubmit(struct i915_request *request)
 	spin_unlock_irqrestore(&se->lock, flags);
 }
 
+static struct intel_engine_cs *active_engine(struct i915_request *rq)
+{
+	struct intel_engine_cs *engine, *locked;
+
+	locked = READ_ONCE(rq->engine);
+	spin_lock_irq(&locked->sched.lock);
+	while (unlikely(locked != (engine = READ_ONCE(rq->engine)))) {
+		spin_unlock(&locked->sched.lock);
+		locked = engine;
+		spin_lock(&locked->sched.lock);
+	}
+
+	engine = NULL;
+	if (i915_request_is_active(rq) && !__i915_request_is_complete(rq))
+		engine = locked;
+
+	spin_unlock_irq(&locked->sched.lock);
+
+	return engine;
+}
+
+static void __cancel_request(struct i915_request *rq)
+{
+	struct intel_engine_cs *engine = active_engine(rq);
+
+	if (engine && intel_engine_pulse(engine))
+		intel_gt_handle_error(engine->gt, engine->mask, 0,
+				      "request cancellation by %s",
+				      current->comm);
+}
+
+void i915_request_cancel(struct i915_request *rq, int error)
+{
+	if (!i915_request_set_error_once(rq, error))
+		return;
+
+	set_bit(I915_FENCE_FLAG_SENTINEL, &rq->fence.flags);
+
+	if (i915_sw_fence_signaled(&rq->submit)) {
+		struct i915_dependency *p;
+
+restart:
+		rcu_read_lock();
+		for_each_waiter(p, rq) {
+			struct i915_request *w =
+				container_of(p->waiter, typeof(*w), sched);
+
+			if (__i915_request_is_complete(w) ||
+			    fatal_error(w->fence.error))
+				continue;
+
+			w = i915_request_get(w);
+			rcu_read_unlock();
+			/* Recursion bound by the number of engines */
+			i915_request_cancel(w, error);
+			i915_request_put(w);
+
+			/* Restart after having to drop rcu lock. */
+			goto restart;
+		}
+		rcu_read_unlock();
+	}
+
+	__cancel_request(rq);
+}
+
 static int __i915_sw_fence_call
 submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
 {
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index dd10a6db3d21..64869a313b3e 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -312,7 +312,7 @@  struct i915_request * __must_check
 i915_request_create(struct intel_context *ce);
 
 void __i915_request_skip(struct i915_request *rq);
-void i915_request_set_error_once(struct i915_request *rq, int error);
+bool i915_request_set_error_once(struct i915_request *rq, int error);
 struct i915_request *i915_request_mark_eio(struct i915_request *rq);
 
 struct i915_request *__i915_request_commit(struct i915_request *request);
@@ -368,6 +368,8 @@  void i915_request_submit(struct i915_request *request);
 void __i915_request_unsubmit(struct i915_request *request);
 void i915_request_unsubmit(struct i915_request *request);
 
+void i915_request_cancel(struct i915_request *rq, int error);
+
 long i915_request_wait(struct i915_request *rq,
 		       unsigned int flags,
 		       long timeout)
diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
index 8035ea7565ed..e63609ec5b97 100644
--- a/drivers/gpu/drm/i915/selftests/i915_request.c
+++ b/drivers/gpu/drm/i915/selftests/i915_request.c
@@ -608,6 +608,280 @@  static int live_nop_request(void *arg)
 	return err;
 }
 
+static int __cancel_inactive(struct intel_engine_cs *engine)
+{
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *rq;
+	int err = 0;
+
+	if (igt_spinner_init(&spin, engine->gt))
+		return -ENOMEM;
+
+	ce = intel_context_create(engine);
+	if (IS_ERR(ce)) {
+		err = PTR_ERR(ce);
+		goto out_spin;
+	}
+
+	rq = igt_spinner_create_request(&spin, ce, MI_ARB_CHECK);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto out_ce;
+	}
+
+	pr_debug("%s: Cancelling inactive request\n", engine->name);
+	i915_request_cancel(rq, -EINTR);
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	if (i915_request_wait(rq, 0, HZ / 5) < 0) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("%s: Failed to cancel inactive request\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_rq;
+	}
+
+	if (rq->fence.error != -EINTR) {
+		pr_err("%s: fence not cancelled (%u)\n",
+		       engine->name, rq->fence.error);
+		err = -EINVAL;
+	}
+
+out_rq:
+	i915_request_put(rq);
+out_ce:
+	intel_context_put(ce);
+out_spin:
+	igt_spinner_fini(&spin);
+	if (err)
+		pr_err("%s: __cancel_inactive error %d\n", engine->name, err);
+	return err;
+}
+
+static int __cancel_active(struct intel_engine_cs *engine)
+{
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *rq;
+	int err = 0;
+
+	if (igt_spinner_init(&spin, engine->gt))
+		return -ENOMEM;
+
+	ce = intel_context_create(engine);
+	if (IS_ERR(ce)) {
+		err = PTR_ERR(ce);
+		goto out_spin;
+	}
+
+	rq = igt_spinner_create_request(&spin, ce, MI_ARB_CHECK);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto out_ce;
+	}
+
+	pr_debug("%s: Cancelling active request\n", engine->name);
+	i915_request_get(rq);
+	i915_request_add(rq);
+	if (!igt_wait_for_spinner(&spin, rq)) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("Failed to start spinner on %s\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_rq;
+	}
+	i915_request_cancel(rq, -EINTR);
+
+	if (i915_request_wait(rq, 0, HZ / 5) < 0) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("%s: Failed to cancel active request\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_rq;
+	}
+
+	if (rq->fence.error != -EINTR) {
+		pr_err("%s: fence not cancelled (%u)\n",
+		       engine->name, rq->fence.error);
+		err = -EINVAL;
+	}
+
+out_rq:
+	i915_request_put(rq);
+out_ce:
+	intel_context_put(ce);
+out_spin:
+	igt_spinner_fini(&spin);
+	if (err)
+		pr_err("%s: __cancel_active error %d\n", engine->name, err);
+	return err;
+}
+
+static int __cancel_active_chain(struct intel_engine_cs *engine)
+{
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *rq[2];
+	int err = 0;
+
+	if (igt_spinner_init(&spin, engine->gt))
+		return -ENOMEM;
+
+	ce = intel_context_create(engine);
+	if (IS_ERR(ce)) {
+		err = PTR_ERR(ce);
+		goto out_spin;
+	}
+
+	rq[0] = igt_spinner_create_request(&spin, ce, MI_ARB_CHECK);
+	if (IS_ERR(rq[0])) {
+		err = PTR_ERR(rq[0]);
+		goto out_ce;
+	}
+	i915_request_get(rq[0]);
+	i915_request_add(rq[0]);
+
+	rq[1] = intel_context_create_request(ce);
+	if (IS_ERR(rq[1])) {
+		err = PTR_ERR(rq[1]);
+		goto out_spinner;
+	}
+	i915_request_get(rq[1]);
+	i915_request_add(rq[1]);
+
+	pr_debug("%s: Cancelling active chain\n", engine->name);
+	intel_engine_flush_scheduler(engine);
+	i915_request_cancel(rq[0], -EINTR);
+	igt_spinner_end(&spin);
+
+	if (i915_request_wait(rq[1], 0, HZ / 5) < 0) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("%s: Failed to cancel chained request\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_waiter;
+	}
+
+	if (rq[0]->fence.error != -EINTR) {
+		pr_err("%s: first fence not cancelled (%u)\n",
+		       engine->name, rq[0]->fence.error);
+		err = -EINVAL;
+	}
+
+	if (rq[1]->fence.error != -EINTR) {
+		pr_err("%s: second fence not cancelled (%u)\n",
+		       engine->name, rq[1]->fence.error);
+		err = -EINVAL;
+	}
+
+out_waiter:
+	i915_request_put(rq[1]);
+out_spinner:
+	i915_request_put(rq[0]);
+out_ce:
+	intel_context_put(ce);
+out_spin:
+	igt_spinner_fini(&spin);
+	if (err)
+		pr_err("%s: __cancel_active_chain error %d\n",
+		       engine->name, err);
+	return err;
+}
+
+static int __cancel_completed(struct intel_engine_cs *engine)
+{
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *rq;
+	int err = 0;
+
+	if (igt_spinner_init(&spin, engine->gt))
+		return -ENOMEM;
+
+	ce = intel_context_create(engine);
+	if (IS_ERR(ce)) {
+		err = PTR_ERR(ce);
+		goto out_spin;
+	}
+
+	rq = igt_spinner_create_request(&spin, ce, MI_ARB_CHECK);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto out_ce;
+	}
+	igt_spinner_end(&spin);
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	if (i915_request_wait(rq, 0, HZ / 5) < 0) {
+		err = -ETIME;
+		goto out_rq;
+	}
+
+	pr_debug("%s: Cancelling completed request\n", engine->name);
+	i915_request_cancel(rq, -EINTR);
+	if (rq->fence.error) {
+		pr_err("%s: fence not cancelled (%u)\n",
+		       engine->name, rq->fence.error);
+		err = -EINVAL;
+	}
+
+out_rq:
+	i915_request_put(rq);
+out_ce:
+	intel_context_put(ce);
+out_spin:
+	igt_spinner_fini(&spin);
+	if (err)
+		pr_err("%s: __cancel_completed error %d\n", engine->name, err);
+	return err;
+}
+
+static int live_cancel_request(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	struct intel_engine_cs *engine;
+
+	/*
+	 * Check cancellation of requests. We expect to be able to immediately
+	 * cancel active requests, even if they are currently on the GPU.
+	 */
+
+	for_each_uabi_engine(engine, i915) {
+		struct igt_live_test t;
+		int err, err2;
+
+		if (!intel_engine_has_preemption(engine))
+			continue;
+
+		err = igt_live_test_begin(&t, i915, __func__, engine->name);
+		if (err)
+			return err;
+
+		err = __cancel_inactive(engine);
+		if (err == 0)
+			err = __cancel_active(engine);
+		if (err == 0)
+			err = __cancel_active_chain(engine);
+		if (err == 0)
+			err = __cancel_completed(engine);
+
+		err2 = igt_live_test_end(&t);
+		if (err)
+			return err;
+		if (err2)
+			return err2;
+	}
+
+	return 0;
+}
+
 static struct i915_vma *empty_batch(struct drm_i915_private *i915)
 {
 	struct drm_i915_gem_object *obj;
@@ -1485,6 +1759,7 @@  int i915_request_live_selftests(struct drm_i915_private *i915)
 		SUBTEST(live_sequential_engines),
 		SUBTEST(live_parallel_engines),
 		SUBTEST(live_empty_request),
+		SUBTEST(live_cancel_request),
 		SUBTEST(live_breadcrumbs_smoketest),
 	};