diff mbox

[13/14] drm/i915/guc: Workaround the missing user interrupt after preemption

Message ID 20171019183619.6235-14-michal.winiarski@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Michał Winiarski Oct. 19, 2017, 6:36 p.m. UTC
With GuC, we're scheduling tasklet on USER_INTERRUPT - since some work
has completed, perhaps we're able to submit some more work. We're doing
similar thing for preemption - after preemption has completed, it's time
to schedule the tasklet and submit more work (since the engine is now
idle). Unfortunately, we can hit the scenarios where the preemption is
done, but the interrupt is nowhere to be seen. To work around the
problem, let's use a delayed work that's kicking the tasklet if
preemption is done, and queueing itself otherwise.

Testcase: igt/gem_exec_whisper/*-priority
Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jeff McGee <jeff.mcgee@intel.com>
Cc: John Harrison <john.c.harrison@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_guc_submission.c | 36 +++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_guc.h           |  1 +
 2 files changed, 36 insertions(+), 1 deletion(-)

Comments

Chris Wilson Oct. 19, 2017, 7:44 p.m. UTC | #1
Quoting Michał Winiarski (2017-10-19 19:36:18)
> With GuC, we're scheduling tasklet on USER_INTERRUPT - since some work
> has completed, perhaps we're able to submit some more work. We're doing
> similar thing for preemption - after preemption has completed, it's time
> to schedule the tasklet and submit more work (since the engine is now
> idle). Unfortunately, we can hit the scenarios where the preemption is
> done, but the interrupt is nowhere to be seen. To work around the
> problem, let's use a delayed work that's kicking the tasklet if
> preemption is done, and queueing itself otherwise.

I'm not buying it yet. Missing USER_INTERRUPT would not be guc specific
(surely?) and so we would be getting similar missed-breadcrumb warnings
on skl+. We dedicate a lot of CI towards detecting those...

The first alternative that springs to mind is bad ordering between ggtt
write and MI_USER_INTERRUPT, but that too is common to execlists, and
also results in missed-breadcrumb warnings (which are notable by their
absence). Though having said that, the HWSP ordering issue with
intel_iommu is worrying me that we have similar issues with breadcrumbs
+ intel_iommu. Is iommu a factor?

In short, if this is a problem here, it should be a problem everywhere.
Right?
-Chris
Chris Wilson Oct. 20, 2017, 4:08 p.m. UTC | #2
Quoting Michał Winiarski (2017-10-19 19:36:18)
> +#define GUC_LOST_IRQ_WORK_DELAY_MS 100
> +static void guc_lost_user_interrupt(struct work_struct *work)
> +{
> +       struct guc_preempt_work *preempt_work =
> +               container_of(to_delayed_work(work), typeof(*preempt_work),
> +                            lost_irq_work);
> +       struct intel_engine_cs *engine = preempt_work->engine;
> +       struct intel_guc *guc = &engine->i915->guc;

Since you switched preempt_work to be stored within intel_guc,
you can use container_of to get to the guc and cut out the dance.
Here and inject_preempt_context.
-Chris
Chris Wilson Oct. 20, 2017, 5:22 p.m. UTC | #3
Quoting Chris Wilson (2017-10-19 20:44:41)
> Quoting Michał Winiarski (2017-10-19 19:36:18)
> > With GuC, we're scheduling tasklet on USER_INTERRUPT - since some work
> > has completed, perhaps we're able to submit some more work. We're doing
> > similar thing for preemption - after preemption has completed, it's time
> > to schedule the tasklet and submit more work (since the engine is now
> > idle). Unfortunately, we can hit the scenarios where the preemption is
> > done, but the interrupt is nowhere to be seen. To work around the
> > problem, let's use a delayed work that's kicking the tasklet if
> > preemption is done, and queueing itself otherwise.
> 
> I'm not buying it yet.

I think I know what it is. I think it is as simple as us turning off the
irq when there's no more requests being signaled on the engine.
-Chris
Daniele Ceraolo Spurio Oct. 20, 2017, 5:42 p.m. UTC | #4
On 20/10/17 10:22, Chris Wilson wrote:
> Quoting Chris Wilson (2017-10-19 20:44:41)
>> Quoting Michał Winiarski (2017-10-19 19:36:18)
>>> With GuC, we're scheduling tasklet on USER_INTERRUPT - since some work
>>> has completed, perhaps we're able to submit some more work. We're doing
>>> similar thing for preemption - after preemption has completed, it's time
>>> to schedule the tasklet and submit more work (since the engine is now
>>> idle). Unfortunately, we can hit the scenarios where the preemption is
>>> done, but the interrupt is nowhere to be seen. To work around the
>>> problem, let's use a delayed work that's kicking the tasklet if
>>> preemption is done, and queueing itself otherwise.
>>
>> I'm not buying it yet.
> 
> I think I know what it is. I think it is as simple as us turning off the
> irq when there's no more requests being signaled on the engine.
> -Chris

I can confirm this, I was just about to reply with the same findings ;)
By hackily adding USER_INTERRUPT to irq_keep_mask we've been able to run 
exec_whisper@fds-priority several time in a row with no problems.

Daniele
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index a11ed4deff4b..dbb03b5481d2 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -574,6 +574,26 @@  static void flush_ggtt_writes(struct i915_vma *vma)
 		POSTING_READ_FW(GUC_STATUS);
 }
 
+#define GUC_LOST_IRQ_WORK_DELAY_MS 100
+static void guc_lost_user_interrupt(struct work_struct *work)
+{
+	struct guc_preempt_work *preempt_work =
+		container_of(to_delayed_work(work), typeof(*preempt_work),
+			     lost_irq_work);
+	struct intel_engine_cs *engine = preempt_work->engine;
+	struct intel_guc *guc = &engine->i915->guc;
+	struct guc_shared_ctx_data *data = guc->shared_data_vaddr;
+	struct guc_ctx_report *report = &data->preempt_ctx_report[engine->guc_id];
+
+	if (report->report_return_status == INTEL_GUC_REPORT_STATUS_COMPLETE)
+		tasklet_schedule(&engine->execlists.irq_tasklet);
+	else
+		queue_delayed_work(guc->preempt_wq,
+				   &preempt_work->lost_irq_work,
+				   msecs_to_jiffies(GUC_LOST_IRQ_WORK_DELAY_MS));
+
+}
+
 #define GUC_PREEMPT_FINISHED 0x1
 #define GUC_PREEMPT_BREADCRUMB_DWORDS 0x8
 static void inject_preempt_context(struct work_struct *work)
@@ -629,7 +649,13 @@  static void inject_preempt_context(struct work_struct *work)
 	if (WARN_ON(intel_guc_send(guc, data, ARRAY_SIZE(data)))) {
 		WRITE_ONCE(engine->execlists.preempt, false);
 		tasklet_schedule(&engine->execlists.irq_tasklet);
+
+		return;
 	}
+
+	queue_delayed_work(engine->i915->guc.preempt_wq,
+			   &preempt_work->lost_irq_work,
+			   msecs_to_jiffies(GUC_LOST_IRQ_WORK_DELAY_MS));
 }
 
 /*
@@ -647,6 +673,10 @@  static void wait_for_guc_preempt_report(struct intel_engine_cs *engine)
 	struct guc_shared_ctx_data *data = guc->shared_data_vaddr;
 	struct guc_ctx_report *report = &data->preempt_ctx_report[engine->guc_id];
 
+	/* If we landed here, it means that we didn't lose an interrupt, and
+	 * we can safely cancel the worker */
+	cancel_delayed_work(&guc->preempt_work[engine->id].lost_irq_work);
+
 	WARN_ON(wait_for_atomic(report->report_return_status ==
 				INTEL_GUC_REPORT_STATUS_COMPLETE,
 				GUC_PREEMPT_POSTPROCESS_DELAY_MS));
@@ -1229,6 +1259,8 @@  static int guc_preempt_work_create(struct intel_guc *guc)
 	for_each_engine(engine, dev_priv, id) {
 		guc->preempt_work[id].engine = engine;
 		INIT_WORK(&guc->preempt_work[id].work, inject_preempt_context);
+		INIT_DELAYED_WORK(&guc->preempt_work[id].lost_irq_work,
+				  guc_lost_user_interrupt);
 	}
 
 	return 0;
@@ -1240,8 +1272,10 @@  static void guc_preempt_work_destroy(struct intel_guc *guc)
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
 
-	for_each_engine(engine, dev_priv, id)
+	for_each_engine(engine, dev_priv, id) {
+		cancel_delayed_work_sync(&guc->preempt_work[id].lost_irq_work);
 		cancel_work_sync(&guc->preempt_work[id].work);
+	}
 
 	destroy_workqueue(guc->preempt_wq);
 }
diff --git a/drivers/gpu/drm/i915/intel_guc.h b/drivers/gpu/drm/i915/intel_guc.h
index 7273a6be7dc1..0c9338b5c4b8 100644
--- a/drivers/gpu/drm/i915/intel_guc.h
+++ b/drivers/gpu/drm/i915/intel_guc.h
@@ -49,6 +49,7 @@  enum i915_guc_client_id {
 struct guc_preempt_work {
 	struct intel_engine_cs *engine;
 	struct work_struct work;
+	struct delayed_work lost_irq_work;
 };
 
 struct intel_guc {