diff mbox

[v3,05/16] drm/i915: track ring progression using seqnos

Message ID 1365089568-20457-6-git-send-email-mika.kuoppala@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mika Kuoppala April 4, 2013, 3:32 p.m. UTC
Instead of relying in acthd, track ring seqno progression
to detect if ring has hung.

v2: put hangcheck stuff inside struct (Chris Wilson)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |    2 --
 drivers/gpu/drm/i915/i915_irq.c         |   30 +++++++++++++-----------------
 drivers/gpu/drm/i915/intel_ringbuffer.h |    6 ++++++
 3 files changed, 19 insertions(+), 19 deletions(-)

Comments

Ben Widawsky April 20, 2013, 6:43 p.m. UTC | #1
On Thu, Apr 04, 2013 at 06:32:37PM +0300, Mika Kuoppala wrote:
> Instead of relying in acthd, track ring seqno progression
> to detect if ring has hung.
> 
> v2: put hangcheck stuff inside struct (Chris Wilson)
> 
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>

This patch really scares me. I think we don't want to ever stop using
ACTHD. AFAICT, ACTHD is aways something we want to track in regards to
the GPU not making progress. (unless we're worried about cycles in the
CS, but there, seqno would equally not help us track progress).

The two danger cases are:
an infinite loop in a shader
never called MI_BATCH_BUFFER_END

And for both, *I think* ACTHD won't change.

I believe what the patch really wants is to stop using instdone, but
continue to use ACTHD as a fallback.

I think if you can keep ACTHD around, I'd be willing to r-b this.

Discussion moved to IRC.


[snip]
Ben Widawsky April 21, 2013, 9:07 p.m. UTC | #2
On Sat, Apr 20, 2013 at 11:43:51AM -0700, Ben Widawsky wrote:
> On Thu, Apr 04, 2013 at 06:32:37PM +0300, Mika Kuoppala wrote:
> > Instead of relying in acthd, track ring seqno progression
> > to detect if ring has hung.
> > 
> > v2: put hangcheck stuff inside struct (Chris Wilson)
> > 
> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> 
> This patch really scares me. I think we don't want to ever stop using
> ACTHD. AFAICT, ACTHD is aways something we want to track in regards to
> the GPU not making progress. (unless we're worried about cycles in the
> CS, but there, seqno would equally not help us track progress).
> 
> The two danger cases are:
> an infinite loop in a shader
> never called MI_BATCH_BUFFER_END
> 
> And for both, *I think* ACTHD won't change.

I'm wrong. I was confusing HEAD with ACTHD. ACTHD is supposed to show
the address of either the batch, or the ring. HEAD is something we could
use instead though. Just a thought.

> 
> I believe what the patch really wants is to stop using instdone, but
> continue to use ACTHD as a fallback.
> 
> I think if you can keep ACTHD around, I'd be willing to r-b this.
> 
> Discussion moved to IRC.
> 
> 
> [snip]
> -- 
> Ben Widawsky, Intel Open Source Technology Center
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Mika Kuoppala April 22, 2013, 1:36 p.m. UTC | #3
Ben Widawsky <ben@bwidawsk.net> writes:

> On Sat, Apr 20, 2013 at 11:43:51AM -0700, Ben Widawsky wrote:
>> On Thu, Apr 04, 2013 at 06:32:37PM +0300, Mika Kuoppala wrote:
>> > Instead of relying in acthd, track ring seqno progression
>> > to detect if ring has hung.
>> > 
>> > v2: put hangcheck stuff inside struct (Chris Wilson)
>> > 
>> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> 
>> This patch really scares me. I think we don't want to ever stop using
>> ACTHD. AFAICT, ACTHD is aways something we want to track in regards to
>> the GPU not making progress. (unless we're worried about cycles in the
>> CS, but there, seqno would equally not help us track progress).
>> 
>> The two danger cases are:
>> an infinite loop in a shader
>> never called MI_BATCH_BUFFER_END
>> 
>> And for both, *I think* ACTHD won't change.
>
> I'm wrong. I was confusing HEAD with ACTHD. ACTHD is supposed to show
> the address of either the batch, or the ring. HEAD is something we could
> use instead though. Just a thought.

Yes, i confirmed this with my test case. ACTHD will change inside the 
ring and also inside the batchbuffer.

The reason I chose seqnos is because that is the least hw (gen)
dependant way to measure the if progress is being made.
And such most close of what userspace sees as progress.
If batch doesn't get retired for whatever reason, reset will happen.

This way we can detect hangs, looped batchbuffers and infinite,
or more precisely too long loops, inside shaders.

I will need to update the commit message to reflect
the fact that with this patch, the batch needs to complete
in 3 second time window.

-Mika


>> 
>> I believe what the patch really wants is to stop using instdone, but
>> continue to use ACTHD as a fallback.
>> 
>> I think if you can keep ACTHD around, I'd be willing to r-b this.
>> 
>> Discussion moved to IRC.
>> 
>> 
>> [snip]
>> -- 
>> Ben Widawsky, Intel Open Source Technology Center
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>
> -- 
> Ben Widawsky, Intel Open Source Technology Center
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 3c85c1b..0404116 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -816,8 +816,6 @@  struct i915_gpu_error {
 #define DRM_I915_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_I915_HANGCHECK_PERIOD)
 	struct timer_list hangcheck_timer;
 	int hangcheck_count;
-	uint32_t last_acthd[I915_NUM_RINGS];
-	uint32_t prev_instdone[I915_NUM_INSTDONE_REG];
 
 	/* For reset and error_state handling. */
 	spinlock_t lock;
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 45c83fb..0363871 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1956,22 +1956,19 @@  void i915_hangcheck_elapsed(unsigned long data)
 {
 	struct drm_device *dev = (struct drm_device *)data;
 	drm_i915_private_t *dev_priv = dev->dev_private;
-	uint32_t acthd[I915_NUM_RINGS], instdone[I915_NUM_INSTDONE_REG];
 	struct intel_ring_buffer *ring;
 	bool err = false, idle;
 	int i;
+	u32 seqno[I915_NUM_RINGS];
+	bool work_done;
 
 	if (!i915_enable_hangcheck)
 		return;
 
-	memset(acthd, 0, sizeof(acthd));
 	idle = true;
 	for_each_ring(ring, dev_priv, i) {
-		u32 seqno;
-
-		seqno = ring->get_seqno(ring, false);
-		idle &= i915_hangcheck_ring_idle(ring, seqno, &err);
-	    acthd[i] = intel_ring_get_active_head(ring);
+		seqno[i] = ring->get_seqno(ring, false);
+		idle &= i915_hangcheck_ring_idle(ring, seqno[i], &err);
 	}
 
 	/* If all work is done then ACTHD clearly hasn't advanced. */
@@ -1987,20 +1984,19 @@  void i915_hangcheck_elapsed(unsigned long data)
 		return;
 	}
 
-	i915_get_extra_instdone(dev, instdone);
-	if (memcmp(dev_priv->gpu_error.last_acthd, acthd,
-		   sizeof(acthd)) == 0 &&
-	    memcmp(dev_priv->gpu_error.prev_instdone, instdone,
-		   sizeof(instdone)) == 0) {
+	work_done = false;
+	for_each_ring(ring, dev_priv, i) {
+		if (ring->hangcheck.seqno != seqno[i]) {
+			work_done = true;
+			ring->hangcheck.seqno = seqno[i];
+		}
+	}
+
+	if (!work_done) {
 		if (i915_hangcheck_hung(dev))
 			return;
 	} else {
 		dev_priv->gpu_error.hangcheck_count = 0;
-
-		memcpy(dev_priv->gpu_error.last_acthd, acthd,
-		       sizeof(acthd));
-		memcpy(dev_priv->gpu_error.prev_instdone, instdone,
-		       sizeof(instdone));
 	}
 
 repeat:
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index d66208c..844381e 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -37,6 +37,10 @@  struct  intel_hw_status_page {
 #define I915_READ_SYNC_0(ring) I915_READ(RING_SYNC_0((ring)->mmio_base))
 #define I915_READ_SYNC_1(ring) I915_READ(RING_SYNC_1((ring)->mmio_base))
 
+struct intel_ring_hangcheck {
+	u32 seqno;
+};
+
 struct  intel_ring_buffer {
 	const char	*name;
 	enum intel_ring_id {
@@ -137,6 +141,8 @@  struct  intel_ring_buffer {
 	struct i915_hw_context *default_context;
 	struct drm_i915_gem_object *last_context_obj;
 
+	struct intel_ring_hangcheck hangcheck;
+
 	void *private;
 };