diff mbox

[2/2] drm/i915: Capture current context on error

Message ID 1361669410-1569-2-git-send-email-ben@bwidawsk.net (mailing list archive)
State New, archived
Headers show

Commit Message

Ben Widawsky Feb. 24, 2013, 1:30 a.m. UTC
On error, this represents the state of the currently running context at
the time it was loaded.

Unfortunately, since we're hung and can't switch out the context this
may not tell us too much about the most current state of the context,
but does give clues about what has happened since loading.

Thanks to recent doc updates, we have a little more confidence regarding
what is actually in this memory, and perhaps it will help us gain more
insight into certain bugs. AFAICT, the most interesting info is in the
first page. To save space, we only capture the first page. In the
future, we might want to dump more.

Sample of the relevant part of error state:
--- HW Context = 0x01b20000
00000000 :  00000000 1100105f 00002028 ffff0880
00000010 :  0000209c feff4040 000020c0 efdf0080
00000020 :  00002178 00000001 0000217c 00145855
00000030 :  00002310 00000000 00002314 00000000
00000040 :  00002318 00000000 0000231c 00000000

References: https://bugs.freedesktop.org/show_bug.cgi?id=55845
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 16 ++++++++++++++++
 drivers/gpu/drm/i915/i915_drv.h     |  1 +
 drivers/gpu/drm/i915/i915_irq.c     |  8 ++++++++
 3 files changed, 25 insertions(+)

Comments

Chris Wilson Feb. 24, 2013, 9:34 a.m. UTC | #1
On Sat, Feb 23, 2013 at 05:30:10PM -0800, Ben Widawsky wrote:
> On error, this represents the state of the currently running context at
> the time it was loaded.
> 
> Unfortunately, since we're hung and can't switch out the context this
> may not tell us too much about the most current state of the context,
> but does give clues about what has happened since loading.
> 
> Thanks to recent doc updates, we have a little more confidence regarding
> what is actually in this memory, and perhaps it will help us gain more
> insight into certain bugs. AFAICT, the most interesting info is in the
> first page. To save space, we only capture the first page. In the
> future, we might want to dump more.
> 
> Sample of the relevant part of error state:
> --- HW Context = 0x01b20000
> 00000000 :  00000000 1100105f 00002028 ffff0880
> 00000010 :  0000209c feff4040 000020c0 efdf0080
> 00000020 :  00002178 00000001 0000217c 00145855
> 00000030 :  00002310 00000000 00002314 00000000
> 00000040 :  00002318 00000000 0000231c 00000000

Presentation looks reasonable, except it will confuse
intel_error_decode as it will match "%x : %x". How about
"[%03x] %08x %08x %08x %08x"?

> 
> References: https://bugs.freedesktop.org/show_bug.cgi?id=55845
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> ---

> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index e95337c..ab88620 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -209,6 +209,7 @@ struct drm_i915_error_state {
>  	u32 pgtbl_er;
>  	u32 ier;
>  	u32 ccid;
> +	struct drm_i915_error_object *ctx_obj;

Put it next to the other pointers; lest we want to start digging holes.

>  	u32 derrmr;
>  	u32 forcewake;
>  	bool waiting[I915_NUM_RINGS];
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index ebaf558..7f7d241 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1321,6 +1321,14 @@ static void i915_capture_error_state(struct drm_device *dev)
>  	error->pgtbl_er = I915_READ(PGTBL_ER);
>  	error->ccid = I915_READ(CCID);
>  
> +	if (error->ccid && !dev_priv->hw_contexts_disabled) {
> +		list_for_each_entry(obj, &dev_priv->mm.active_list, mm_list)

I am doubtful that the active list will hold the object in all cases, as
we only put the context obj onto the active list when switching away.
I'd check the gtt_list to be on the safe side. And ignore what we think
of hw_context_disabled - if the CCID randomly points to one of our
objects, lets attach it.
-Chris
Ben Widawsky Feb. 24, 2013, 9:24 p.m. UTC | #2
On Sun, Feb 24, 2013 at 09:34:30AM +0000, Chris Wilson wrote:
> On Sat, Feb 23, 2013 at 05:30:10PM -0800, Ben Widawsky wrote:
> > On error, this represents the state of the currently running context at
> > the time it was loaded.
> > 
> > Unfortunately, since we're hung and can't switch out the context this
> > may not tell us too much about the most current state of the context,
> > but does give clues about what has happened since loading.
> > 
> > Thanks to recent doc updates, we have a little more confidence regarding
> > what is actually in this memory, and perhaps it will help us gain more
> > insight into certain bugs. AFAICT, the most interesting info is in the
> > first page. To save space, we only capture the first page. In the
> > future, we might want to dump more.
> > 
> > Sample of the relevant part of error state:
> > --- HW Context = 0x01b20000
> > 00000000 :  00000000 1100105f 00002028 ffff0880
> > 00000010 :  0000209c feff4040 000020c0 efdf0080
> > 00000020 :  00002178 00000001 0000217c 00145855
> > 00000030 :  00002310 00000000 00002314 00000000
> > 00000040 :  00002318 00000000 0000231c 00000000
> 
> Presentation looks reasonable, except it will confuse
> intel_error_decode as it will match "%x : %x". How about
> "[%03x] %08x %08x %08x %08x"?

Will %4x be okay? I suspect 1 page may not be sufficient at some point.
Also, it's probably worth modifying intel_error_decode since I suspect
over time we'll want to dump more objects.

> 
> > 
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=55845
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > ---
> 
> > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> > index e95337c..ab88620 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.h
> > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > @@ -209,6 +209,7 @@ struct drm_i915_error_state {
> >  	u32 pgtbl_er;
> >  	u32 ier;
> >  	u32 ccid;
> > +	struct drm_i915_error_object *ctx_obj;
> 
> Put it next to the other pointers; lest we want to start digging holes.

On that request, I am going to make 1 per ring then. We don't have any
non-ring error objects currently. I've also always been in favor of
having the code more ring independent but that was removed because of
Daniel.

> 
> >  	u32 derrmr;
> >  	u32 forcewake;
> >  	bool waiting[I915_NUM_RINGS];
> > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> > index ebaf558..7f7d241 100644
> > --- a/drivers/gpu/drm/i915/i915_irq.c
> > +++ b/drivers/gpu/drm/i915/i915_irq.c
> > @@ -1321,6 +1321,14 @@ static void i915_capture_error_state(struct drm_device *dev)
> >  	error->pgtbl_er = I915_READ(PGTBL_ER);
> >  	error->ccid = I915_READ(CCID);
> >  
> > +	if (error->ccid && !dev_priv->hw_contexts_disabled) {
> > +		list_for_each_entry(obj, &dev_priv->mm.active_list, mm_list)
> 
> I am doubtful that the active list will hold the object in all cases, as
> we only put the context obj onto the active list when switching away.
> I'd check the gtt_list to be on the safe side.

Hmm, that shouldn't have been how the code worked when I wrote it, and
I'm too lazy to look if it's changed. The context should be made active
on the switch to it. If we use rc6, and don't do that, that's really bad
because rc6 will automatically save/restore the context (though as long
as it remained pinned we're safe).  In any case, gtt_list is fine with
me, though I still believe if it's not on the active list, it's an
error.

> And ignore what we think of hw_context_disabled - if the CCID randomly
> points to one of our objects, lets attach it.

I can switch to HAS_HW_CONTEXTS for this. I'm really hoping we never see
CCID != 0 when not using contexts though.

> -Chris

> -- 
> Chris Wilson, Intel Open Source Technology Centre
Chris Wilson Feb. 24, 2013, 9:40 p.m. UTC | #3
On Sun, Feb 24, 2013 at 01:24:59PM -0800, Ben Widawsky wrote:
> On Sun, Feb 24, 2013 at 09:34:30AM +0000, Chris Wilson wrote:
> > On Sat, Feb 23, 2013 at 05:30:10PM -0800, Ben Widawsky wrote:
> > > On error, this represents the state of the currently running context at
> > > the time it was loaded.
> > > 
> > > Unfortunately, since we're hung and can't switch out the context this
> > > may not tell us too much about the most current state of the context,
> > > but does give clues about what has happened since loading.
> > > 
> > > Thanks to recent doc updates, we have a little more confidence regarding
> > > what is actually in this memory, and perhaps it will help us gain more
> > > insight into certain bugs. AFAICT, the most interesting info is in the
> > > first page. To save space, we only capture the first page. In the
> > > future, we might want to dump more.
> > > 
> > > Sample of the relevant part of error state:
> > > --- HW Context = 0x01b20000
> > > 00000000 :  00000000 1100105f 00002028 ffff0880
> > > 00000010 :  0000209c feff4040 000020c0 efdf0080
> > > 00000020 :  00002178 00000001 0000217c 00145855
> > > 00000030 :  00002310 00000000 00002314 00000000
> > > 00000040 :  00002318 00000000 0000231c 00000000
> > 
> > Presentation looks reasonable, except it will confuse
> > intel_error_decode as it will match "%x : %x". How about
> > "[%03x] %08x %08x %08x %08x"?
> 
> Will %4x be okay? I suspect 1 page may not be sufficient at some point.
> Also, it's probably worth modifying intel_error_decode since I suspect
> over time we'll want to dump more objects.

%04x is fine. I was simply trimming a few zeroes for brevity. Using []
was the key to making it distinct enough for intel_error_decode was
still being comprehensible for us.
 
> > > References: https://bugs.freedesktop.org/show_bug.cgi?id=55845
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > > ---
> > 
> > > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> > > index e95337c..ab88620 100644
> > > --- a/drivers/gpu/drm/i915/i915_drv.h
> > > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > > @@ -209,6 +209,7 @@ struct drm_i915_error_state {
> > >  	u32 pgtbl_er;
> > >  	u32 ier;
> > >  	u32 ccid;
> > > +	struct drm_i915_error_object *ctx_obj;
> > 
> > Put it next to the other pointers; lest we want to start digging holes.
> 
> On that request, I am going to make 1 per ring then. We don't have any
> non-ring error objects currently. I've also always been in favor of
> having the code more ring independent but that was removed because of
> Daniel.

Yes. I also think of contexts as being a far more generic constructs than
just the current automagic hw save/restore.

> > >  	u32 derrmr;
> > >  	u32 forcewake;
> > >  	bool waiting[I915_NUM_RINGS];
> > > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> > > index ebaf558..7f7d241 100644
> > > --- a/drivers/gpu/drm/i915/i915_irq.c
> > > +++ b/drivers/gpu/drm/i915/i915_irq.c
> > > @@ -1321,6 +1321,14 @@ static void i915_capture_error_state(struct drm_device *dev)
> > >  	error->pgtbl_er = I915_READ(PGTBL_ER);
> > >  	error->ccid = I915_READ(CCID);
> > >  
> > > +	if (error->ccid && !dev_priv->hw_contexts_disabled) {
> > > +		list_for_each_entry(obj, &dev_priv->mm.active_list, mm_list)
> > 
> > I am doubtful that the active list will hold the object in all cases, as
> > we only put the context obj onto the active list when switching away.
> > I'd check the gtt_list to be on the safe side.
> 
> Hmm, that shouldn't have been how the code worked when I wrote it, and
> I'm too lazy to look if it's changed. The context should be made active
> on the switch to it. If we use rc6, and don't do that, that's really bad
> because rc6 will automatically save/restore the context (though as long
> as it remained pinned we're safe).  In any case, gtt_list is fine with
> me, though I still believe if it's not on the active list, it's an
> error.

Hmm, the code we ended up with only did move_to_active() when switching
away in order for it to remain pinned until after the MI_SET_CONTEXT had
been processed. Before then it is kept alive through explicit pinning
and referencing from the ring.
 
> > And ignore what we think of hw_context_disabled - if the CCID randomly
> > points to one of our objects, lets attach it.
> 
> I can switch to HAS_HW_CONTEXTS for this. I'm really hoping we never see
> CCID != 0 when not using contexts though.

Complete agree. I also dread the day we see random CCID values. But
stanger things have happened.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 7c65ab8..b71d0ac 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -774,6 +774,22 @@  static int i915_error_state(struct seq_file *m, void *unused)
 		}
 	}
 
+	if (error->ctx_obj) {
+		struct drm_i915_error_object *obj = error->ctx_obj;
+		seq_printf(m, "--- HW Context = 0x%08x\n",
+			   obj->gtt_offset);
+		offset = 0;
+		for (elt = 0; elt < PAGE_SIZE/16; elt+=4) {
+			seq_printf(m, "%08x :  %08x %08x %08x %08x\n",
+				   offset,
+				   obj->pages[0][elt],
+				   obj->pages[0][elt+1],
+				   obj->pages[0][elt+2],
+				   obj->pages[0][elt+3]);
+				offset += 16;
+		}
+	}
+
 	if (error->overlay)
 		intel_overlay_print_error_state(m, error->overlay);
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index e95337c..ab88620 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -209,6 +209,7 @@  struct drm_i915_error_state {
 	u32 pgtbl_er;
 	u32 ier;
 	u32 ccid;
+	struct drm_i915_error_object *ctx_obj;
 	u32 derrmr;
 	u32 forcewake;
 	bool waiting[I915_NUM_RINGS];
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index ebaf558..7f7d241 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1321,6 +1321,14 @@  static void i915_capture_error_state(struct drm_device *dev)
 	error->pgtbl_er = I915_READ(PGTBL_ER);
 	error->ccid = I915_READ(CCID);
 
+	if (error->ccid && !dev_priv->hw_contexts_disabled) {
+		list_for_each_entry(obj, &dev_priv->mm.active_list, mm_list)
+			if ((error->ccid & PAGE_MASK) == obj->gtt_offset)
+				error->ctx_obj =
+					i915_error_object_create_sized(dev_priv,
+								       obj, 1);
+	}
+
 	if (HAS_PCH_SPLIT(dev))
 		error->ier = I915_READ(DEIER) | I915_READ(GTIER);
 	else if (IS_VALLEYVIEW(dev))