diff mbox

[07/10] drm/i915: print Gen5+ CPU poison interrupts

Message ID 1360352121-3989-8-git-send-email-przanoni@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Paulo Zanoni Feb. 8, 2013, 7:35 p.m. UTC
From: Paulo Zanoni <paulo.r.zanoni@intel.com>

On ILK/SNB all we need to do is to enable the "poison" bit, but on
IVB/HSW we need to enable the CPU error interrupt register, which is
responsible not only for poison interrupts, but also other things.
This includes the "unclaimed register" interrupt, so on the IVB irq
handler we now need to: (i) check whether the interrupt was triggered by an
unclaimed register and (ii) mask the error interrupt bit so we don't
risk generating "unclaimed register" interrupts form inside the
interrupt handler.

Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c |   49 ++++++++++++++++++++++++++++++++++++---
 drivers/gpu/drm/i915/i915_reg.h |    5 ++--
 2 files changed, 49 insertions(+), 5 deletions(-)

Comments

Jesse Barnes Feb. 8, 2013, 7:42 p.m. UTC | #1
On Fri,  8 Feb 2013 17:35:18 -0200
Paulo Zanoni <przanoni@gmail.com> wrote:

> From: Paulo Zanoni <paulo.r.zanoni@intel.com>
> 
> On ILK/SNB all we need to do is to enable the "poison" bit, but on
> IVB/HSW we need to enable the CPU error interrupt register, which is
> responsible not only for poison interrupts, but also other things.
> This includes the "unclaimed register" interrupt, so on the IVB irq
> handler we now need to: (i) check whether the interrupt was triggered by an
> unclaimed register and (ii) mask the error interrupt bit so we don't
> risk generating "unclaimed register" interrupts form inside the
> interrupt handler.
> 
> Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
> ---

OTOH there's nothing the user can do about it... so we might do a
WARN_ONCE or something here instead.  But even then, I'm not sure
there's much *we* can do about these, as they indicate a corruption in
the communication between the CPU and PCH.
Paulo Zanoni Feb. 8, 2013, 7:54 p.m. UTC | #2
Hi

2013/2/8 Jesse Barnes <jbarnes@virtuousgeek.org>:
> On Fri,  8 Feb 2013 17:35:18 -0200
> Paulo Zanoni <przanoni@gmail.com> wrote:
>
>> From: Paulo Zanoni <paulo.r.zanoni@intel.com>
>>
>> On ILK/SNB all we need to do is to enable the "poison" bit, but on
>> IVB/HSW we need to enable the CPU error interrupt register, which is
>> responsible not only for poison interrupts, but also other things.
>> This includes the "unclaimed register" interrupt, so on the IVB irq
>> handler we now need to: (i) check whether the interrupt was triggered by an
>> unclaimed register and (ii) mask the error interrupt bit so we don't
>> risk generating "unclaimed register" interrupts form inside the
>> interrupt handler.
>>
>> Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
>> ---
>
> OTOH there's nothing the user can do about it... so we might do a
> WARN_ONCE or something here instead.

Well, so far I haven't seen the message. If we conclude it happens
*too much*, then we can use WARN_ONCE.

> But even then, I'm not sure
> there's much *we* can do about these, as they indicate a corruption in
> the communication between the CPU and PCH.

At least if we get the message we may be able to understand and/or
reproduce the problems. So far we don't even know whether the problem
is happening or not... And when there's a display bug, we don't know
if it's caused by "poison".

>
> --
> Jesse Barnes, Intel Open Source Technology Center
Jesse Barnes Feb. 8, 2013, 8:01 p.m. UTC | #3
On Fri, 8 Feb 2013 17:54:23 -0200
Paulo Zanoni <przanoni@gmail.com> wrote:

> Hi
> 
> 2013/2/8 Jesse Barnes <jbarnes@virtuousgeek.org>:
> > On Fri,  8 Feb 2013 17:35:18 -0200
> > Paulo Zanoni <przanoni@gmail.com> wrote:
> >
> >> From: Paulo Zanoni <paulo.r.zanoni@intel.com>
> >>
> >> On ILK/SNB all we need to do is to enable the "poison" bit, but on
> >> IVB/HSW we need to enable the CPU error interrupt register, which is
> >> responsible not only for poison interrupts, but also other things.
> >> This includes the "unclaimed register" interrupt, so on the IVB irq
> >> handler we now need to: (i) check whether the interrupt was triggered by an
> >> unclaimed register and (ii) mask the error interrupt bit so we don't
> >> risk generating "unclaimed register" interrupts form inside the
> >> interrupt handler.
> >>
> >> Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
> >> ---
> >
> > OTOH there's nothing the user can do about it... so we might do a
> > WARN_ONCE or something here instead.
> 
> Well, so far I haven't seen the message. If we conclude it happens
> *too much*, then we can use WARN_ONCE.
> 
> > But even then, I'm not sure
> > there's much *we* can do about these, as they indicate a corruption in
> > the communication between the CPU and PCH.
> 
> At least if we get the message we may be able to understand and/or
> reproduce the problems. So far we don't even know whether the problem
> is happening or not... And when there's a display bug, we don't know
> if it's caused by "poison".

Ok I guess the DRM_ERROR won't hurt if/until we see reports.  Then we
can dig in and see if keeping the message makes sense or not.
Ben Widawsky Feb. 9, 2013, 5:30 p.m. UTC | #4
On Fri, 8 Feb 2013 11:42:39 -0800
Jesse Barnes <jbarnes@virtuousgeek.org> wrote:

> On Fri,  8 Feb 2013 17:35:18 -0200
> Paulo Zanoni <przanoni@gmail.com> wrote:
> 
> > From: Paulo Zanoni <paulo.r.zanoni@intel.com>
> > 
> > On ILK/SNB all we need to do is to enable the "poison" bit, but on
> > IVB/HSW we need to enable the CPU error interrupt register, which is
> > responsible not only for poison interrupts, but also other things.
> > This includes the "unclaimed register" interrupt, so on the IVB irq
> > handler we now need to: (i) check whether the interrupt was
> > triggered by an unclaimed register and (ii) mask the error
> > interrupt bit so we don't risk generating "unclaimed register"
> > interrupts form inside the interrupt handler.
> > 
> > Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
> > ---
> 
> OTOH there's nothing the user can do about it... so we might do a
> WARN_ONCE or something here instead.  But even then, I'm not sure
> there's much *we* can do about these, as they indicate a corruption in
> the communication between the CPU and PCH.
> 

I agree with Jesse. I wouldn't bother with these. Even a WARN_ONCE
isn't helpful since the backtrace wouldn't really be meaningful.

If OTOH, you wanted to save away this information into error state; I
could get behind that.
Paulo Zanoni Feb. 14, 2013, 8:35 p.m. UTC | #5
Hi

2013/2/9 Ben Widawsky <ben@bwidawsk.net>:
> On Fri, 8 Feb 2013 11:42:39 -0800
> Jesse Barnes <jbarnes@virtuousgeek.org> wrote:
>
>> On Fri,  8 Feb 2013 17:35:18 -0200
>> Paulo Zanoni <przanoni@gmail.com> wrote:
>>
>> > From: Paulo Zanoni <paulo.r.zanoni@intel.com>
>> >
>> > On ILK/SNB all we need to do is to enable the "poison" bit, but on
>> > IVB/HSW we need to enable the CPU error interrupt register, which is
>> > responsible not only for poison interrupts, but also other things.
>> > This includes the "unclaimed register" interrupt, so on the IVB irq
>> > handler we now need to: (i) check whether the interrupt was
>> > triggered by an unclaimed register and (ii) mask the error
>> > interrupt bit so we don't risk generating "unclaimed register"
>> > interrupts form inside the interrupt handler.
>> >
>> > Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
>> > ---
>>
>> OTOH there's nothing the user can do about it... so we might do a
>> WARN_ONCE or something here instead.  But even then, I'm not sure
>> there's much *we* can do about these, as they indicate a corruption in
>> the communication between the CPU and PCH.
>>
>
> I agree with Jesse. I wouldn't bother with these. Even a WARN_ONCE
> isn't helpful since the backtrace wouldn't really be meaningful.

Why isn't it helpful? Right now we don't even know whether this
problem happens or not, we're completely "blind" to a possible problem
that may be affecting us in some specific cases and we don't even
know. Knowing that it happens and how often it happens is IMHO
certainly better than closing our eyes and pretending it doesn't
exist.

>
> If OTOH, you wanted to save away this information into error state; I
> could get behind that.
Ben Widawsky Feb. 15, 2013, 4:05 a.m. UTC | #6
On Thu, 14 Feb 2013 18:35:32 -0200
Paulo Zanoni <przanoni@gmail.com> wrote:

> Hi
> 
> 2013/2/9 Ben Widawsky <ben@bwidawsk.net>:
> > On Fri, 8 Feb 2013 11:42:39 -0800
> > Jesse Barnes <jbarnes@virtuousgeek.org> wrote:
> >
> >> On Fri,  8 Feb 2013 17:35:18 -0200
> >> Paulo Zanoni <przanoni@gmail.com> wrote:
> >>
> >> > From: Paulo Zanoni <paulo.r.zanoni@intel.com>
> >> >
> >> > On ILK/SNB all we need to do is to enable the "poison" bit, but
> >> > on IVB/HSW we need to enable the CPU error interrupt register,
> >> > which is responsible not only for poison interrupts, but also
> >> > other things. This includes the "unclaimed register" interrupt,
> >> > so on the IVB irq handler we now need to: (i) check whether the
> >> > interrupt was triggered by an unclaimed register and (ii) mask
> >> > the error interrupt bit so we don't risk generating "unclaimed
> >> > register" interrupts form inside the interrupt handler.
> >> >
> >> > Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
> >> > ---
> >>
> >> OTOH there's nothing the user can do about it... so we might do a
> >> WARN_ONCE or something here instead.  But even then, I'm not sure
> >> there's much *we* can do about these, as they indicate a
> >> corruption in the communication between the CPU and PCH.
> >>
> >
> > I agree with Jesse. I wouldn't bother with these. Even a WARN_ONCE
> > isn't helpful since the backtrace wouldn't really be meaningful.
> 
> Why isn't it helpful? Right now we don't even know whether this
> problem happens or not, we're completely "blind" to a possible problem
> that may be affecting us in some specific cases and we don't even
> know. Knowing that it happens and how often it happens is IMHO
> certainly better than closing our eyes and pretending it doesn't
> exist.
> 

I suppose you're right. I'm strongly of the opinion that we won't
ever see this error because the system will crap out before we'd be
able to get that info - of course I cannot prove that, and I don't
know enough about what exactly poison means. I just think it sucks that
we have yet another gen specific thing which has TBD value. I certainly
won't nak it, and of course if it proves useful, I'll be most
apologetic.

As for the WARN being unhelpful, it's the same problem again. You're
getting the notifications via interrupt, so a backtrace is useless on
IVB/HSW. Perhaps it makes sense on ILK/SNB. Reinventing a "do this
once" macro isn't worthwhile either, so I guess WARN_ON with the
assumption that we ignore the backtrace is fine on IVB/HSW.

> >
> > If OTOH, you wanted to save away this information into error state;
> > I could get behind that.
> 
> 
> 
>
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 10aec0e..703a08a 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -665,6 +665,17 @@  static void ibx_irq_handler(struct drm_device *dev, u32 pch_iir)
 		DRM_DEBUG_DRIVER("PCH transcoder A underrun interrupt\n");
 }
 
+static void err_int_handler(struct drm_device *dev)
+{
+	drm_i915_private_t *dev_priv = (drm_i915_private_t *) dev->dev_private;
+	u32 err_int = I915_READ(GEN7_ERR_INT);
+
+	if (err_int & ERR_INT_POISON)
+		DRM_ERROR("Poison interrupt\n");
+
+	I915_WRITE(GEN7_ERR_INT, err_int);
+}
+
 static void serr_int_handler(struct drm_device *dev)
 {
 	drm_i915_private_t *dev_priv = (drm_i915_private_t *) dev->dev_private;
@@ -715,16 +726,33 @@  static irqreturn_t ivybridge_irq_handler(int irq, void *arg)
 {
 	struct drm_device *dev = (struct drm_device *) arg;
 	drm_i915_private_t *dev_priv = (drm_i915_private_t *) dev->dev_private;
-	u32 de_iir, gt_iir, de_ier, pm_iir, sde_ier;
+	u32 de_iir, gt_iir, de_ier, pm_iir, sde_ier, de_imr;
 	irqreturn_t ret = IRQ_NONE;
 	int i;
 
 	atomic_inc(&dev_priv->irq_received);
 
+	/* We get interrupts on unclaimed registers, so check for this before we
+	 * do any I915_{READ,WRITE}. */
+	if (IS_HASWELL(dev) &&
+	    (I915_READ_NOTRACE(FPGA_DBG) & FPGA_DBG_RM_NOCLAIM)) {
+		DRM_ERROR("Unclaimed register before interrupt\n");
+		I915_WRITE_NOTRACE(FPGA_DBG, FPGA_DBG_RM_NOCLAIM);
+	}
+
 	/* disable master interrupt before clearing iir  */
 	de_ier = I915_READ(DEIER);
 	I915_WRITE(DEIER, de_ier & ~DE_MASTER_IRQ_CONTROL);
 
+	/* On Haswell, also mask ERR_INT because we don't want to risk
+	 * generating "unclaimed register" interrupts from inside the interrupt
+	 * handler. */
+	de_imr = I915_READ(DEIMR);
+	if (IS_HASWELL(dev)) {
+		I915_WRITE(DEIMR, de_imr | DE_ERR_INT_IVB);
+		POSTING_READ(DEIMR);
+	}
+
 	sde_ier = I915_READ(SDEIER);
 	I915_WRITE(SDEIER, 0);
 	POSTING_READ(SDEIER);
@@ -738,6 +766,9 @@  static irqreturn_t ivybridge_irq_handler(int irq, void *arg)
 
 	de_iir = I915_READ(DEIIR);
 	if (de_iir) {
+		if (de_iir & DE_ERR_INT_IVB)
+			err_int_handler(dev);
+
 		if (de_iir & DE_AUX_CHANNEL_A_IVB)
 			dp_aux_irq_handler(dev);
 
@@ -775,6 +806,11 @@  static irqreturn_t ivybridge_irq_handler(int irq, void *arg)
 		ret = IRQ_HANDLED;
 	}
 
+	if (IS_HASWELL(dev)) {
+		I915_WRITE(DEIMR, de_imr);
+		POSTING_READ(DEIMR);
+	}
+
 	I915_WRITE(DEIER, de_ier);
 	POSTING_READ(DEIER);
 	I915_WRITE(SDEIER, sde_ier);
@@ -837,6 +873,9 @@  static irqreturn_t ironlake_irq_handler(int irq, void *arg)
 	if (de_iir & DE_PIPEB_VBLANK)
 		drm_handle_vblank(dev, 1);
 
+	if (de_iir & DE_POISON)
+		DRM_ERROR("Poison interrupt\n");
+
 	if (de_iir & DE_PLANEA_FLIP_DONE) {
 		intel_prepare_page_flip(dev, 0);
 		intel_finish_page_flip_plane(dev, 0);
@@ -1985,7 +2024,7 @@  static int ironlake_irq_postinstall(struct drm_device *dev)
 	/* enable kind of interrupts always enabled */
 	u32 display_mask = DE_MASTER_IRQ_CONTROL | DE_GSE | DE_PCH_EVENT |
 			   DE_PLANEA_FLIP_DONE | DE_PLANEB_FLIP_DONE |
-			   DE_AUX_CHANNEL_A;
+			   DE_AUX_CHANNEL_A | DE_POISON;
 	u32 render_irqs;
 
 	dev_priv->irq_mask = ~display_mask;
@@ -2035,12 +2074,14 @@  static int ivybridge_irq_postinstall(struct drm_device *dev)
 		DE_PLANEC_FLIP_DONE_IVB |
 		DE_PLANEB_FLIP_DONE_IVB |
 		DE_PLANEA_FLIP_DONE_IVB |
-		DE_AUX_CHANNEL_A_IVB;
+		DE_AUX_CHANNEL_A_IVB |
+		DE_ERR_INT_IVB;
 	u32 render_irqs;
 
 	dev_priv->irq_mask = ~display_mask;
 
 	/* should always can generate irq */
+	I915_WRITE(GEN7_ERR_INT, I915_READ(GEN7_ERR_INT));
 	I915_WRITE(DEIIR, I915_READ(DEIIR));
 	I915_WRITE(DEIMR, dev_priv->irq_mask);
 	I915_WRITE(DEIER,
@@ -2191,6 +2232,8 @@  static void ironlake_irq_uninstall(struct drm_device *dev)
 	I915_WRITE(DEIMR, 0xffffffff);
 	I915_WRITE(DEIER, 0x0);
 	I915_WRITE(DEIIR, I915_READ(DEIIR));
+	if (IS_GEN7(dev))
+		I915_WRITE(GEN7_ERR_INT, I915_READ(GEN7_ERR_INT));
 
 	I915_WRITE(GTIMR, 0xffffffff);
 	I915_WRITE(GTIER, 0x0);
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 9cd59f7..f22e27d 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -519,7 +519,8 @@ 
 
 #define ERROR_GEN6	0x040a0
 #define GEN7_ERR_INT	0x44040
-#define   ERR_INT_MMIO_UNCLAIMED (1<<13)
+#define   ERR_INT_POISON		(1<<31)
+#define   ERR_INT_MMIO_UNCLAIMED	(1<<13)
 
 #define FPGA_DBG		0x42300
 #define FPGA_DBG_RM_NOCLAIM	(1<<31)
@@ -3378,7 +3379,7 @@ 
 #define DE_PIPEA_FIFO_UNDERRUN  (1 << 0)
 
 /* More Ivybridge lolz */
-#define DE_ERR_DEBUG_IVB		(1<<30)
+#define DE_ERR_INT_IVB			(1<<30)
 #define DE_GSE_IVB			(1<<29)
 #define DE_PCH_EVENT_IVB		(1<<28)
 #define DE_DP_A_HOTPLUG_IVB		(1<<27)