diff mbox

[2/2] drm/i915: shut up gen8+ SDE irq dmesg noise

Message ID 1445590572-23631-2-git-send-email-daniel.vetter@ffwll.ch (mailing list archive)
State New, archived
Headers show

Commit Message

Daniel Vetter Oct. 23, 2015, 8:56 a.m. UTC
We get tons of cases where the master interrupt handler apparently set
a bit, with the SDEIIR agreeing. No idea what's going on there, but
it's consistent on gen8+, no one seems to care about it and it's
making CI results flaky.

Shut it up.

No idea what's going on here, but we've had fun with PCH interrupts
before:

commit 44498aea293b37af1d463acd9658cdce1ecdf427
Author: Paulo Zanoni <paulo.r.zanoni@intel.com>
Date:   Fri Feb 22 17:05:28 2013 -0300

    drm/i915: also disable south interrupts when handling them

Cc: Paulo Zanoni <paulo.r.zanoni@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

Comments

Chris Wilson Oct. 23, 2015, 9:03 a.m. UTC | #1
On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote:
> We get tons of cases where the master interrupt handler apparently set
> a bit, with the SDEIIR agreeing. No idea what's going on there, but
> it's consistent on gen8+, no one seems to care about it and it's
> making CI results flaky.

Just delete the message and delete them all. There isn't anything we can
do and if anybody actually cared (which apparently they didn't in the
first place), they could just trace the mmio.
-Chris
Daniel Vetter Oct. 23, 2015, 9:14 a.m. UTC | #2
On Fri, Oct 23, 2015 at 10:03:50AM +0100, Chris Wilson wrote:
> On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote:
> > We get tons of cases where the master interrupt handler apparently set
> > a bit, with the SDEIIR agreeing. No idea what's going on there, but
> > it's consistent on gen8+, no one seems to care about it and it's
> > making CI results flaky.
> 
> Just delete the message and delete them all. There isn't anything we can
> do and if anybody actually cared (which apparently they didn't in the
> first place), they could just trace the mmio.

The non-SDE ones don't fire and I think are useful. And in case we have
another report from users that gmbus is not reliably working with their
touchpad (this is how we discovered the original pch irq issues) I think
finding some breadcrumbs in dmesg would be useful. The SDE one happens
rarely enough that I don't think it should be a performance issue, ever.

Hence why I decided to keep them.
-Daniel
Jani Nikula Oct. 23, 2015, 9:21 a.m. UTC | #3
On Fri, 23 Oct 2015, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote:
>> We get tons of cases where the master interrupt handler apparently set
>> a bit, with the SDEIIR agreeing. No idea what's going on there, but
>> it's consistent on gen8+, no one seems to care about it and it's
>> making CI results flaky.
>
> Just delete the message and delete them all. There isn't anything we can
> do and if anybody actually cared (which apparently they didn't in the
> first place), they could just trace the mmio.

Except this one is a regression, introduced by a bisected commit, and
suspiciously the errors pop up during aux transfers.

https://bugs.freedesktop.org/show_bug.cgi?id=92084

BR,
Jani.
Daniel Vetter Oct. 23, 2015, 9:23 a.m. UTC | #4
On Fri, Oct 23, 2015 at 12:21:37PM +0300, Jani Nikula wrote:
> On Fri, 23 Oct 2015, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote:
> >> We get tons of cases where the master interrupt handler apparently set
> >> a bit, with the SDEIIR agreeing. No idea what's going on there, but
> >> it's consistent on gen8+, no one seems to care about it and it's
> >> making CI results flaky.
> >
> > Just delete the message and delete them all. There isn't anything we can
> > do and if anybody actually cared (which apparently they didn't in the
> > first place), they could just trace the mmio.
> 
> Except this one is a regression, introduced by a bisected commit, and
> suspiciously the errors pop up during aux transfers.

dp aux is a red herring very likely, since it's just the source of a _lot_
of sde interrupts.

> https://bugs.freedesktop.org/show_bug.cgi?id=92084

No one demonstrated any bad side-effects of this, let's shut it up (but
keep the breadcrumb in debug logs in case) and move on to other bugs. We
have enough.
-Daniel
Ville Syrjälä Oct. 23, 2015, 1:33 p.m. UTC | #5
On Fri, Oct 23, 2015 at 11:23:12AM +0200, Daniel Vetter wrote:
> On Fri, Oct 23, 2015 at 12:21:37PM +0300, Jani Nikula wrote:
> > On Fri, 23 Oct 2015, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > > On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote:
> > >> We get tons of cases where the master interrupt handler apparently set
> > >> a bit, with the SDEIIR agreeing. No idea what's going on there, but
> > >> it's consistent on gen8+, no one seems to care about it and it's
> > >> making CI results flaky.
> > >
> > > Just delete the message and delete them all. There isn't anything we can
> > > do and if anybody actually cared (which apparently they didn't in the
> > > first place), they could just trace the mmio.
> > 
> > Except this one is a regression, introduced by a bisected commit, and
> > suspiciously the errors pop up during aux transfers.
> 
> dp aux is a red herring very likely, since it's just the source of a _lot_
> of sde interrupts.
> 
> > https://bugs.freedesktop.org/show_bug.cgi?id=92084
> 
> No one demonstrated any bad side-effects of this, let's shut it up (but
> keep the breadcrumb in debug logs in case) and move on to other bugs. We
> have enough.

I was still waiting for an answer to my latest idea how to avoid the
error. Would have been a very simple thing to test for anyone with the
hardware.
Chris Wilson Oct. 23, 2015, 1:40 p.m. UTC | #6
On Fri, Oct 23, 2015 at 04:33:52PM +0300, Ville Syrjälä wrote:
> On Fri, Oct 23, 2015 at 11:23:12AM +0200, Daniel Vetter wrote:
> > On Fri, Oct 23, 2015 at 12:21:37PM +0300, Jani Nikula wrote:
> > > On Fri, 23 Oct 2015, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > > > On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote:
> > > >> We get tons of cases where the master interrupt handler apparently set
> > > >> a bit, with the SDEIIR agreeing. No idea what's going on there, but
> > > >> it's consistent on gen8+, no one seems to care about it and it's
> > > >> making CI results flaky.
> > > >
> > > > Just delete the message and delete them all. There isn't anything we can
> > > > do and if anybody actually cared (which apparently they didn't in the
> > > > first place), they could just trace the mmio.
> > > 
> > > Except this one is a regression, introduced by a bisected commit, and
> > > suspiciously the errors pop up during aux transfers.
> > 
> > dp aux is a red herring very likely, since it's just the source of a _lot_
> > of sde interrupts.
> > 
> > > https://bugs.freedesktop.org/show_bug.cgi?id=92084
> > 
> > No one demonstrated any bad side-effects of this, let's shut it up (but
> > keep the breadcrumb in debug logs in case) and move on to other bugs. We
> > have enough.
> 
> I was still waiting for an answer to my latest idea how to avoid the
> error. Would have been a very simple thing to test for anyone with the
> hardware.

Presumably,

 --- a/drivers/gpu/drm/i915/i915_irq.c
 +++ b/drivers/gpu/drm/i915/i915_irq.c
 @@ -2345,6 +2345,7 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg)
                 u32 pch_iir = I915_READ(SDEIIR);
                 if (pch_iir) {
                         I915_WRITE(SDEIIR, pch_iir);
 +                       POSTING_READ(SDEIIR);
                         ret = IRQ_HANDLED;

?
-Chris
Ville Syrjälä Oct. 23, 2015, 1:47 p.m. UTC | #7
On Fri, Oct 23, 2015 at 02:40:31PM +0100, Chris Wilson wrote:
> On Fri, Oct 23, 2015 at 04:33:52PM +0300, Ville Syrjälä wrote:
> > On Fri, Oct 23, 2015 at 11:23:12AM +0200, Daniel Vetter wrote:
> > > On Fri, Oct 23, 2015 at 12:21:37PM +0300, Jani Nikula wrote:
> > > > On Fri, 23 Oct 2015, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > > > > On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote:
> > > > >> We get tons of cases where the master interrupt handler apparently set
> > > > >> a bit, with the SDEIIR agreeing. No idea what's going on there, but
> > > > >> it's consistent on gen8+, no one seems to care about it and it's
> > > > >> making CI results flaky.
> > > > >
> > > > > Just delete the message and delete them all. There isn't anything we can
> > > > > do and if anybody actually cared (which apparently they didn't in the
> > > > > first place), they could just trace the mmio.
> > > > 
> > > > Except this one is a regression, introduced by a bisected commit, and
> > > > suspiciously the errors pop up during aux transfers.
> > > 
> > > dp aux is a red herring very likely, since it's just the source of a _lot_
> > > of sde interrupts.
> > > 
> > > > https://bugs.freedesktop.org/show_bug.cgi?id=92084
> > > 
> > > No one demonstrated any bad side-effects of this, let's shut it up (but
> > > keep the breadcrumb in debug logs in case) and move on to other bugs. We
> > > have enough.
> > 
> > I was still waiting for an answer to my latest idea how to avoid the
> > error. Would have been a very simple thing to test for anyone with the
> > hardware.
> 
> Presumably,
> 
>  --- a/drivers/gpu/drm/i915/i915_irq.c
>  +++ b/drivers/gpu/drm/i915/i915_irq.c
>  @@ -2345,6 +2345,7 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg)
>                  u32 pch_iir = I915_READ(SDEIIR);
>                  if (pch_iir) {
>                          I915_WRITE(SDEIIR, pch_iir);
>  +                       POSTING_READ(SDEIIR);
>                          ret = IRQ_HANDLED;
> 
> ?

No, we already tried that and it didn't help.

What I was thinking is we'd just do a nop write to the
PCH_PORT_HOTPLUG register like so:

dig_hotplug_reg = I915_READ(PCH_PORT_HOTPLUG);
if (!hotplug_trigger)
	dig_hotplug_reg &= ~(*_HOTPLUG_STATUS_MASK);
I915_WRITE(PCH_PORT_HOTPLUG, dig_hotplug_reg
if (!hotplug_trigger)
	return;
Dave Gordon Oct. 23, 2015, 2:21 p.m. UTC | #8
On 23/10/15 10:14, Daniel Vetter wrote:
> On Fri, Oct 23, 2015 at 10:03:50AM +0100, Chris Wilson wrote:
>> On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote:
>>> We get tons of cases where the master interrupt handler apparently set
>>> a bit, with the SDEIIR agreeing. No idea what's going on there, but
>>> it's consistent on gen8+, no one seems to care about it and it's
>>> making CI results flaky.
>>
>> Just delete the message and delete them all. There isn't anything we can
>> do and if anybody actually cared (which apparently they didn't in the
>> first place), they could just trace the mmio.
>
> The non-SDE ones don't fire and I think are useful. And in case we have
> another report from users that gmbus is not reliably working with their
> touchpad (this is how we discovered the original pch irq issues) I think
> finding some breadcrumbs in dmesg would be useful. The SDE one happens
> rarely enough that I don't think it should be a performance issue, ever.
>
> Hence why I decided to keep them.
> -Daniel

I used to get the "master interrupt lied" message quite frequently. 
Since it was not clear whether it really meant that the master had an 
extra bit set or that one (any) of the detail registers was missing an 
interrupt bit, I tried servicing the interrupt anyway. But it didn't 
help with any of the issues I was seeing at the time, and the message no 
longer occurs (on SKL) with more recent kernels.

.Dave.
Daniel Vetter Oct. 23, 2015, 2:42 p.m. UTC | #9
On Fri, Oct 23, 2015 at 3:33 PM, Ville Syrjälä
<ville.syrjala@linux.intel.com> wrote:
> On Fri, Oct 23, 2015 at 11:23:12AM +0200, Daniel Vetter wrote:
>> On Fri, Oct 23, 2015 at 12:21:37PM +0300, Jani Nikula wrote:
>> > On Fri, 23 Oct 2015, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> > > On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote:
>> > >> We get tons of cases where the master interrupt handler apparently set
>> > >> a bit, with the SDEIIR agreeing. No idea what's going on there, but
>> > >> it's consistent on gen8+, no one seems to care about it and it's
>> > >> making CI results flaky.
>> > >
>> > > Just delete the message and delete them all. There isn't anything we can
>> > > do and if anybody actually cared (which apparently they didn't in the
>> > > first place), they could just trace the mmio.
>> >
>> > Except this one is a regression, introduced by a bisected commit, and
>> > suspiciously the errors pop up during aux transfers.
>>
>> dp aux is a red herring very likely, since it's just the source of a _lot_
>> of sde interrupts.
>>
>> > https://bugs.freedesktop.org/show_bug.cgi?id=92084
>>
>> No one demonstrated any bad side-effects of this, let's shut it up (but
>> keep the breadcrumb in debug logs in case) and move on to other bugs. We
>> have enough.
>
> I was still waiting for an answer to my latest idea how to avoid the
> error. Would have been a very simple thing to test for anyone with the
> hardware.

We can still do that ofc. But this was a regression, trivial to fix
(we didn't have the message before at all), creating noise in our CI
and 16 months old. Our rule is that an interim solution/revert should
be applied latest within 1 week, this was more than overdue.

Yep, QA is back and I'll make sure everyone knows ;-)
-Daniel
Ville Syrjälä Oct. 23, 2015, 3:22 p.m. UTC | #10
On Fri, Oct 23, 2015 at 04:42:19PM +0200, Daniel Vetter wrote:
> On Fri, Oct 23, 2015 at 3:33 PM, Ville Syrjälä
> <ville.syrjala@linux.intel.com> wrote:
> > On Fri, Oct 23, 2015 at 11:23:12AM +0200, Daniel Vetter wrote:
> >> On Fri, Oct 23, 2015 at 12:21:37PM +0300, Jani Nikula wrote:
> >> > On Fri, 23 Oct 2015, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> >> > > On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote:
> >> > >> We get tons of cases where the master interrupt handler apparently set
> >> > >> a bit, with the SDEIIR agreeing. No idea what's going on there, but
> >> > >> it's consistent on gen8+, no one seems to care about it and it's
> >> > >> making CI results flaky.
> >> > >
> >> > > Just delete the message and delete them all. There isn't anything we can
> >> > > do and if anybody actually cared (which apparently they didn't in the
> >> > > first place), they could just trace the mmio.
> >> >
> >> > Except this one is a regression, introduced by a bisected commit, and
> >> > suspiciously the errors pop up during aux transfers.
> >>
> >> dp aux is a red herring very likely, since it's just the source of a _lot_
> >> of sde interrupts.
> >>
> >> > https://bugs.freedesktop.org/show_bug.cgi?id=92084
> >>
> >> No one demonstrated any bad side-effects of this, let's shut it up (but
> >> keep the breadcrumb in debug logs in case) and move on to other bugs. We
> >> have enough.
> >
> > I was still waiting for an answer to my latest idea how to avoid the
> > error. Would have been a very simple thing to test for anyone with the
> > hardware.
> 
> We can still do that ofc. But this was a regression, trivial to fix
> (we didn't have the message before at all),

The regressing patch change didn't add the message, so there was a clear
change in behaviour, and now it's papered over.

> creating noise in our CI
> and 16 months old. Our rule is that an interim solution/revert should
> be applied latest within 1 week, this was more than overdue.
> 
> Yep, QA is back and I'll make sure everyone knows ;-)
> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Daniel Vetter Oct. 27, 2015, 2:31 a.m. UTC | #11
On Fri, Oct 23, 2015 at 8:22 AM, Ville Syrjälä
<ville.syrjala@linux.intel.com> wrote:
> The regressing patch change didn't add the message, so there was a clear
> change in behaviour, and now it's papered over.

It did move around the DRM_ERROR for all the others and also added the
SDE one for consistency. At least that's how I read that patch - I
could't find the SDE DRM_ERROR in the old code. Did I miss something?
-Daniel
Jani Nikula Oct. 27, 2015, 8:10 a.m. UTC | #12
On Tue, 27 Oct 2015, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Fri, Oct 23, 2015 at 8:22 AM, Ville Syrjälä
> <ville.syrjala@linux.intel.com> wrote:
>> The regressing patch change didn't add the message, so there was a clear
>> change in behaviour, and now it's papered over.
>
> It did move around the DRM_ERROR for all the others and also added the
> SDE one for consistency. At least that's how I read that patch - I
> could't find the SDE DRM_ERROR in the old code. Did I miss something?

Yes. We tried and failed to point out that this is a bisected regression
with a bug report. The bad commit is *NOT* when the error message was
added or moved. The first bad commit is

commit aaf5ec2e51ab1d9c5e962b4728a1107ed3ff7a3e
Author: Sonika Jindal <sonika.jindal@intel.com>
Date:   Wed Jul 8 17:07:47 2015 +0530

    drm/i915: Handle HPD when it has actually occurred

which triggers printing of the error message. This is all mentioned in
the bug, along with a few attempts at remedying the situation.


BR,
Jani.
Daniel Vetter Oct. 28, 2015, 12:22 a.m. UTC | #13
On Tue, Oct 27, 2015 at 1:10 AM, Jani Nikula
<jani.nikula@linux.intel.com> wrote:
> On Tue, 27 Oct 2015, Daniel Vetter <daniel@ffwll.ch> wrote:
>> On Fri, Oct 23, 2015 at 8:22 AM, Ville Syrjälä
>> <ville.syrjala@linux.intel.com> wrote:
>>> The regressing patch change didn't add the message, so there was a clear
>>> change in behaviour, and now it's papered over.
>>
>> It did move around the DRM_ERROR for all the others and also added the
>> SDE one for consistency. At least that's how I read that patch - I
>> could't find the SDE DRM_ERROR in the old code. Did I miss something?
>
> Yes. We tried and failed to point out that this is a bisected regression
> with a bug report. The bad commit is *NOT* when the error message was
> added or moved. The first bad commit is
>
> commit aaf5ec2e51ab1d9c5e962b4728a1107ed3ff7a3e
> Author: Sonika Jindal <sonika.jindal@intel.com>
> Date:   Wed Jul 8 17:07:47 2015 +0530
>
>     drm/i915: Handle HPD when it has actually occurred
>
> which triggers printing of the error message. This is all mentioned in
> the bug, along with a few attempts at remedying the situation.

Oops, I pasted the wrong commit into the commit message :( Sorry for
all the confusion and me not noticing the real bisect result.
-Daniel
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 9caf22caed89..a614e89fc934 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2353,9 +2353,13 @@  static irqreturn_t gen8_irq_handler(int irq, void *arg)
 				spt_irq_handler(dev, pch_iir);
 			else
 				cpt_irq_handler(dev, pch_iir);
-		} else
-			DRM_ERROR("The master control interrupt lied (SDE)!\n");
-
+		} else {
+			/*
+			 * Like on previous PCH there seems to be something
+			 * fishy going on with forwarding PCH interrupts.
+			 */
+			DRM_DEBUG_DRIVER("The master control interrupt lied (SDE)!\n");
+		}
 	}
 
 	I915_WRITE_FW(GEN8_MASTER_IRQ, GEN8_MASTER_IRQ_CONTROL);