Message ID | 1445590572-23631-2-git-send-email-daniel.vetter@ffwll.ch (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote: > We get tons of cases where the master interrupt handler apparently set > a bit, with the SDEIIR agreeing. No idea what's going on there, but > it's consistent on gen8+, no one seems to care about it and it's > making CI results flaky. Just delete the message and delete them all. There isn't anything we can do and if anybody actually cared (which apparently they didn't in the first place), they could just trace the mmio. -Chris
On Fri, Oct 23, 2015 at 10:03:50AM +0100, Chris Wilson wrote: > On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote: > > We get tons of cases where the master interrupt handler apparently set > > a bit, with the SDEIIR agreeing. No idea what's going on there, but > > it's consistent on gen8+, no one seems to care about it and it's > > making CI results flaky. > > Just delete the message and delete them all. There isn't anything we can > do and if anybody actually cared (which apparently they didn't in the > first place), they could just trace the mmio. The non-SDE ones don't fire and I think are useful. And in case we have another report from users that gmbus is not reliably working with their touchpad (this is how we discovered the original pch irq issues) I think finding some breadcrumbs in dmesg would be useful. The SDE one happens rarely enough that I don't think it should be a performance issue, ever. Hence why I decided to keep them. -Daniel
On Fri, 23 Oct 2015, Chris Wilson <chris@chris-wilson.co.uk> wrote: > On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote: >> We get tons of cases where the master interrupt handler apparently set >> a bit, with the SDEIIR agreeing. No idea what's going on there, but >> it's consistent on gen8+, no one seems to care about it and it's >> making CI results flaky. > > Just delete the message and delete them all. There isn't anything we can > do and if anybody actually cared (which apparently they didn't in the > first place), they could just trace the mmio. Except this one is a regression, introduced by a bisected commit, and suspiciously the errors pop up during aux transfers. https://bugs.freedesktop.org/show_bug.cgi?id=92084 BR, Jani.
On Fri, Oct 23, 2015 at 12:21:37PM +0300, Jani Nikula wrote: > On Fri, 23 Oct 2015, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote: > >> We get tons of cases where the master interrupt handler apparently set > >> a bit, with the SDEIIR agreeing. No idea what's going on there, but > >> it's consistent on gen8+, no one seems to care about it and it's > >> making CI results flaky. > > > > Just delete the message and delete them all. There isn't anything we can > > do and if anybody actually cared (which apparently they didn't in the > > first place), they could just trace the mmio. > > Except this one is a regression, introduced by a bisected commit, and > suspiciously the errors pop up during aux transfers. dp aux is a red herring very likely, since it's just the source of a _lot_ of sde interrupts. > https://bugs.freedesktop.org/show_bug.cgi?id=92084 No one demonstrated any bad side-effects of this, let's shut it up (but keep the breadcrumb in debug logs in case) and move on to other bugs. We have enough. -Daniel
On Fri, Oct 23, 2015 at 11:23:12AM +0200, Daniel Vetter wrote: > On Fri, Oct 23, 2015 at 12:21:37PM +0300, Jani Nikula wrote: > > On Fri, 23 Oct 2015, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > > On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote: > > >> We get tons of cases where the master interrupt handler apparently set > > >> a bit, with the SDEIIR agreeing. No idea what's going on there, but > > >> it's consistent on gen8+, no one seems to care about it and it's > > >> making CI results flaky. > > > > > > Just delete the message and delete them all. There isn't anything we can > > > do and if anybody actually cared (which apparently they didn't in the > > > first place), they could just trace the mmio. > > > > Except this one is a regression, introduced by a bisected commit, and > > suspiciously the errors pop up during aux transfers. > > dp aux is a red herring very likely, since it's just the source of a _lot_ > of sde interrupts. > > > https://bugs.freedesktop.org/show_bug.cgi?id=92084 > > No one demonstrated any bad side-effects of this, let's shut it up (but > keep the breadcrumb in debug logs in case) and move on to other bugs. We > have enough. I was still waiting for an answer to my latest idea how to avoid the error. Would have been a very simple thing to test for anyone with the hardware.
On Fri, Oct 23, 2015 at 04:33:52PM +0300, Ville Syrjälä wrote: > On Fri, Oct 23, 2015 at 11:23:12AM +0200, Daniel Vetter wrote: > > On Fri, Oct 23, 2015 at 12:21:37PM +0300, Jani Nikula wrote: > > > On Fri, 23 Oct 2015, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > > > On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote: > > > >> We get tons of cases where the master interrupt handler apparently set > > > >> a bit, with the SDEIIR agreeing. No idea what's going on there, but > > > >> it's consistent on gen8+, no one seems to care about it and it's > > > >> making CI results flaky. > > > > > > > > Just delete the message and delete them all. There isn't anything we can > > > > do and if anybody actually cared (which apparently they didn't in the > > > > first place), they could just trace the mmio. > > > > > > Except this one is a regression, introduced by a bisected commit, and > > > suspiciously the errors pop up during aux transfers. > > > > dp aux is a red herring very likely, since it's just the source of a _lot_ > > of sde interrupts. > > > > > https://bugs.freedesktop.org/show_bug.cgi?id=92084 > > > > No one demonstrated any bad side-effects of this, let's shut it up (but > > keep the breadcrumb in debug logs in case) and move on to other bugs. We > > have enough. > > I was still waiting for an answer to my latest idea how to avoid the > error. Would have been a very simple thing to test for anyone with the > hardware. Presumably, --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2345,6 +2345,7 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg) u32 pch_iir = I915_READ(SDEIIR); if (pch_iir) { I915_WRITE(SDEIIR, pch_iir); + POSTING_READ(SDEIIR); ret = IRQ_HANDLED; ? -Chris
On Fri, Oct 23, 2015 at 02:40:31PM +0100, Chris Wilson wrote: > On Fri, Oct 23, 2015 at 04:33:52PM +0300, Ville Syrjälä wrote: > > On Fri, Oct 23, 2015 at 11:23:12AM +0200, Daniel Vetter wrote: > > > On Fri, Oct 23, 2015 at 12:21:37PM +0300, Jani Nikula wrote: > > > > On Fri, 23 Oct 2015, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > > > > On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote: > > > > >> We get tons of cases where the master interrupt handler apparently set > > > > >> a bit, with the SDEIIR agreeing. No idea what's going on there, but > > > > >> it's consistent on gen8+, no one seems to care about it and it's > > > > >> making CI results flaky. > > > > > > > > > > Just delete the message and delete them all. There isn't anything we can > > > > > do and if anybody actually cared (which apparently they didn't in the > > > > > first place), they could just trace the mmio. > > > > > > > > Except this one is a regression, introduced by a bisected commit, and > > > > suspiciously the errors pop up during aux transfers. > > > > > > dp aux is a red herring very likely, since it's just the source of a _lot_ > > > of sde interrupts. > > > > > > > https://bugs.freedesktop.org/show_bug.cgi?id=92084 > > > > > > No one demonstrated any bad side-effects of this, let's shut it up (but > > > keep the breadcrumb in debug logs in case) and move on to other bugs. We > > > have enough. > > > > I was still waiting for an answer to my latest idea how to avoid the > > error. Would have been a very simple thing to test for anyone with the > > hardware. > > Presumably, > > --- a/drivers/gpu/drm/i915/i915_irq.c > +++ b/drivers/gpu/drm/i915/i915_irq.c > @@ -2345,6 +2345,7 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg) > u32 pch_iir = I915_READ(SDEIIR); > if (pch_iir) { > I915_WRITE(SDEIIR, pch_iir); > + POSTING_READ(SDEIIR); > ret = IRQ_HANDLED; > > ? No, we already tried that and it didn't help. What I was thinking is we'd just do a nop write to the PCH_PORT_HOTPLUG register like so: dig_hotplug_reg = I915_READ(PCH_PORT_HOTPLUG); if (!hotplug_trigger) dig_hotplug_reg &= ~(*_HOTPLUG_STATUS_MASK); I915_WRITE(PCH_PORT_HOTPLUG, dig_hotplug_reg if (!hotplug_trigger) return;
On 23/10/15 10:14, Daniel Vetter wrote: > On Fri, Oct 23, 2015 at 10:03:50AM +0100, Chris Wilson wrote: >> On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote: >>> We get tons of cases where the master interrupt handler apparently set >>> a bit, with the SDEIIR agreeing. No idea what's going on there, but >>> it's consistent on gen8+, no one seems to care about it and it's >>> making CI results flaky. >> >> Just delete the message and delete them all. There isn't anything we can >> do and if anybody actually cared (which apparently they didn't in the >> first place), they could just trace the mmio. > > The non-SDE ones don't fire and I think are useful. And in case we have > another report from users that gmbus is not reliably working with their > touchpad (this is how we discovered the original pch irq issues) I think > finding some breadcrumbs in dmesg would be useful. The SDE one happens > rarely enough that I don't think it should be a performance issue, ever. > > Hence why I decided to keep them. > -Daniel I used to get the "master interrupt lied" message quite frequently. Since it was not clear whether it really meant that the master had an extra bit set or that one (any) of the detail registers was missing an interrupt bit, I tried servicing the interrupt anyway. But it didn't help with any of the issues I was seeing at the time, and the message no longer occurs (on SKL) with more recent kernels. .Dave.
On Fri, Oct 23, 2015 at 3:33 PM, Ville Syrjälä <ville.syrjala@linux.intel.com> wrote: > On Fri, Oct 23, 2015 at 11:23:12AM +0200, Daniel Vetter wrote: >> On Fri, Oct 23, 2015 at 12:21:37PM +0300, Jani Nikula wrote: >> > On Fri, 23 Oct 2015, Chris Wilson <chris@chris-wilson.co.uk> wrote: >> > > On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote: >> > >> We get tons of cases where the master interrupt handler apparently set >> > >> a bit, with the SDEIIR agreeing. No idea what's going on there, but >> > >> it's consistent on gen8+, no one seems to care about it and it's >> > >> making CI results flaky. >> > > >> > > Just delete the message and delete them all. There isn't anything we can >> > > do and if anybody actually cared (which apparently they didn't in the >> > > first place), they could just trace the mmio. >> > >> > Except this one is a regression, introduced by a bisected commit, and >> > suspiciously the errors pop up during aux transfers. >> >> dp aux is a red herring very likely, since it's just the source of a _lot_ >> of sde interrupts. >> >> > https://bugs.freedesktop.org/show_bug.cgi?id=92084 >> >> No one demonstrated any bad side-effects of this, let's shut it up (but >> keep the breadcrumb in debug logs in case) and move on to other bugs. We >> have enough. > > I was still waiting for an answer to my latest idea how to avoid the > error. Would have been a very simple thing to test for anyone with the > hardware. We can still do that ofc. But this was a regression, trivial to fix (we didn't have the message before at all), creating noise in our CI and 16 months old. Our rule is that an interim solution/revert should be applied latest within 1 week, this was more than overdue. Yep, QA is back and I'll make sure everyone knows ;-) -Daniel
On Fri, Oct 23, 2015 at 04:42:19PM +0200, Daniel Vetter wrote: > On Fri, Oct 23, 2015 at 3:33 PM, Ville Syrjälä > <ville.syrjala@linux.intel.com> wrote: > > On Fri, Oct 23, 2015 at 11:23:12AM +0200, Daniel Vetter wrote: > >> On Fri, Oct 23, 2015 at 12:21:37PM +0300, Jani Nikula wrote: > >> > On Fri, 23 Oct 2015, Chris Wilson <chris@chris-wilson.co.uk> wrote: > >> > > On Fri, Oct 23, 2015 at 10:56:12AM +0200, Daniel Vetter wrote: > >> > >> We get tons of cases where the master interrupt handler apparently set > >> > >> a bit, with the SDEIIR agreeing. No idea what's going on there, but > >> > >> it's consistent on gen8+, no one seems to care about it and it's > >> > >> making CI results flaky. > >> > > > >> > > Just delete the message and delete them all. There isn't anything we can > >> > > do and if anybody actually cared (which apparently they didn't in the > >> > > first place), they could just trace the mmio. > >> > > >> > Except this one is a regression, introduced by a bisected commit, and > >> > suspiciously the errors pop up during aux transfers. > >> > >> dp aux is a red herring very likely, since it's just the source of a _lot_ > >> of sde interrupts. > >> > >> > https://bugs.freedesktop.org/show_bug.cgi?id=92084 > >> > >> No one demonstrated any bad side-effects of this, let's shut it up (but > >> keep the breadcrumb in debug logs in case) and move on to other bugs. We > >> have enough. > > > > I was still waiting for an answer to my latest idea how to avoid the > > error. Would have been a very simple thing to test for anyone with the > > hardware. > > We can still do that ofc. But this was a regression, trivial to fix > (we didn't have the message before at all), The regressing patch change didn't add the message, so there was a clear change in behaviour, and now it's papered over. > creating noise in our CI > and 16 months old. Our rule is that an interim solution/revert should > be applied latest within 1 week, this was more than overdue. > > Yep, QA is back and I'll make sure everyone knows ;-) > -Daniel > -- > Daniel Vetter > Software Engineer, Intel Corporation > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Fri, Oct 23, 2015 at 8:22 AM, Ville Syrjälä <ville.syrjala@linux.intel.com> wrote: > The regressing patch change didn't add the message, so there was a clear > change in behaviour, and now it's papered over. It did move around the DRM_ERROR for all the others and also added the SDE one for consistency. At least that's how I read that patch - I could't find the SDE DRM_ERROR in the old code. Did I miss something? -Daniel
On Tue, 27 Oct 2015, Daniel Vetter <daniel@ffwll.ch> wrote: > On Fri, Oct 23, 2015 at 8:22 AM, Ville Syrjälä > <ville.syrjala@linux.intel.com> wrote: >> The regressing patch change didn't add the message, so there was a clear >> change in behaviour, and now it's papered over. > > It did move around the DRM_ERROR for all the others and also added the > SDE one for consistency. At least that's how I read that patch - I > could't find the SDE DRM_ERROR in the old code. Did I miss something? Yes. We tried and failed to point out that this is a bisected regression with a bug report. The bad commit is *NOT* when the error message was added or moved. The first bad commit is commit aaf5ec2e51ab1d9c5e962b4728a1107ed3ff7a3e Author: Sonika Jindal <sonika.jindal@intel.com> Date: Wed Jul 8 17:07:47 2015 +0530 drm/i915: Handle HPD when it has actually occurred which triggers printing of the error message. This is all mentioned in the bug, along with a few attempts at remedying the situation. BR, Jani.
On Tue, Oct 27, 2015 at 1:10 AM, Jani Nikula <jani.nikula@linux.intel.com> wrote: > On Tue, 27 Oct 2015, Daniel Vetter <daniel@ffwll.ch> wrote: >> On Fri, Oct 23, 2015 at 8:22 AM, Ville Syrjälä >> <ville.syrjala@linux.intel.com> wrote: >>> The regressing patch change didn't add the message, so there was a clear >>> change in behaviour, and now it's papered over. >> >> It did move around the DRM_ERROR for all the others and also added the >> SDE one for consistency. At least that's how I read that patch - I >> could't find the SDE DRM_ERROR in the old code. Did I miss something? > > Yes. We tried and failed to point out that this is a bisected regression > with a bug report. The bad commit is *NOT* when the error message was > added or moved. The first bad commit is > > commit aaf5ec2e51ab1d9c5e962b4728a1107ed3ff7a3e > Author: Sonika Jindal <sonika.jindal@intel.com> > Date: Wed Jul 8 17:07:47 2015 +0530 > > drm/i915: Handle HPD when it has actually occurred > > which triggers printing of the error message. This is all mentioned in > the bug, along with a few attempts at remedying the situation. Oops, I pasted the wrong commit into the commit message :( Sorry for all the confusion and me not noticing the real bisect result. -Daniel
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 9caf22caed89..a614e89fc934 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2353,9 +2353,13 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg) spt_irq_handler(dev, pch_iir); else cpt_irq_handler(dev, pch_iir); - } else - DRM_ERROR("The master control interrupt lied (SDE)!\n"); - + } else { + /* + * Like on previous PCH there seems to be something + * fishy going on with forwarding PCH interrupts. + */ + DRM_DEBUG_DRIVER("The master control interrupt lied (SDE)!\n"); + } } I915_WRITE_FW(GEN8_MASTER_IRQ, GEN8_MASTER_IRQ_CONTROL);
We get tons of cases where the master interrupt handler apparently set a bit, with the SDEIIR agreeing. No idea what's going on there, but it's consistent on gen8+, no one seems to care about it and it's making CI results flaky. Shut it up. No idea what's going on here, but we've had fun with PCH interrupts before: commit 44498aea293b37af1d463acd9658cdce1ecdf427 Author: Paulo Zanoni <paulo.r.zanoni@intel.com> Date: Fri Feb 22 17:05:28 2013 -0300 drm/i915: also disable south interrupts when handling them Cc: Paulo Zanoni <paulo.r.zanoni@intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> --- drivers/gpu/drm/i915/i915_irq.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)