diff mbox

drm/i915: Reinstate error level message for non-simulated gpu hangs

Message ID 1412118259-4860-1-git-send-email-daniel.vetter@ffwll.ch (mailing list archive)
State New, archived
Headers show

Commit Message

Daniel Vetter Sept. 30, 2014, 11:04 p.m. UTC
This seems to have been accidentally lost in

commit be62acb4cce1389a28296852737e3917d9cc5b25
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Fri Aug 30 16:19:28 2013 +0300

    drm/i915: ban badly behaving contexts

Without this real gpu hangs only log output at info level, which gets
filtered away by piglit's testrunner.

Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Kenneth Graunke <kenneth@whitecape.org>
Reported-by: Kenneth Graunke <kenneth@whitecape.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

--

Note, totally untested since it's a bit late here ;-)
-Daniel
---
 drivers/gpu/drm/i915/i915_drv.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Chris Wilson Oct. 1, 2014, 6:28 a.m. UTC | #1
On Wed, Oct 01, 2014 at 01:04:19AM +0200, Daniel Vetter wrote:
> This seems to have been accidentally lost in
> 
> commit be62acb4cce1389a28296852737e3917d9cc5b25
> Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Date:   Fri Aug 30 16:19:28 2013 +0300
> 
>     drm/i915: ban badly behaving contexts
> 
> Without this real gpu hangs only log output at info level, which gets
> filtered away by piglit's testrunner.

A successful GPU hang is not an error. Might be a warn or a notice, but
it certainly isn't a driver error.
-Chris
Daniel Vetter Oct. 1, 2014, 8:13 a.m. UTC | #2
On Wed, Oct 01, 2014 at 07:28:39AM +0100, Chris Wilson wrote:
> On Wed, Oct 01, 2014 at 01:04:19AM +0200, Daniel Vetter wrote:
> > This seems to have been accidentally lost in
> > 
> > commit be62acb4cce1389a28296852737e3917d9cc5b25
> > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Date:   Fri Aug 30 16:19:28 2013 +0300
> > 
> >     drm/i915: ban badly behaving contexts
> > 
> > Without this real gpu hangs only log output at info level, which gets
> > filtered away by piglit's testrunner.
> 
> A successful GPU hang is not an error. Might be a warn or a notice, but
> it certainly isn't a driver error.

Well not of the kernel driver, but might very well be a bug in the
userspace driver. With this piglit marks tests that hung the gpu as
dmesg-fail, without this they might even pass. Ken raised this on irc and
I agree that it's a must-have feature for developers that their testsuite
can tell them when stuff broke. Provding this some other way is a lot more
work and imo should be done in a separate patch, this here is just the
minimal fix for this regression.
-Daniel
Chris Wilson Oct. 1, 2014, 8:19 a.m. UTC | #3
On Wed, Oct 01, 2014 at 10:13:00AM +0200, Daniel Vetter wrote:
> On Wed, Oct 01, 2014 at 07:28:39AM +0100, Chris Wilson wrote:
> > On Wed, Oct 01, 2014 at 01:04:19AM +0200, Daniel Vetter wrote:
> > > This seems to have been accidentally lost in
> > > 
> > > commit be62acb4cce1389a28296852737e3917d9cc5b25
> > > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > Date:   Fri Aug 30 16:19:28 2013 +0300
> > > 
> > >     drm/i915: ban badly behaving contexts
> > > 
> > > Without this real gpu hangs only log output at info level, which gets
> > > filtered away by piglit's testrunner.
> > 
> > A successful GPU hang is not an error. Might be a warn or a notice, but
> > it certainly isn't a driver error.
> 
> Well not of the kernel driver, but might very well be a bug in the
> userspace driver. With this piglit marks tests that hung the gpu as
> dmesg-fail, without this they might even pass. Ken raised this on irc and
> I agree that it's a must-have feature for developers that their testsuite
> can tell them when stuff broke. Provding this some other way is a lot more
> work and imo should be done in a separate patch, this here is just the
> minimal fix for this regression.

I strongly disagree that we should be working around self-imposed
limitations of the test suite by making users believe their kernel is
broken.
-Chris
Daniel Vetter Oct. 1, 2014, 8:29 a.m. UTC | #4
On Wed, Oct 01, 2014 at 09:19:50AM +0100, Chris Wilson wrote:
> On Wed, Oct 01, 2014 at 10:13:00AM +0200, Daniel Vetter wrote:
> > On Wed, Oct 01, 2014 at 07:28:39AM +0100, Chris Wilson wrote:
> > > On Wed, Oct 01, 2014 at 01:04:19AM +0200, Daniel Vetter wrote:
> > > > This seems to have been accidentally lost in
> > > > 
> > > > commit be62acb4cce1389a28296852737e3917d9cc5b25
> > > > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > > Date:   Fri Aug 30 16:19:28 2013 +0300
> > > > 
> > > >     drm/i915: ban badly behaving contexts
> > > > 
> > > > Without this real gpu hangs only log output at info level, which gets
> > > > filtered away by piglit's testrunner.
> > > 
> > > A successful GPU hang is not an error. Might be a warn or a notice, but
> > > it certainly isn't a driver error.
> > 
> > Well not of the kernel driver, but might very well be a bug in the
> > userspace driver. With this piglit marks tests that hung the gpu as
> > dmesg-fail, without this they might even pass. Ken raised this on irc and
> > I agree that it's a must-have feature for developers that their testsuite
> > can tell them when stuff broke. Provding this some other way is a lot more
> > work and imo should be done in a separate patch, this here is just the
> > minimal fix for this regression.
> 
> I strongly disagree that we should be working around self-imposed
> limitations of the test suite by making users believe their kernel is
> broken.

So what else should piglit do then?
-Daniel
Kenneth Graunke Oct. 1, 2014, 8:52 a.m. UTC | #5
On Wednesday, October 01, 2014 10:29:07 AM Daniel Vetter wrote:
> On Wed, Oct 01, 2014 at 09:19:50AM +0100, Chris Wilson wrote:
> > On Wed, Oct 01, 2014 at 10:13:00AM +0200, Daniel Vetter wrote:
> > > On Wed, Oct 01, 2014 at 07:28:39AM +0100, Chris Wilson wrote:
> > > > On Wed, Oct 01, 2014 at 01:04:19AM +0200, Daniel Vetter wrote:
> > > > > This seems to have been accidentally lost in
> > > > > 
> > > > > commit be62acb4cce1389a28296852737e3917d9cc5b25
> > > > > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > > > Date:   Fri Aug 30 16:19:28 2013 +0300
> > > > > 
> > > > >     drm/i915: ban badly behaving contexts
> > > > > 
> > > > > Without this real gpu hangs only log output at info level, which gets
> > > > > filtered away by piglit's testrunner.
> > > > 
> > > > A successful GPU hang is not an error. Might be a warn or a notice, but
> > > > it certainly isn't a driver error.
> > > 
> > > Well not of the kernel driver, but might very well be a bug in the
> > > userspace driver. With this piglit marks tests that hung the gpu as
> > > dmesg-fail, without this they might even pass. Ken raised this on irc and
> > > I agree that it's a must-have feature for developers that their testsuite
> > > can tell them when stuff broke. Provding this some other way is a lot more
> > > work and imo should be done in a separate patch, this here is just the
> > > minimal fix for this regression.
> > 
> > I strongly disagree that we should be working around self-imposed
> > limitations of the test suite by making users believe their kernel is
> > broken.
> 
> So what else should piglit do then?
> -Daniel

Your GPU hanging is clearly more severe than "info" - it may impact your system stability, and likely represents a bug somewhere in the graphics drivers (whether kernel or userspace).  I think we all agree on that.

Piglit runs "dmesg --level emerg,alert,crit,err,warn,notice", which covers everything except "info" and "debug".  So anything other than info/debug would be just fine.

--Ken
Chris Wilson Oct. 1, 2014, 9:10 a.m. UTC | #6
On Wed, Oct 01, 2014 at 01:52:20AM -0700, Kenneth Graunke wrote:
> On Wednesday, October 01, 2014 10:29:07 AM Daniel Vetter wrote:
> > On Wed, Oct 01, 2014 at 09:19:50AM +0100, Chris Wilson wrote:
> > > On Wed, Oct 01, 2014 at 10:13:00AM +0200, Daniel Vetter wrote:
> > > > On Wed, Oct 01, 2014 at 07:28:39AM +0100, Chris Wilson wrote:
> > > > > On Wed, Oct 01, 2014 at 01:04:19AM +0200, Daniel Vetter wrote:
> > > > > > This seems to have been accidentally lost in
> > > > > > 
> > > > > > commit be62acb4cce1389a28296852737e3917d9cc5b25
> > > > > > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > > > > Date:   Fri Aug 30 16:19:28 2013 +0300
> > > > > > 
> > > > > >     drm/i915: ban badly behaving contexts
> > > > > > 
> > > > > > Without this real gpu hangs only log output at info level, which gets
> > > > > > filtered away by piglit's testrunner.
> > > > > 
> > > > > A successful GPU hang is not an error. Might be a warn or a notice, but
> > > > > it certainly isn't a driver error.
> > > > 
> > > > Well not of the kernel driver, but might very well be a bug in the
> > > > userspace driver. With this piglit marks tests that hung the gpu as
> > > > dmesg-fail, without this they might even pass. Ken raised this on irc and
> > > > I agree that it's a must-have feature for developers that their testsuite
> > > > can tell them when stuff broke. Provding this some other way is a lot more
> > > > work and imo should be done in a separate patch, this here is just the
> > > > minimal fix for this regression.
> > > 
> > > I strongly disagree that we should be working around self-imposed
> > > limitations of the test suite by making users believe their kernel is
> > > broken.
> > 
> > So what else should piglit do then?
> > -Daniel
> 
> Your GPU hanging is clearly more severe than "info" - it may impact your system stability, and likely represents a bug somewhere in the graphics drivers (whether kernel or userspace).  I think we all agree on that.
> 
> Piglit runs "dmesg --level emerg,alert,crit,err,warn,notice", which covers everything except "info" and "debug".  So anything other than info/debug would be just fine.

If we are happy with KERN_NOTICE (normal, but significant condition), that
is what I would suggest. Actually, we should make the GPU hang detection
itself the notice (to aide regular users). But that probably runs into
complications with the simulated hangs with igt causing a WARN test
failure - but again, I'd rather our interface with the user (and
userfacing bug reporting tools) be correct.
-Chris
Chris Wilson Oct. 1, 2014, 9:37 a.m. UTC | #7
On Wed, Oct 01, 2014 at 11:28:47AM +0200, Daniel Vetter wrote:
> This seems to have been accidentally lost in
> 
> commit be62acb4cce1389a28296852737e3917d9cc5b25
> Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Date:   Fri Aug 30 16:19:28 2013 +0300
> 
>     drm/i915: ban badly behaving contexts
> 
> Without this real gpu hangs only log output at info level, which gets
> filtered away by piglit's testrunner.
> 
> v2: Tune down to notice level. Note that we need to add drm/i915 so
> that at least the automatic igt dmesg filtering still picks it up.
> 
> v3: git add and lack of coffee don't mix well.
> 
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Kenneth Graunke <kenneth@whitecape.org>
> Reported-by: Kenneth Graunke <kenneth@whitecape.org>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

Hmm, in my igt hang tests, I don't use dev_priv->gpu_error.stop_rings as
it is itself incompatible with the tests. This now causes those to fail
with a WARN.
-Chris
Daniel Vetter Oct. 1, 2014, 9:41 a.m. UTC | #8
On Wed, Oct 01, 2014 at 10:37:38AM +0100, Chris Wilson wrote:
> On Wed, Oct 01, 2014 at 11:28:47AM +0200, Daniel Vetter wrote:
> > This seems to have been accidentally lost in
> > 
> > commit be62acb4cce1389a28296852737e3917d9cc5b25
> > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Date:   Fri Aug 30 16:19:28 2013 +0300
> > 
> >     drm/i915: ban badly behaving contexts
> > 
> > Without this real gpu hangs only log output at info level, which gets
> > filtered away by piglit's testrunner.
> > 
> > v2: Tune down to notice level. Note that we need to add drm/i915 so
> > that at least the automatic igt dmesg filtering still picks it up.
> > 
> > v3: git add and lack of coffee don't mix well.
> > 
> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Kenneth Graunke <kenneth@whitecape.org>
> > Reported-by: Kenneth Graunke <kenneth@whitecape.org>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> 
> Hmm, in my igt hang tests, I don't use dev_priv->gpu_error.stop_rings as
> it is itself incompatible with the tests. This now causes those to fail
> with a WARN.

In Mika's reset stats test we've set stop_rings after execbuf so that we
could submit fancy stuff like endless loops with batch-chaining while
still shutting up the kernel's output for real hangs.

The nice benefit is that looking at stop_rings then gives you an easy way
to double-check from the test that it all worked out since it's getting
auto-cleared.

In any case we have a fairly great mess here :(
-Daniel
Chris Wilson Oct. 1, 2014, 9:50 a.m. UTC | #9
On Wed, Oct 01, 2014 at 11:41:31AM +0200, Daniel Vetter wrote:
> On Wed, Oct 01, 2014 at 10:37:38AM +0100, Chris Wilson wrote:
> > On Wed, Oct 01, 2014 at 11:28:47AM +0200, Daniel Vetter wrote:
> > > This seems to have been accidentally lost in
> > > 
> > > commit be62acb4cce1389a28296852737e3917d9cc5b25
> > > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > Date:   Fri Aug 30 16:19:28 2013 +0300
> > > 
> > >     drm/i915: ban badly behaving contexts
> > > 
> > > Without this real gpu hangs only log output at info level, which gets
> > > filtered away by piglit's testrunner.
> > > 
> > > v2: Tune down to notice level. Note that we need to add drm/i915 so
> > > that at least the automatic igt dmesg filtering still picks it up.
> > > 
> > > v3: git add and lack of coffee don't mix well.
> > > 
> > > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Kenneth Graunke <kenneth@whitecape.org>
> > > Reported-by: Kenneth Graunke <kenneth@whitecape.org>
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > 
> > Hmm, in my igt hang tests, I don't use dev_priv->gpu_error.stop_rings as
> > it is itself incompatible with the tests. This now causes those to fail
> > with a WARN.
> 
> In Mika's reset stats test we've set stop_rings after execbuf so that we
> could submit fancy stuff like endless loops with batch-chaining while
> still shutting up the kernel's output for real hangs.

That doesn't work for me when I try hanging from one context and running
normal coherency tests in another...
 
> The nice benefit is that looking at stop_rings then gives you an easy way
> to double-check from the test that it all worked out since it's getting
> auto-cleared.

A single global value when using multiple concurrent rendering contexts
is not so nice. Plus the interface is a pita.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 6948877c881c..1870759a5ed8 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -842,6 +842,8 @@  int i915_reset(struct drm_device *dev)
 				 "error for simulated gpu hangs\n");
 			ret = 0;
 		}
+	} else {
+		DRM_ERROR("Reset chip after gpu hang\n");
 	}
 
 	if (ret) {