Message ID | 1412118259-4860-1-git-send-email-daniel.vetter@ffwll.ch (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Oct 01, 2014 at 01:04:19AM +0200, Daniel Vetter wrote: > This seems to have been accidentally lost in > > commit be62acb4cce1389a28296852737e3917d9cc5b25 > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Date: Fri Aug 30 16:19:28 2013 +0300 > > drm/i915: ban badly behaving contexts > > Without this real gpu hangs only log output at info level, which gets > filtered away by piglit's testrunner. A successful GPU hang is not an error. Might be a warn or a notice, but it certainly isn't a driver error. -Chris
On Wed, Oct 01, 2014 at 07:28:39AM +0100, Chris Wilson wrote: > On Wed, Oct 01, 2014 at 01:04:19AM +0200, Daniel Vetter wrote: > > This seems to have been accidentally lost in > > > > commit be62acb4cce1389a28296852737e3917d9cc5b25 > > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > Date: Fri Aug 30 16:19:28 2013 +0300 > > > > drm/i915: ban badly behaving contexts > > > > Without this real gpu hangs only log output at info level, which gets > > filtered away by piglit's testrunner. > > A successful GPU hang is not an error. Might be a warn or a notice, but > it certainly isn't a driver error. Well not of the kernel driver, but might very well be a bug in the userspace driver. With this piglit marks tests that hung the gpu as dmesg-fail, without this they might even pass. Ken raised this on irc and I agree that it's a must-have feature for developers that their testsuite can tell them when stuff broke. Provding this some other way is a lot more work and imo should be done in a separate patch, this here is just the minimal fix for this regression. -Daniel
On Wed, Oct 01, 2014 at 10:13:00AM +0200, Daniel Vetter wrote: > On Wed, Oct 01, 2014 at 07:28:39AM +0100, Chris Wilson wrote: > > On Wed, Oct 01, 2014 at 01:04:19AM +0200, Daniel Vetter wrote: > > > This seems to have been accidentally lost in > > > > > > commit be62acb4cce1389a28296852737e3917d9cc5b25 > > > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > > Date: Fri Aug 30 16:19:28 2013 +0300 > > > > > > drm/i915: ban badly behaving contexts > > > > > > Without this real gpu hangs only log output at info level, which gets > > > filtered away by piglit's testrunner. > > > > A successful GPU hang is not an error. Might be a warn or a notice, but > > it certainly isn't a driver error. > > Well not of the kernel driver, but might very well be a bug in the > userspace driver. With this piglit marks tests that hung the gpu as > dmesg-fail, without this they might even pass. Ken raised this on irc and > I agree that it's a must-have feature for developers that their testsuite > can tell them when stuff broke. Provding this some other way is a lot more > work and imo should be done in a separate patch, this here is just the > minimal fix for this regression. I strongly disagree that we should be working around self-imposed limitations of the test suite by making users believe their kernel is broken. -Chris
On Wed, Oct 01, 2014 at 09:19:50AM +0100, Chris Wilson wrote: > On Wed, Oct 01, 2014 at 10:13:00AM +0200, Daniel Vetter wrote: > > On Wed, Oct 01, 2014 at 07:28:39AM +0100, Chris Wilson wrote: > > > On Wed, Oct 01, 2014 at 01:04:19AM +0200, Daniel Vetter wrote: > > > > This seems to have been accidentally lost in > > > > > > > > commit be62acb4cce1389a28296852737e3917d9cc5b25 > > > > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > > > Date: Fri Aug 30 16:19:28 2013 +0300 > > > > > > > > drm/i915: ban badly behaving contexts > > > > > > > > Without this real gpu hangs only log output at info level, which gets > > > > filtered away by piglit's testrunner. > > > > > > A successful GPU hang is not an error. Might be a warn or a notice, but > > > it certainly isn't a driver error. > > > > Well not of the kernel driver, but might very well be a bug in the > > userspace driver. With this piglit marks tests that hung the gpu as > > dmesg-fail, without this they might even pass. Ken raised this on irc and > > I agree that it's a must-have feature for developers that their testsuite > > can tell them when stuff broke. Provding this some other way is a lot more > > work and imo should be done in a separate patch, this here is just the > > minimal fix for this regression. > > I strongly disagree that we should be working around self-imposed > limitations of the test suite by making users believe their kernel is > broken. So what else should piglit do then? -Daniel
On Wednesday, October 01, 2014 10:29:07 AM Daniel Vetter wrote: > On Wed, Oct 01, 2014 at 09:19:50AM +0100, Chris Wilson wrote: > > On Wed, Oct 01, 2014 at 10:13:00AM +0200, Daniel Vetter wrote: > > > On Wed, Oct 01, 2014 at 07:28:39AM +0100, Chris Wilson wrote: > > > > On Wed, Oct 01, 2014 at 01:04:19AM +0200, Daniel Vetter wrote: > > > > > This seems to have been accidentally lost in > > > > > > > > > > commit be62acb4cce1389a28296852737e3917d9cc5b25 > > > > > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > > > > Date: Fri Aug 30 16:19:28 2013 +0300 > > > > > > > > > > drm/i915: ban badly behaving contexts > > > > > > > > > > Without this real gpu hangs only log output at info level, which gets > > > > > filtered away by piglit's testrunner. > > > > > > > > A successful GPU hang is not an error. Might be a warn or a notice, but > > > > it certainly isn't a driver error. > > > > > > Well not of the kernel driver, but might very well be a bug in the > > > userspace driver. With this piglit marks tests that hung the gpu as > > > dmesg-fail, without this they might even pass. Ken raised this on irc and > > > I agree that it's a must-have feature for developers that their testsuite > > > can tell them when stuff broke. Provding this some other way is a lot more > > > work and imo should be done in a separate patch, this here is just the > > > minimal fix for this regression. > > > > I strongly disagree that we should be working around self-imposed > > limitations of the test suite by making users believe their kernel is > > broken. > > So what else should piglit do then? > -Daniel Your GPU hanging is clearly more severe than "info" - it may impact your system stability, and likely represents a bug somewhere in the graphics drivers (whether kernel or userspace). I think we all agree on that. Piglit runs "dmesg --level emerg,alert,crit,err,warn,notice", which covers everything except "info" and "debug". So anything other than info/debug would be just fine. --Ken
On Wed, Oct 01, 2014 at 01:52:20AM -0700, Kenneth Graunke wrote: > On Wednesday, October 01, 2014 10:29:07 AM Daniel Vetter wrote: > > On Wed, Oct 01, 2014 at 09:19:50AM +0100, Chris Wilson wrote: > > > On Wed, Oct 01, 2014 at 10:13:00AM +0200, Daniel Vetter wrote: > > > > On Wed, Oct 01, 2014 at 07:28:39AM +0100, Chris Wilson wrote: > > > > > On Wed, Oct 01, 2014 at 01:04:19AM +0200, Daniel Vetter wrote: > > > > > > This seems to have been accidentally lost in > > > > > > > > > > > > commit be62acb4cce1389a28296852737e3917d9cc5b25 > > > > > > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > > > > > Date: Fri Aug 30 16:19:28 2013 +0300 > > > > > > > > > > > > drm/i915: ban badly behaving contexts > > > > > > > > > > > > Without this real gpu hangs only log output at info level, which gets > > > > > > filtered away by piglit's testrunner. > > > > > > > > > > A successful GPU hang is not an error. Might be a warn or a notice, but > > > > > it certainly isn't a driver error. > > > > > > > > Well not of the kernel driver, but might very well be a bug in the > > > > userspace driver. With this piglit marks tests that hung the gpu as > > > > dmesg-fail, without this they might even pass. Ken raised this on irc and > > > > I agree that it's a must-have feature for developers that their testsuite > > > > can tell them when stuff broke. Provding this some other way is a lot more > > > > work and imo should be done in a separate patch, this here is just the > > > > minimal fix for this regression. > > > > > > I strongly disagree that we should be working around self-imposed > > > limitations of the test suite by making users believe their kernel is > > > broken. > > > > So what else should piglit do then? > > -Daniel > > Your GPU hanging is clearly more severe than "info" - it may impact your system stability, and likely represents a bug somewhere in the graphics drivers (whether kernel or userspace). I think we all agree on that. > > Piglit runs "dmesg --level emerg,alert,crit,err,warn,notice", which covers everything except "info" and "debug". So anything other than info/debug would be just fine. If we are happy with KERN_NOTICE (normal, but significant condition), that is what I would suggest. Actually, we should make the GPU hang detection itself the notice (to aide regular users). But that probably runs into complications with the simulated hangs with igt causing a WARN test failure - but again, I'd rather our interface with the user (and userfacing bug reporting tools) be correct. -Chris
On Wed, Oct 01, 2014 at 11:28:47AM +0200, Daniel Vetter wrote: > This seems to have been accidentally lost in > > commit be62acb4cce1389a28296852737e3917d9cc5b25 > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Date: Fri Aug 30 16:19:28 2013 +0300 > > drm/i915: ban badly behaving contexts > > Without this real gpu hangs only log output at info level, which gets > filtered away by piglit's testrunner. > > v2: Tune down to notice level. Note that we need to add drm/i915 so > that at least the automatic igt dmesg filtering still picks it up. > > v3: git add and lack of coffee don't mix well. > > Cc: Mika Kuoppala <mika.kuoppala@intel.com> > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Kenneth Graunke <kenneth@whitecape.org> > Reported-by: Kenneth Graunke <kenneth@whitecape.org> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Hmm, in my igt hang tests, I don't use dev_priv->gpu_error.stop_rings as it is itself incompatible with the tests. This now causes those to fail with a WARN. -Chris
On Wed, Oct 01, 2014 at 10:37:38AM +0100, Chris Wilson wrote: > On Wed, Oct 01, 2014 at 11:28:47AM +0200, Daniel Vetter wrote: > > This seems to have been accidentally lost in > > > > commit be62acb4cce1389a28296852737e3917d9cc5b25 > > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > Date: Fri Aug 30 16:19:28 2013 +0300 > > > > drm/i915: ban badly behaving contexts > > > > Without this real gpu hangs only log output at info level, which gets > > filtered away by piglit's testrunner. > > > > v2: Tune down to notice level. Note that we need to add drm/i915 so > > that at least the automatic igt dmesg filtering still picks it up. > > > > v3: git add and lack of coffee don't mix well. > > > > Cc: Mika Kuoppala <mika.kuoppala@intel.com> > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > > Cc: Kenneth Graunke <kenneth@whitecape.org> > > Reported-by: Kenneth Graunke <kenneth@whitecape.org> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > > Hmm, in my igt hang tests, I don't use dev_priv->gpu_error.stop_rings as > it is itself incompatible with the tests. This now causes those to fail > with a WARN. In Mika's reset stats test we've set stop_rings after execbuf so that we could submit fancy stuff like endless loops with batch-chaining while still shutting up the kernel's output for real hangs. The nice benefit is that looking at stop_rings then gives you an easy way to double-check from the test that it all worked out since it's getting auto-cleared. In any case we have a fairly great mess here :( -Daniel
On Wed, Oct 01, 2014 at 11:41:31AM +0200, Daniel Vetter wrote: > On Wed, Oct 01, 2014 at 10:37:38AM +0100, Chris Wilson wrote: > > On Wed, Oct 01, 2014 at 11:28:47AM +0200, Daniel Vetter wrote: > > > This seems to have been accidentally lost in > > > > > > commit be62acb4cce1389a28296852737e3917d9cc5b25 > > > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > > Date: Fri Aug 30 16:19:28 2013 +0300 > > > > > > drm/i915: ban badly behaving contexts > > > > > > Without this real gpu hangs only log output at info level, which gets > > > filtered away by piglit's testrunner. > > > > > > v2: Tune down to notice level. Note that we need to add drm/i915 so > > > that at least the automatic igt dmesg filtering still picks it up. > > > > > > v3: git add and lack of coffee don't mix well. > > > > > > Cc: Mika Kuoppala <mika.kuoppala@intel.com> > > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > > > Cc: Kenneth Graunke <kenneth@whitecape.org> > > > Reported-by: Kenneth Graunke <kenneth@whitecape.org> > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > > > > Hmm, in my igt hang tests, I don't use dev_priv->gpu_error.stop_rings as > > it is itself incompatible with the tests. This now causes those to fail > > with a WARN. > > In Mika's reset stats test we've set stop_rings after execbuf so that we > could submit fancy stuff like endless loops with batch-chaining while > still shutting up the kernel's output for real hangs. That doesn't work for me when I try hanging from one context and running normal coherency tests in another... > The nice benefit is that looking at stop_rings then gives you an easy way > to double-check from the test that it all worked out since it's getting > auto-cleared. A single global value when using multiple concurrent rendering contexts is not so nice. Plus the interface is a pita. -Chris
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index 6948877c881c..1870759a5ed8 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -842,6 +842,8 @@ int i915_reset(struct drm_device *dev) "error for simulated gpu hangs\n"); ret = 0; } + } else { + DRM_ERROR("Reset chip after gpu hang\n"); } if (ret) {
This seems to have been accidentally lost in commit be62acb4cce1389a28296852737e3917d9cc5b25 Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> Date: Fri Aug 30 16:19:28 2013 +0300 drm/i915: ban badly behaving contexts Without this real gpu hangs only log output at info level, which gets filtered away by piglit's testrunner. Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Kenneth Graunke <kenneth@whitecape.org> Reported-by: Kenneth Graunke <kenneth@whitecape.org> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> -- Note, totally untested since it's a bit late here ;-) -Daniel --- drivers/gpu/drm/i915/i915_drv.c | 2 ++ 1 file changed, 2 insertions(+)