diff mbox series

[2/2] drm/i915/selftests: Improve error detection of reset failure

Message ID 20190312111146.10662-2-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show
Series [1/2] drm/i915: Consolidate reset-request debug message | expand

Commit Message

Chris Wilson March 12, 2019, 11:11 a.m. UTC
Use a timedwait to promptly detect if the recovery after reset fails and
provide a meaningful debug dump.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 .../gpu/drm/i915/selftests/intel_hangcheck.c   | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

Comments

Mika Kuoppala March 12, 2019, 12:33 p.m. UTC | #1
Chris Wilson <chris@chris-wilson.co.uk> writes:

> Use a timedwait to promptly detect if the recovery after reset fails and
> provide a meaningful debug dump.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  .../gpu/drm/i915/selftests/intel_hangcheck.c   | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 10658ad05305..b5e35b2a925f 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -983,7 +983,23 @@ static int __igt_reset_engines(struct drm_i915_private *i915,
>  			count++;
>  
>  			if (rq) {
> -				i915_request_wait(rq, 0, MAX_SCHEDULE_TIMEOUT);
> +				if (i915_request_wait(rq, 0, HZ / 5) < 0) {
> +					struct drm_printer p =
> +						drm_info_printer(i915->drm.dev);
> +
> +					pr_err("i915_reset_engine(%s:%s):"
> +					       " failed to complete request after reset\n",
> +					       engine->name, test_name);
> +					intel_engine_dump(engine, &p,
> +							  "%s\n", engine->name);
> +					i915_request_put(rq);
> +
> +					GEM_TRACE_DUMP();

Would this be useful also in above, where we wait until request
is running?


> +					i915_gem_set_wedged(i915);
> +					err = -EIO;

Moving the *rq out from loop and then doing if(rq) i915_request_put(rq);
out loop after exit, might be easier on error handling, shrug.

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

> +					break;
> +				}
> +
>  				i915_request_put(rq);
>  			}
>  
> -- 
> 2.20.1
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson March 12, 2019, 12:43 p.m. UTC | #2
Quoting Mika Kuoppala (2019-03-12 12:33:37)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Use a timedwait to promptly detect if the recovery after reset fails and
> > provide a meaningful debug dump.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> >  .../gpu/drm/i915/selftests/intel_hangcheck.c   | 18 +++++++++++++++++-
> >  1 file changed, 17 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > index 10658ad05305..b5e35b2a925f 100644
> > --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > @@ -983,7 +983,23 @@ static int __igt_reset_engines(struct drm_i915_private *i915,
> >                       count++;
> >  
> >                       if (rq) {
> > -                             i915_request_wait(rq, 0, MAX_SCHEDULE_TIMEOUT);
> > +                             if (i915_request_wait(rq, 0, HZ / 5) < 0) {
> > +                                     struct drm_printer p =
> > +                                             drm_info_printer(i915->drm.dev);
> > +
> > +                                     pr_err("i915_reset_engine(%s:%s):"
> > +                                            " failed to complete request after reset\n",
> > +                                            engine->name, test_name);
> > +                                     intel_engine_dump(engine, &p,
> > +                                                       "%s\n", engine->name);
> > +                                     i915_request_put(rq);
> > +
> > +                                     GEM_TRACE_DUMP();
> 
> Would this be useful also in above, where we wait until request
> is running?

Hard to tell until we hit it. The big problem here is the trace is very
noisy; the important bits are in the engine dump which provide the
needle. And the trace may or may not have the information you are
looking for.

Generally, it's the post reset recovery that fails; once we are happy
the reset is ok, the next request is also likely to work. Or it may not,
hence the test!

In the case I was debugging, it wasn't so much a failure in reset
handling, but this test had the precise timing required to fool
unwind_incomplete_requests.

> > +                                     i915_gem_set_wedged(i915);
> > +                                     err = -EIO;
> 
> Moving the *rq out from loop and then doing if(rq) i915_request_put(rq);
> out loop after exit, might be easier on error handling, shrug.

Or more convoluted. Didn't feel particularly happy.
-Chris
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index 10658ad05305..b5e35b2a925f 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -983,7 +983,23 @@  static int __igt_reset_engines(struct drm_i915_private *i915,
 			count++;
 
 			if (rq) {
-				i915_request_wait(rq, 0, MAX_SCHEDULE_TIMEOUT);
+				if (i915_request_wait(rq, 0, HZ / 5) < 0) {
+					struct drm_printer p =
+						drm_info_printer(i915->drm.dev);
+
+					pr_err("i915_reset_engine(%s:%s):"
+					       " failed to complete request after reset\n",
+					       engine->name, test_name);
+					intel_engine_dump(engine, &p,
+							  "%s\n", engine->name);
+					i915_request_put(rq);
+
+					GEM_TRACE_DUMP();
+					i915_gem_set_wedged(i915);
+					err = -EIO;
+					break;
+				}
+
 				i915_request_put(rq);
 			}