diff mbox

drm/i915: Remove instructions to file a bug report.

Message ID 1480726985-12762-1-git-send-email-mattst88@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Matt Turner Dec. 3, 2016, 1:03 a.m. UTC
From these instructions, users assume that /sys/class/drm/card0/error
contains all the information a developer needs to diagnose and fix a GPU
hang.

In fact it doesn't, and we have no tools for solving them (other than
stabbing in the dark). Most of the time the error state itself isn't
even useful because it just shows a hang on a PIPE_CONTROL or similar.

Until a time when the error state contains enough information to
actually solve a hang, stop telling users to file unsolvable bugs, and
instead rely on users who know where and how to file a good bug report
to find their own way there.

Signed-off-by: Matt Turner <mattst88@gmail.com>
---
Maybe now's a good time to discuss what *would* be useful to put in the
error state for debugging hangs. The currently executing shader program
would be a great place to start.

 drivers/gpu/drm/i915/i915_gpu_error.c | 11 -----------
 1 file changed, 11 deletions(-)

Comments

Matt Turner Dec. 3, 2016, 1:26 a.m. UTC | #1
On Fri, Dec 2, 2016 at 5:03 PM, Matt Turner <mattst88@gmail.com> wrote:
> From these instructions, users assume that /sys/class/drm/card0/error
> contains all the information a developer needs to diagnose and fix a GPU
> hang.
>
> In fact it doesn't, and we have no tools for solving them (other than
> stabbing in the dark). Most of the time the error state itself isn't
> even useful because it just shows a hang on a PIPE_CONTROL or similar.
>
> Until a time when the error state contains enough information to
> actually solve a hang, stop telling users to file unsolvable bugs, and
> instead rely on users who know where and how to file a good bug report
> to find their own way there.
>
> Signed-off-by: Matt Turner <mattst88@gmail.com>
> ---
> Maybe now's a good time to discuss what *would* be useful to put in the
> error state for debugging hangs. The currently executing shader program
> would be a great place to start.

Looks like Ben had a patch:

https://cgit.freedesktop.org/~bwidawsk/drm-intel/commit/?h=extra_error_objs&id=711da2570cd3302d0cb9f2489a101e4a7c966a6c
Chris Wilson Dec. 3, 2016, 8:57 a.m. UTC | #2
On Fri, Dec 02, 2016 at 05:03:05PM -0800, Matt Turner wrote:
> From these instructions, users assume that /sys/class/drm/card0/error
> contains all the information a developer needs to diagnose and fix a GPU
> hang.
> 
> In fact it doesn't, and we have no tools for solving them (other than
> stabbing in the dark). Most of the time the error state itself isn't
> even useful because it just shows a hang on a PIPE_CONTROL or similar.
> 
> Until a time when the error state contains enough information to
> actually solve a hang, stop telling users to file unsolvable bugs, and
> instead rely on users who know where and how to file a good bug report
> to find their own way there.
> 
> Signed-off-by: Matt Turner <mattst88@gmail.com>

Nak. Though having stale bug reports is a pain, we've recently adopted
the policy of stopping the request after a certain period, those bug
reports are still vital. They don't just represent bugs in mesa.

> ---
> Maybe now's a good time to discuss what *would* be useful to put in the
> error state for debugging hangs. The currently executing shader program
> would be a great place to start.

Now? That is the conversation we've being trying to have for several
years. The contents of the error state are currently about sufficient to
spot kernel bugs, triage the culprit and the general class of bug.

Capturing all state for a request is unfeasible (because we can't copy
the gigabytes of memory required). Copying a selected set of aux bo is
one option. And since those bo are under user control and do not have to
be executed, you can even store aub data in them or whatnot.

Even if you make attaching the debug information conditional, I would
still keep the error message unconditional.
-Chris
Jani Nikula Dec. 3, 2016, 9:52 a.m. UTC | #3
On Sat, 03 Dec 2016, Matt Turner <mattst88@gmail.com> wrote:
> From these instructions, users assume that /sys/class/drm/card0/error
> contains all the information a developer needs to diagnose and fix a GPU
> hang.
>
> In fact it doesn't, and we have no tools for solving them (other than
> stabbing in the dark). Most of the time the error state itself isn't
> even useful because it just shows a hang on a PIPE_CONTROL or similar.
>
> Until a time when the error state contains enough information to
> actually solve a hang, stop telling users to file unsolvable bugs, and
> instead rely on users who know where and how to file a good bug report
> to find their own way there.
>
> Signed-off-by: Matt Turner <mattst88@gmail.com>
> ---
> Maybe now's a good time to discuss what *would* be useful to put in the
> error state for debugging hangs. The currently executing shader program
> would be a great place to start.

I'm wondering why we're getting this patch now, and my guess is that
it's because we have been reassigning the related bugs to Mesa more
actively lately. Is that the case?

IIUC the bug reports are useful for us when it's a kernel bug, but less
useful for you when it's a Mesa bug. And you'd rather have fewer
incoming bugs that you think are unsolvable with the information at
hand.

Sounds like a bug workflow issue between drm/i915 and Mesa to be ironed
out.


BR,
Jani.


>
>  drivers/gpu/drm/i915/i915_gpu_error.c | 11 -----------
>  1 file changed, 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 334f15d..8ddca7b 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1431,7 +1431,6 @@ void i915_capture_error_state(struct drm_i915_private *dev_priv,
>  			      u32 engine_mask,
>  			      const char *error_msg)
>  {
> -	static bool warned;
>  	struct drm_i915_error_state *error;
>  	unsigned long flags;
>  
> @@ -1475,16 +1474,6 @@ void i915_capture_error_state(struct drm_i915_private *dev_priv,
>  		i915_error_state_free(&error->ref);
>  		return;
>  	}
> -
> -	if (!warned) {
> -		DRM_INFO("GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.\n");
> -		DRM_INFO("Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel\n");
> -		DRM_INFO("drm/i915 developers can then reassign to the right component if it's not a kernel issue.\n");
> -		DRM_INFO("The gpu crash dump is required to analyze gpu hangs, so please always attach it.\n");
> -		DRM_INFO("GPU crash dump saved to /sys/class/drm/card%d/error\n",
> -			 dev_priv->drm.primary->index);
> -		warned = true;
> -	}
>  }
>  
>  void i915_error_state_get(struct drm_device *dev,
Matt Turner Dec. 6, 2016, 12:55 a.m. UTC | #4
On Sat, Dec 3, 2016 at 1:52 AM, Jani Nikula <jani.nikula@linux.intel.com> wrote:
> On Sat, 03 Dec 2016, Matt Turner <mattst88@gmail.com> wrote:
>> From these instructions, users assume that /sys/class/drm/card0/error
>> contains all the information a developer needs to diagnose and fix a GPU
>> hang.
>>
>> In fact it doesn't, and we have no tools for solving them (other than
>> stabbing in the dark). Most of the time the error state itself isn't
>> even useful because it just shows a hang on a PIPE_CONTROL or similar.
>>
>> Until a time when the error state contains enough information to
>> actually solve a hang, stop telling users to file unsolvable bugs, and
>> instead rely on users who know where and how to file a good bug report
>> to find their own way there.
>>
>> Signed-off-by: Matt Turner <mattst88@gmail.com>
>> ---
>> Maybe now's a good time to discuss what *would* be useful to put in the
>> error state for debugging hangs. The currently executing shader program
>> would be a great place to start.
>
> I'm wondering why we're getting this patch now, and my guess is that
> it's because we have been reassigning the related bugs to Mesa more
> actively lately. Is that the case?

No, it's simply because I spent a week going through Bugzilla and
realized how incomplete an unactionable the majority of GPU hang
reports are.

Asking users to report bugs, but not telling them what actually
constitutes a bug report, is a recipe for a lot of wasted developer
time.

I suspect we could improve the usefulness of the reports by directing
users to a webpage that gave a few suggestions (tell us what you were
doing when the hang occurred would be an obvious one) about filing a
bug and then provided a link to Bugzilla. Or even configured Bugzilla
to have a default template that requested various bits of information.

> IIUC the bug reports are useful for us when it's a kernel bug, but less
> useful for you when it's a Mesa bug. And you'd rather have fewer
> incoming bugs that you think are unsolvable with the information at
> hand.
>
> Sounds like a bug workflow issue between drm/i915 and Mesa to be ironed
> out.

Indeed. I'd rather have the information provided in a bug report to
actually solve it. I hope having access to the shader program will
make many more reports useful.

I am also happy to see that there's now a sunset to the GPU hang message.
Daniel Vetter Dec. 7, 2016, 4:09 p.m. UTC | #5
On Mon, Dec 05, 2016 at 04:55:47PM -0800, Matt Turner wrote:
> On Sat, Dec 3, 2016 at 1:52 AM, Jani Nikula <jani.nikula@linux.intel.com> wrote:
> > On Sat, 03 Dec 2016, Matt Turner <mattst88@gmail.com> wrote:
> >> From these instructions, users assume that /sys/class/drm/card0/error
> >> contains all the information a developer needs to diagnose and fix a GPU
> >> hang.
> >>
> >> In fact it doesn't, and we have no tools for solving them (other than
> >> stabbing in the dark). Most of the time the error state itself isn't
> >> even useful because it just shows a hang on a PIPE_CONTROL or similar.
> >>
> >> Until a time when the error state contains enough information to
> >> actually solve a hang, stop telling users to file unsolvable bugs, and
> >> instead rely on users who know where and how to file a good bug report
> >> to find their own way there.
> >>
> >> Signed-off-by: Matt Turner <mattst88@gmail.com>
> >> ---
> >> Maybe now's a good time to discuss what *would* be useful to put in the
> >> error state for debugging hangs. The currently executing shader program
> >> would be a great place to start.
> >
> > I'm wondering why we're getting this patch now, and my guess is that
> > it's because we have been reassigning the related bugs to Mesa more
> > actively lately. Is that the case?
> 
> No, it's simply because I spent a week going through Bugzilla and
> realized how incomplete an unactionable the majority of GPU hang
> reports are.
> 
> Asking users to report bugs, but not telling them what actually
> constitutes a bug report, is a recipe for a lot of wasted developer
> time.
> 
> I suspect we could improve the usefulness of the reports by directing
> users to a webpage that gave a few suggestions (tell us what you were
> doing when the hang occurred would be an obvious one) about filing a
> bug and then provided a link to Bugzilla. Or even configured Bugzilla
> to have a default template that requested various bits of information.

I think dumping at least some of the aux buffers should make this tons
more useful for mesa, since it would indicate stuff like "we always die on
resolves on skl gt4" or stuff like that. Thus far error states have been
mostly used by kernel folks to debug kernel issues, which is why none of
that additional stuff gets dumped.

But a bare-bones parser to hunt for indirect state base addresses and fish
out the aux stuff shouldn't be that hard, and might make this fully
useful.

Like Chris said the goal is to at least be able to triage and classify
bugs, and I'm perfectly fine with merging additional code to the dumper to
get there for mesa folks. We z-compress the state, so size isn't really an
issue. And Ben has commit rights, so shouldn't be a problem to get this
all merged.

> > IIUC the bug reports are useful for us when it's a kernel bug, but less
> > useful for you when it's a Mesa bug. And you'd rather have fewer
> > incoming bugs that you think are unsolvable with the information at
> > hand.
> >
> > Sounds like a bug workflow issue between drm/i915 and Mesa to be ironed
> > out.
> 
> Indeed. I'd rather have the information provided in a bug report to
> actually solve it. I hope having access to the shader program will
> make many more reports useful.
> 
> I am also happy to see that there's now a sunset to the GPU hang message.

The other option is that mesa folks don't want error states that we triage
to mesa. We could definitely update the process in that area.
-Daniel
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 334f15d..8ddca7b 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1431,7 +1431,6 @@  void i915_capture_error_state(struct drm_i915_private *dev_priv,
 			      u32 engine_mask,
 			      const char *error_msg)
 {
-	static bool warned;
 	struct drm_i915_error_state *error;
 	unsigned long flags;
 
@@ -1475,16 +1474,6 @@  void i915_capture_error_state(struct drm_i915_private *dev_priv,
 		i915_error_state_free(&error->ref);
 		return;
 	}
-
-	if (!warned) {
-		DRM_INFO("GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.\n");
-		DRM_INFO("Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel\n");
-		DRM_INFO("drm/i915 developers can then reassign to the right component if it's not a kernel issue.\n");
-		DRM_INFO("The gpu crash dump is required to analyze gpu hangs, so please always attach it.\n");
-		DRM_INFO("GPU crash dump saved to /sys/class/drm/card%d/error\n",
-			 dev_priv->drm.primary->index);
-		warned = true;
-	}
 }
 
 void i915_error_state_get(struct drm_device *dev,