diff mbox series

[RFC,2/2] drm/i915: Use user engine names in error state ecode

Message ID 20201104122043.876567-2-tvrtko.ursulin@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series [RFC,1/2] drm/i915: Improve record of hung engines in error state | expand

Commit Message

Tvrtko Ursulin Nov. 4, 2020, 12:20 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Instead of printing out the internal engine mask, which can change between
kernel versions making it difficult to map to actual engines, list user
friendly engine names in the ecode string. For example:

  [drm] GPU HANG: ecode 9:vcs1:a77ffefe, in gem_exec_captur [929]

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 38 ++++++++++++++++-----------
 drivers/gpu/drm/i915/i915_gpu_error.h |  2 +-
 2 files changed, 24 insertions(+), 16 deletions(-)

Comments

Chris Wilson Nov. 4, 2020, 12:33 p.m. UTC | #1
Quoting Tvrtko Ursulin (2020-11-04 12:20:43)
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Instead of printing out the internal engine mask, which can change between
> kernel versions making it difficult to map to actual engines, list user
> friendly engine names in the ecode string. For example:

Nah. It's a nonsense number, just exists for quick and futile discrimination.
Trying to interpret it is pointless.

There's very little value to be gained from it, it should just serve as a
tale-tell that we have captured an error state. The action and impact of
the reset should be separately recorded.
-Chris
Tvrtko Ursulin Nov. 4, 2020, 1:06 p.m. UTC | #2
On 04/11/2020 12:33, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-11-04 12:20:43)
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> Instead of printing out the internal engine mask, which can change between
>> kernel versions making it difficult to map to actual engines, list user
>> friendly engine names in the ecode string. For example:
> 
> Nah. It's a nonsense number, just exists for quick and futile discrimination.
> Trying to interpret it is pointless.
> 
> There's very little value to be gained from it, it should just serve as a
> tale-tell that we have captured an error state. The action and impact of
> the reset should be separately recorded.

My problem with the nonsense number is that we have it, but that is is 
unstable and people are interpreting it.

How about a bitmask of uabi classes instead? As you can see I really 
want something from the ABI-land, or not at all. Classes might be just 
the thing for the purpose of a signature.

Regards,

Tvrtko
Chris Wilson Nov. 4, 2020, 1:21 p.m. UTC | #3
Quoting Tvrtko Ursulin (2020-11-04 13:06:43)
> 
> On 04/11/2020 12:33, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2020-11-04 12:20:43)
> >> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>
> >> Instead of printing out the internal engine mask, which can change between
> >> kernel versions making it difficult to map to actual engines, list user
> >> friendly engine names in the ecode string. For example:
> > 
> > Nah. It's a nonsense number, just exists for quick and futile discrimination.
> > Trying to interpret it is pointless.
> > 
> > There's very little value to be gained from it, it should just serve as a
> > tale-tell that we have captured an error state. The action and impact of
> > the reset should be separately recorded.
> 
> My problem with the nonsense number is that we have it, but that is is 
> unstable and people are interpreting it.
> 
> How about a bitmask of uabi classes instead? As you can see I really 
> want something from the ABI-land, or not at all. Classes might be just 
> the thing for the purpose of a signature.

You can probably tell I've been pushing for the not-at-all :)

I've personally not found it helpful, it's too simplistic and unstable
even for repeated GL hangs. The concept of having a hash that can
summarise the hang is definitely a good idea, but the input to that hash
is flawed.

Given that we record the reset action, and the context that was
impacted, I wonder how much we need to say here. Just announce a new
error state has been captured?
-Chris
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 857db66cc4a3..ce876e3f3ec5 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1659,32 +1659,40 @@  static u32 generate_ecode(const struct intel_engine_coredump *ee)
 static const char *error_msg(struct i915_gpu_coredump *error)
 {
 	struct intel_engine_coredump *first = NULL;
+	int space = sizeof(error->error_msg) - 1, len;
 	struct intel_gt_coredump *gt;
-	intel_engine_mask_t engines;
-	int len;
+	char *p = error->error_msg;
+
+	len = scnprintf(p, space, "GPU HANG: ecode %d:",
+			INTEL_GEN(error->i915));
+	p += len;
+	space -= len;
 
-	engines = 0;
 	for (gt = error->gt; gt; gt = gt->next) {
 		struct intel_engine_coredump *cs;
 
 		for (cs = gt->engine; cs; cs = cs->next) {
-			if (cs->hung) {
-				engines |= cs->engine->mask;
-				if (!first)
-					first = cs;
-			}
+			if (!cs->hung)
+				continue;
+
+			len = scnprintf(p, space, "%s%s",
+					first ? "," : "",
+					cs->engine->name);
+			p += len;
+			space -= len;
+
+			if (!first)
+				first = cs;
 		}
 	}
 
-	len = scnprintf(error->error_msg, sizeof(error->error_msg),
-			"GPU HANG: ecode %d:%x:%08x",
-			INTEL_GEN(error->i915), engines,
-			generate_ecode(first));
+	len = scnprintf(p, space, ":%08x", generate_ecode(first));
+	p += len;
+	space -= len;
+
 	if (first && first->context.pid) {
 		/* Just show the first executing process, more is confusing */
-		len += scnprintf(error->error_msg + len,
-				 sizeof(error->error_msg) - len,
-				 ", in %s [%d]",
+		len += scnprintf(p, space, ", in %s [%d]",
 				 first->context.comm, first->context.pid);
 	}
 
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
index 3a7ca90a3436..6b8573ddbe67 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.h
+++ b/drivers/gpu/drm/i915/i915_gpu_error.h
@@ -168,7 +168,7 @@  struct i915_gpu_coredump {
 
 	struct intel_gt_coredump *gt;
 
-	char error_msg[128];
+	char error_msg[256];
 	bool simulated;
 	bool wakelock;
 	bool suspended;