Message ID | 20201104122043.876567-2-tvrtko.ursulin@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC,1/2] drm/i915: Improve record of hung engines in error state | expand |
Quoting Tvrtko Ursulin (2020-11-04 12:20:43) > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Instead of printing out the internal engine mask, which can change between > kernel versions making it difficult to map to actual engines, list user > friendly engine names in the ecode string. For example: Nah. It's a nonsense number, just exists for quick and futile discrimination. Trying to interpret it is pointless. There's very little value to be gained from it, it should just serve as a tale-tell that we have captured an error state. The action and impact of the reset should be separately recorded. -Chris
On 04/11/2020 12:33, Chris Wilson wrote: > Quoting Tvrtko Ursulin (2020-11-04 12:20:43) >> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >> >> Instead of printing out the internal engine mask, which can change between >> kernel versions making it difficult to map to actual engines, list user >> friendly engine names in the ecode string. For example: > > Nah. It's a nonsense number, just exists for quick and futile discrimination. > Trying to interpret it is pointless. > > There's very little value to be gained from it, it should just serve as a > tale-tell that we have captured an error state. The action and impact of > the reset should be separately recorded. My problem with the nonsense number is that we have it, but that is is unstable and people are interpreting it. How about a bitmask of uabi classes instead? As you can see I really want something from the ABI-land, or not at all. Classes might be just the thing for the purpose of a signature. Regards, Tvrtko
Quoting Tvrtko Ursulin (2020-11-04 13:06:43) > > On 04/11/2020 12:33, Chris Wilson wrote: > > Quoting Tvrtko Ursulin (2020-11-04 12:20:43) > >> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >> > >> Instead of printing out the internal engine mask, which can change between > >> kernel versions making it difficult to map to actual engines, list user > >> friendly engine names in the ecode string. For example: > > > > Nah. It's a nonsense number, just exists for quick and futile discrimination. > > Trying to interpret it is pointless. > > > > There's very little value to be gained from it, it should just serve as a > > tale-tell that we have captured an error state. The action and impact of > > the reset should be separately recorded. > > My problem with the nonsense number is that we have it, but that is is > unstable and people are interpreting it. > > How about a bitmask of uabi classes instead? As you can see I really > want something from the ABI-land, or not at all. Classes might be just > the thing for the purpose of a signature. You can probably tell I've been pushing for the not-at-all :) I've personally not found it helpful, it's too simplistic and unstable even for repeated GL hangs. The concept of having a hash that can summarise the hang is definitely a good idea, but the input to that hash is flawed. Given that we record the reset action, and the context that was impacted, I wonder how much we need to say here. Just announce a new error state has been captured? -Chris
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 857db66cc4a3..ce876e3f3ec5 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1659,32 +1659,40 @@ static u32 generate_ecode(const struct intel_engine_coredump *ee) static const char *error_msg(struct i915_gpu_coredump *error) { struct intel_engine_coredump *first = NULL; + int space = sizeof(error->error_msg) - 1, len; struct intel_gt_coredump *gt; - intel_engine_mask_t engines; - int len; + char *p = error->error_msg; + + len = scnprintf(p, space, "GPU HANG: ecode %d:", + INTEL_GEN(error->i915)); + p += len; + space -= len; - engines = 0; for (gt = error->gt; gt; gt = gt->next) { struct intel_engine_coredump *cs; for (cs = gt->engine; cs; cs = cs->next) { - if (cs->hung) { - engines |= cs->engine->mask; - if (!first) - first = cs; - } + if (!cs->hung) + continue; + + len = scnprintf(p, space, "%s%s", + first ? "," : "", + cs->engine->name); + p += len; + space -= len; + + if (!first) + first = cs; } } - len = scnprintf(error->error_msg, sizeof(error->error_msg), - "GPU HANG: ecode %d:%x:%08x", - INTEL_GEN(error->i915), engines, - generate_ecode(first)); + len = scnprintf(p, space, ":%08x", generate_ecode(first)); + p += len; + space -= len; + if (first && first->context.pid) { /* Just show the first executing process, more is confusing */ - len += scnprintf(error->error_msg + len, - sizeof(error->error_msg) - len, - ", in %s [%d]", + len += scnprintf(p, space, ", in %s [%d]", first->context.comm, first->context.pid); } diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h index 3a7ca90a3436..6b8573ddbe67 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.h +++ b/drivers/gpu/drm/i915/i915_gpu_error.h @@ -168,7 +168,7 @@ struct i915_gpu_coredump { struct intel_gt_coredump *gt; - char error_msg[128]; + char error_msg[256]; bool simulated; bool wakelock; bool suspended;