Message ID | 20180528171618.10436-1-michal.wajdeczko@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Quoting Michal Wajdeczko (2018-05-28 18:16:18) > SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and > those events are now handled by intel_guc_to_host_event_handler_mmio(). > > We should not try to read it on MMIO action error as 1) we may be using > different set of registers for GuC MMIO communication, and 2) GuC may > use CTB mechanism for sending events to host. Ok. > While here, upgrade error message to DRM_ERROR. Does the error help? What do you want to convey to the user? For error handling, we want to propagate the result back anyway for the caller has to decide what to do next. -Chris
On Tue, 29 May 2018 16:54:12 +0200, Chris Wilson <chris@chris-wilson.co.uk> wrote: > Quoting Michal Wajdeczko (2018-05-28 18:16:18) >> SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and >> those events are now handled by intel_guc_to_host_event_handler_mmio(). >> >> We should not try to read it on MMIO action error as 1) we may be using >> different set of registers for GuC MMIO communication, and 2) GuC may >> use CTB mechanism for sending events to host. > > Ok. > >> While here, upgrade error message to DRM_ERROR. > > Does the error help? What do you want to convey to the user? For error > handling, we want to propagate the result back anyway for the caller has > to decide what to do next. We are propagating error code to the caller, but since any error from the GuC is unexpected, we should rather always log it and don't rely on the caller or drm debug for that. Note that in case of CTB we also log received errors using DRM_ERROR (see intel_guc_send_ct). Michal
Quoting Michal Wajdeczko (2018-05-29 16:10:44) > On Tue, 29 May 2018 16:54:12 +0200, Chris Wilson > <chris@chris-wilson.co.uk> wrote: > > > Quoting Michal Wajdeczko (2018-05-28 18:16:18) > >> SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and > >> those events are now handled by intel_guc_to_host_event_handler_mmio(). > >> > >> We should not try to read it on MMIO action error as 1) we may be using > >> different set of registers for GuC MMIO communication, and 2) GuC may > >> use CTB mechanism for sending events to host. > > > > Ok. > > > >> While here, upgrade error message to DRM_ERROR. > > > > Does the error help? What do you want to convey to the user? For error > > handling, we want to propagate the result back anyway for the caller has > > to decide what to do next. > > We are propagating error code to the caller, but since any error from the > GuC is unexpected, we should rather always log it and don't rely on the > caller or drm debug for that. Note that in case of CTB we also log received > errors using DRM_ERROR (see intel_guc_send_ct). But whose error? Ours or the hw? We expect hw errors, or should ;) But mostly from the pov of the message, is this the right information to flag as the error or does the caller have better context? -Chris
On Tue, 29 May 2018 17:17:02 +0200, Chris Wilson <chris@chris-wilson.co.uk> wrote: > Quoting Michal Wajdeczko (2018-05-29 16:10:44) >> On Tue, 29 May 2018 16:54:12 +0200, Chris Wilson >> <chris@chris-wilson.co.uk> wrote: >> >> > Quoting Michal Wajdeczko (2018-05-28 18:16:18) >> >> SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host >> and >> >> those events are now handled by >> intel_guc_to_host_event_handler_mmio(). >> >> >> >> We should not try to read it on MMIO action error as 1) we may be >> using >> >> different set of registers for GuC MMIO communication, and 2) GuC may >> >> use CTB mechanism for sending events to host. >> > >> > Ok. >> > >> >> While here, upgrade error message to DRM_ERROR. >> > >> > Does the error help? What do you want to convey to the user? For error >> > handling, we want to propagate the result back anyway for the caller >> has >> > to decide what to do next. >> >> We are propagating error code to the caller, but since any error from >> the >> GuC is unexpected, we should rather always log it and don't rely on the >> caller or drm debug for that. Note that in case of CTB we also log >> received >> errors using DRM_ERROR (see intel_guc_send_ct). > > But whose error? Ours or the hw? We expect hw errors, or should ;) well, it can be any i915/FW/HW - hard to tell without other full logs.. > > But mostly from the pov of the message, is this the right information to > flag as the error or does the caller have better context? Only caller can easily provide additional info related for failed command (such as index/address that was rejected by FW) that could help diagnose the problem, but in case FW/HW errors it does not matter. At this point, we can only identify request/action ID that has failed. But that's better than nothing. /Michal
Quoting Michal Wajdeczko (2018-05-28 18:16:18) > SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and > those events are now handled by intel_guc_to_host_event_handler_mmio(). > > We should not try to read it on MMIO action error as 1) we may be using > different set of registers for GuC MMIO communication, and 2) GuC may > use CTB mechanism for sending events to host. > > While here, upgrade error message to DRM_ERROR. > > Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> > Cc: Michel Thierry <michel.thierry@intel.com> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Cc: Chris Wilson <chris@chris-wilson.co.uk> I'm still not totally sold on having the DRM_ERROR here improves debugging; it doesn't do anything to improve error handling, but Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> nevertheless. -Chris
Quoting Chris Wilson (2018-05-31 19:13:32) > Quoting Michal Wajdeczko (2018-05-28 18:16:18) > > SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and > > those events are now handled by intel_guc_to_host_event_handler_mmio(). > > > > We should not try to read it on MMIO action error as 1) we may be using > > different set of registers for GuC MMIO communication, and 2) GuC may > > use CTB mechanism for sending events to host. > > > > While here, upgrade error message to DRM_ERROR. > > > > Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> > > Cc: Michel Thierry <michel.thierry@intel.com> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > > I'm still not totally sold on having the DRM_ERROR here improves > debugging; it doesn't do anything to improve error handling, but > > Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> > > nevertheless. And pushed. Thank you for the patch, -Chris
Quoting Chris Wilson (2018-05-29 15:54:12) > Quoting Michal Wajdeczko (2018-05-28 18:16:18) > > SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and > > those events are now handled by intel_guc_to_host_event_handler_mmio(). > > > > We should not try to read it on MMIO action error as 1) we may be using > > different set of registers for GuC MMIO communication, and 2) GuC may > > use CTB mechanism for sending events to host. > > Ok. > > > While here, upgrade error message to DRM_ERROR. > > Does the error help? What do you want to convey to the user? For error > handling, we want to propagate the result back anyway for the caller has > to decide what to do next. Good news! We see the error in BAT, [ 542.138479] i915: unknown parameter 'enable_guc_loading' ignored [ 542.138483] i915: unknown parameter 'enable_guc_submission' ignored [ 542.138485] Setting dangerous option enable_guc - tainting kernel [ 542.138488] Setting dangerous option live_selftests - tainting kernel [ 542.173291] [drm:intel_guc_send_mmio [i915]] *ERROR* MMIO: GuC action 0x10 failed with error -5 0xf000f000 [ 542.367055] i915: probe of 0000:00:02.0 failed with error -25 And Michał reminded me this wasn't the first time... commit feb06c151fade9ecaa3dd410d792cce26e8b10de Author: Michał Winiarski <michal.winiarski@intel.com> Date: Mon Mar 19 10:53:47 2018 +0100 drm/i915/guc: Demote GuC error messages We're using those functions in selftests, and the callers are expected to do the error handling anyways. Let's demote all GuC actions and doorbell creation to DEBUG_DRIVER. So do we kindly ask Michał to resubmit his fix? -Chris
On Thu, 12 Jul 2018 17:31:14 +0200, Chris Wilson <chris@chris-wilson.co.uk> wrote: > Quoting Chris Wilson (2018-05-29 15:54:12) >> Quoting Michal Wajdeczko (2018-05-28 18:16:18) >> > SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host >> and >> > those events are now handled by >> intel_guc_to_host_event_handler_mmio(). >> > >> > We should not try to read it on MMIO action error as 1) we may be >> using >> > different set of registers for GuC MMIO communication, and 2) GuC may >> > use CTB mechanism for sending events to host. >> >> Ok. >> >> > While here, upgrade error message to DRM_ERROR. >> >> Does the error help? What do you want to convey to the user? For error >> handling, we want to propagate the result back anyway for the caller has >> to decide what to do next. > > Good news! We see the error in BAT, > > [ 542.138479] i915: unknown parameter 'enable_guc_loading' ignored > [ 542.138483] i915: unknown parameter 'enable_guc_submission' ignored > [ 542.138485] Setting dangerous option enable_guc - tainting kernel > [ 542.138488] Setting dangerous option live_selftests - tainting kernel > [ 542.173291] [drm:intel_guc_send_mmio [i915]] *ERROR* MMIO: GuC action > 0x10 failed with error -5 0xf000f000 > [ 542.367055] i915: probe of 0000:00:02.0 failed with error -25 > > And Michał reminded me this wasn't the first time... > > commit feb06c151fade9ecaa3dd410d792cce26e8b10de > Author: Michał Winiarski <michal.winiarski@intel.com> > Date: Mon Mar 19 10:53:47 2018 +0100 > > drm/i915/guc: Demote GuC error messages > We're using those functions in selftests, and the callers are expected > to do the error handling anyways. Let's demote all GuC actions and > doorbell creation to DEBUG_DRIVER. > > So do we kindly ask Michał to resubmit his fix? There are more places where DRM_ERROR is used after detection GuC error (see intel_guc_send_ct as example, more to show up shortly) I would rather prefer to add GUC_ERROR macro that could be tweaked under SELFTEST config and runtime flags to demote unwanted errors: #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST) #define GUC_ERROR(...) \ do { \ if (unlikely(i915_selftest.mock || i915_selftest.live)) \ DRM_DEBUG_DRIVER(__VA_ARGS__); \ else \ DRM_ERROR(__VA_ARGS__); } while (0); #else #define GUC_ERROR(...) DRM_ERROR(__VA_ARGS__) #endif /Michal
diff --git a/drivers/gpu/drm/i915/intel_guc.c b/drivers/gpu/drm/i915/intel_guc.c index 116f4cc..e28a996 100644 --- a/drivers/gpu/drm/i915/intel_guc.c +++ b/drivers/gpu/drm/i915/intel_guc.c @@ -346,10 +346,8 @@ int intel_guc_send_mmio(struct intel_guc *guc, const u32 *action, u32 len, ret = -EIO; if (ret) { - DRM_DEBUG_DRIVER("INTEL_GUC_SEND: Action 0x%X failed;" - " ret=%d status=0x%08X response=0x%08X\n", - action[0], ret, status, - I915_READ(SOFT_SCRATCH(15))); + DRM_ERROR("MMIO: GuC action %#x failed with error %d %#x\n", + action[0], ret, status); goto out; }
SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and those events are now handled by intel_guc_to_host_event_handler_mmio(). We should not try to read it on MMIO action error as 1) we may be using different set of registers for GuC MMIO communication, and 2) GuC may use CTB mechanism for sending events to host. While here, upgrade error message to DRM_ERROR. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Michel Thierry <michel.thierry@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> --- drivers/gpu/drm/i915/intel_guc.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)