diff mbox

drm/i915/guc: Don't read SOFT_SCRATCH(15) on MMIO error

Message ID 20180528171618.10436-1-michal.wajdeczko@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Michal Wajdeczko May 28, 2018, 5:16 p.m. UTC
SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and
those events are now handled by intel_guc_to_host_event_handler_mmio().

We should not try to read it on MMIO action error as 1) we may be using
different set of registers for GuC MMIO communication, and 2) GuC may
use CTB mechanism for sending events to host.

While here, upgrade error message to DRM_ERROR.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/intel_guc.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

Comments

Chris Wilson May 29, 2018, 2:54 p.m. UTC | #1
Quoting Michal Wajdeczko (2018-05-28 18:16:18)
> SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and
> those events are now handled by intel_guc_to_host_event_handler_mmio().
> 
> We should not try to read it on MMIO action error as 1) we may be using
> different set of registers for GuC MMIO communication, and 2) GuC may
> use CTB mechanism for sending events to host.

Ok.
 
> While here, upgrade error message to DRM_ERROR.

Does the error help? What do you want to convey to the user? For error
handling, we want to propagate the result back anyway for the caller has
to decide what to do next.
-Chris
Michal Wajdeczko May 29, 2018, 3:10 p.m. UTC | #2
On Tue, 29 May 2018 16:54:12 +0200, Chris Wilson  
<chris@chris-wilson.co.uk> wrote:

> Quoting Michal Wajdeczko (2018-05-28 18:16:18)
>> SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and
>> those events are now handled by intel_guc_to_host_event_handler_mmio().
>>
>> We should not try to read it on MMIO action error as 1) we may be using
>> different set of registers for GuC MMIO communication, and 2) GuC may
>> use CTB mechanism for sending events to host.
>
> Ok.
>
>> While here, upgrade error message to DRM_ERROR.
>
> Does the error help? What do you want to convey to the user? For error
> handling, we want to propagate the result back anyway for the caller has
> to decide what to do next.

We are propagating error code to the caller, but since any error from the
GuC is unexpected, we should rather always log it and don't rely on the
caller or drm debug for that. Note that in case of CTB we also log received
errors using DRM_ERROR (see intel_guc_send_ct).

Michal
Chris Wilson May 29, 2018, 3:17 p.m. UTC | #3
Quoting Michal Wajdeczko (2018-05-29 16:10:44)
> On Tue, 29 May 2018 16:54:12 +0200, Chris Wilson  
> <chris@chris-wilson.co.uk> wrote:
> 
> > Quoting Michal Wajdeczko (2018-05-28 18:16:18)
> >> SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and
> >> those events are now handled by intel_guc_to_host_event_handler_mmio().
> >>
> >> We should not try to read it on MMIO action error as 1) we may be using
> >> different set of registers for GuC MMIO communication, and 2) GuC may
> >> use CTB mechanism for sending events to host.
> >
> > Ok.
> >
> >> While here, upgrade error message to DRM_ERROR.
> >
> > Does the error help? What do you want to convey to the user? For error
> > handling, we want to propagate the result back anyway for the caller has
> > to decide what to do next.
> 
> We are propagating error code to the caller, but since any error from the
> GuC is unexpected, we should rather always log it and don't rely on the
> caller or drm debug for that. Note that in case of CTB we also log received
> errors using DRM_ERROR (see intel_guc_send_ct).

But whose error? Ours or the hw? We expect hw errors, or should ;)

But mostly from the pov of the message, is this the right information to
flag as the error or does the caller have better context?
-Chris
Michal Wajdeczko May 29, 2018, 3:30 p.m. UTC | #4
On Tue, 29 May 2018 17:17:02 +0200, Chris Wilson  
<chris@chris-wilson.co.uk> wrote:

> Quoting Michal Wajdeczko (2018-05-29 16:10:44)
>> On Tue, 29 May 2018 16:54:12 +0200, Chris Wilson
>> <chris@chris-wilson.co.uk> wrote:
>>
>> > Quoting Michal Wajdeczko (2018-05-28 18:16:18)
>> >> SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host  
>> and
>> >> those events are now handled by  
>> intel_guc_to_host_event_handler_mmio().
>> >>
>> >> We should not try to read it on MMIO action error as 1) we may be  
>> using
>> >> different set of registers for GuC MMIO communication, and 2) GuC may
>> >> use CTB mechanism for sending events to host.
>> >
>> > Ok.
>> >
>> >> While here, upgrade error message to DRM_ERROR.
>> >
>> > Does the error help? What do you want to convey to the user? For error
>> > handling, we want to propagate the result back anyway for the caller  
>> has
>> > to decide what to do next.
>>
>> We are propagating error code to the caller, but since any error from  
>> the
>> GuC is unexpected, we should rather always log it and don't rely on the
>> caller or drm debug for that. Note that in case of CTB we also log  
>> received
>> errors using DRM_ERROR (see intel_guc_send_ct).
>
> But whose error? Ours or the hw? We expect hw errors, or should ;)

well, it can be any i915/FW/HW - hard to tell without other full logs..

>
> But mostly from the pov of the message, is this the right information to
> flag as the error or does the caller have better context?

Only caller can easily provide additional info related for failed command
(such as index/address that was rejected by FW) that could help diagnose
the problem, but in case FW/HW errors it does not matter.

At this point, we can only identify request/action ID that has failed.
But that's better than nothing.

/Michal
Chris Wilson May 31, 2018, 6:13 p.m. UTC | #5
Quoting Michal Wajdeczko (2018-05-28 18:16:18)
> SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and
> those events are now handled by intel_guc_to_host_event_handler_mmio().
> 
> We should not try to read it on MMIO action error as 1) we may be using
> different set of registers for GuC MMIO communication, and 2) GuC may
> use CTB mechanism for sending events to host.
> 
> While here, upgrade error message to DRM_ERROR.
> 
> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>

I'm still not totally sold on having the DRM_ERROR here improves
debugging; it doesn't do anything to improve error handling, but

Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>

nevertheless.
-Chris
Chris Wilson May 31, 2018, 6:23 p.m. UTC | #6
Quoting Chris Wilson (2018-05-31 19:13:32)
> Quoting Michal Wajdeczko (2018-05-28 18:16:18)
> > SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and
> > those events are now handled by intel_guc_to_host_event_handler_mmio().
> > 
> > We should not try to read it on MMIO action error as 1) we may be using
> > different set of registers for GuC MMIO communication, and 2) GuC may
> > use CTB mechanism for sending events to host.
> > 
> > While here, upgrade error message to DRM_ERROR.
> > 
> > Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > Cc: Michel Thierry <michel.thierry@intel.com>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> 
> I'm still not totally sold on having the DRM_ERROR here improves
> debugging; it doesn't do anything to improve error handling, but
> 
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
> 
> nevertheless.

And pushed. Thank you for the patch,
-Chris
Chris Wilson July 12, 2018, 3:31 p.m. UTC | #7
Quoting Chris Wilson (2018-05-29 15:54:12)
> Quoting Michal Wajdeczko (2018-05-28 18:16:18)
> > SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and
> > those events are now handled by intel_guc_to_host_event_handler_mmio().
> > 
> > We should not try to read it on MMIO action error as 1) we may be using
> > different set of registers for GuC MMIO communication, and 2) GuC may
> > use CTB mechanism for sending events to host.
> 
> Ok.
>  
> > While here, upgrade error message to DRM_ERROR.
> 
> Does the error help? What do you want to convey to the user? For error
> handling, we want to propagate the result back anyway for the caller has
> to decide what to do next.

Good news! We see the error in BAT,

[  542.138479] i915: unknown parameter 'enable_guc_loading' ignored
[  542.138483] i915: unknown parameter 'enable_guc_submission' ignored
[  542.138485] Setting dangerous option enable_guc - tainting kernel
[  542.138488] Setting dangerous option live_selftests - tainting kernel
[  542.173291] [drm:intel_guc_send_mmio [i915]] *ERROR* MMIO: GuC action 0x10 failed with error -5 0xf000f000
[  542.367055] i915: probe of 0000:00:02.0 failed with error -25

And Michał reminded me this wasn't the first time...

commit feb06c151fade9ecaa3dd410d792cce26e8b10de
Author: Michał Winiarski <michal.winiarski@intel.com>
Date:   Mon Mar 19 10:53:47 2018 +0100

    drm/i915/guc: Demote GuC error messages
    
    We're using those functions in selftests, and the callers are expected
    to do the error handling anyways. Let's demote all GuC actions and
    doorbell creation to DEBUG_DRIVER.

So do we kindly ask Michał to resubmit his fix?
-Chris
Michal Wajdeczko July 12, 2018, 5:29 p.m. UTC | #8
On Thu, 12 Jul 2018 17:31:14 +0200, Chris Wilson  
<chris@chris-wilson.co.uk> wrote:

> Quoting Chris Wilson (2018-05-29 15:54:12)
>> Quoting Michal Wajdeczko (2018-05-28 18:16:18)
>> > SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host  
>> and
>> > those events are now handled by  
>> intel_guc_to_host_event_handler_mmio().
>> >
>> > We should not try to read it on MMIO action error as 1) we may be  
>> using
>> > different set of registers for GuC MMIO communication, and 2) GuC may
>> > use CTB mechanism for sending events to host.
>>
>> Ok.
>>
>> > While here, upgrade error message to DRM_ERROR.
>>
>> Does the error help? What do you want to convey to the user? For error
>> handling, we want to propagate the result back anyway for the caller has
>> to decide what to do next.
>
> Good news! We see the error in BAT,
>
> [  542.138479] i915: unknown parameter 'enable_guc_loading' ignored
> [  542.138483] i915: unknown parameter 'enable_guc_submission' ignored
> [  542.138485] Setting dangerous option enable_guc - tainting kernel
> [  542.138488] Setting dangerous option live_selftests - tainting kernel
> [  542.173291] [drm:intel_guc_send_mmio [i915]] *ERROR* MMIO: GuC action  
> 0x10 failed with error -5 0xf000f000
> [  542.367055] i915: probe of 0000:00:02.0 failed with error -25
>
> And Michał reminded me this wasn't the first time...
>
> commit feb06c151fade9ecaa3dd410d792cce26e8b10de
> Author: Michał Winiarski <michal.winiarski@intel.com>
> Date:   Mon Mar 19 10:53:47 2018 +0100
>
>     drm/i915/guc: Demote GuC error messages
>    We're using those functions in selftests, and the callers are expected
>     to do the error handling anyways. Let's demote all GuC actions and
>     doorbell creation to DEBUG_DRIVER.
>
> So do we kindly ask Michał to resubmit his fix?

There are more places where DRM_ERROR is used after detection GuC error
(see intel_guc_send_ct as example, more to show up shortly)

I would rather prefer to add GUC_ERROR macro that could be tweaked
under SELFTEST config and runtime flags to demote unwanted errors:

#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
#define GUC_ERROR(...) \
	do { \
		if (unlikely(i915_selftest.mock || i915_selftest.live)) \
			DRM_DEBUG_DRIVER(__VA_ARGS__); \
		else \
			DRM_ERROR(__VA_ARGS__);
	} while (0);
#else
#define GUC_ERROR(...) DRM_ERROR(__VA_ARGS__)
#endif

/Michal
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_guc.c b/drivers/gpu/drm/i915/intel_guc.c
index 116f4cc..e28a996 100644
--- a/drivers/gpu/drm/i915/intel_guc.c
+++ b/drivers/gpu/drm/i915/intel_guc.c
@@ -346,10 +346,8 @@  int intel_guc_send_mmio(struct intel_guc *guc, const u32 *action, u32 len,
 		ret = -EIO;
 
 	if (ret) {
-		DRM_DEBUG_DRIVER("INTEL_GUC_SEND: Action 0x%X failed;"
-				 " ret=%d status=0x%08X response=0x%08X\n",
-				 action[0], ret, status,
-				 I915_READ(SOFT_SCRATCH(15)));
+		DRM_ERROR("MMIO: GuC action %#x failed with error %d %#x\n",
+			  action[0], ret, status);
 		goto out;
 	}