drm/i915/uc: Fini hw even if GuC is not running

Message ID	20190813162628.18531-1-fernando.pacheco@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Fernando Pacheco <fernando.pacheco@intel.com> To: intel-gfx@lists.freedesktop.org Date: Tue, 13 Aug 2019 09:26:28 -0700 Message-Id: <20190813162628.18531-1-fernando.pacheco@intel.com> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH] drm/i915/uc: Fini hw even if GuC is not running Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	drm/i915/uc: Fini hw even if GuC is not running \| expand drm/i915/uc: Fini hw even if GuC is not running

Message ID

20190813162628.18531-1-fernando.pacheco@intel.com (mailing list archive)

State

New, archived

Headers

From: Fernando Pacheco <fernando.pacheco@intel.com>
To: intel-gfx@lists.freedesktop.org
Date: Tue, 13 Aug 2019 09:26:28 -0700
Message-Id: <20190813162628.18531-1-fernando.pacheco@intel.com>
MIME-Version: 1.0
Subject: [Intel-gfx] [PATCH] drm/i915/uc: Fini hw even if GuC is not running
Precedence: list
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Series

drm/i915/uc: Fini hw even if GuC is not running | expand

Commit Message

Fernando Pacheco Aug. 13, 2019, 4:26 p.m. UTC

We should not be skipping uc_fini_hw on finding GuC
is no longer running. There is plenty of hw and internal
state that can be cleaned up without having to communicate
with GuC.

Signed-off-by: Fernando Pacheco <fernando.pacheco@intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_uc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Daniele Ceraolo Spurio Aug. 13, 2019, 8:16 p.m. UTC | #1

On 8/13/19 9:26 AM, Fernando Pacheco wrote:
> We should not be skipping uc_fini_hw on finding GuC
> is no longer running. There is plenty of hw and internal
> state that can be cleaned up without having to communicate
> with GuC.
> 

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110943

> Signed-off-by: Fernando Pacheco <fernando.pacheco@intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_uc.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> index 0dc2b0cf4604..c698cddc14dc 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> @@ -521,7 +521,7 @@ void intel_uc_fini_hw(struct intel_uc *uc)
>   {
>   	struct intel_guc *guc = &uc->guc;
>   
> -	if (!intel_guc_is_running(guc))
> +	if (!intel_uc_supports_guc(uc))

Both calls below handle the case where GuC is already not running so 
we're safe, but now that we wedge when we hit a guc error we can also do 
something like:

	-EIO load error -> uc_fini_hw() -> wedge
and then
	unload -> uc_fini_hw()

it should be all be handled safely (the fault injection test is passing 
where we've run it), but we end up with "GuC communication disabled" 
multiple times in the logs:

https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13999/fi-skl-guc/igt@i915_module_load@reload-with-fault-injection.html

<6> [237.818905] [drm] GuC communication enabled
....
<6> [237.822739] i915 0000:00:02.0: [drm:__i915_inject_load_error 
[i915]] Injecting failure -5 at checkpoint 44 [i915_gem_init:1503]
<6> [237.840808] [drm] GuC communication disabled
...
<6> [238.160935] [drm] GuC communication disabled

Maybe return early from guc_disable_communication if the communication 
was already disabled?

Daniele

>   		return;
>   
>   	if (intel_uc_supports_guc_submission(uc))
>

Michal Wajdeczko Aug. 13, 2019, 8:18 p.m. UTC | #2

On Tue, 13 Aug 2019 18:26:28 +0200, Fernando Pacheco  
<fernando.pacheco@intel.com> wrote:

> We should not be skipping uc_fini_hw on finding GuC
> is no longer running. There is plenty of hw and internal
> state that can be cleaned up without having to communicate
> with GuC.
>
> Signed-off-by: Fernando Pacheco <fernando.pacheco@intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_uc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c  
> b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> index 0dc2b0cf4604..c698cddc14dc 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> @@ -521,7 +521,7 @@ void intel_uc_fini_hw(struct intel_uc *uc)
>  {
>  	struct intel_guc *guc = &uc->guc;
> -	if (!intel_guc_is_running(guc))
> +	if (!intel_uc_supports_guc(uc))

there is a huge difference between is_running vs supports_guc
and choosing supports_guc is optimist approach as we can fail
to fetch guc fw and abort early, so maybe

	if (!intel_uc_fw_is_available(&guc->fw))

would be closer to reality (assuming we don't fail on wopcm
(hmm, maybe we should force fw state to FAIL in such case?)

>  		return;
> 	if (intel_uc_supports_guc_submission(uc))

Fernando Pacheco Aug. 14, 2019, 2:20 p.m. UTC | #3

On 8/13/19 1:16 PM, Daniele Ceraolo Spurio wrote:
>
>
> On 8/13/19 9:26 AM, Fernando Pacheco wrote:
>> We should not be skipping uc_fini_hw on finding GuC
>> is no longer running. There is plenty of hw and internal
>> state that can be cleaned up without having to communicate
>> with GuC.
>>
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110943
>
>> Signed-off-by: Fernando Pacheco <fernando.pacheco@intel.com>
>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/uc/intel_uc.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
>> index 0dc2b0cf4604..c698cddc14dc 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
>> @@ -521,7 +521,7 @@ void intel_uc_fini_hw(struct intel_uc *uc)
>>   {
>>       struct intel_guc *guc = &uc->guc;
>>   -    if (!intel_guc_is_running(guc))
>> +    if (!intel_uc_supports_guc(uc))
>
> Both calls below handle the case where GuC is already not running so we're safe, but now that we wedge when we hit a guc error we can also do something like:
>
>     -EIO load error -> uc_fini_hw() -> wedge
> and then
>     unload -> uc_fini_hw()
>
> it should be all be handled safely (the fault injection test is passing where we've run it), but we end up with "GuC communication disabled" multiple times in the logs:
>
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13999/fi-skl-guc/igt@i915_module_load@reload-with-fault-injection.html
>
> <6> [237.818905] [drm] GuC communication enabled
> ....
> <6> [237.822739] i915 0000:00:02.0: [drm:__i915_inject_load_error [i915]] Injecting failure -5 at checkpoint 44 [i915_gem_init:1503]
> <6> [237.840808] [drm] GuC communication disabled
> ...
> <6> [238.160935] [drm] GuC communication disabled
>
> Maybe return early from guc_disable_communication if the communication was already disabled?

As discussed offline, an early return might also require changes to the stop communication phase.
I'll work on this separately as for now the extra disable only results in the extra logging.

Thanks,
Fernando

>
>
> Daniele
>
>>           return;
>>         if (intel_uc_supports_guc_submission(uc))
>>

Fernando Pacheco Aug. 14, 2019, 2:33 p.m. UTC | #4

On 8/13/19 1:18 PM, Michal Wajdeczko wrote:
> On Tue, 13 Aug 2019 18:26:28 +0200, Fernando Pacheco <fernando.pacheco@intel.com> wrote:
>
>> We should not be skipping uc_fini_hw on finding GuC
>> is no longer running. There is plenty of hw and internal
>> state that can be cleaned up without having to communicate
>> with GuC.
>>
>> Signed-off-by: Fernando Pacheco <fernando.pacheco@intel.com>
>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
>> ---
>>  drivers/gpu/drm/i915/gt/uc/intel_uc.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
>> index 0dc2b0cf4604..c698cddc14dc 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
>> @@ -521,7 +521,7 @@ void intel_uc_fini_hw(struct intel_uc *uc)
>>  {
>>      struct intel_guc *guc = &uc->guc;
>> -    if (!intel_guc_is_running(guc))
>> +    if (!intel_uc_supports_guc(uc))
>
> there is a huge difference between is_running vs supports_guc
> and choosing supports_guc is optimist approach as we can fail
> to fetch guc fw and abort early, so maybe
>
>     if (!intel_uc_fw_is_available(&guc->fw))

This is the better check. Thanks!

>
> would be closer to reality (assuming we don't fail on wopcm
> (hmm, maybe we should force fw state to FAIL in such case?)

That would make sense to me. The fail on wopcm directly
affects the state of the fw because we abort the upload
as a result.

Thanks,
Fernando

>
>
>>          return;
>>     if (intel_uc_supports_guc_submission(uc))

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
index 0dc2b0cf4604..c698cddc14dc 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
@@ -521,7 +521,7 @@  void intel_uc_fini_hw(struct intel_uc *uc)
 {
 	struct intel_guc *guc = &uc->guc;
 
-	if (!intel_guc_is_running(guc))
+	if (!intel_uc_supports_guc(uc))
 		return;
 
 	if (intel_uc_supports_guc_submission(uc))