diff mbox

[3/5] drm/i915/guc: Use correct error code for GuC initialization failure

Message ID 20180131173241.19704-3-michal.wajdeczko@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Michal Wajdeczko Jan. 31, 2018, 5:32 p.m. UTC
Since commit 6ca9a2beb5 ("drm/i915: Unwind i915_gem_init() failure")
we believed that we correctly handle all errors encountered during
GuC initialization, including special one that indicates request to
run driver with disabled GPU submission (-EIO).

Unfortunately since commit 121981fafe ("drm/i915/guc: Combine
enable_guc_loading|submission modparams") we stopped using that
error code to avoid unwanted fallback to execlist submission mode.

In result any GuC initialization failure was treated as non-recoverable
error leading to driver load abort, so we could not even read related
GuC error log to investigate cause of the problem.

Fix that by always returning -EIO on uC hardware related failure.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michal Winiarski <michal.winiarski@intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Sagar Arun Kamble <sagar.a.kamble@intel.com>
---
 drivers/gpu/drm/i915/intel_uc.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

Comments

sagar.a.kamble@intel.com Feb. 1, 2018, 11:48 a.m. UTC | #1
On 1/31/2018 11:02 PM, Michal Wajdeczko wrote:
> Since commit 6ca9a2beb5 ("drm/i915: Unwind i915_gem_init() failure")
> we believed that we correctly handle all errors encountered during
> GuC initialization, including special one that indicates request to
> run driver with disabled GPU submission (-EIO).
>
> Unfortunately since commit 121981fafe ("drm/i915/guc: Combine
> enable_guc_loading|submission modparams") we stopped using that
> error code to avoid unwanted fallback to execlist submission mode.
>
> In result any GuC initialization failure was treated as non-recoverable
> error leading to driver load abort, so we could not even read related
> GuC error log to investigate cause of the problem.
>
> Fix that by always returning -EIO on uC hardware related failure.
>
> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Sagar Arun Kamble <sagar.a.kamble@intel.com>
> ---
>   drivers/gpu/drm/i915/intel_uc.c | 5 +----
>   1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_uc.c b/drivers/gpu/drm/i915/intel_uc.c
> index 3a13cbb..1547e16 100644
> --- a/drivers/gpu/drm/i915/intel_uc.c
> +++ b/drivers/gpu/drm/i915/intel_uc.c
> @@ -432,11 +432,8 @@ int intel_uc_init_hw(struct drm_i915_private *dev_priv)
>   	 * Note that there is no fallback as either user explicitly asked for
>   	 * the GuC or driver default option was to run with the GuC enabled.
>   	 */
> -	if (GEM_WARN_ON(ret == -EIO))
> -		ret = -EINVAL;
> -
>   	dev_err(dev_priv->drm.dev, "GuC initialization failed %d\n", ret);
> -	return ret;
> +	return -EIO; /* We want to disable GPU submission but keep KMS alive */
>   }
We'll need to avoid uc_fini_hw/ in unload path since GPU will be marked 
wedged after this?
>   
>   void intel_uc_fini_hw(struct drm_i915_private *dev_priv)
Chris Wilson Feb. 1, 2018, 12:35 p.m. UTC | #2
Quoting Sagar Arun Kamble (2018-02-01 11:48:54)
> 
> 
> On 1/31/2018 11:02 PM, Michal Wajdeczko wrote:
> > Since commit 6ca9a2beb5 ("drm/i915: Unwind i915_gem_init() failure")
> > we believed that we correctly handle all errors encountered during
> > GuC initialization, including special one that indicates request to
> > run driver with disabled GPU submission (-EIO).
> >
> > Unfortunately since commit 121981fafe ("drm/i915/guc: Combine
> > enable_guc_loading|submission modparams") we stopped using that
> > error code to avoid unwanted fallback to execlist submission mode.
> >
> > In result any GuC initialization failure was treated as non-recoverable
> > error leading to driver load abort, so we could not even read related
> > GuC error log to investigate cause of the problem.
> >
> > Fix that by always returning -EIO on uC hardware related failure.
> >
> > Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Michal Winiarski <michal.winiarski@intel.com>
> > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> > Cc: Sagar Arun Kamble <sagar.a.kamble@intel.com>
> > ---
> >   drivers/gpu/drm/i915/intel_uc.c | 5 +----
> >   1 file changed, 1 insertion(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_uc.c b/drivers/gpu/drm/i915/intel_uc.c
> > index 3a13cbb..1547e16 100644
> > --- a/drivers/gpu/drm/i915/intel_uc.c
> > +++ b/drivers/gpu/drm/i915/intel_uc.c
> > @@ -432,11 +432,8 @@ int intel_uc_init_hw(struct drm_i915_private *dev_priv)
> >        * Note that there is no fallback as either user explicitly asked for
> >        * the GuC or driver default option was to run with the GuC enabled.
> >        */
> > -     if (GEM_WARN_ON(ret == -EIO))
> > -             ret = -EINVAL;
> > -
> >       dev_err(dev_priv->drm.dev, "GuC initialization failed %d\n", ret);
> > -     return ret;
> > +     return -EIO; /* We want to disable GPU submission but keep KMS alive */
> >   }
> We'll need to avoid uc_fini_hw/ in unload path since GPU will be marked 
> wedged after this?

We should be avoiding uc_fini_hw in unload because we haven't completed
uc_init_hw. Such failure should already be handled?
-Chris
Michal Wajdeczko Feb. 1, 2018, 2:23 p.m. UTC | #3
On Thu, 01 Feb 2018 13:35:15 +0100, Chris Wilson  
<chris@chris-wilson.co.uk> wrote:

> Quoting Sagar Arun Kamble (2018-02-01 11:48:54)
>>
>>
>> On 1/31/2018 11:02 PM, Michal Wajdeczko wrote:
>> > Since commit 6ca9a2beb5 ("drm/i915: Unwind i915_gem_init() failure")
>> > we believed that we correctly handle all errors encountered during
>> > GuC initialization, including special one that indicates request to
>> > run driver with disabled GPU submission (-EIO).
>> >
>> > Unfortunately since commit 121981fafe ("drm/i915/guc: Combine
>> > enable_guc_loading|submission modparams") we stopped using that
>> > error code to avoid unwanted fallback to execlist submission mode.
>> >
>> > In result any GuC initialization failure was treated as  
>> non-recoverable
>> > error leading to driver load abort, so we could not even read related
>> > GuC error log to investigate cause of the problem.
>> >
>> > Fix that by always returning -EIO on uC hardware related failure.
>> >
>> > Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
>> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> > Cc: Michal Winiarski <michal.winiarski@intel.com>
>> > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> > Cc: Sagar Arun Kamble <sagar.a.kamble@intel.com>
>> > ---
>> >   drivers/gpu/drm/i915/intel_uc.c | 5 +----
>> >   1 file changed, 1 insertion(+), 4 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/i915/intel_uc.c  
>> b/drivers/gpu/drm/i915/intel_uc.c
>> > index 3a13cbb..1547e16 100644
>> > --- a/drivers/gpu/drm/i915/intel_uc.c
>> > +++ b/drivers/gpu/drm/i915/intel_uc.c
>> > @@ -432,11 +432,8 @@ int intel_uc_init_hw(struct drm_i915_private  
>> *dev_priv)
>> >        * Note that there is no fallback as either user explicitly  
>> asked for
>> >        * the GuC or driver default option was to run with the GuC  
>> enabled.
>> >        */
>> > -     if (GEM_WARN_ON(ret == -EIO))
>> > -             ret = -EINVAL;
>> > -
>> >       dev_err(dev_priv->drm.dev, "GuC initialization failed %d\n",  
>> ret);
>> > -     return ret;
>> > +     return -EIO; /* We want to disable GPU submission but keep KMS  
>> alive */
>> >   }
>> We'll need to avoid uc_fini_hw/ in unload path since GPU will be marked
>> wedged after this?
>
> We should be avoiding uc_fini_hw in unload because we haven't completed
> uc_init_hw. Such failure should already be handled?

It's even worst. In case of uc_init_hw() failure we always call uc_fini()
and uc_fini_misc() regardless of EIO, and then in case of running
in wedged more we will try to call them again in unload path ...

While it is easy to postpone uc_fini[_misc] calls to unload stage,
it is tricky to avoid calling uc_fini_hw there without adding extra check
(earlier there was discussion that we should not have any conditions
in _fini functions)

Any ideas how to solve that ?

Michal
sagar.a.kamble@intel.com Feb. 1, 2018, 3 p.m. UTC | #4
On 2/1/2018 7:53 PM, Michal Wajdeczko wrote:
> On Thu, 01 Feb 2018 13:35:15 +0100, Chris Wilson 
> <chris@chris-wilson.co.uk> wrote:
>
>> Quoting Sagar Arun Kamble (2018-02-01 11:48:54)
>>>
>>>
>>> On 1/31/2018 11:02 PM, Michal Wajdeczko wrote:
>>> > Since commit 6ca9a2beb5 ("drm/i915: Unwind i915_gem_init() failure")
>>> > we believed that we correctly handle all errors encountered during
>>> > GuC initialization, including special one that indicates request to
>>> > run driver with disabled GPU submission (-EIO).
>>> >
>>> > Unfortunately since commit 121981fafe ("drm/i915/guc: Combine
>>> > enable_guc_loading|submission modparams") we stopped using that
>>> > error code to avoid unwanted fallback to execlist submission mode.
>>> >
>>> > In result any GuC initialization failure was treated as 
>>> non-recoverable
>>> > error leading to driver load abort, so we could not even read related
>>> > GuC error log to investigate cause of the problem.
>>> >
>>> > Fix that by always returning -EIO on uC hardware related failure.
>>> >
>>> > Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
>>> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>> > Cc: Michal Winiarski <michal.winiarski@intel.com>
>>> > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>>> > Cc: Sagar Arun Kamble <sagar.a.kamble@intel.com>
>>> > ---
>>> >   drivers/gpu/drm/i915/intel_uc.c | 5 +----
>>> >   1 file changed, 1 insertion(+), 4 deletions(-)
>>> >
>>> > diff --git a/drivers/gpu/drm/i915/intel_uc.c 
>>> b/drivers/gpu/drm/i915/intel_uc.c
>>> > index 3a13cbb..1547e16 100644
>>> > --- a/drivers/gpu/drm/i915/intel_uc.c
>>> > +++ b/drivers/gpu/drm/i915/intel_uc.c
>>> > @@ -432,11 +432,8 @@ int intel_uc_init_hw(struct drm_i915_private 
>>> *dev_priv)
>>> >        * Note that there is no fallback as either user explicitly 
>>> asked for
>>> >        * the GuC or driver default option was to run with the GuC 
>>> enabled.
>>> >        */
>>> > -     if (GEM_WARN_ON(ret == -EIO))
>>> > -             ret = -EINVAL;
>>> > -
>>> >       dev_err(dev_priv->drm.dev, "GuC initialization failed %d\n", 
>>> ret);
>>> > -     return ret;
>>> > +     return -EIO; /* We want to disable GPU submission but keep 
>>> KMS alive */
>>> >   }
>>> We'll need to avoid uc_fini_hw/ in unload path since GPU will be marked
>>> wedged after this?
>>
>> We should be avoiding uc_fini_hw in unload because we haven't completed
>> uc_init_hw. Such failure should already be handled?
>
> It's even worst. In case of uc_init_hw() failure we always call uc_fini()
> and uc_fini_misc() regardless of EIO, and then in case of running
> in wedged more we will try to call them again in unload path ...
>
> While it is easy to postpone uc_fini[_misc] calls to unload stage,
> it is tricky to avoid calling uc_fini_hw there without adding extra check
> (earlier there was discussion that we should not have any conditions
> in _fini functions)
>
> Any ideas how to solve that ?
>
Earlier we could rely on status of enable_guc_loading/submission. Now I 
think we need to
have status per uc, uc_misc, uc_hw as many internal steps can't rely on 
like pointer check.
> Michal
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_uc.c b/drivers/gpu/drm/i915/intel_uc.c
index 3a13cbb..1547e16 100644
--- a/drivers/gpu/drm/i915/intel_uc.c
+++ b/drivers/gpu/drm/i915/intel_uc.c
@@ -432,11 +432,8 @@  int intel_uc_init_hw(struct drm_i915_private *dev_priv)
 	 * Note that there is no fallback as either user explicitly asked for
 	 * the GuC or driver default option was to run with the GuC enabled.
 	 */
-	if (GEM_WARN_ON(ret == -EIO))
-		ret = -EINVAL;
-
 	dev_err(dev_priv->drm.dev, "GuC initialization failed %d\n", ret);
-	return ret;
+	return -EIO; /* We want to disable GPU submission but keep KMS alive */
 }
 
 void intel_uc_fini_hw(struct drm_i915_private *dev_priv)