Message ID | 20180131173241.19704-3-michal.wajdeczko@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 1/31/2018 11:02 PM, Michal Wajdeczko wrote: > Since commit 6ca9a2beb5 ("drm/i915: Unwind i915_gem_init() failure") > we believed that we correctly handle all errors encountered during > GuC initialization, including special one that indicates request to > run driver with disabled GPU submission (-EIO). > > Unfortunately since commit 121981fafe ("drm/i915/guc: Combine > enable_guc_loading|submission modparams") we stopped using that > error code to avoid unwanted fallback to execlist submission mode. > > In result any GuC initialization failure was treated as non-recoverable > error leading to driver load abort, so we could not even read related > GuC error log to investigate cause of the problem. > > Fix that by always returning -EIO on uC hardware related failure. > > Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Michal Winiarski <michal.winiarski@intel.com> > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> > Cc: Sagar Arun Kamble <sagar.a.kamble@intel.com> > --- > drivers/gpu/drm/i915/intel_uc.c | 5 +---- > 1 file changed, 1 insertion(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/i915/intel_uc.c b/drivers/gpu/drm/i915/intel_uc.c > index 3a13cbb..1547e16 100644 > --- a/drivers/gpu/drm/i915/intel_uc.c > +++ b/drivers/gpu/drm/i915/intel_uc.c > @@ -432,11 +432,8 @@ int intel_uc_init_hw(struct drm_i915_private *dev_priv) > * Note that there is no fallback as either user explicitly asked for > * the GuC or driver default option was to run with the GuC enabled. > */ > - if (GEM_WARN_ON(ret == -EIO)) > - ret = -EINVAL; > - > dev_err(dev_priv->drm.dev, "GuC initialization failed %d\n", ret); > - return ret; > + return -EIO; /* We want to disable GPU submission but keep KMS alive */ > } We'll need to avoid uc_fini_hw/ in unload path since GPU will be marked wedged after this? > > void intel_uc_fini_hw(struct drm_i915_private *dev_priv)
Quoting Sagar Arun Kamble (2018-02-01 11:48:54) > > > On 1/31/2018 11:02 PM, Michal Wajdeczko wrote: > > Since commit 6ca9a2beb5 ("drm/i915: Unwind i915_gem_init() failure") > > we believed that we correctly handle all errors encountered during > > GuC initialization, including special one that indicates request to > > run driver with disabled GPU submission (-EIO). > > > > Unfortunately since commit 121981fafe ("drm/i915/guc: Combine > > enable_guc_loading|submission modparams") we stopped using that > > error code to avoid unwanted fallback to execlist submission mode. > > > > In result any GuC initialization failure was treated as non-recoverable > > error leading to driver load abort, so we could not even read related > > GuC error log to investigate cause of the problem. > > > > Fix that by always returning -EIO on uC hardware related failure. > > > > Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > > Cc: Michal Winiarski <michal.winiarski@intel.com> > > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> > > Cc: Sagar Arun Kamble <sagar.a.kamble@intel.com> > > --- > > drivers/gpu/drm/i915/intel_uc.c | 5 +---- > > 1 file changed, 1 insertion(+), 4 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/intel_uc.c b/drivers/gpu/drm/i915/intel_uc.c > > index 3a13cbb..1547e16 100644 > > --- a/drivers/gpu/drm/i915/intel_uc.c > > +++ b/drivers/gpu/drm/i915/intel_uc.c > > @@ -432,11 +432,8 @@ int intel_uc_init_hw(struct drm_i915_private *dev_priv) > > * Note that there is no fallback as either user explicitly asked for > > * the GuC or driver default option was to run with the GuC enabled. > > */ > > - if (GEM_WARN_ON(ret == -EIO)) > > - ret = -EINVAL; > > - > > dev_err(dev_priv->drm.dev, "GuC initialization failed %d\n", ret); > > - return ret; > > + return -EIO; /* We want to disable GPU submission but keep KMS alive */ > > } > We'll need to avoid uc_fini_hw/ in unload path since GPU will be marked > wedged after this? We should be avoiding uc_fini_hw in unload because we haven't completed uc_init_hw. Such failure should already be handled? -Chris
On Thu, 01 Feb 2018 13:35:15 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > Quoting Sagar Arun Kamble (2018-02-01 11:48:54) >> >> >> On 1/31/2018 11:02 PM, Michal Wajdeczko wrote: >> > Since commit 6ca9a2beb5 ("drm/i915: Unwind i915_gem_init() failure") >> > we believed that we correctly handle all errors encountered during >> > GuC initialization, including special one that indicates request to >> > run driver with disabled GPU submission (-EIO). >> > >> > Unfortunately since commit 121981fafe ("drm/i915/guc: Combine >> > enable_guc_loading|submission modparams") we stopped using that >> > error code to avoid unwanted fallback to execlist submission mode. >> > >> > In result any GuC initialization failure was treated as >> non-recoverable >> > error leading to driver load abort, so we could not even read related >> > GuC error log to investigate cause of the problem. >> > >> > Fix that by always returning -EIO on uC hardware related failure. >> > >> > Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> >> > Cc: Chris Wilson <chris@chris-wilson.co.uk> >> > Cc: Michal Winiarski <michal.winiarski@intel.com> >> > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> >> > Cc: Sagar Arun Kamble <sagar.a.kamble@intel.com> >> > --- >> > drivers/gpu/drm/i915/intel_uc.c | 5 +---- >> > 1 file changed, 1 insertion(+), 4 deletions(-) >> > >> > diff --git a/drivers/gpu/drm/i915/intel_uc.c >> b/drivers/gpu/drm/i915/intel_uc.c >> > index 3a13cbb..1547e16 100644 >> > --- a/drivers/gpu/drm/i915/intel_uc.c >> > +++ b/drivers/gpu/drm/i915/intel_uc.c >> > @@ -432,11 +432,8 @@ int intel_uc_init_hw(struct drm_i915_private >> *dev_priv) >> > * Note that there is no fallback as either user explicitly >> asked for >> > * the GuC or driver default option was to run with the GuC >> enabled. >> > */ >> > - if (GEM_WARN_ON(ret == -EIO)) >> > - ret = -EINVAL; >> > - >> > dev_err(dev_priv->drm.dev, "GuC initialization failed %d\n", >> ret); >> > - return ret; >> > + return -EIO; /* We want to disable GPU submission but keep KMS >> alive */ >> > } >> We'll need to avoid uc_fini_hw/ in unload path since GPU will be marked >> wedged after this? > > We should be avoiding uc_fini_hw in unload because we haven't completed > uc_init_hw. Such failure should already be handled? It's even worst. In case of uc_init_hw() failure we always call uc_fini() and uc_fini_misc() regardless of EIO, and then in case of running in wedged more we will try to call them again in unload path ... While it is easy to postpone uc_fini[_misc] calls to unload stage, it is tricky to avoid calling uc_fini_hw there without adding extra check (earlier there was discussion that we should not have any conditions in _fini functions) Any ideas how to solve that ? Michal
On 2/1/2018 7:53 PM, Michal Wajdeczko wrote: > On Thu, 01 Feb 2018 13:35:15 +0100, Chris Wilson > <chris@chris-wilson.co.uk> wrote: > >> Quoting Sagar Arun Kamble (2018-02-01 11:48:54) >>> >>> >>> On 1/31/2018 11:02 PM, Michal Wajdeczko wrote: >>> > Since commit 6ca9a2beb5 ("drm/i915: Unwind i915_gem_init() failure") >>> > we believed that we correctly handle all errors encountered during >>> > GuC initialization, including special one that indicates request to >>> > run driver with disabled GPU submission (-EIO). >>> > >>> > Unfortunately since commit 121981fafe ("drm/i915/guc: Combine >>> > enable_guc_loading|submission modparams") we stopped using that >>> > error code to avoid unwanted fallback to execlist submission mode. >>> > >>> > In result any GuC initialization failure was treated as >>> non-recoverable >>> > error leading to driver load abort, so we could not even read related >>> > GuC error log to investigate cause of the problem. >>> > >>> > Fix that by always returning -EIO on uC hardware related failure. >>> > >>> > Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> >>> > Cc: Chris Wilson <chris@chris-wilson.co.uk> >>> > Cc: Michal Winiarski <michal.winiarski@intel.com> >>> > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> >>> > Cc: Sagar Arun Kamble <sagar.a.kamble@intel.com> >>> > --- >>> > drivers/gpu/drm/i915/intel_uc.c | 5 +---- >>> > 1 file changed, 1 insertion(+), 4 deletions(-) >>> > >>> > diff --git a/drivers/gpu/drm/i915/intel_uc.c >>> b/drivers/gpu/drm/i915/intel_uc.c >>> > index 3a13cbb..1547e16 100644 >>> > --- a/drivers/gpu/drm/i915/intel_uc.c >>> > +++ b/drivers/gpu/drm/i915/intel_uc.c >>> > @@ -432,11 +432,8 @@ int intel_uc_init_hw(struct drm_i915_private >>> *dev_priv) >>> > * Note that there is no fallback as either user explicitly >>> asked for >>> > * the GuC or driver default option was to run with the GuC >>> enabled. >>> > */ >>> > - if (GEM_WARN_ON(ret == -EIO)) >>> > - ret = -EINVAL; >>> > - >>> > dev_err(dev_priv->drm.dev, "GuC initialization failed %d\n", >>> ret); >>> > - return ret; >>> > + return -EIO; /* We want to disable GPU submission but keep >>> KMS alive */ >>> > } >>> We'll need to avoid uc_fini_hw/ in unload path since GPU will be marked >>> wedged after this? >> >> We should be avoiding uc_fini_hw in unload because we haven't completed >> uc_init_hw. Such failure should already be handled? > > It's even worst. In case of uc_init_hw() failure we always call uc_fini() > and uc_fini_misc() regardless of EIO, and then in case of running > in wedged more we will try to call them again in unload path ... > > While it is easy to postpone uc_fini[_misc] calls to unload stage, > it is tricky to avoid calling uc_fini_hw there without adding extra check > (earlier there was discussion that we should not have any conditions > in _fini functions) > > Any ideas how to solve that ? > Earlier we could rely on status of enable_guc_loading/submission. Now I think we need to have status per uc, uc_misc, uc_hw as many internal steps can't rely on like pointer check. > Michal
diff --git a/drivers/gpu/drm/i915/intel_uc.c b/drivers/gpu/drm/i915/intel_uc.c index 3a13cbb..1547e16 100644 --- a/drivers/gpu/drm/i915/intel_uc.c +++ b/drivers/gpu/drm/i915/intel_uc.c @@ -432,11 +432,8 @@ int intel_uc_init_hw(struct drm_i915_private *dev_priv) * Note that there is no fallback as either user explicitly asked for * the GuC or driver default option was to run with the GuC enabled. */ - if (GEM_WARN_ON(ret == -EIO)) - ret = -EINVAL; - dev_err(dev_priv->drm.dev, "GuC initialization failed %d\n", ret); - return ret; + return -EIO; /* We want to disable GPU submission but keep KMS alive */ } void intel_uc_fini_hw(struct drm_i915_private *dev_priv)
Since commit 6ca9a2beb5 ("drm/i915: Unwind i915_gem_init() failure") we believed that we correctly handle all errors encountered during GuC initialization, including special one that indicates request to run driver with disabled GPU submission (-EIO). Unfortunately since commit 121981fafe ("drm/i915/guc: Combine enable_guc_loading|submission modparams") we stopped using that error code to avoid unwanted fallback to execlist submission mode. In result any GuC initialization failure was treated as non-recoverable error leading to driver load abort, so we could not even read related GuC error log to investigate cause of the problem. Fix that by always returning -EIO on uC hardware related failure. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Michal Winiarski <michal.winiarski@intel.com> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Sagar Arun Kamble <sagar.a.kamble@intel.com> --- drivers/gpu/drm/i915/intel_uc.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)