Message ID | 20171208011720.5553-1-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Dec 08, 2017 at 01:17:20AM +0000, Chris Wilson wrote: > Since Michal introduced new errors other than -EIO during > i915_gem_init(), we need to actually unwind on the error path as we have > to abort the module load (and we expect to do so cleanly!). > > As we now teardown key state and then mark the driver as wedged (on > EIO), we have to be careful to not allow ourselves to resume and > unwedge, thus attempting to use the uninitialised driver. > > v2: Try not to free driver state for the suppressed EIO > > References: 8620eb1dbbf2 ("drm/i915/uc: Don't use -EIO to report missing firmware") > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Cc: Sagar Arun Kamble <sagar.a.kamble@intel.com> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > --- > drivers/gpu/drm/i915/i915_gem.c | 82 +++++++++++++++++++++++++++++++++-------- > 1 file changed, 67 insertions(+), 15 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index c7b5db78fbb4..ee243e1ef706 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c [SNIP] > +err_ggtt: > +err_unlock: So... Just unlock? :> Does what it says on the tin (fixing WARN_ON galore on guc load failure): Reviewed-by: Michał Winiarski <michal.winiarski@intel.com> -Michał > + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL); > + mutex_unlock(&dev_priv->drm.struct_mutex); > + > + if (ret != -EIO) > + i915_gem_cleanup_userptr(dev_priv); > + > if (ret == -EIO) { > - /* Allow engine initialisation to fail by marking the GPU as > + /* > + * Allow engine initialisation to fail by marking the GPU as > * wedged. But we only want to do this where the GPU is angry, > * for all other failure, such as an allocation failure, bail. > */ > @@ -5199,9 +5252,8 @@ int i915_gem_init(struct drm_i915_private *dev_priv) > } > ret = 0; > } > - intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL); > - mutex_unlock(&dev_priv->drm.struct_mutex); > > + i915_gem_drain_freed_objects(dev_priv); > return ret; > } > > -- > 2.15.1 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Quoting Michał Winiarski (2017-12-08 22:32:40) > On Fri, Dec 08, 2017 at 01:17:20AM +0000, Chris Wilson wrote: > > Since Michal introduced new errors other than -EIO during > > i915_gem_init(), we need to actually unwind on the error path as we have > > to abort the module load (and we expect to do so cleanly!). > > > > As we now teardown key state and then mark the driver as wedged (on > > EIO), we have to be careful to not allow ourselves to resume and > > unwedge, thus attempting to use the uninitialised driver. > > > > v2: Try not to free driver state for the suppressed EIO > > > > References: 8620eb1dbbf2 ("drm/i915/uc: Don't use -EIO to report missing firmware") > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > > Cc: Sagar Arun Kamble <sagar.a.kamble@intel.com> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > --- > > drivers/gpu/drm/i915/i915_gem.c | 82 +++++++++++++++++++++++++++++++++-------- > > 1 file changed, 67 insertions(+), 15 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > > index c7b5db78fbb4..ee243e1ef706 100644 > > --- a/drivers/gpu/drm/i915/i915_gem.c > > +++ b/drivers/gpu/drm/i915/i915_gem.c > > [SNIP] > > > +err_ggtt: > > +err_unlock: > > So... Just unlock? :> Nothing to see here, please move along. I was caught by surprise that we didn't have an immediate cleanup for err_ggtt. > > Does what it says on the tin (fixing WARN_ON galore on guc load failure): > > Reviewed-by: Michał Winiarski <michal.winiarski@intel.com> Bug not in an elegant way, going it the w/e and see if someone comes up with a better way. As a note to self, if we also have a if (i915_inject_load_failure()) return -ENODEV; then we will automatically exercise both failure methods. I say automatically, except basic-reload-inject uses a hard-coded max number of passes. Fantastic. -Chris
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index c7b5db78fbb4..ee243e1ef706 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -3235,7 +3235,12 @@ bool i915_gem_unset_wedged(struct drm_i915_private *i915) if (!test_bit(I915_WEDGED, &i915->gpu_error.flags)) return true; - /* Before unwedging, make sure that all pending operations + /* Never successfully initialised, so can not unwedge? */ + if (!i915->kernel_context) + return false; + + /* + * Before unwedging, make sure that all pending operations * are flushed and errored out - we may have requests waiting upon * third party fences. We marked all inflight requests as EIO, and * every execbuf since returned EIO, for consistency we want all @@ -4853,7 +4858,8 @@ void i915_gem_resume(struct drm_i915_private *i915) i915_gem_restore_gtt_mappings(i915); i915_gem_restore_fences(i915); - /* As we didn't flush the kernel context before suspend, we cannot + /* + * As we didn't flush the kernel context before suspend, we cannot * guarantee that the context image is complete. So let's just reset * it and start again. */ @@ -4874,8 +4880,10 @@ void i915_gem_resume(struct drm_i915_private *i915) return; err_wedged: - DRM_ERROR("failed to re-initialize GPU, declaring wedged!\n"); - i915_gem_set_wedged(i915); + if (!i915_terminally_wedged(&i915->gpu_error)) { + DRM_ERROR("failed to re-initialize GPU, declaring wedged!\n"); + i915_gem_set_wedged(i915); + } goto out_unlock; } @@ -5158,22 +5166,28 @@ int i915_gem_init(struct drm_i915_private *dev_priv) intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL); ret = i915_gem_init_ggtt(dev_priv); - if (ret) - goto out_unlock; + if (ret) { + GEM_BUG_ON(ret == -EIO); + goto err_unlock; + } ret = i915_gem_contexts_init(dev_priv); - if (ret) - goto out_unlock; + if (ret) { + GEM_BUG_ON(ret == -EIO); + goto err_ggtt; + } ret = intel_engines_init(dev_priv); - if (ret) - goto out_unlock; + if (ret) { + GEM_BUG_ON(ret == -EIO); + goto err_context; + } intel_init_gt_powersave(dev_priv); ret = i915_gem_init_hw(dev_priv); if (ret) - goto out_unlock; + goto err_pm; /* * Despite its name intel_init_clock_gating applies both display @@ -5187,9 +5201,48 @@ int i915_gem_init(struct drm_i915_private *dev_priv) intel_init_clock_gating(dev_priv); ret = __intel_engines_record_defaults(dev_priv); -out_unlock: + if (ret) + goto err_init_hw; + + if (i915_inject_load_failure()) { + ret = -EIO; + goto err_init_hw; + } + + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL); + mutex_unlock(&dev_priv->drm.struct_mutex); + + return 0; + + /* + * Unwinding is complicated by that we want to handle -EIO to mean + * disable GPU submission but keep KMS alive. We want to mark the + * HW as irrevisibly wedged, but keep enough state around that the + * driver doesn't explode during runtime. + */ +err_init_hw: + i915_gem_wait_for_idle(dev_priv, I915_WAIT_LOCKED); + i915_gem_contexts_lost(dev_priv); + intel_uc_fini_hw(dev_priv); +err_pm: + if (ret != -EIO) { + intel_cleanup_gt_powersave(dev_priv); + i915_gem_cleanup_engines(dev_priv); + } +err_context: + if (ret != -EIO) + i915_gem_contexts_fini(dev_priv); +err_ggtt: +err_unlock: + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL); + mutex_unlock(&dev_priv->drm.struct_mutex); + + if (ret != -EIO) + i915_gem_cleanup_userptr(dev_priv); + if (ret == -EIO) { - /* Allow engine initialisation to fail by marking the GPU as + /* + * Allow engine initialisation to fail by marking the GPU as * wedged. But we only want to do this where the GPU is angry, * for all other failure, such as an allocation failure, bail. */ @@ -5199,9 +5252,8 @@ int i915_gem_init(struct drm_i915_private *dev_priv) } ret = 0; } - intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL); - mutex_unlock(&dev_priv->drm.struct_mutex); + i915_gem_drain_freed_objects(dev_priv); return ret; }