diff mbox

drm/i915: Prevent machine hang from Broxton's vtd w/a and error capture

Message ID 20171206141903.1925-1-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson Dec. 6, 2017, 2:19 p.m. UTC
Since capturing the error state requires fiddling around with the GGTT
to read arbitrary buffers and is itself run under stop_machine(), it
deadlocks the machine (effectively a hard hang) when run in conjunction
with Broxton's VTd workaround to serialize GGTT access.

Fixes: 0ef34ad6222a ("drm/i915: Serialize GTT/Aperture accesses on BXT")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jon Bloomfield <jon.bloomfield@intel.com>
Cc: John Harrison <john.C.Harrison@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Daniel Vetter Dec. 6, 2017, 2:43 p.m. UTC | #1
On Wed, Dec 06, 2017 at 02:19:03PM +0000, Chris Wilson wrote:
> Since capturing the error state requires fiddling around with the GGTT
> to read arbitrary buffers and is itself run under stop_machine(), it
> deadlocks the machine (effectively a hard hang) when run in conjunction
> with Broxton's VTd workaround to serialize GGTT access.
> 
> Fixes: 0ef34ad6222a ("drm/i915: Serialize GTT/Aperture accesses on BXT")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
> Cc: John Harrison <john.C.Harrison@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gpu_error.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 48418fb81066..e6c7e8e53815 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1813,6 +1813,10 @@ void i915_capture_error_state(struct drm_i915_private *dev_priv,
>  	if (!i915_modparams.error_capture)
>  		return;
>  
> +	/* Prevent recursively calling stop_machine() and deadlocking. */
> +	if (intel_ggtt_update_needs_vtd_wa(dev_priv))
> +		return;

I'd put this closer to the stop machine, at the head of
i915_capture_gpu_state(). If the bogus debug output annoys then we could
switch that to an PTR_ERR return value I guess. But I guess this here is
ok too, so either way:

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

> +
>  	if (READ_ONCE(dev_priv->gpu_error.first_error))
>  		return;
>  
> -- 
> 2.15.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson Dec. 6, 2017, 2:48 p.m. UTC | #2
Quoting Daniel Vetter (2017-12-06 14:43:39)
> On Wed, Dec 06, 2017 at 02:19:03PM +0000, Chris Wilson wrote:
> > Since capturing the error state requires fiddling around with the GGTT
> > to read arbitrary buffers and is itself run under stop_machine(), it
> > deadlocks the machine (effectively a hard hang) when run in conjunction
> > with Broxton's VTd workaround to serialize GGTT access.
> > 
> > Fixes: 0ef34ad6222a ("drm/i915: Serialize GTT/Aperture accesses on BXT")
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Jon Bloomfield <jon.bloomfield@intel.com>
> > Cc: John Harrison <john.C.Harrison@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_gpu_error.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> > index 48418fb81066..e6c7e8e53815 100644
> > --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> > @@ -1813,6 +1813,10 @@ void i915_capture_error_state(struct drm_i915_private *dev_priv,
> >       if (!i915_modparams.error_capture)
> >               return;
> >  
> > +     /* Prevent recursively calling stop_machine() and deadlocking. */
> > +     if (intel_ggtt_update_needs_vtd_wa(dev_priv))
> > +             return;
> 
> I'd put this closer to the stop machine, at the head of
> i915_capture_gpu_state(). If the bogus debug output annoys then we could
> switch that to an PTR_ERR return value I guess. But I guess this here is
> ok too, so either way:

I was considering doing some of the capture, skipping the buffers, but
nowadays those buffers tend to the crux of triaging. My only real concern
is how to explain to the user that the error state cannot exist, for 
which we could go and add -ENODEV to sysfs/debugfs just to be clear.
-Chris
Daniel Vetter Dec. 6, 2017, 2:51 p.m. UTC | #3
On Wed, Dec 06, 2017 at 02:48:36PM +0000, Chris Wilson wrote:
> Quoting Daniel Vetter (2017-12-06 14:43:39)
> > On Wed, Dec 06, 2017 at 02:19:03PM +0000, Chris Wilson wrote:
> > > Since capturing the error state requires fiddling around with the GGTT
> > > to read arbitrary buffers and is itself run under stop_machine(), it
> > > deadlocks the machine (effectively a hard hang) when run in conjunction
> > > with Broxton's VTd workaround to serialize GGTT access.
> > > 
> > > Fixes: 0ef34ad6222a ("drm/i915: Serialize GTT/Aperture accesses on BXT")
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Jon Bloomfield <jon.bloomfield@intel.com>
> > > Cc: John Harrison <john.C.Harrison@intel.com>
> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/i915_gpu_error.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> > > index 48418fb81066..e6c7e8e53815 100644
> > > --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> > > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> > > @@ -1813,6 +1813,10 @@ void i915_capture_error_state(struct drm_i915_private *dev_priv,
> > >       if (!i915_modparams.error_capture)
> > >               return;
> > >  
> > > +     /* Prevent recursively calling stop_machine() and deadlocking. */
> > > +     if (intel_ggtt_update_needs_vtd_wa(dev_priv))
> > > +             return;
> > 
> > I'd put this closer to the stop machine, at the head of
> > i915_capture_gpu_state(). If the bogus debug output annoys then we could
> > switch that to an PTR_ERR return value I guess. But I guess this here is
> > ok too, so either way:
> 
> I was considering doing some of the capture, skipping the buffers, but
> nowadays those buffers tend to the crux of triaging. My only real concern
> is how to explain to the user that the error state cannot exist, for 
> which we could go and add -ENODEV to sysfs/debugfs just to be clear.

Fancy idea: store ther PTR_ERR in ->first.error and return that? Would
address both my bikeshed and your suggestion.
-Daniel
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 48418fb81066..e6c7e8e53815 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1813,6 +1813,10 @@  void i915_capture_error_state(struct drm_i915_private *dev_priv,
 	if (!i915_modparams.error_capture)
 		return;
 
+	/* Prevent recursively calling stop_machine() and deadlocking. */
+	if (intel_ggtt_update_needs_vtd_wa(dev_priv))
+		return;
+
 	if (READ_ONCE(dev_priv->gpu_error.first_error))
 		return;