diff mbox

drm/i915: Report an error when i915.reset prevents a reset

Message ID 1434624128-13379-1-git-send-email-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson June 18, 2015, 10:42 a.m. UTC
If the user disables the GPU reset using the i915.reset parameter and
one occurs, report that we failed to reset the GPU. If we return early,
as we currently do, then we leave all state intact (with a hung GPU)
and clients block forever waiting for their requests to complete.

Testcase: igt/gem_eio
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_dma.c     | 1 -
 drivers/gpu/drm/i915/i915_drv.c     | 3 ---
 drivers/gpu/drm/i915/intel_uncore.c | 3 +++
 3 files changed, 3 insertions(+), 4 deletions(-)

Comments

Daniel Vetter June 22, 2015, 1:44 p.m. UTC | #1
On Thu, Jun 18, 2015 at 11:42:08AM +0100, Chris Wilson wrote:
> If the user disables the GPU reset using the i915.reset parameter and
> one occurs, report that we failed to reset the GPU. If we return early,
> as we currently do, then we leave all state intact (with a hung GPU)
> and clients block forever waiting for their requests to complete.
> 
> Testcase: igt/gem_eio
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/i915_dma.c     | 1 -
>  drivers/gpu/drm/i915/i915_drv.c     | 3 ---
>  drivers/gpu/drm/i915/intel_uncore.c | 3 +++
>  3 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> index 88795d2f1819..c5349fa3fcce 100644
> --- a/drivers/gpu/drm/i915/i915_dma.c
> +++ b/drivers/gpu/drm/i915/i915_dma.c
> @@ -165,7 +165,6 @@ static int i915_getparam(struct drm_device *dev, void *data,
>  		break;
>  	case I915_PARAM_HAS_GPU_RESET:
>  		value = i915.enable_hangcheck &&
> -			i915.reset &&
>  			intel_has_gpu_reset(dev);
>  		break;
>  	default:
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 78ef0bb53c36..25ffe8afe744 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -863,9 +863,6 @@ int i915_reset(struct drm_device *dev)
>  	bool simulated;
>  	int ret;
>  
> -	if (!i915.reset)
> -		return 0;
> -
>  	intel_reset_gt_powersave(dev);
>  
>  	mutex_lock(&dev->struct_mutex);
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 4a86cf007aa0..f8e75def1a1d 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1457,6 +1457,9 @@ static int gen6_do_reset(struct drm_device *dev)
>  
>  static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
>  {
> +	if (!i915.reset)
> +		return NULL;

Maybe a special reset function which always returns -EIO and prints
something to illuminate the situation to dmesg? Otherwise even more wtf
material in bugzilla ...
-Daniel

> +
>  	if (INTEL_INFO(dev)->gen >= 6)
>  		return gen6_do_reset;
>  	else if (IS_GEN5(dev))
> -- 
> 2.1.4
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson June 22, 2015, 1:53 p.m. UTC | #2
On Mon, Jun 22, 2015 at 03:44:51PM +0200, Daniel Vetter wrote:
> On Thu, Jun 18, 2015 at 11:42:08AM +0100, Chris Wilson wrote:
> > If the user disables the GPU reset using the i915.reset parameter and
> > one occurs, report that we failed to reset the GPU. If we return early,
> > as we currently do, then we leave all state intact (with a hung GPU)
> > and clients block forever waiting for their requests to complete.
> > 
> > Testcase: igt/gem_eio
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> >  drivers/gpu/drm/i915/i915_dma.c     | 1 -
> >  drivers/gpu/drm/i915/i915_drv.c     | 3 ---
> >  drivers/gpu/drm/i915/intel_uncore.c | 3 +++
> >  3 files changed, 3 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> > index 88795d2f1819..c5349fa3fcce 100644
> > --- a/drivers/gpu/drm/i915/i915_dma.c
> > +++ b/drivers/gpu/drm/i915/i915_dma.c
> > @@ -165,7 +165,6 @@ static int i915_getparam(struct drm_device *dev, void *data,
> >  		break;
> >  	case I915_PARAM_HAS_GPU_RESET:
> >  		value = i915.enable_hangcheck &&
> > -			i915.reset &&
> >  			intel_has_gpu_reset(dev);
> >  		break;
> >  	default:
> > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> > index 78ef0bb53c36..25ffe8afe744 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.c
> > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > @@ -863,9 +863,6 @@ int i915_reset(struct drm_device *dev)
> >  	bool simulated;
> >  	int ret;
> >  
> > -	if (!i915.reset)
> > -		return 0;
> > -
> >  	intel_reset_gt_powersave(dev);
> >  
> >  	mutex_lock(&dev->struct_mutex);
> > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> > index 4a86cf007aa0..f8e75def1a1d 100644
> > --- a/drivers/gpu/drm/i915/intel_uncore.c
> > +++ b/drivers/gpu/drm/i915/intel_uncore.c
> > @@ -1457,6 +1457,9 @@ static int gen6_do_reset(struct drm_device *dev)
> >  
> >  static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
> >  {
> > +	if (!i915.reset)
> > +		return NULL;
> 
> Maybe a special reset function which always returns -EIO and prints
> something to illuminate the situation to dmesg? Otherwise even more wtf
> material in bugzilla ...

It does report a failure (different to reset() itself failing), and reset
should be yet another do not touch module option.
-Chris
Daniel Vetter June 22, 2015, 2:48 p.m. UTC | #3
On Mon, Jun 22, 2015 at 02:53:39PM +0100, Chris Wilson wrote:
> On Mon, Jun 22, 2015 at 03:44:51PM +0200, Daniel Vetter wrote:
> > On Thu, Jun 18, 2015 at 11:42:08AM +0100, Chris Wilson wrote:
> > > If the user disables the GPU reset using the i915.reset parameter and
> > > one occurs, report that we failed to reset the GPU. If we return early,
> > > as we currently do, then we leave all state intact (with a hung GPU)
> > > and clients block forever waiting for their requests to complete.
> > > 
> > > Testcase: igt/gem_eio
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > ---
> > >  drivers/gpu/drm/i915/i915_dma.c     | 1 -
> > >  drivers/gpu/drm/i915/i915_drv.c     | 3 ---
> > >  drivers/gpu/drm/i915/intel_uncore.c | 3 +++
> > >  3 files changed, 3 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> > > index 88795d2f1819..c5349fa3fcce 100644
> > > --- a/drivers/gpu/drm/i915/i915_dma.c
> > > +++ b/drivers/gpu/drm/i915/i915_dma.c
> > > @@ -165,7 +165,6 @@ static int i915_getparam(struct drm_device *dev, void *data,
> > >  		break;
> > >  	case I915_PARAM_HAS_GPU_RESET:
> > >  		value = i915.enable_hangcheck &&
> > > -			i915.reset &&
> > >  			intel_has_gpu_reset(dev);
> > >  		break;
> > >  	default:
> > > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> > > index 78ef0bb53c36..25ffe8afe744 100644
> > > --- a/drivers/gpu/drm/i915/i915_drv.c
> > > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > > @@ -863,9 +863,6 @@ int i915_reset(struct drm_device *dev)
> > >  	bool simulated;
> > >  	int ret;
> > >  
> > > -	if (!i915.reset)
> > > -		return 0;
> > > -
> > >  	intel_reset_gt_powersave(dev);
> > >  
> > >  	mutex_lock(&dev->struct_mutex);
> > > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> > > index 4a86cf007aa0..f8e75def1a1d 100644
> > > --- a/drivers/gpu/drm/i915/intel_uncore.c
> > > +++ b/drivers/gpu/drm/i915/intel_uncore.c
> > > @@ -1457,6 +1457,9 @@ static int gen6_do_reset(struct drm_device *dev)
> > >  
> > >  static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
> > >  {
> > > +	if (!i915.reset)
> > > +		return NULL;
> > 
> > Maybe a special reset function which always returns -EIO and prints
> > something to illuminate the situation to dmesg? Otherwise even more wtf
> > material in bugzilla ...
> 
> It does report a failure (different to reset() itself failing), and reset
> should be yet another do not touch module option.

The only one I could spot is conditional upon the hang being a simulated
one. But just marking i915.reset as unsafe works for me too, I'll amend
your patch.
-Daniel
Chris Wilson June 22, 2015, 2:53 p.m. UTC | #4
On Mon, Jun 22, 2015 at 04:48:26PM +0200, Daniel Vetter wrote:
> On Mon, Jun 22, 2015 at 02:53:39PM +0100, Chris Wilson wrote:
> > On Mon, Jun 22, 2015 at 03:44:51PM +0200, Daniel Vetter wrote:
> > > On Thu, Jun 18, 2015 at 11:42:08AM +0100, Chris Wilson wrote:
> > > > If the user disables the GPU reset using the i915.reset parameter and
> > > > one occurs, report that we failed to reset the GPU. If we return early,
> > > > as we currently do, then we leave all state intact (with a hung GPU)
> > > > and clients block forever waiting for their requests to complete.
> > > > 
> > > > Testcase: igt/gem_eio
> > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > > ---
> > > >  drivers/gpu/drm/i915/i915_dma.c     | 1 -
> > > >  drivers/gpu/drm/i915/i915_drv.c     | 3 ---
> > > >  drivers/gpu/drm/i915/intel_uncore.c | 3 +++
> > > >  3 files changed, 3 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> > > > index 88795d2f1819..c5349fa3fcce 100644
> > > > --- a/drivers/gpu/drm/i915/i915_dma.c
> > > > +++ b/drivers/gpu/drm/i915/i915_dma.c
> > > > @@ -165,7 +165,6 @@ static int i915_getparam(struct drm_device *dev, void *data,
> > > >  		break;
> > > >  	case I915_PARAM_HAS_GPU_RESET:
> > > >  		value = i915.enable_hangcheck &&
> > > > -			i915.reset &&
> > > >  			intel_has_gpu_reset(dev);
> > > >  		break;
> > > >  	default:
> > > > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> > > > index 78ef0bb53c36..25ffe8afe744 100644
> > > > --- a/drivers/gpu/drm/i915/i915_drv.c
> > > > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > > > @@ -863,9 +863,6 @@ int i915_reset(struct drm_device *dev)
> > > >  	bool simulated;
> > > >  	int ret;
> > > >  
> > > > -	if (!i915.reset)
> > > > -		return 0;
> > > > -
> > > >  	intel_reset_gt_powersave(dev);
> > > >  
> > > >  	mutex_lock(&dev->struct_mutex);
> > > > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> > > > index 4a86cf007aa0..f8e75def1a1d 100644
> > > > --- a/drivers/gpu/drm/i915/intel_uncore.c
> > > > +++ b/drivers/gpu/drm/i915/intel_uncore.c
> > > > @@ -1457,6 +1457,9 @@ static int gen6_do_reset(struct drm_device *dev)
> > > >  
> > > >  static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
> > > >  {
> > > > +	if (!i915.reset)
> > > > +		return NULL;
> > > 
> > > Maybe a special reset function which always returns -EIO and prints
> > > something to illuminate the situation to dmesg? Otherwise even more wtf
> > > material in bugzilla ...
> > 
> > It does report a failure (different to reset() itself failing), and reset
> > should be yet another do not touch module option.
> 
> The only one I could spot is conditional upon the hang being a simulated
> one. But just marking i915.reset as unsafe works for me too, I'll amend
> your patch.

	if (ret) {
		DRM_ERROR("Failed to reset chip: %i\n", ret);
		goto error;
	}

ENODEV for anything > gen2 is a giveaway.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 88795d2f1819..c5349fa3fcce 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -165,7 +165,6 @@  static int i915_getparam(struct drm_device *dev, void *data,
 		break;
 	case I915_PARAM_HAS_GPU_RESET:
 		value = i915.enable_hangcheck &&
-			i915.reset &&
 			intel_has_gpu_reset(dev);
 		break;
 	default:
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 78ef0bb53c36..25ffe8afe744 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -863,9 +863,6 @@  int i915_reset(struct drm_device *dev)
 	bool simulated;
 	int ret;
 
-	if (!i915.reset)
-		return 0;
-
 	intel_reset_gt_powersave(dev);
 
 	mutex_lock(&dev->struct_mutex);
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 4a86cf007aa0..f8e75def1a1d 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1457,6 +1457,9 @@  static int gen6_do_reset(struct drm_device *dev)
 
 static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
 {
+	if (!i915.reset)
+		return NULL;
+
 	if (INTEL_INFO(dev)->gen >= 6)
 		return gen6_do_reset;
 	else if (IS_GEN5(dev))