diff mbox series

[RFC] drm/i915/debugfs: Only wedge if we have reset available

Message ID 20191002124820.1862-1-janusz.krzysztofik@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series [RFC] drm/i915/debugfs: Only wedge if we have reset available | expand

Commit Message

Janusz Krzysztofik Oct. 2, 2019, 12:48 p.m. UTC
If we process DROP_RESET_ACTIVE and cancel all outstanding requests by
forcing a GPU reset on a hardware with reset capabilities disabled or
not supported, we certainly end up with a terminally wedged GPU,
impossible to recover.  That's probably not what we want.

Before setting the GPU wedged, verify if we have GPU reset available
and fail with -EBUSY if not.

Suggested-by: Petri Latvala <petri.latvala@intel.com>
Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
Cc: Michał Wajdeczko <michal.wajdeczko@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Piotr Piórkowski <piotr.piorkowski@intel.com>
Cc: Tomasz Lis <tomasz.lis@intel.com>
Cc: Petri Latvala <petri.latvala@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Martin Peres <martin.peres@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Comments

Tvrtko Ursulin Oct. 2, 2019, 3:45 p.m. UTC | #1
On 02/10/2019 13:48, Janusz Krzysztofik wrote:
> If we process DROP_RESET_ACTIVE and cancel all outstanding requests by
> forcing a GPU reset on a hardware with reset capabilities disabled or
> not supported, we certainly end up with a terminally wedged GPU,
> impossible to recover.  That's probably not what we want.

I forgot the whole background story here I'm afraid. Is the concern here 
the IGT exit handler calling DROP_RESET_ACTIVE? If so with this patch it 
will fail with -EBUSY, which could be fine, but what happens from the 
perspective of next test which gets to run? It won't find a wedged GPU, 
but will encounter a possibly nondeterministic amount of GPU work 
scheduled before it, no?

Regards,

Tvrtko

> Before setting the GPU wedged, verify if we have GPU reset available
> and fail with -EBUSY if not.
> 
> Suggested-by: Petri Latvala <petri.latvala@intel.com>
> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
> Cc: Michał Wajdeczko <michal.wajdeczko@intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Piotr Piórkowski <piotr.piorkowski@intel.com>
> Cc: Tomasz Lis <tomasz.lis@intel.com>
> Cc: Petri Latvala <petri.latvala@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Martin Peres <martin.peres@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/i915_debugfs.c | 11 ++++++++++-
>   1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index fec9fb7cc384..0774ca6e2a05 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -3627,8 +3627,17 @@ i915_drop_caches_set(void *data, u64 val)
>   
>   	if (val & DROP_RESET_ACTIVE &&
>   	    wait_for(intel_engines_are_idle(&i915->gt),
> -		     I915_IDLE_ENGINES_TIMEOUT))
> +		     I915_IDLE_ENGINES_TIMEOUT)) {
> +		/*
> +		 * Only wedge if reset is supported and not disabled, otherwise
> +		 * we certainly end up with the GPU terminally wedged.  Inform
> +		 * userspace about the problem instead.
> +		 */
> +		if (!intel_has_gpu_reset(&i915->gt))
> +			return -EBUSY;
> +
>   		intel_gt_set_wedged(&i915->gt);
> +	}
>   
>   	/* No need to check and wait for gpu resets, only libdrm auto-restarts
>   	 * on ioctls on -EAGAIN. */
>
Chris Wilson Oct. 2, 2019, 3:59 p.m. UTC | #2
Quoting Tvrtko Ursulin (2019-10-02 16:45:18)
> 
> On 02/10/2019 13:48, Janusz Krzysztofik wrote:
> > If we process DROP_RESET_ACTIVE and cancel all outstanding requests by
> > forcing a GPU reset on a hardware with reset capabilities disabled or
> > not supported, we certainly end up with a terminally wedged GPU,
> > impossible to recover.  That's probably not what we want.
> 
> I forgot the whole background story here I'm afraid. Is the concern here 
> the IGT exit handler calling DROP_RESET_ACTIVE? If so with this patch it 
> will fail with -EBUSY, which could be fine, but what happens from the 
> perspective of next test which gets to run? It won't find a wedged GPU, 
> but will encounter a possibly nondeterministic amount of GPU work 
> scheduled before it, no?

Yes, that is the conundrum. If the test left work outstanding, and in a
few cases, we explicitly rely on the reset here to cancel persistent
(unbound nonpreemptible spinners) work, then it will cause the next
test, where drm_driver_open(DRM_INTEL) calls gem_quiescent_gpu(),
to wait until eventually it is wedged. There's a good chance that next
test will then fail because it doesn't handle the wedged gpu.

The alternative would be to wedge here, taint and reboot. Then
hopefully resume testing at the next test with vanilla state.
-Chris
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index fec9fb7cc384..0774ca6e2a05 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -3627,8 +3627,17 @@  i915_drop_caches_set(void *data, u64 val)
 
 	if (val & DROP_RESET_ACTIVE &&
 	    wait_for(intel_engines_are_idle(&i915->gt),
-		     I915_IDLE_ENGINES_TIMEOUT))
+		     I915_IDLE_ENGINES_TIMEOUT)) {
+		/*
+		 * Only wedge if reset is supported and not disabled, otherwise
+		 * we certainly end up with the GPU terminally wedged.  Inform
+		 * userspace about the problem instead.
+		 */
+		if (!intel_has_gpu_reset(&i915->gt))
+			return -EBUSY;
+
 		intel_gt_set_wedged(&i915->gt);
+	}
 
 	/* No need to check and wait for gpu resets, only libdrm auto-restarts
 	 * on ioctls on -EAGAIN. */