diff mbox series

[1/2] drm/i915/gt: obey "reset" module parameter

Message ID CA+GA0_srCED0nX7XkiuOBxsxPy8xskG0Z-Lu9bWnD=tknnJNww@mail.gmail.com (mailing list archive)
State New, archived
Headers show
Series [1/2] drm/i915/gt: obey "reset" module parameter | expand

Commit Message

Marcin Ślusarz Aug. 18, 2020, 11:36 a.m. UTC
From: Marcin Ślusarz <marcin.slusarz@intel.com>

For some reason intel_gt_reset attempts to reset the GPU twice.
On one code path (do_reset) "reset" parameter is obeyed, but is
not on the other one (__intel_gt_set_wedged).

Fix this.

I noticed this because I stumbled on a bug which completely locks
up a machine on reset (preventing me from saving the error state)
and i915.reset=0 wasn't working as expected.

Signed-off-by: Marcin Ślusarz <marcin.slusarz@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Chris Wilson Aug. 18, 2020, 11:58 a.m. UTC | #1
Quoting Marcin Ślusarz (2020-08-18 12:36:07)
> From: Marcin Ślusarz <marcin.slusarz@intel.com>
> 
> For some reason intel_gt_reset attempts to reset the GPU twice.
> On one code path (do_reset) "reset" parameter is obeyed, but is
> not on the other one (__intel_gt_set_wedged).

It's not that simple, we do want to force __intel_gt_set_wedged() to
cancel whatever is running on the GPU as it is used for more than just
failing resets (e.g. around control boundaries) regardless of what the
user may want.

I'm loathe to add a parameter just to enable unsafe behaviour, but that
may be the compromise.
-Chris
Rodrigo Vivi Aug. 18, 2020, 5:49 p.m. UTC | #2
On Tue, Aug 18, 2020 at 12:58:00PM +0100, Chris Wilson wrote:
> Quoting Marcin Ślusarz (2020-08-18 12:36:07)
> > From: Marcin Ślusarz <marcin.slusarz@intel.com>
> > 
> > For some reason intel_gt_reset attempts to reset the GPU twice.
> > On one code path (do_reset) "reset" parameter is obeyed, but is
> > not on the other one (__intel_gt_set_wedged).
> 
> It's not that simple, we do want to force __intel_gt_set_wedged() to
> cancel whatever is running on the GPU as it is used for more than just
> failing resets (e.g. around control boundaries) regardless of what the
> user may want.
> 
> I'm loathe to add a parameter just to enable unsafe behaviour, but that
> may be the compromise.

we probably need this compromise for these cases Marcin faced...

what about moving this to intel_get_gpu_reset()?


@bool intel_has_gpu_reset(const struct intel_gt *gt)
-		if (!gt->i915->params.reset)
-			return NULL;

@ static reset_func intel_get_gpu_reset(const struct intel_gt *gt)
+		if (!gt->i915->params.reset)
+			return NULL;

> -Chris
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson Aug. 18, 2020, 6:12 p.m. UTC | #3
Quoting Rodrigo Vivi (2020-08-18 18:49:19)
> On Tue, Aug 18, 2020 at 12:58:00PM +0100, Chris Wilson wrote:
> > Quoting Marcin Ślusarz (2020-08-18 12:36:07)
> > > From: Marcin Ślusarz <marcin.slusarz@intel.com>
> > > 
> > > For some reason intel_gt_reset attempts to reset the GPU twice.
> > > On one code path (do_reset) "reset" parameter is obeyed, but is
> > > not on the other one (__intel_gt_set_wedged).
> > 
> > It's not that simple, we do want to force __intel_gt_set_wedged() to
> > cancel whatever is running on the GPU as it is used for more than just
> > failing resets (e.g. around control boundaries) regardless of what the
> > user may want.
> > 
> > I'm loathe to add a parameter just to enable unsafe behaviour, but that
> > may be the compromise.
> 
> we probably need this compromise for these cases Marcin faced...

You can always say those who risk unsafe parameters are always capable
of patching the kernel to break it.

> what about moving this to intel_get_gpu_reset()?

When it was there, we didn't have the reason why, so we ended up
duplicating the tests anyway to suppress the error messages for CI.

And it breaks the control boundary cases where we have to reset the GPU,
or when we need the wedge to undeadlock modesetting which will also
lockup the machine. In short, we should remove the parameter; we'll
still end up having to bisect through the GPU features [atomic ops, it's
always atomic ops] to find which one is killing the machine.
-Chris
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
b/drivers/gpu/drm/i915/gt/intel_reset.c
index 39070b514e65..f4823ca2d71f 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -816,7 +816,8 @@  static void __intel_gt_set_wedged(struct intel_gt *gt)
     awake = reset_prepare(gt);

     /* Even if the GPU reset fails, it should still stop the engines */
-    if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display)
+    if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display &&
+            i915_modparams.reset)
         __intel_gt_reset(gt, ALL_ENGINES);

     for_each_engine(engine, gt, id)