[1/2] drm/i915/gt: obey "reset" module parameter

Message ID	CA+GA0_srCED0nX7XkiuOBxsxPy8xskG0Z-Lu9bWnD=tknnJNww@mail.gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=x/xo=B4=lists.freedesktop.org=intel-gfx-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6C25B20578 MIME-Version: 1.0 From: =?utf-8?q?Marcin_=C5=9Alusarz?= <marcin.slusarz@gmail.com> Date: Tue, 18 Aug 2020 13:36:07 +0200 Message-ID: <CA+GA0_srCED0nX7XkiuOBxsxPy8xskG0Z-Lu9bWnD=tknnJNww@mail.gmail.com> To: intel-gfx@lists.freedesktop.org Subject: [Intel-gfx] [PATCH 1/2] drm/i915/gt: obey "reset" module parameter Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	[1/2] drm/i915/gt: obey "reset" module parameter \| expand [1/2] drm/i915/gt: obey "reset" module parameter [2/2] drm/i915/gt: raise logging level of "gpu reset disabled" message

Message ID

CA+GA0_srCED0nX7XkiuOBxsxPy8xskG0Z-Lu9bWnD=tknnJNww@mail.gmail.com (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6C25B20578
MIME-Version: 1.0
From: =?utf-8?q?Marcin_=C5=9Alusarz?= <marcin.slusarz@gmail.com>
Date: Tue, 18 Aug 2020 13:36:07 +0200
Message-ID: 
 <CA+GA0_srCED0nX7XkiuOBxsxPy8xskG0Z-Lu9bWnD=tknnJNww@mail.gmail.com>
To: intel-gfx@lists.freedesktop.org
Subject: [Intel-gfx] [PATCH 1/2] drm/i915/gt: obey "reset" module parameter
Precedence: list
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Series

[1/2] drm/i915/gt: obey "reset" module parameter | expand

Commit Message

Marcin Ślusarz Aug. 18, 2020, 11:36 a.m. UTC

From: Marcin Ślusarz <marcin.slusarz@intel.com>

For some reason intel_gt_reset attempts to reset the GPU twice.
On one code path (do_reset) "reset" parameter is obeyed, but is
not on the other one (__intel_gt_set_wedged).

Fix this.

I noticed this because I stumbled on a bug which completely locks
up a machine on reset (preventing me from saving the error state)
and i915.reset=0 wasn't working as expected.

Signed-off-by: Marcin Ślusarz <marcin.slusarz@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Chris Wilson Aug. 18, 2020, 11:58 a.m. UTC | #1

Quoting Marcin Ślusarz (2020-08-18 12:36:07)
> From: Marcin Ślusarz <marcin.slusarz@intel.com>
> 
> For some reason intel_gt_reset attempts to reset the GPU twice.
> On one code path (do_reset) "reset" parameter is obeyed, but is
> not on the other one (__intel_gt_set_wedged).

It's not that simple, we do want to force __intel_gt_set_wedged() to
cancel whatever is running on the GPU as it is used for more than just
failing resets (e.g. around control boundaries) regardless of what the
user may want.

I'm loathe to add a parameter just to enable unsafe behaviour, but that
may be the compromise.
-Chris

Rodrigo Vivi Aug. 18, 2020, 5:49 p.m. UTC | #2

On Tue, Aug 18, 2020 at 12:58:00PM +0100, Chris Wilson wrote:
> Quoting Marcin Ślusarz (2020-08-18 12:36:07)
> > From: Marcin Ślusarz <marcin.slusarz@intel.com>
> > 
> > For some reason intel_gt_reset attempts to reset the GPU twice.
> > On one code path (do_reset) "reset" parameter is obeyed, but is
> > not on the other one (__intel_gt_set_wedged).
> 
> It's not that simple, we do want to force __intel_gt_set_wedged() to
> cancel whatever is running on the GPU as it is used for more than just
> failing resets (e.g. around control boundaries) regardless of what the
> user may want.
> 
> I'm loathe to add a parameter just to enable unsafe behaviour, but that
> may be the compromise.

we probably need this compromise for these cases Marcin faced...

what about moving this to intel_get_gpu_reset()?


@bool intel_has_gpu_reset(const struct intel_gt *gt)
-		if (!gt->i915->params.reset)
-			return NULL;

@ static reset_func intel_get_gpu_reset(const struct intel_gt *gt)
+		if (!gt->i915->params.reset)
+			return NULL;

> -Chris
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Chris Wilson Aug. 18, 2020, 6:12 p.m. UTC | #3

Quoting Rodrigo Vivi (2020-08-18 18:49:19)
> On Tue, Aug 18, 2020 at 12:58:00PM +0100, Chris Wilson wrote:
> > Quoting Marcin Ślusarz (2020-08-18 12:36:07)
> > > From: Marcin Ślusarz <marcin.slusarz@intel.com>
> > > 
> > > For some reason intel_gt_reset attempts to reset the GPU twice.
> > > On one code path (do_reset) "reset" parameter is obeyed, but is
> > > not on the other one (__intel_gt_set_wedged).
> > 
> > It's not that simple, we do want to force __intel_gt_set_wedged() to
> > cancel whatever is running on the GPU as it is used for more than just
> > failing resets (e.g. around control boundaries) regardless of what the
> > user may want.
> > 
> > I'm loathe to add a parameter just to enable unsafe behaviour, but that
> > may be the compromise.
> 
> we probably need this compromise for these cases Marcin faced...

You can always say those who risk unsafe parameters are always capable
of patching the kernel to break it.

> what about moving this to intel_get_gpu_reset()?

When it was there, we didn't have the reason why, so we ended up
duplicating the tests anyway to suppress the error messages for CI.

And it breaks the control boundary cases where we have to reset the GPU,
or when we need the wedge to undeadlock modesetting which will also
lockup the machine. In short, we should remove the parameter; we'll
still end up having to bisect through the GPU features [atomic ops, it's
always atomic ops] to find which one is killing the machine.
-Chris

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
b/drivers/gpu/drm/i915/gt/intel_reset.c
index 39070b514e65..f4823ca2d71f 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -816,7 +816,8 @@  static void __intel_gt_set_wedged(struct intel_gt *gt)
     awake = reset_prepare(gt);

     /* Even if the GPU reset fails, it should still stop the engines */
-    if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display)
+    if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display &&
+            i915_modparams.reset)
         __intel_gt_reset(gt, ALL_ENGINES);

     for_each_engine(engine, gt, id)

[1/2] drm/i915/gt: obey "reset" module parameter

Commit Message

Comments

Patch