Message ID | CA+GA0_srCED0nX7XkiuOBxsxPy8xskG0Z-Lu9bWnD=tknnJNww@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] drm/i915/gt: obey "reset" module parameter | expand |
Quoting Marcin Ślusarz (2020-08-18 12:36:07) > From: Marcin Ślusarz <marcin.slusarz@intel.com> > > For some reason intel_gt_reset attempts to reset the GPU twice. > On one code path (do_reset) "reset" parameter is obeyed, but is > not on the other one (__intel_gt_set_wedged). It's not that simple, we do want to force __intel_gt_set_wedged() to cancel whatever is running on the GPU as it is used for more than just failing resets (e.g. around control boundaries) regardless of what the user may want. I'm loathe to add a parameter just to enable unsafe behaviour, but that may be the compromise. -Chris
On Tue, Aug 18, 2020 at 12:58:00PM +0100, Chris Wilson wrote: > Quoting Marcin Ślusarz (2020-08-18 12:36:07) > > From: Marcin Ślusarz <marcin.slusarz@intel.com> > > > > For some reason intel_gt_reset attempts to reset the GPU twice. > > On one code path (do_reset) "reset" parameter is obeyed, but is > > not on the other one (__intel_gt_set_wedged). > > It's not that simple, we do want to force __intel_gt_set_wedged() to > cancel whatever is running on the GPU as it is used for more than just > failing resets (e.g. around control boundaries) regardless of what the > user may want. > > I'm loathe to add a parameter just to enable unsafe behaviour, but that > may be the compromise. we probably need this compromise for these cases Marcin faced... what about moving this to intel_get_gpu_reset()? @bool intel_has_gpu_reset(const struct intel_gt *gt) - if (!gt->i915->params.reset) - return NULL; @ static reset_func intel_get_gpu_reset(const struct intel_gt *gt) + if (!gt->i915->params.reset) + return NULL; > -Chris > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Quoting Rodrigo Vivi (2020-08-18 18:49:19) > On Tue, Aug 18, 2020 at 12:58:00PM +0100, Chris Wilson wrote: > > Quoting Marcin Ślusarz (2020-08-18 12:36:07) > > > From: Marcin Ślusarz <marcin.slusarz@intel.com> > > > > > > For some reason intel_gt_reset attempts to reset the GPU twice. > > > On one code path (do_reset) "reset" parameter is obeyed, but is > > > not on the other one (__intel_gt_set_wedged). > > > > It's not that simple, we do want to force __intel_gt_set_wedged() to > > cancel whatever is running on the GPU as it is used for more than just > > failing resets (e.g. around control boundaries) regardless of what the > > user may want. > > > > I'm loathe to add a parameter just to enable unsafe behaviour, but that > > may be the compromise. > > we probably need this compromise for these cases Marcin faced... You can always say those who risk unsafe parameters are always capable of patching the kernel to break it. > what about moving this to intel_get_gpu_reset()? When it was there, we didn't have the reason why, so we ended up duplicating the tests anyway to suppress the error messages for CI. And it breaks the control boundary cases where we have to reset the GPU, or when we need the wedge to undeadlock modesetting which will also lockup the machine. In short, we should remove the parameter; we'll still end up having to bisect through the GPU features [atomic ops, it's always atomic ops] to find which one is killing the machine. -Chris
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index 39070b514e65..f4823ca2d71f 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -816,7 +816,8 @@ static void __intel_gt_set_wedged(struct intel_gt *gt) awake = reset_prepare(gt); /* Even if the GPU reset fails, it should still stop the engines */ - if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display) + if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display && + i915_modparams.reset) __intel_gt_reset(gt, ALL_ENGINES); for_each_engine(engine, gt, id)