Message ID | 20231009233856.1932887-1-jonathan.cavitt@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/i915/gt: Temporarily force MTL into uncached mode | expand |
On Mon, Oct 09, 2023 at 04:38:56PM -0700, Jonathan Cavitt wrote: > FIXME: CAT errors are cropping up on MTL. This removes them, > but the real root cause must still be diagnosed. Do you have a link to specific IGT test(s) that illustrate the CAT errors so that we can ensure that they now appear fixed in CI? Matt > > Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com> > --- > drivers/gpu/drm/i915/gt/intel_gt.c | 6 +++++- > drivers/gpu/drm/i915/gt/intel_lrc.c | 5 ++++- > drivers/gpu/drm/i915/gt/uc/intel_guc.c | 5 ++++- > 3 files changed, 13 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c > index ed32bf5b15464..b52c8eb0b033f 100644 > --- a/drivers/gpu/drm/i915/gt/intel_gt.c > +++ b/drivers/gpu/drm/i915/gt/intel_gt.c > @@ -1026,8 +1026,12 @@ enum i915_map_type intel_gt_coherent_map_type(struct intel_gt *gt, > /* > * Wa_22016122933: always return I915_MAP_WC for Media > * version 13.0 when the object is on the Media GT > + * > + * FIXME: CAT errors are cropping up on MTL. This removes them, > + * but the real root cause must still be diagnosed. > */ > - if (i915_gem_object_is_lmem(obj) || intel_gt_needs_wa_22016122933(gt)) > + if (i915_gem_object_is_lmem(obj) || intel_gt_needs_wa_22016122933(gt) || > + IS_METEORLAKE(gt->i915)) > return I915_MAP_WC; > if (HAS_LLC(gt->i915) || always_coherent) > return I915_MAP_WB; > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c > index eaf66d9031665..8aaa4df84cb3e 100644 > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > @@ -1124,8 +1124,11 @@ __lrc_alloc_state(struct intel_context *ce, struct intel_engine_cs *engine) > * Wa_22016122933: For Media version 13.0, all Media GT shared > * memory needs to be mapped as WC on CPU side and UC (PAT > * index 2) on GPU side. > + * > + * FIXME: CAT errors are cropping up on MTL. This removes them, > + * but the real root cause must still be diagnosed. > */ > - if (intel_gt_needs_wa_22016122933(engine->gt)) > + if (intel_gt_needs_wa_22016122933(engine->gt) || IS_METEORLAKE(engine->i915)) > i915_gem_object_set_cache_coherency(obj, I915_CACHE_NONE); > } > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c > index 27df41c53b890..e3a7d61506188 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c > @@ -774,8 +774,11 @@ struct i915_vma *intel_guc_allocate_vma(struct intel_guc *guc, u32 size) > * Wa_22016122933: For Media version 13.0, all Media GT shared > * memory needs to be mapped as WC on CPU side and UC (PAT > * index 2) on GPU side. > + * > + * FIXME: CAT errors are cropping up on MTL. This removes them, > + * but the real root cause must still be diagnosed. > */ > - if (intel_gt_needs_wa_22016122933(gt)) > + if (intel_gt_needs_wa_22016122933(gt) || IS_METEORLAKE(gt->i915)) > i915_gem_object_set_cache_coherency(obj, I915_CACHE_NONE); > > vma = i915_vma_instance(obj, >->ggtt->vm, NULL); > -- > 2.25.1 >
Hi Matt, On Tue, Oct 10, 2023 at 06:58:27AM -0700, Matt Roper wrote: > On Mon, Oct 09, 2023 at 04:38:56PM -0700, Jonathan Cavitt wrote: > > FIXME: CAT errors are cropping up on MTL. This removes them, > > but the real root cause must still be diagnosed. > > Do you have a link to specific IGT test(s) that illustrate the CAT > errors so that we can ensure that they now appear fixed in CI? this one: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124599v1/bat-mtlp-8/igt@i915_selftest@live@hugepages.html Andi
On Tue, Oct 10, 2023 at 05:11:54PM +0200, Andi Shyti wrote: > Hi Matt, > > On Tue, Oct 10, 2023 at 06:58:27AM -0700, Matt Roper wrote: > > On Mon, Oct 09, 2023 at 04:38:56PM -0700, Jonathan Cavitt wrote: > > > FIXME: CAT errors are cropping up on MTL. This removes them, > > > but the real root cause must still be diagnosed. > > > > Do you have a link to specific IGT test(s) that illustrate the CAT > > errors so that we can ensure that they now appear fixed in CI? > > this one: > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124599v1/bat-mtlp-8/igt@i915_selftest@live@hugepages.html > > Andi Wait, now I'm confused. That's a failure caused by a different patch series (one that we won't be moving forward with). The live@hugepages test is always passing on drm-tip today: https://intel-gfx-ci.01.org/tree/drm-tip/igt@i915_selftest@live@hugepages.html Is there a test that's giving CAT errors on drm-tip itself (even sporadically) that we can monitor to see the impact of Jonathan's patch here? Matt
Hi Matt, > > > > FIXME: CAT errors are cropping up on MTL. This removes them, > > > > but the real root cause must still be diagnosed. > > > > > > Do you have a link to specific IGT test(s) that illustrate the CAT > > > errors so that we can ensure that they now appear fixed in CI? > > > > this one: > > > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124599v1/bat-mtlp-8/igt@i915_selftest@live@hugepages.html > > > > Andi > > Wait, now I'm confused. That's a failure caused by a different patch > series (one that we won't be moving forward with). The live@hugepages > test is always passing on drm-tip today: > https://intel-gfx-ci.01.org/tree/drm-tip/igt@i915_selftest@live@hugepages.html yes, true, but that patch allows us to move forward with the testing and hit the CAT error. (it was the most reachable link I found :)) > Is there a test that's giving CAT errors on drm-tip itself (even > sporadically) that we can monitor to see the impact of Jonathan's patch > here? Otherwise this one: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13667/re-mtlp-3/igt@gem_exec_fence@parallel.html#dmesg-warnings11 Andi
On 10/10/2023 17:17, Andi Shyti wrote: > Hi Matt, > >>>>> FIXME: CAT errors are cropping up on MTL. This removes them, >>>>> but the real root cause must still be diagnosed. >>>> >>>> Do you have a link to specific IGT test(s) that illustrate the CAT >>>> errors so that we can ensure that they now appear fixed in CI? >>> >>> this one: >>> >>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124599v1/bat-mtlp-8/igt@i915_selftest@live@hugepages.html >>> >>> Andi >> >> Wait, now I'm confused. That's a failure caused by a different patch >> series (one that we won't be moving forward with). The live@hugepages >> test is always passing on drm-tip today: >> https://intel-gfx-ci.01.org/tree/drm-tip/igt@i915_selftest@live@hugepages.html > > yes, true, but that patch allows us to move forward with the > testing and hit the CAT error. > > (it was the most reachable link I found :)) > >> Is there a test that's giving CAT errors on drm-tip itself (even >> sporadically) that we can monitor to see the impact of Jonathan's patch >> here? > > Otherwise this one: > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13667/re-mtlp-3/igt@gem_exec_fence@parallel.html#dmesg-warnings11 Parachuting in on a tangent - please do not mix CAT and CT errors. CAT, for me at least, associates with CATastrophic faults reported over CT channel, like GuC page faulting IIRC. For CT errors maybe GuC folks can sched some light what they mean. Regards, Tvrtko
On Tue, Oct 10, 2023 at 06:17:27PM +0200, Andi Shyti wrote: > Hi Matt, > > > > > > FIXME: CAT errors are cropping up on MTL. This removes them, > > > > > but the real root cause must still be diagnosed. > > > > > > > > Do you have a link to specific IGT test(s) that illustrate the CAT > > > > errors so that we can ensure that they now appear fixed in CI? > > > > > > this one: > > > > > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124599v1/bat-mtlp-8/igt@i915_selftest@live@hugepages.html > > > > > > Andi > > > > Wait, now I'm confused. That's a failure caused by a different patch > > series (one that we won't be moving forward with). The live@hugepages > > test is always passing on drm-tip today: > > https://intel-gfx-ci.01.org/tree/drm-tip/igt@i915_selftest@live@hugepages.html > > yes, true, but that patch allows us to move forward with the > testing and hit the CAT error. > > (it was the most reachable link I found :)) > > > Is there a test that's giving CAT errors on drm-tip itself (even > > sporadically) that we can monitor to see the impact of Jonathan's patch > > here? > > Otherwise this one: > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13667/re-mtlp-3/igt@gem_exec_fence@parallel.html#dmesg-warnings11 Okay, looks like this is a pretty sporadic failure: https://intel-gfx-ci.01.org/tree/drm-tip/igt@gem_exec_fence@parallel@rcs0.html so we'll need to monitor this for quite a while to make sure it's truly gone. Assuming you've done enough local test cycles to confirm that this definitely avoids the CAT errors, Acked-by: Matt Roper <matthew.d.roper@intel.com> as a short-term mitigation while we debug further. We still need to continue searching for a proper fix and/or drive this through the hardware team and get them to document this as a new official workaround for some kind of cache coherency problem. BTW, it would also be good to have a patch that adds explicit handling for GuC action 0x6000 (GUC_ACTION_GUC2HOST_NOTIFY_MEMORY_CAT_ERROR) so that we'll at least have more meaningful error output if/when this is encountered in the future. Matt > > Andi
On Tue, Oct 10, 2023 at 05:42:28PM +0100, Tvrtko Ursulin wrote: > > On 10/10/2023 17:17, Andi Shyti wrote: > > Hi Matt, > > > > > > > > FIXME: CAT errors are cropping up on MTL. This removes them, > > > > > > but the real root cause must still be diagnosed. > > > > > > > > > > Do you have a link to specific IGT test(s) that illustrate the CAT > > > > > errors so that we can ensure that they now appear fixed in CI? > > > > > > > > this one: > > > > > > > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124599v1/bat-mtlp-8/igt@i915_selftest@live@hugepages.html > > > > > > > > Andi > > > > > > Wait, now I'm confused. That's a failure caused by a different patch > > > series (one that we won't be moving forward with). The live@hugepages > > > test is always passing on drm-tip today: > > > https://intel-gfx-ci.01.org/tree/drm-tip/igt@i915_selftest@live@hugepages.html > > > > yes, true, but that patch allows us to move forward with the > > testing and hit the CAT error. > > > > (it was the most reachable link I found :)) > > > > > Is there a test that's giving CAT errors on drm-tip itself (even > > > sporadically) that we can monitor to see the impact of Jonathan's patch > > > here? > > > > Otherwise this one: > > > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13667/re-mtlp-3/igt@gem_exec_fence@parallel.html#dmesg-warnings11 > > Parachuting in on a tangent - please do not mix CAT and CT errors. CAT, for me at least, associates with CATastrophic faults reported over CT channel, like GuC page faulting IIRC. > > For CT errors maybe GuC folks can sched some light what they mean. 0x6000 is GUC_ACTION_GUC2HOST_NOTIFY_MEMORY_CAT_ERROR so this actually is a CAT error, delivered via the CT channel. Matt > > Regards, > > Tvrtko
On 10/10/2023 09:44, Matt Roper wrote: > On Tue, Oct 10, 2023 at 05:42:28PM +0100, Tvrtko Ursulin wrote: >> On 10/10/2023 17:17, Andi Shyti wrote: >>> Hi Matt, >>> >>>>>>> FIXME: CAT errors are cropping up on MTL. This removes them, >>>>>>> but the real root cause must still be diagnosed. >>>>>> Do you have a link to specific IGT test(s) that illustrate the CAT >>>>>> errors so that we can ensure that they now appear fixed in CI? >>>>> this one: >>>>> >>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124599v1/bat-mtlp-8/igt@i915_selftest@live@hugepages.html >>>>> >>>>> Andi >>>> Wait, now I'm confused. That's a failure caused by a different patch >>>> series (one that we won't be moving forward with). The live@hugepages >>>> test is always passing on drm-tip today: >>>> https://intel-gfx-ci.01.org/tree/drm-tip/igt@i915_selftest@live@hugepages.html >>> yes, true, but that patch allows us to move forward with the >>> testing and hit the CAT error. >>> >>> (it was the most reachable link I found :)) >>> >>>> Is there a test that's giving CAT errors on drm-tip itself (even >>>> sporadically) that we can monitor to see the impact of Jonathan's patch >>>> here? >>> Otherwise this one: >>> >>> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13667/re-mtlp-3/igt@gem_exec_fence@parallel.html#dmesg-warnings11 >> Parachuting in on a tangent - please do not mix CAT and CT errors. CAT, for me at least, associates with CATastrophic faults reported over CT channel, like GuC page faulting IIRC. >> >> For CT errors maybe GuC folks can sched some light what they mean. > 0x6000 is GUC_ACTION_GUC2HOST_NOTIFY_MEMORY_CAT_ERROR so this actually > is a CAT error, delivered via the CT channel. The history is that catastrophic memory errors (CAT is an abbreviation not an acronym) are never meant to happen in the upstream driver because we map all invalid addresses to a scratch page and silently hide such accesses. Hence there has been push back on adding support for an error channel which is officially impossible to hit. The problem is that we keep hitting it due to hardware and/or software bugs. Because there is no official support for handling this notification, the CT layer reports it as an unexpected notification and barfs. As far as the CT layer is concerned, it is a corrupted packet from GuC. And thus the error reporting looks totally weird for what is just an illegal address access from some random part of the GPU. And note that it is very unlikely that GuC itself caused the page fault. It is much more plausible to be coming from an engine/EU/batch buffer instruction. Although as noted, the fundamental cause is believed to be broken page table updates due to cache coherency issues. John. > > > Matt > >> Regards, >> >> Tvrtko
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index ed32bf5b15464..b52c8eb0b033f 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -1026,8 +1026,12 @@ enum i915_map_type intel_gt_coherent_map_type(struct intel_gt *gt, /* * Wa_22016122933: always return I915_MAP_WC for Media * version 13.0 when the object is on the Media GT + * + * FIXME: CAT errors are cropping up on MTL. This removes them, + * but the real root cause must still be diagnosed. */ - if (i915_gem_object_is_lmem(obj) || intel_gt_needs_wa_22016122933(gt)) + if (i915_gem_object_is_lmem(obj) || intel_gt_needs_wa_22016122933(gt) || + IS_METEORLAKE(gt->i915)) return I915_MAP_WC; if (HAS_LLC(gt->i915) || always_coherent) return I915_MAP_WB; diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index eaf66d9031665..8aaa4df84cb3e 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -1124,8 +1124,11 @@ __lrc_alloc_state(struct intel_context *ce, struct intel_engine_cs *engine) * Wa_22016122933: For Media version 13.0, all Media GT shared * memory needs to be mapped as WC on CPU side and UC (PAT * index 2) on GPU side. + * + * FIXME: CAT errors are cropping up on MTL. This removes them, + * but the real root cause must still be diagnosed. */ - if (intel_gt_needs_wa_22016122933(engine->gt)) + if (intel_gt_needs_wa_22016122933(engine->gt) || IS_METEORLAKE(engine->i915)) i915_gem_object_set_cache_coherency(obj, I915_CACHE_NONE); } diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c index 27df41c53b890..e3a7d61506188 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c @@ -774,8 +774,11 @@ struct i915_vma *intel_guc_allocate_vma(struct intel_guc *guc, u32 size) * Wa_22016122933: For Media version 13.0, all Media GT shared * memory needs to be mapped as WC on CPU side and UC (PAT * index 2) on GPU side. + * + * FIXME: CAT errors are cropping up on MTL. This removes them, + * but the real root cause must still be diagnosed. */ - if (intel_gt_needs_wa_22016122933(gt)) + if (intel_gt_needs_wa_22016122933(gt) || IS_METEORLAKE(gt->i915)) i915_gem_object_set_cache_coherency(obj, I915_CACHE_NONE); vma = i915_vma_instance(obj, >->ggtt->vm, NULL);
FIXME: CAT errors are cropping up on MTL. This removes them, but the real root cause must still be diagnosed. Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com> --- drivers/gpu/drm/i915/gt/intel_gt.c | 6 +++++- drivers/gpu/drm/i915/gt/intel_lrc.c | 5 ++++- drivers/gpu/drm/i915/gt/uc/intel_guc.c | 5 ++++- 3 files changed, 13 insertions(+), 3 deletions(-)