diff mbox series

[RFC] i915/gt: Reapply workarounds in case the previous attempt failed.

Message ID aqoql4ri3vpe4larpkz4p6hxy76agq6pmn6gunt5xv56hxdbye@72ilwk7rpiu5 (mailing list archive)
State New
Headers show
Series [RFC] i915/gt: Reapply workarounds in case the previous attempt failed. | expand

Commit Message

Sebastian Brzezinka Dec. 5, 2024, 3:47 p.m. UTC
`wa_verify`sporadically detects lost workaround on application; this
is unusual behavior since wa are applied at `intel_gt_init_hw` and
verified right away by `intel_gt_verify_workarounds`, and  `wa_verify`
doesn't fail on initialization as one might suspect would happen.

One approach that may be somewhat beneficial is to reapply workarounds
in the event of failure, or even get rid of verify on application,
since it's redundant to `intel_gt_verify_workarounds`.

This patch aims to resolve: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12668

Signed-off-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_workarounds.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Rodrigo Vivi Dec. 6, 2024, 3:38 p.m. UTC | #1
On Thu, Dec 05, 2024 at 03:47:35PM +0000, Sebastian Brzezinka wrote:
> `wa_verify`sporadically detects lost workaround on application; this
> is unusual behavior since wa are applied at `intel_gt_init_hw` and
> verified right away by `intel_gt_verify_workarounds`, and  `wa_verify`
> doesn't fail on initialization as one might suspect would happen.
> 
> One approach that may be somewhat beneficial is to reapply workarounds
> in the event of failure, or even get rid of verify on application,
> since it's redundant to `intel_gt_verify_workarounds`.
> 
> This patch aims to resolve: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12668

It should be:

Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12668

> 
> Signed-off-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_workarounds.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c b/drivers/gpu/drm/i915/gt/intel_workarounds.c
> index 570c91878189..4ee623448223 100644
> --- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
> +++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
> @@ -1761,6 +1761,17 @@ static void wa_list_apply(const struct i915_wa_list *wal)
>  				intel_gt_mcr_read_any_fw(gt, wa->mcr_reg) :
>  				intel_uncore_read_fw(uncore, wa->reg);
>  
> +			if ((val ^ wa->set) & wa->read) { 
> +				if (wa->is_mcr)
> +					intel_gt_mcr_multicast_write_fw(gt, wa->mcr_reg, val);
> +				else
> +					intel_uncore_write_fw(uncore, wa->reg, val);
> +			}

instead of duplicating the code you should extract that to an aux function
to write it...

> +
> +			val = wa->is_mcr ?
> +				intel_gt_mcr_read_any_fw(gt, wa->mcr_reg) :
> +				intel_uncore_read_fw(uncore, wa->reg);

and another one to read it...

> +
>  			wa_verify(gt, wa, val, wal->name, "application");

However my biggest concern with this patch is the brute force solution
and only on CONFIG_DRM_I915_DEBUG_GEM case...

and as duplication because I see that the second write attempt is
already happening above if (val != old || !wa->clr)

So, something is not quite right in here and this deserves another alternative..


>  		}
>  	}
> -- 
> 2.34.1
>
Andi Shyti Dec. 12, 2024, 2:51 p.m. UTC | #2
Hi Rodrigo,

On Fri, Dec 06, 2024 at 10:38:24AM -0500, Rodrigo Vivi wrote:
> On Thu, Dec 05, 2024 at 03:47:35PM +0000, Sebastian Brzezinka wrote:
> > `wa_verify`sporadically detects lost workaround on application; this
> > is unusual behavior since wa are applied at `intel_gt_init_hw` and
> > verified right away by `intel_gt_verify_workarounds`, and  `wa_verify`
> > doesn't fail on initialization as one might suspect would happen.
> > 
> > One approach that may be somewhat beneficial is to reapply workarounds
> > in the event of failure, or even get rid of verify on application,
> > since it's redundant to `intel_gt_verify_workarounds`.
> > 
> > This patch aims to resolve: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12668
> 
> It should be:
> 
> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12668

aapart from the formatting issues this was suggested by me. We
have observed some sporadic vailures in applying the specific
workaround added by Ville (now cc'ed to the thread) in commit
0ddae025ab6c ("drm/i915: Disable compression tricks on JSL").

Because it's sporadic, we could give it one more chance and try
to re-apply it.

> > 
> > Signed-off-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com>
> > ---
> >  drivers/gpu/drm/i915/gt/intel_workarounds.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c b/drivers/gpu/drm/i915/gt/intel_workarounds.c
> > index 570c91878189..4ee623448223 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
> > @@ -1761,6 +1761,17 @@ static void wa_list_apply(const struct i915_wa_list *wal)
> >  				intel_gt_mcr_read_any_fw(gt, wa->mcr_reg) :
> >  				intel_uncore_read_fw(uncore, wa->reg);
> >  
> > +			if ((val ^ wa->set) & wa->read) { 
> > +				if (wa->is_mcr)
> > +					intel_gt_mcr_multicast_write_fw(gt, wa->mcr_reg, val);
> > +				else
> > +					intel_uncore_write_fw(uncore, wa->reg, val);
> > +			}
> 
> instead of duplicating the code you should extract that to an aux function
> to write it...

a for loop can decrease the amount of duplicated code.

> > +
> > +			val = wa->is_mcr ?
> > +				intel_gt_mcr_read_any_fw(gt, wa->mcr_reg) :
> > +				intel_uncore_read_fw(uncore, wa->reg);
> 
> and another one to read it...

this, indeed it's just reading, but we are trying to re-write. If
we wrote the unwanted value, we will keep reading the unwanted
value.

> > +
> >  			wa_verify(gt, wa, val, wal->name, "application");
> 
> However my biggest concern with this patch is the brute force solution
> and only on CONFIG_DRM_I915_DEBUG_GEM case...

this is a good point, indeed, I don't understand why the
confirmation should be within the DEBUG section.

Andi

> and as duplication because I see that the second write attempt is
> already happening above if (val != old || !wa->clr)
> 
> So, something is not quite right in here and this deserves another alternative..
> 
> 
> >  		}
> >  	}
> > -- 
> > 2.34.1
> >
Matt Roper Dec. 16, 2024, 9:27 p.m. UTC | #3
On Thu, Dec 12, 2024 at 03:51:12PM +0100, Andi Shyti wrote:
> Hi Rodrigo,
> 
> On Fri, Dec 06, 2024 at 10:38:24AM -0500, Rodrigo Vivi wrote:
> > On Thu, Dec 05, 2024 at 03:47:35PM +0000, Sebastian Brzezinka wrote:
> > > `wa_verify`sporadically detects lost workaround on application; this
> > > is unusual behavior since wa are applied at `intel_gt_init_hw` and
> > > verified right away by `intel_gt_verify_workarounds`, and  `wa_verify`
> > > doesn't fail on initialization as one might suspect would happen.
> > > 
> > > One approach that may be somewhat beneficial is to reapply workarounds
> > > in the event of failure, or even get rid of verify on application,
> > > since it's redundant to `intel_gt_verify_workarounds`.
> > > 
> > > This patch aims to resolve: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12668
> > 
> > It should be:
> > 
> > Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12668
> 
> aapart from the formatting issues this was suggested by me. We
> have observed some sporadic vailures in applying the specific
> workaround added by Ville (now cc'ed to the thread) in commit
> 0ddae025ab6c ("drm/i915: Disable compression tricks on JSL").
> 
> Because it's sporadic, we could give it one more chance and try
> to re-apply it.

That sounds like it's just papering over the issue without really
figuring out what's truly going on.

Looking at the current implementation, it looks like at least one
possible problem is that it was implemented in rcs_engine_wa_init, but
the CACHE_MODE_0 register itself is part of the LRC (according to bspec
18907).  So we want to move it to icl_ctx_workarounds_init() instead to
make sure it gets recorded in the golden context image.  Our
initialization and reset handling for workarounds touching registers in
the context are different from those that aren't.

BTW, I'm a bit surprised to see us needing to implement this workaround
in the kernel at all.  CACHE_MODE_0 is a register that's under userspace
control (according to bspec 14181), so we usually expect the userspace
drivers to own implementing any workarounds dealing with the registers
they control.  Indeed, it looks like Mesa's Iris driver already has an
implementation of this workaround in iris_state.c:

   if (devinfo->disable_ccs_repack) {
      iris_emit_reg(batch, GENX(CACHE_MODE_0), reg) {
         reg.DisableRepackingforCompression = true;
         reg.DisableRepackingforCompressionMask = true;
      }
   }

and that workaround was added back in mid-2019 so it should be in all
recent Mesa releases.


Matt

> 
> > > 
> > > Signed-off-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_workarounds.c | 11 +++++++++++
> > >  1 file changed, 11 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c b/drivers/gpu/drm/i915/gt/intel_workarounds.c
> > > index 570c91878189..4ee623448223 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
> > > @@ -1761,6 +1761,17 @@ static void wa_list_apply(const struct i915_wa_list *wal)
> > >  				intel_gt_mcr_read_any_fw(gt, wa->mcr_reg) :
> > >  				intel_uncore_read_fw(uncore, wa->reg);
> > >  
> > > +			if ((val ^ wa->set) & wa->read) { 
> > > +				if (wa->is_mcr)
> > > +					intel_gt_mcr_multicast_write_fw(gt, wa->mcr_reg, val);
> > > +				else
> > > +					intel_uncore_write_fw(uncore, wa->reg, val);
> > > +			}
> > 
> > instead of duplicating the code you should extract that to an aux function
> > to write it...
> 
> a for loop can decrease the amount of duplicated code.
> 
> > > +
> > > +			val = wa->is_mcr ?
> > > +				intel_gt_mcr_read_any_fw(gt, wa->mcr_reg) :
> > > +				intel_uncore_read_fw(uncore, wa->reg);
> > 
> > and another one to read it...
> 
> this, indeed it's just reading, but we are trying to re-write. If
> we wrote the unwanted value, we will keep reading the unwanted
> value.
> 
> > > +
> > >  			wa_verify(gt, wa, val, wal->name, "application");
> > 
> > However my biggest concern with this patch is the brute force solution
> > and only on CONFIG_DRM_I915_DEBUG_GEM case...
> 
> this is a good point, indeed, I don't understand why the
> confirmation should be within the DEBUG section.
> 
> Andi
> 
> > and as duplication because I see that the second write attempt is
> > already happening above if (val != old || !wa->clr)
> > 
> > So, something is not quite right in here and this deserves another alternative..
> > 
> > 
> > >  		}
> > >  	}
> > > -- 
> > > 2.34.1
> > >
Andi Shyti Dec. 18, 2024, 2:48 p.m. UTC | #4
Hi Matt,

On Mon, Dec 16, 2024 at 01:27:51PM -0800, Matt Roper wrote:
> On Thu, Dec 12, 2024 at 03:51:12PM +0100, Andi Shyti wrote:
> > On Fri, Dec 06, 2024 at 10:38:24AM -0500, Rodrigo Vivi wrote:
> > > On Thu, Dec 05, 2024 at 03:47:35PM +0000, Sebastian Brzezinka wrote:
> > > > `wa_verify`sporadically detects lost workaround on application; this
> > > > is unusual behavior since wa are applied at `intel_gt_init_hw` and
> > > > verified right away by `intel_gt_verify_workarounds`, and  `wa_verify`
> > > > doesn't fail on initialization as one might suspect would happen.
> > > > 
> > > > One approach that may be somewhat beneficial is to reapply workarounds
> > > > in the event of failure, or even get rid of verify on application,
> > > > since it's redundant to `intel_gt_verify_workarounds`.
> > > > 
> > > > This patch aims to resolve: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12668
> > > 
> > > It should be:
> > > 
> > > Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12668
> > 
> > aapart from the formatting issues this was suggested by me. We
> > have observed some sporadic vailures in applying the specific
> > workaround added by Ville (now cc'ed to the thread) in commit
> > 0ddae025ab6c ("drm/i915: Disable compression tricks on JSL").
> > 
> > Because it's sporadic, we could give it one more chance and try
> > to re-apply it.
> 
> That sounds like it's just papering over the issue without really
> figuring out what's truly going on.

Independently from your next comments and the implementation, the
sporadic errors we've seen have an extremely low rate and we
weren't able to make sense to them or fully test them.

We had a short chat with Ville (who implemented the workaround)
amd he suggested to leave it as it is, while I suggested to give
it another chance.

> Looking at the current implementation, it looks like at least one
> possible problem is that it was implemented in rcs_engine_wa_init, but
> the CACHE_MODE_0 register itself is part of the LRC (according to bspec
> 18907).  So we want to move it to icl_ctx_workarounds_init() instead to
> make sure it gets recorded in the golden context image.  Our
> initialization and reset handling for workarounds touching registers in
> the context are different from those that aren't.
> BTW, I'm a bit surprised to see us needing to implement this workaround
> in the kernel at all.  CACHE_MODE_0 is a register that's under userspace
> control (according to bspec 14181), so we usually expect the userspace
> drivers to own implementing any workarounds dealing with the registers
> they control.  Indeed, it looks like Mesa's Iris driver already has an
> implementation of this workaround in iris_state.c:
> 
>    if (devinfo->disable_ccs_repack) {
>       iris_emit_reg(batch, GENX(CACHE_MODE_0), reg) {
>          reg.DisableRepackingforCompression = true;
>          reg.DisableRepackingforCompressionMask = true;
>       }
>    }
> 
> and that workaround was added back in mid-2019 so it should be in all
> recent Mesa releases.

Ville? Any comment here?

Andi
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 570c91878189..4ee623448223 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -1761,6 +1761,17 @@  static void wa_list_apply(const struct i915_wa_list *wal)
 				intel_gt_mcr_read_any_fw(gt, wa->mcr_reg) :
 				intel_uncore_read_fw(uncore, wa->reg);
 
+			if ((val ^ wa->set) & wa->read) { 
+				if (wa->is_mcr)
+					intel_gt_mcr_multicast_write_fw(gt, wa->mcr_reg, val);
+				else
+					intel_uncore_write_fw(uncore, wa->reg, val);
+			}
+
+			val = wa->is_mcr ?
+				intel_gt_mcr_read_any_fw(gt, wa->mcr_reg) :
+				intel_uncore_read_fw(uncore, wa->reg);
+
 			wa_verify(gt, wa, val, wal->name, "application");
 		}
 	}