diff mbox series

[2/3] drm/i915/gt: Fix context workarounds with non-masked regs

Message ID 20230622182731.3765039-2-lucas.demarchi@intel.com (mailing list archive)
State New, archived
Headers show
Series [1/3] drm/i915/gt: Move wal_get_fw_for_rmw() | expand

Commit Message

Lucas De Marchi June 22, 2023, 6:27 p.m. UTC
Most of the context workarounds tweak masked registers, but not all. For
masked registers, when writing the value it's sufficient to just write
the wa->set_bits since that will take care of both the clr and set bits
as well as not overwriting other bits.

However there are some workarounds, the registers are non-masked. Up
until now the driver was simply emitting a MI_LOAD_REGISTER_IMM with the
set_bits to program the register via the GPU in the WA bb. This has the
side effect of overwriting the content of the register outside of bits
that should be set and also doesn't handle the bits that should be
cleared.

Kenneth reported that on DG2, mesa was seeing a weird behavior due to
the kernel programming of L3SQCREG5 in dg2_ctx_gt_tuning_init(). With
the GPU idle, that register could be read via intel_reg as 0x00e001ff,
but during a 3D workload it would change to 0x0000007f. So the
programming of that tuning was affecting more than the bits in
L3_PWM_TIMER_INIT_VAL_MASK. Matt Roper noticed the lack of rmw for the
context workarounds due to the use of MI_LOAD_REGISTER_IMM.

So, for registers that are not masked, read its value via mmio, modify
and then set it in the buffer to be written by the GPU. This should take
care in a simple way of programming just the bits required by the
tuning/workaround. If in future there are registers that involved that
can't be read by the CPU, a more complex approach may be required like
a) issuing additional instructions to read and modify; or b) scan the
golden context and patch it in place before saving it; or something
else. But for now this should suffice.

Scanning the context workarounds for all platforms, these are the
impacted ones with the respective registers

	mtl: DRAW_WATERMARK
	mtl/dg2: XEHP_L3SQCREG5, XEHP_FF_MODE2
	gen12: GEN12_FF_MODE2

ICL has some non-masked registers in the context workarounds:
GEN8_L3CNTLREG, IVB_FBC_RT_BASE and VB_FBC_RT_BASE_UPPER, but there
shouldn't be an impact. The first is already being manually read and the
other 2 are intentionally overwriting the entire register.

Cc: Kenneth Graunke <kenneth@whitecape.org>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: stable@vger.kernel.org # v5.7+
Link: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23783#note_1968971
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_workarounds.c | 27 ++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

Comments

Kenneth Graunke June 22, 2023, 11:37 p.m. UTC | #1
On Thursday, June 22, 2023 11:27:30 AM PDT Lucas De Marchi wrote:
> Most of the context workarounds tweak masked registers, but not all. For
> masked registers, when writing the value it's sufficient to just write
> the wa->set_bits since that will take care of both the clr and set bits
> as well as not overwriting other bits.
> 
> However there are some workarounds, the registers are non-masked. Up
> until now the driver was simply emitting a MI_LOAD_REGISTER_IMM with the
> set_bits to program the register via the GPU in the WA bb. This has the
> side effect of overwriting the content of the register outside of bits
> that should be set and also doesn't handle the bits that should be
> cleared.
> 
> Kenneth reported that on DG2, mesa was seeing a weird behavior due to
> the kernel programming of L3SQCREG5 in dg2_ctx_gt_tuning_init(). With
> the GPU idle, that register could be read via intel_reg as 0x00e001ff,
> but during a 3D workload it would change to 0x0000007f. So the
> programming of that tuning was affecting more than the bits in
> L3_PWM_TIMER_INIT_VAL_MASK. Matt Roper noticed the lack of rmw for the
> context workarounds due to the use of MI_LOAD_REGISTER_IMM.
> 
> So, for registers that are not masked, read its value via mmio, modify
> and then set it in the buffer to be written by the GPU. This should take
> care in a simple way of programming just the bits required by the
> tuning/workaround. If in future there are registers that involved that
> can't be read by the CPU, a more complex approach may be required like
> a) issuing additional instructions to read and modify; or b) scan the
> golden context and patch it in place before saving it; or something
> else. But for now this should suffice.
> 
> Scanning the context workarounds for all platforms, these are the
> impacted ones with the respective registers
> 
> 	mtl: DRAW_WATERMARK
> 	mtl/dg2: XEHP_L3SQCREG5, XEHP_FF_MODE2
> 	gen12: GEN12_FF_MODE2

Speaking of GEN12_FF_MODE2...there's a big scary comment above that
workaround write which says that register "will return the wrong value
when read."  I think with this patch, we'll start doing a RMW cycle for
the register, which could mix in some of this "wrong value".  The
comment mentions that the intention is to write the whole register,
as the default value is 0 for all fields.

Maybe what we want to do is change gen12_ctx_gt_tuning_init to do

    wa_write(wal, GEN12_FF_MODE2, FF_MODE2_TDS_TIMER_128);

so it has a clear mask of ~0 instead of FF_MODE2_TDS_TIMER_MASK, and
then in this patch update your condition below from

+		if (wa->masked_reg || wa->set == U32_MAX) {

to

+		if (wa->masked_reg || wa->set == U32_MAX || wa->clear == U32_MAX) {

because if we're clearing all bits then we don't care about doing a
read-modify-write either.

--Ken
Lucas De Marchi June 23, 2023, 3:49 p.m. UTC | #2
On Thu, Jun 22, 2023 at 04:37:21PM -0700, Kenneth Graunke wrote:
>On Thursday, June 22, 2023 11:27:30 AM PDT Lucas De Marchi wrote:
>> Most of the context workarounds tweak masked registers, but not all. For
>> masked registers, when writing the value it's sufficient to just write
>> the wa->set_bits since that will take care of both the clr and set bits
>> as well as not overwriting other bits.
>>
>> However there are some workarounds, the registers are non-masked. Up
>> until now the driver was simply emitting a MI_LOAD_REGISTER_IMM with the
>> set_bits to program the register via the GPU in the WA bb. This has the
>> side effect of overwriting the content of the register outside of bits
>> that should be set and also doesn't handle the bits that should be
>> cleared.
>>
>> Kenneth reported that on DG2, mesa was seeing a weird behavior due to
>> the kernel programming of L3SQCREG5 in dg2_ctx_gt_tuning_init(). With
>> the GPU idle, that register could be read via intel_reg as 0x00e001ff,
>> but during a 3D workload it would change to 0x0000007f. So the
>> programming of that tuning was affecting more than the bits in
>> L3_PWM_TIMER_INIT_VAL_MASK. Matt Roper noticed the lack of rmw for the
>> context workarounds due to the use of MI_LOAD_REGISTER_IMM.
>>
>> So, for registers that are not masked, read its value via mmio, modify
>> and then set it in the buffer to be written by the GPU. This should take
>> care in a simple way of programming just the bits required by the
>> tuning/workaround. If in future there are registers that involved that
>> can't be read by the CPU, a more complex approach may be required like
>> a) issuing additional instructions to read and modify; or b) scan the
>> golden context and patch it in place before saving it; or something
>> else. But for now this should suffice.
>>
>> Scanning the context workarounds for all platforms, these are the
>> impacted ones with the respective registers
>>
>> 	mtl: DRAW_WATERMARK
>> 	mtl/dg2: XEHP_L3SQCREG5, XEHP_FF_MODE2
>> 	gen12: GEN12_FF_MODE2
>
>Speaking of GEN12_FF_MODE2...there's a big scary comment above that
>workaround write which says that register "will return the wrong value
>when read."  I think with this patch, we'll start doing a RMW cycle for
>the register, which could mix in some of this "wrong value".  The
>comment mentions that the intention is to write the whole register,
>as the default value is 0 for all fields.

Good point. That also means we don't need to backport this patch to
stable kernel to any gen12, since overwritting the other bits is
actually the intended behavior.

>
>Maybe what we want to do is change gen12_ctx_gt_tuning_init to do
>
>    wa_write(wal, GEN12_FF_MODE2, FF_MODE2_TDS_TIMER_128);
>
>so it has a clear mask of ~0 instead of FF_MODE2_TDS_TIMER_MASK, and

In order to ignore read back when verifying, we would still need to use
wa_add(), but changing the mask. We don't have a wa_write() that ends up
with { .clr = ~0, .read_mask = 0 }.

	wa_add(wal,
	       GEN12_FF_MODE2,
	       ~0, FF_MODE2_TDS_TIMER_128,
	       0, false);


>then in this patch update your condition below from
>
>+		if (wa->masked_reg || wa->set == U32_MAX) {
>
>to
>
>+		if (wa->masked_reg || wa->set == U32_MAX || wa->clear == U32_MAX) {

yeah... and maybe also warn if wa->read is 0, which means it's one
of the registers we can't/shouldn't read from the CPU.

>
>because if we're clearing all bits then we don't care about doing a
>read-modify-write either.

thanks
Lucas De Marchi

>
>--Ken
Kenneth Graunke June 23, 2023, 7:48 p.m. UTC | #3
On Friday, June 23, 2023 8:49:05 AM PDT Lucas De Marchi wrote:
> On Thu, Jun 22, 2023 at 04:37:21PM -0700, Kenneth Graunke wrote:
> >On Thursday, June 22, 2023 11:27:30 AM PDT Lucas De Marchi wrote:
> >> Most of the context workarounds tweak masked registers, but not all. For
> >> masked registers, when writing the value it's sufficient to just write
> >> the wa->set_bits since that will take care of both the clr and set bits
> >> as well as not overwriting other bits.
> >>
> >> However there are some workarounds, the registers are non-masked. Up
> >> until now the driver was simply emitting a MI_LOAD_REGISTER_IMM with the
> >> set_bits to program the register via the GPU in the WA bb. This has the
> >> side effect of overwriting the content of the register outside of bits
> >> that should be set and also doesn't handle the bits that should be
> >> cleared.
> >>
> >> Kenneth reported that on DG2, mesa was seeing a weird behavior due to
> >> the kernel programming of L3SQCREG5 in dg2_ctx_gt_tuning_init(). With
> >> the GPU idle, that register could be read via intel_reg as 0x00e001ff,
> >> but during a 3D workload it would change to 0x0000007f. So the
> >> programming of that tuning was affecting more than the bits in
> >> L3_PWM_TIMER_INIT_VAL_MASK. Matt Roper noticed the lack of rmw for the
> >> context workarounds due to the use of MI_LOAD_REGISTER_IMM.
> >>
> >> So, for registers that are not masked, read its value via mmio, modify
> >> and then set it in the buffer to be written by the GPU. This should take
> >> care in a simple way of programming just the bits required by the
> >> tuning/workaround. If in future there are registers that involved that
> >> can't be read by the CPU, a more complex approach may be required like
> >> a) issuing additional instructions to read and modify; or b) scan the
> >> golden context and patch it in place before saving it; or something
> >> else. But for now this should suffice.
> >>
> >> Scanning the context workarounds for all platforms, these are the
> >> impacted ones with the respective registers
> >>
> >> 	mtl: DRAW_WATERMARK
> >> 	mtl/dg2: XEHP_L3SQCREG5, XEHP_FF_MODE2
> >> 	gen12: GEN12_FF_MODE2
> >
> >Speaking of GEN12_FF_MODE2...there's a big scary comment above that
> >workaround write which says that register "will return the wrong value
> >when read."  I think with this patch, we'll start doing a RMW cycle for
> >the register, which could mix in some of this "wrong value".  The
> >comment mentions that the intention is to write the whole register,
> >as the default value is 0 for all fields.
> 
> Good point. That also means we don't need to backport this patch to
> stable kernel to any gen12, since overwritting the other bits is
> actually the intended behavior.
> 
> >
> >Maybe what we want to do is change gen12_ctx_gt_tuning_init to do
> >
> >    wa_write(wal, GEN12_FF_MODE2, FF_MODE2_TDS_TIMER_128);
> >
> >so it has a clear mask of ~0 instead of FF_MODE2_TDS_TIMER_MASK, and
> 
> In order to ignore read back when verifying, we would still need to use
> wa_add(), but changing the mask. We don't have a wa_write() that ends up
> with { .clr = ~0, .read_mask = 0 }.
> 
> 	wa_add(wal,
> 	       GEN12_FF_MODE2,
> 	       ~0, FF_MODE2_TDS_TIMER_128,
> 	       0, false);

Good point!  Though, I just noticed another bug here:

gen12_ctx_workarounds_init sets FF_MODE2_GS_TIMER_224 to avoid hangs
in the HS/DS unit, after gen12_ctx_gt_tuning_init set TDS_TIMER_128
for performance.  One of those is going to clobber the other; we're
likely losing the TDS tuning today.  Combining those workarounds into
one place seems like an easy way to fix that.

> >then in this patch update your condition below from
> >
> >+		if (wa->masked_reg || wa->set == U32_MAX) {
> >
> >to
> >
> >+		if (wa->masked_reg || wa->set == U32_MAX || wa->clear == U32_MAX) {
> 
> yeah... and maybe also warn if wa->read is 0, which means it's one
> of the registers we can't/shouldn't read from the CPU.
> 
> >
> >because if we're clearing all bits then we don't care about doing a
> >read-modify-write either.
> 
> thanks
> Lucas De Marchi
> 
> >
> >--Ken
> 
> 
>
Lucas De Marchi June 23, 2023, 9:05 p.m. UTC | #4
On Fri, Jun 23, 2023 at 12:48:13PM -0700, Kenneth Graunke wrote:
>On Friday, June 23, 2023 8:49:05 AM PDT Lucas De Marchi wrote:
>> On Thu, Jun 22, 2023 at 04:37:21PM -0700, Kenneth Graunke wrote:
>> >On Thursday, June 22, 2023 11:27:30 AM PDT Lucas De Marchi wrote:
>> >> Most of the context workarounds tweak masked registers, but not all. For
>> >> masked registers, when writing the value it's sufficient to just write
>> >> the wa->set_bits since that will take care of both the clr and set bits
>> >> as well as not overwriting other bits.
>> >>
>> >> However there are some workarounds, the registers are non-masked. Up
>> >> until now the driver was simply emitting a MI_LOAD_REGISTER_IMM with the
>> >> set_bits to program the register via the GPU in the WA bb. This has the
>> >> side effect of overwriting the content of the register outside of bits
>> >> that should be set and also doesn't handle the bits that should be
>> >> cleared.
>> >>
>> >> Kenneth reported that on DG2, mesa was seeing a weird behavior due to
>> >> the kernel programming of L3SQCREG5 in dg2_ctx_gt_tuning_init(). With
>> >> the GPU idle, that register could be read via intel_reg as 0x00e001ff,
>> >> but during a 3D workload it would change to 0x0000007f. So the
>> >> programming of that tuning was affecting more than the bits in
>> >> L3_PWM_TIMER_INIT_VAL_MASK. Matt Roper noticed the lack of rmw for the
>> >> context workarounds due to the use of MI_LOAD_REGISTER_IMM.
>> >>
>> >> So, for registers that are not masked, read its value via mmio, modify
>> >> and then set it in the buffer to be written by the GPU. This should take
>> >> care in a simple way of programming just the bits required by the
>> >> tuning/workaround. If in future there are registers that involved that
>> >> can't be read by the CPU, a more complex approach may be required like
>> >> a) issuing additional instructions to read and modify; or b) scan the
>> >> golden context and patch it in place before saving it; or something
>> >> else. But for now this should suffice.
>> >>
>> >> Scanning the context workarounds for all platforms, these are the
>> >> impacted ones with the respective registers
>> >>
>> >> 	mtl: DRAW_WATERMARK
>> >> 	mtl/dg2: XEHP_L3SQCREG5, XEHP_FF_MODE2
>> >> 	gen12: GEN12_FF_MODE2
>> >
>> >Speaking of GEN12_FF_MODE2...there's a big scary comment above that
>> >workaround write which says that register "will return the wrong value
>> >when read."  I think with this patch, we'll start doing a RMW cycle for
>> >the register, which could mix in some of this "wrong value".  The
>> >comment mentions that the intention is to write the whole register,
>> >as the default value is 0 for all fields.
>>
>> Good point. That also means we don't need to backport this patch to
>> stable kernel to any gen12, since overwritting the other bits is
>> actually the intended behavior.
>>
>> >
>> >Maybe what we want to do is change gen12_ctx_gt_tuning_init to do
>> >
>> >    wa_write(wal, GEN12_FF_MODE2, FF_MODE2_TDS_TIMER_128);
>> >
>> >so it has a clear mask of ~0 instead of FF_MODE2_TDS_TIMER_MASK, and
>>
>> In order to ignore read back when verifying, we would still need to use
>> wa_add(), but changing the mask. We don't have a wa_write() that ends up
>> with { .clr = ~0, .read_mask = 0 }.
>>
>> 	wa_add(wal,
>> 	       GEN12_FF_MODE2,
>> 	       ~0, FF_MODE2_TDS_TIMER_128,
>> 	       0, false);
>
>Good point!  Though, I just noticed another bug here:
>
>gen12_ctx_workarounds_init sets FF_MODE2_GS_TIMER_224 to avoid hangs
>in the HS/DS unit, after gen12_ctx_gt_tuning_init set TDS_TIMER_128
>for performance.  One of those is going to clobber the other; we're
>likely losing the TDS tuning today.  Combining those workarounds into

we are not losing it today. As long as the wa list is the same, we do detect collisions when
adding workarounds and they are coallesced before applying. However,
indeed if we change this to make clear be ~0, then they will collide and
we will see a warning.

Applying them together in a single operation would indeed solve it
with a side-effect of moving this back to the workarounds. Either that
or

a) we handle the read_back == 0 && clear == U32_MAX specially when
    adding WAs. If that is true, then the check for collisions can
    be adjusted to allow that.

b) we give up on this approach and proceed with one of

	1) scan the ctx wa list. If it has any non-masked register,
	   we submit a job to read it from the GPU side. MCR will
	   make this harder as the steering from the GPU side is
	   different than the CPU

	2) emit additional commands to read and modify the register from
	   the GPU side

	3) find the register in the golden context and patch it in place




>one place seems like an easy way to fix that.

I'm leaning towards this option in the hope we don't have have
another GEN12_FF_MODE2 in future.

Matt, we've been pushing towards separating the tuning from the WAs, but
here we'd go the other way. Anything against doing this for now?

thanks
Lucas De Marchi

>
>> >then in this patch update your condition below from
>> >
>> >+		if (wa->masked_reg || wa->set == U32_MAX) {
>> >
>> >to
>> >
>> >+		if (wa->masked_reg || wa->set == U32_MAX || wa->clear == U32_MAX) {
>>
>> yeah... and maybe also warn if wa->read is 0, which means it's one
>> of the registers we can't/shouldn't read from the CPU.
>>
>> >
>> >because if we're clearing all bits then we don't care about doing a
>> >read-modify-write either.
>>
>> thanks
>> Lucas De Marchi
>>
>> >
>> >--Ken
>>
>>
>>
>
Matt Roper June 23, 2023, 9:56 p.m. UTC | #5
On Fri, Jun 23, 2023 at 02:05:20PM -0700, Lucas De Marchi wrote:
> On Fri, Jun 23, 2023 at 12:48:13PM -0700, Kenneth Graunke wrote:
> > On Friday, June 23, 2023 8:49:05 AM PDT Lucas De Marchi wrote:
> > > On Thu, Jun 22, 2023 at 04:37:21PM -0700, Kenneth Graunke wrote:
> > > >On Thursday, June 22, 2023 11:27:30 AM PDT Lucas De Marchi wrote:
> > > >> Most of the context workarounds tweak masked registers, but not all. For
> > > >> masked registers, when writing the value it's sufficient to just write
> > > >> the wa->set_bits since that will take care of both the clr and set bits
> > > >> as well as not overwriting other bits.
> > > >>
> > > >> However there are some workarounds, the registers are non-masked. Up
> > > >> until now the driver was simply emitting a MI_LOAD_REGISTER_IMM with the
> > > >> set_bits to program the register via the GPU in the WA bb. This has the
> > > >> side effect of overwriting the content of the register outside of bits
> > > >> that should be set and also doesn't handle the bits that should be
> > > >> cleared.
> > > >>
> > > >> Kenneth reported that on DG2, mesa was seeing a weird behavior due to
> > > >> the kernel programming of L3SQCREG5 in dg2_ctx_gt_tuning_init(). With
> > > >> the GPU idle, that register could be read via intel_reg as 0x00e001ff,
> > > >> but during a 3D workload it would change to 0x0000007f. So the
> > > >> programming of that tuning was affecting more than the bits in
> > > >> L3_PWM_TIMER_INIT_VAL_MASK. Matt Roper noticed the lack of rmw for the
> > > >> context workarounds due to the use of MI_LOAD_REGISTER_IMM.
> > > >>
> > > >> So, for registers that are not masked, read its value via mmio, modify
> > > >> and then set it in the buffer to be written by the GPU. This should take
> > > >> care in a simple way of programming just the bits required by the
> > > >> tuning/workaround. If in future there are registers that involved that
> > > >> can't be read by the CPU, a more complex approach may be required like
> > > >> a) issuing additional instructions to read and modify; or b) scan the
> > > >> golden context and patch it in place before saving it; or something
> > > >> else. But for now this should suffice.
> > > >>
> > > >> Scanning the context workarounds for all platforms, these are the
> > > >> impacted ones with the respective registers
> > > >>
> > > >> 	mtl: DRAW_WATERMARK
> > > >> 	mtl/dg2: XEHP_L3SQCREG5, XEHP_FF_MODE2
> > > >> 	gen12: GEN12_FF_MODE2
> > > >
> > > >Speaking of GEN12_FF_MODE2...there's a big scary comment above that
> > > >workaround write which says that register "will return the wrong value
> > > >when read."  I think with this patch, we'll start doing a RMW cycle for
> > > >the register, which could mix in some of this "wrong value".  The
> > > >comment mentions that the intention is to write the whole register,
> > > >as the default value is 0 for all fields.
> > > 
> > > Good point. That also means we don't need to backport this patch to
> > > stable kernel to any gen12, since overwritting the other bits is
> > > actually the intended behavior.
> > > 
> > > >
> > > >Maybe what we want to do is change gen12_ctx_gt_tuning_init to do
> > > >
> > > >    wa_write(wal, GEN12_FF_MODE2, FF_MODE2_TDS_TIMER_128);
> > > >
> > > >so it has a clear mask of ~0 instead of FF_MODE2_TDS_TIMER_MASK, and
> > > 
> > > In order to ignore read back when verifying, we would still need to use
> > > wa_add(), but changing the mask. We don't have a wa_write() that ends up
> > > with { .clr = ~0, .read_mask = 0 }.
> > > 
> > > 	wa_add(wal,
> > > 	       GEN12_FF_MODE2,
> > > 	       ~0, FF_MODE2_TDS_TIMER_128,
> > > 	       0, false);
> > 
> > Good point!  Though, I just noticed another bug here:
> > 
> > gen12_ctx_workarounds_init sets FF_MODE2_GS_TIMER_224 to avoid hangs
> > in the HS/DS unit, after gen12_ctx_gt_tuning_init set TDS_TIMER_128
> > for performance.  One of those is going to clobber the other; we're
> > likely losing the TDS tuning today.  Combining those workarounds into
> 
> we are not losing it today. As long as the wa list is the same, we do detect collisions when
> adding workarounds and they are coallesced before applying. However,
> indeed if we change this to make clear be ~0, then they will collide and
> we will see a warning.
> 
> Applying them together in a single operation would indeed solve it
> with a side-effect of moving this back to the workarounds. Either that
> or
> 
> a) we handle the read_back == 0 && clear == U32_MAX specially when
>    adding WAs. If that is true, then the check for collisions can
>    be adjusted to allow that.
> 
> b) we give up on this approach and proceed with one of
> 
> 	1) scan the ctx wa list. If it has any non-masked register,
> 	   we submit a job to read it from the GPU side. MCR will
> 	   make this harder as the steering from the GPU side is
> 	   different than the CPU
> 
> 	2) emit additional commands to read and modify the register from
> 	   the GPU side
> 
> 	3) find the register in the golden context and patch it in place
> 
> 
> 
> 
> > one place seems like an easy way to fix that.
> 
> I'm leaning towards this option in the hope we don't have have
> another GEN12_FF_MODE2 in future.
> 
> Matt, we've been pushing towards separating the tuning from the WAs, but
> here we'd go the other way. Anything against doing this for now?

That's probably fine as long as we leave a comment behind in the tuning
section explaining why that specific setting is found in a different
spot.


Matt


> 
> thanks
> Lucas De Marchi
> 
> > 
> > > >then in this patch update your condition below from
> > > >
> > > >+		if (wa->masked_reg || wa->set == U32_MAX) {
> > > >
> > > >to
> > > >
> > > >+		if (wa->masked_reg || wa->set == U32_MAX || wa->clear == U32_MAX) {
> > > 
> > > yeah... and maybe also warn if wa->read is 0, which means it's one
> > > of the registers we can't/shouldn't read from the CPU.
> > > 
> > > >
> > > >because if we're clearing all bits then we don't care about doing a
> > > >read-modify-write either.
> > > 
> > > thanks
> > > Lucas De Marchi
> > > 
> > > >
> > > >--Ken
> > > 
> > > 
> > > 
> > 
> 
>
Lucas De Marchi June 24, 2023, 6:12 p.m. UTC | #6
On Fri, Jun 23, 2023 at 02:56:46PM -0700, Matt Roper wrote:
>On Fri, Jun 23, 2023 at 02:05:20PM -0700, Lucas De Marchi wrote:
>> On Fri, Jun 23, 2023 at 12:48:13PM -0700, Kenneth Graunke wrote:
>> > On Friday, June 23, 2023 8:49:05 AM PDT Lucas De Marchi wrote:
>> > > On Thu, Jun 22, 2023 at 04:37:21PM -0700, Kenneth Graunke wrote:
>> > > >On Thursday, June 22, 2023 11:27:30 AM PDT Lucas De Marchi wrote:
>> > > >> Most of the context workarounds tweak masked registers, but not all. For
>> > > >> masked registers, when writing the value it's sufficient to just write
>> > > >> the wa->set_bits since that will take care of both the clr and set bits
>> > > >> as well as not overwriting other bits.
>> > > >>
>> > > >> However there are some workarounds, the registers are non-masked. Up
>> > > >> until now the driver was simply emitting a MI_LOAD_REGISTER_IMM with the
>> > > >> set_bits to program the register via the GPU in the WA bb. This has the
>> > > >> side effect of overwriting the content of the register outside of bits
>> > > >> that should be set and also doesn't handle the bits that should be
>> > > >> cleared.
>> > > >>
>> > > >> Kenneth reported that on DG2, mesa was seeing a weird behavior due to
>> > > >> the kernel programming of L3SQCREG5 in dg2_ctx_gt_tuning_init(). With
>> > > >> the GPU idle, that register could be read via intel_reg as 0x00e001ff,
>> > > >> but during a 3D workload it would change to 0x0000007f. So the
>> > > >> programming of that tuning was affecting more than the bits in
>> > > >> L3_PWM_TIMER_INIT_VAL_MASK. Matt Roper noticed the lack of rmw for the
>> > > >> context workarounds due to the use of MI_LOAD_REGISTER_IMM.
>> > > >>
>> > > >> So, for registers that are not masked, read its value via mmio, modify
>> > > >> and then set it in the buffer to be written by the GPU. This should take
>> > > >> care in a simple way of programming just the bits required by the
>> > > >> tuning/workaround. If in future there are registers that involved that
>> > > >> can't be read by the CPU, a more complex approach may be required like
>> > > >> a) issuing additional instructions to read and modify; or b) scan the
>> > > >> golden context and patch it in place before saving it; or something
>> > > >> else. But for now this should suffice.
>> > > >>
>> > > >> Scanning the context workarounds for all platforms, these are the
>> > > >> impacted ones with the respective registers
>> > > >>
>> > > >> 	mtl: DRAW_WATERMARK
>> > > >> 	mtl/dg2: XEHP_L3SQCREG5, XEHP_FF_MODE2
>> > > >> 	gen12: GEN12_FF_MODE2
>> > > >
>> > > >Speaking of GEN12_FF_MODE2...there's a big scary comment above that
>> > > >workaround write which says that register "will return the wrong value
>> > > >when read."  I think with this patch, we'll start doing a RMW cycle for
>> > > >the register, which could mix in some of this "wrong value".  The
>> > > >comment mentions that the intention is to write the whole register,
>> > > >as the default value is 0 for all fields.
>> > >
>> > > Good point. That also means we don't need to backport this patch to
>> > > stable kernel to any gen12, since overwritting the other bits is
>> > > actually the intended behavior.
>> > >
>> > > >
>> > > >Maybe what we want to do is change gen12_ctx_gt_tuning_init to do
>> > > >
>> > > >    wa_write(wal, GEN12_FF_MODE2, FF_MODE2_TDS_TIMER_128);
>> > > >
>> > > >so it has a clear mask of ~0 instead of FF_MODE2_TDS_TIMER_MASK, and
>> > >
>> > > In order to ignore read back when verifying, we would still need to use
>> > > wa_add(), but changing the mask. We don't have a wa_write() that ends up
>> > > with { .clr = ~0, .read_mask = 0 }.
>> > >
>> > > 	wa_add(wal,
>> > > 	       GEN12_FF_MODE2,
>> > > 	       ~0, FF_MODE2_TDS_TIMER_128,
>> > > 	       0, false);
>> >
>> > Good point!  Though, I just noticed another bug here:
>> >
>> > gen12_ctx_workarounds_init sets FF_MODE2_GS_TIMER_224 to avoid hangs
>> > in the HS/DS unit, after gen12_ctx_gt_tuning_init set TDS_TIMER_128
>> > for performance.  One of those is going to clobber the other; we're
>> > likely losing the TDS tuning today.  Combining those workarounds into
>>
>> we are not losing it today. As long as the wa list is the same, we do detect collisions when
>> adding workarounds and they are coallesced before applying. However,
>> indeed if we change this to make clear be ~0, then they will collide and
>> we will see a warning.
>>
>> Applying them together in a single operation would indeed solve it
>> with a side-effect of moving this back to the workarounds. Either that
>> or
>>
>> a) we handle the read_back == 0 && clear == U32_MAX specially when
>>    adding WAs. If that is true, then the check for collisions can
>>    be adjusted to allow that.
>>
>> b) we give up on this approach and proceed with one of
>>
>> 	1) scan the ctx wa list. If it has any non-masked register,
>> 	   we submit a job to read it from the GPU side. MCR will
>> 	   make this harder as the steering from the GPU side is
>> 	   different than the CPU
>>
>> 	2) emit additional commands to read and modify the register from
>> 	   the GPU side
>>
>> 	3) find the register in the golden context and patch it in place
>>
>>
>>
>>
>> > one place seems like an easy way to fix that.
>>
>> I'm leaning towards this option in the hope we don't have have
>> another GEN12_FF_MODE2 in future.
>>
>> Matt, we've been pushing towards separating the tuning from the WAs, but
>> here we'd go the other way. Anything against doing this for now?
>
>That's probably fine as long as we leave a comment behind in the tuning
>section explaining why that specific setting is found in a different
>spot.

alright, just submitted a new version with a few more changes.

thanks
Lucas De Marchi
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 0578fc2c9e60..a013f245a790 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -1003,6 +1003,9 @@  void intel_engine_init_ctx_wa(struct intel_engine_cs *engine)
 int intel_engine_emit_ctx_wa(struct i915_request *rq)
 {
 	struct i915_wa_list *wal = &rq->engine->ctx_wa_list;
+	struct intel_uncore *uncore = rq->engine->uncore;
+	enum forcewake_domains fw;
+	unsigned long flags;
 	struct i915_wa *wa;
 	unsigned int i;
 	u32 *cs;
@@ -1019,13 +1022,35 @@  int intel_engine_emit_ctx_wa(struct i915_request *rq)
 	if (IS_ERR(cs))
 		return PTR_ERR(cs);
 
+	fw = wal_get_fw_for_rmw(uncore, wal);
+
+	intel_gt_mcr_lock(wal->gt, &flags);
+	spin_lock(&uncore->lock);
+	intel_uncore_forcewake_get__locked(uncore, fw);
+
 	*cs++ = MI_LOAD_REGISTER_IMM(wal->count);
 	for (i = 0, wa = wal->list; i < wal->count; i++, wa++) {
+		u32 val;
+
+		if (wa->masked_reg || wa->set == U32_MAX) {
+			val = wa->set;
+		} else {
+			val = wa->is_mcr ?
+				intel_gt_mcr_read_any_fw(wal->gt, wa->mcr_reg) :
+				intel_uncore_read_fw(uncore, wa->reg);
+			val &= ~wa->clr;
+			val |= wa->set;
+		}
+
 		*cs++ = i915_mmio_reg_offset(wa->reg);
-		*cs++ = wa->set;
+		*cs++ = val;
 	}
 	*cs++ = MI_NOOP;
 
+	intel_uncore_forcewake_put__locked(uncore, fw);
+	spin_unlock(&uncore->lock);
+	intel_gt_mcr_unlock(wal->gt, flags);
+
 	intel_ring_advance(rq, cs);
 
 	ret = rq->engine->emit_flush(rq, EMIT_BARRIER);