[RFC] drm/i915/tgl: Advanced preparser support for GPU relocs
diff mbox series

Message ID 20190823020909.6029-1-daniele.ceraolospurio@intel.com
State New
Headers show
Series
  • [RFC] drm/i915/tgl: Advanced preparser support for GPU relocs
Related show

Commit Message

Daniele Ceraolo Spurio Aug. 23, 2019, 2:09 a.m. UTC
TGL has an improved CS pre-parser that can now pre-fetch commands across
batch boundaries. This improves performances when lots of small batches
are used, but has an impact on self-modifying code. If we want to modify
the content of a batch from another ring/batch, we need to either
guarantee that the memory location is updated before the pre-parser gets
to it or we need to turn the pre-parser off around the modification.
In i915, we use self-modifying code only for GPU relocations.

The pre-parser fetches across memory synchronization commands as well,
so the only way to guarantee that the writes land before the parser gets
to it is to have more instructions between the sync and the destination
than the parser FIFO depth, which is not an optimal solution.

The parser can be disabled either globally (from GFX_MODE) or at the
context level, using a new flag in the ARB_CHECK command. When
re-enabled, the parser turns on at the next arbitration point (ARB_CHECK
is not an arbitration point when the parser flag is set). The command is
not privileged, so the status can be changed by user batches as well.

To cope with this new HW feature, this patch turns off the parser when
GPU relocs are emitted and it conditionally turns back on after the
emission, before the user batch is started. The original status of the
parser, which is stored in RING_INSTPM, is used to decide whether to
re-enable the parser or not. This ensure that we don't turn the parser
back on if the userspace had decided to disable it.

Note that with this patch, the parser defaults to on for all contexts,
which might regress legacy userspace application that use self-modifying
code. However, if we turn it off by default, e.g. with an ARB_CHECK as
first cmd on the ring, we will regress performance, because even legacy
parsing capabilities are disabled in this scenario.

Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 45 ++++++++++++--
 drivers/gpu/drm/i915/gt/intel_engine.h        |  9 +++
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  4 ++
 drivers/gpu/drm/i915/gt/intel_gpu_commands.h  | 17 +++++-
 drivers/gpu/drm/i915/gt/intel_lrc.c           | 60 +++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_timeline.c      | 13 +++-
 drivers/gpu/drm/i915/i915_reg.h               |  1 +
 7 files changed, 141 insertions(+), 8 deletions(-)

Comments

Chris Wilson Aug. 23, 2019, 7:27 a.m. UTC | #1
Quoting Daniele Ceraolo Spurio (2019-08-23 03:09:09)
> TGL has an improved CS pre-parser that can now pre-fetch commands across
> batch boundaries. This improves performances when lots of small batches
> are used, but has an impact on self-modifying code. If we want to modify
> the content of a batch from another ring/batch, we need to either
> guarantee that the memory location is updated before the pre-parser gets
> to it or we need to turn the pre-parser off around the modification.
> In i915, we use self-modifying code only for GPU relocations.
> 
> The pre-parser fetches across memory synchronization commands as well,
> so the only way to guarantee that the writes land before the parser gets
> to it is to have more instructions between the sync and the destination
> than the parser FIFO depth, which is not an optimal solution.

Well, our ABI is that memory is coherent before the breadcrumb of *each*
batch. That is a fundamental requirement for our signaling to userspace.
Please tell me that there is a context flag to turn this off, or we else
we need to emit 32x flushes or whatever it takes.
-Chris
Chris Wilson Aug. 23, 2019, 2:26 p.m. UTC | #2
Quoting Chris Wilson (2019-08-23 08:27:25)
> Quoting Daniele Ceraolo Spurio (2019-08-23 03:09:09)
> > TGL has an improved CS pre-parser that can now pre-fetch commands across
> > batch boundaries. This improves performances when lots of small batches
> > are used, but has an impact on self-modifying code. If we want to modify
> > the content of a batch from another ring/batch, we need to either
> > guarantee that the memory location is updated before the pre-parser gets
> > to it or we need to turn the pre-parser off around the modification.
> > In i915, we use self-modifying code only for GPU relocations.
> > 
> > The pre-parser fetches across memory synchronization commands as well,
> > so the only way to guarantee that the writes land before the parser gets
> > to it is to have more instructions between the sync and the destination
> > than the parser FIFO depth, which is not an optimal solution.
> 
> Well, our ABI is that memory is coherent before the breadcrumb of *each*
> batch. That is a fundamental requirement for our signaling to userspace.
> Please tell me that there is a context flag to turn this off, or we else
> we need to emit 32x flushes or whatever it takes.

So looking at what you are doing, it seems entirely possible that we can
switch off the preparser for the breadcrumb -- is that enough to make
that final signal coherent and provide the barrier required for the
invalidation at the start of the next? (You might even only enable the
preparser around userspace batches.) Or I hope they have an extra flush
bit for correct serialisation.
-Chris
Daniele Ceraolo Spurio Aug. 23, 2019, 3:05 p.m. UTC | #3
On 8/23/19 7:26 AM, Chris Wilson wrote:
> Quoting Chris Wilson (2019-08-23 08:27:25)
>> Quoting Daniele Ceraolo Spurio (2019-08-23 03:09:09)
>>> TGL has an improved CS pre-parser that can now pre-fetch commands across
>>> batch boundaries. This improves performances when lots of small batches
>>> are used, but has an impact on self-modifying code. If we want to modify
>>> the content of a batch from another ring/batch, we need to either
>>> guarantee that the memory location is updated before the pre-parser gets
>>> to it or we need to turn the pre-parser off around the modification.
>>> In i915, we use self-modifying code only for GPU relocations.
>>>
>>> The pre-parser fetches across memory synchronization commands as well,
>>> so the only way to guarantee that the writes land before the parser gets
>>> to it is to have more instructions between the sync and the destination
>>> than the parser FIFO depth, which is not an optimal solution.
>>
>> Well, our ABI is that memory is coherent before the breadcrumb of *each*
>> batch. That is a fundamental requirement for our signaling to userspace.
>> Please tell me that there is a context flag to turn this off, or we else
>> we need to emit 32x flushes or whatever it takes.
> 
Are you referring to the specific case where we have a request modifying 
an object that is then used as a batch in the next request? Because 
coherency of objects that are not executed as batches is not impacted.

> So looking at what you are doing, it seems entirely possible that we can
> switch off the preparser for the breadcrumb -- is that enough to make
> that final signal coherent and provide the barrier required for the
> invalidation at the start of the next? (You might even only enable the
> preparser around userspace batches.) Or I hope they have an extra flush
> bit for correct serialisation.

The instructions I got from the HW team on how to handle the 
self-modifying code say that the pre-parser must be disabled before the 
write is emitted and re-enabled afterward, so I'm not sure if having it 
off just around the breadcrumb is enough, we might need and extra 
BBSTART in the breadcrumb to flush the parser status. Should we just 
keep the parser off by default and have the userspace app opt-in (via 
and ARB_CHECK in the batch) if they know they can handle it?

Daniele

> -Chris
>
Chris Wilson Aug. 23, 2019, 3:10 p.m. UTC | #4
Quoting Daniele Ceraolo Spurio (2019-08-23 16:05:45)
> 
> 
> On 8/23/19 7:26 AM, Chris Wilson wrote:
> > Quoting Chris Wilson (2019-08-23 08:27:25)
> >> Quoting Daniele Ceraolo Spurio (2019-08-23 03:09:09)
> >>> TGL has an improved CS pre-parser that can now pre-fetch commands across
> >>> batch boundaries. This improves performances when lots of small batches
> >>> are used, but has an impact on self-modifying code. If we want to modify
> >>> the content of a batch from another ring/batch, we need to either
> >>> guarantee that the memory location is updated before the pre-parser gets
> >>> to it or we need to turn the pre-parser off around the modification.
> >>> In i915, we use self-modifying code only for GPU relocations.
> >>>
> >>> The pre-parser fetches across memory synchronization commands as well,
> >>> so the only way to guarantee that the writes land before the parser gets
> >>> to it is to have more instructions between the sync and the destination
> >>> than the parser FIFO depth, which is not an optimal solution.
> >>
> >> Well, our ABI is that memory is coherent before the breadcrumb of *each*
> >> batch. That is a fundamental requirement for our signaling to userspace.
> >> Please tell me that there is a context flag to turn this off, or we else
> >> we need to emit 32x flushes or whatever it takes.
> > 
> Are you referring to the specific case where we have a request modifying 
> an object that is then used as a batch in the next request? Because 
> coherency of objects that are not executed as batches is not impacted.

"Fetches across memory sync" sounds like a major ABI break. The batches
are a hard serialisation barrier, with memory coherency guaranteed prior
to the signaling at the end of one batch and clear caches guaranteed at
the start of the next.

There is mutterings for a weaker mode, the above is our existing
contract. There is nothing special about the relocation code, it is
assuming our contract holds.
-Chris
Chris Wilson Aug. 23, 2019, 3:28 p.m. UTC | #5
Quoting Chris Wilson (2019-08-23 16:10:48)
> Quoting Daniele Ceraolo Spurio (2019-08-23 16:05:45)
> > 
> > 
> > On 8/23/19 7:26 AM, Chris Wilson wrote:
> > > Quoting Chris Wilson (2019-08-23 08:27:25)
> > >> Quoting Daniele Ceraolo Spurio (2019-08-23 03:09:09)
> > >>> TGL has an improved CS pre-parser that can now pre-fetch commands across
> > >>> batch boundaries. This improves performances when lots of small batches
> > >>> are used, but has an impact on self-modifying code. If we want to modify
> > >>> the content of a batch from another ring/batch, we need to either
> > >>> guarantee that the memory location is updated before the pre-parser gets
> > >>> to it or we need to turn the pre-parser off around the modification.
> > >>> In i915, we use self-modifying code only for GPU relocations.
> > >>>
> > >>> The pre-parser fetches across memory synchronization commands as well,
> > >>> so the only way to guarantee that the writes land before the parser gets
> > >>> to it is to have more instructions between the sync and the destination
> > >>> than the parser FIFO depth, which is not an optimal solution.
> > >>
> > >> Well, our ABI is that memory is coherent before the breadcrumb of *each*
> > >> batch. That is a fundamental requirement for our signaling to userspace.
> > >> Please tell me that there is a context flag to turn this off, or we else
> > >> we need to emit 32x flushes or whatever it takes.
> > > 
> > Are you referring to the specific case where we have a request modifying 
> > an object that is then used as a batch in the next request? Because 
> > coherency of objects that are not executed as batches is not impacted.
> 
> "Fetches across memory sync" sounds like a major ABI break. The batches
> are a hard serialisation barrier, with memory coherency guaranteed prior
> to the signaling at the end of one batch and clear caches guaranteed at
> the start of the next.

We have relocs, oa and sseu all using self-modifying code. I expect we
will have PTE modifications and much more done via the GPU in the near
future. All rely on the CS_STALL doing exactly what it says on the tin.
-Chris
Daniele Ceraolo Spurio Aug. 23, 2019, 3:39 p.m. UTC | #6
On 8/23/19 8:28 AM, Chris Wilson wrote:
> Quoting Chris Wilson (2019-08-23 16:10:48)
>> Quoting Daniele Ceraolo Spurio (2019-08-23 16:05:45)
>>>
>>>
>>> On 8/23/19 7:26 AM, Chris Wilson wrote:
>>>> Quoting Chris Wilson (2019-08-23 08:27:25)
>>>>> Quoting Daniele Ceraolo Spurio (2019-08-23 03:09:09)
>>>>>> TGL has an improved CS pre-parser that can now pre-fetch commands across
>>>>>> batch boundaries. This improves performances when lots of small batches
>>>>>> are used, but has an impact on self-modifying code. If we want to modify
>>>>>> the content of a batch from another ring/batch, we need to either
>>>>>> guarantee that the memory location is updated before the pre-parser gets
>>>>>> to it or we need to turn the pre-parser off around the modification.
>>>>>> In i915, we use self-modifying code only for GPU relocations.
>>>>>>
>>>>>> The pre-parser fetches across memory synchronization commands as well,
>>>>>> so the only way to guarantee that the writes land before the parser gets
>>>>>> to it is to have more instructions between the sync and the destination
>>>>>> than the parser FIFO depth, which is not an optimal solution.
>>>>>
>>>>> Well, our ABI is that memory is coherent before the breadcrumb of *each*
>>>>> batch. That is a fundamental requirement for our signaling to userspace.
>>>>> Please tell me that there is a context flag to turn this off, or we else
>>>>> we need to emit 32x flushes or whatever it takes.
>>>>
>>> Are you referring to the specific case where we have a request modifying
>>> an object that is then used as a batch in the next request? Because
>>> coherency of objects that are not executed as batches is not impacted.
>>
>> "Fetches across memory sync" sounds like a major ABI break. The batches
>> are a hard serialisation barrier, with memory coherency guaranteed prior
>> to the signaling at the end of one batch and clear caches guaranteed at
>> the start of the next.
> 
> We have relocs, oa and sseu all using self-modifying code. I expect we
> will have PTE modifications and much more done via the GPU in the near
> future. All rely on the CS_STALL doing exactly what it says on the tin.
> -Chris
> 

I guess the easiest solution is then to keep the parser off outside of 
user batches. We can default to off and then restore what the user has 
programmed before the BBSTART. It's not a breach of contract if we say 
that if you opt-in to the parser then you need to make sure your batches 
are not self-modifying, right?

BTW the CS_STALL does not guarantee on pre-gen12 gens that 
self-modifying code works within the same batch/ring because the 
pre-parser is already pre-fetching across memory sync points, it just 
stops at the next arb point.

Daniele
Chris Wilson Aug. 23, 2019, 3:52 p.m. UTC | #7
Quoting Daniele Ceraolo Spurio (2019-08-23 16:39:14)
> 
> 
> On 8/23/19 8:28 AM, Chris Wilson wrote:
> > Quoting Chris Wilson (2019-08-23 16:10:48)
> >> Quoting Daniele Ceraolo Spurio (2019-08-23 16:05:45)
> >>>
> >>>
> >>> On 8/23/19 7:26 AM, Chris Wilson wrote:
> >>>> Quoting Chris Wilson (2019-08-23 08:27:25)
> >>>>> Quoting Daniele Ceraolo Spurio (2019-08-23 03:09:09)
> >>>>>> TGL has an improved CS pre-parser that can now pre-fetch commands across
> >>>>>> batch boundaries. This improves performances when lots of small batches
> >>>>>> are used, but has an impact on self-modifying code. If we want to modify
> >>>>>> the content of a batch from another ring/batch, we need to either
> >>>>>> guarantee that the memory location is updated before the pre-parser gets
> >>>>>> to it or we need to turn the pre-parser off around the modification.
> >>>>>> In i915, we use self-modifying code only for GPU relocations.
> >>>>>>
> >>>>>> The pre-parser fetches across memory synchronization commands as well,
> >>>>>> so the only way to guarantee that the writes land before the parser gets
> >>>>>> to it is to have more instructions between the sync and the destination
> >>>>>> than the parser FIFO depth, which is not an optimal solution.
> >>>>>
> >>>>> Well, our ABI is that memory is coherent before the breadcrumb of *each*
> >>>>> batch. That is a fundamental requirement for our signaling to userspace.
> >>>>> Please tell me that there is a context flag to turn this off, or we else
> >>>>> we need to emit 32x flushes or whatever it takes.
> >>>>
> >>> Are you referring to the specific case where we have a request modifying
> >>> an object that is then used as a batch in the next request? Because
> >>> coherency of objects that are not executed as batches is not impacted.
> >>
> >> "Fetches across memory sync" sounds like a major ABI break. The batches
> >> are a hard serialisation barrier, with memory coherency guaranteed prior
> >> to the signaling at the end of one batch and clear caches guaranteed at
> >> the start of the next.
> > 
> > We have relocs, oa and sseu all using self-modifying code. I expect we
> > will have PTE modifications and much more done via the GPU in the near
> > future. All rely on the CS_STALL doing exactly what it says on the tin.
> > -Chris
> > 
> 
> I guess the easiest solution is then to keep the parser off outside of 
> user batches. We can default to off and then restore what the user has 
> programmed before the BBSTART. It's not a breach of contract if we say 
> that if you opt-in to the parser then you need to make sure your batches 
> are not self-modifying, right?

Is it just the MI_ARB_ONOFF bits, and is that still a privileged
command? i.e. can userspace change mode by itself, or it is a
context-param?

> BTW the CS_STALL does not guarantee on pre-gen12 gens that 
> self-modifying code works within the same batch/ring because the 
> pre-parser is already pre-fetching across memory sync points, it just 
> stops at the next arb point.

Ok, we still uphold our contract if they can't execute any code in the
window where they would see someone else's data.
-Chris
Daniele Ceraolo Spurio Aug. 23, 2019, 3:56 p.m. UTC | #8
On 8/23/19 8:52 AM, Chris Wilson wrote:
> Quoting Daniele Ceraolo Spurio (2019-08-23 16:39:14)
>>
>>
>> On 8/23/19 8:28 AM, Chris Wilson wrote:
>>> Quoting Chris Wilson (2019-08-23 16:10:48)
>>>> Quoting Daniele Ceraolo Spurio (2019-08-23 16:05:45)
>>>>>
>>>>>
>>>>> On 8/23/19 7:26 AM, Chris Wilson wrote:
>>>>>> Quoting Chris Wilson (2019-08-23 08:27:25)
>>>>>>> Quoting Daniele Ceraolo Spurio (2019-08-23 03:09:09)
>>>>>>>> TGL has an improved CS pre-parser that can now pre-fetch commands across
>>>>>>>> batch boundaries. This improves performances when lots of small batches
>>>>>>>> are used, but has an impact on self-modifying code. If we want to modify
>>>>>>>> the content of a batch from another ring/batch, we need to either
>>>>>>>> guarantee that the memory location is updated before the pre-parser gets
>>>>>>>> to it or we need to turn the pre-parser off around the modification.
>>>>>>>> In i915, we use self-modifying code only for GPU relocations.
>>>>>>>>
>>>>>>>> The pre-parser fetches across memory synchronization commands as well,
>>>>>>>> so the only way to guarantee that the writes land before the parser gets
>>>>>>>> to it is to have more instructions between the sync and the destination
>>>>>>>> than the parser FIFO depth, which is not an optimal solution.
>>>>>>>
>>>>>>> Well, our ABI is that memory is coherent before the breadcrumb of *each*
>>>>>>> batch. That is a fundamental requirement for our signaling to userspace.
>>>>>>> Please tell me that there is a context flag to turn this off, or we else
>>>>>>> we need to emit 32x flushes or whatever it takes.
>>>>>>
>>>>> Are you referring to the specific case where we have a request modifying
>>>>> an object that is then used as a batch in the next request? Because
>>>>> coherency of objects that are not executed as batches is not impacted.
>>>>
>>>> "Fetches across memory sync" sounds like a major ABI break. The batches
>>>> are a hard serialisation barrier, with memory coherency guaranteed prior
>>>> to the signaling at the end of one batch and clear caches guaranteed at
>>>> the start of the next.
>>>
>>> We have relocs, oa and sseu all using self-modifying code. I expect we
>>> will have PTE modifications and much more done via the GPU in the near
>>> future. All rely on the CS_STALL doing exactly what it says on the tin.
>>> -Chris
>>>
>>
>> I guess the easiest solution is then to keep the parser off outside of
>> user batches. We can default to off and then restore what the user has
>> programmed before the BBSTART. It's not a breach of contract if we say
>> that if you opt-in to the parser then you need to make sure your batches
>> are not self-modifying, right?
> 
> Is it just the MI_ARB_ONOFF bits, and is that still a privileged
> command? i.e. can userspace change mode by itself, or it is a
> context-param?

It's the ARB_CHECK, not the ARB_ONOFF, so yes, it is not privileged and 
userspace can modify it itself. It would've been easier if it was a 
context param :)

Daniele

> 
>> BTW the CS_STALL does not guarantee on pre-gen12 gens that
>> self-modifying code works within the same batch/ring because the
>> pre-parser is already pre-fetching across memory sync points, it just
>> stops at the next arb point.
> 
> Ok, we still uphold our contract if they can't execute any code in the
> window where they would see someone else's data.
> -Chris
>
Chris Wilson Aug. 23, 2019, 4:31 p.m. UTC | #9
Quoting Daniele Ceraolo Spurio (2019-08-23 16:56:54)
> 
> 
> On 8/23/19 8:52 AM, Chris Wilson wrote:
> > Quoting Daniele Ceraolo Spurio (2019-08-23 16:39:14)
> >>
> >>
> >> On 8/23/19 8:28 AM, Chris Wilson wrote:
> >>> Quoting Chris Wilson (2019-08-23 16:10:48)
> >>>> Quoting Daniele Ceraolo Spurio (2019-08-23 16:05:45)
> >>>>>
> >>>>>
> >>>>> On 8/23/19 7:26 AM, Chris Wilson wrote:
> >>>>>> Quoting Chris Wilson (2019-08-23 08:27:25)
> >>>>>>> Quoting Daniele Ceraolo Spurio (2019-08-23 03:09:09)
> >>>>>>>> TGL has an improved CS pre-parser that can now pre-fetch commands across
> >>>>>>>> batch boundaries. This improves performances when lots of small batches
> >>>>>>>> are used, but has an impact on self-modifying code. If we want to modify
> >>>>>>>> the content of a batch from another ring/batch, we need to either
> >>>>>>>> guarantee that the memory location is updated before the pre-parser gets
> >>>>>>>> to it or we need to turn the pre-parser off around the modification.
> >>>>>>>> In i915, we use self-modifying code only for GPU relocations.
> >>>>>>>>
> >>>>>>>> The pre-parser fetches across memory synchronization commands as well,
> >>>>>>>> so the only way to guarantee that the writes land before the parser gets
> >>>>>>>> to it is to have more instructions between the sync and the destination
> >>>>>>>> than the parser FIFO depth, which is not an optimal solution.
> >>>>>>>
> >>>>>>> Well, our ABI is that memory is coherent before the breadcrumb of *each*
> >>>>>>> batch. That is a fundamental requirement for our signaling to userspace.
> >>>>>>> Please tell me that there is a context flag to turn this off, or we else
> >>>>>>> we need to emit 32x flushes or whatever it takes.
> >>>>>>
> >>>>> Are you referring to the specific case where we have a request modifying
> >>>>> an object that is then used as a batch in the next request? Because
> >>>>> coherency of objects that are not executed as batches is not impacted.
> >>>>
> >>>> "Fetches across memory sync" sounds like a major ABI break. The batches
> >>>> are a hard serialisation barrier, with memory coherency guaranteed prior
> >>>> to the signaling at the end of one batch and clear caches guaranteed at
> >>>> the start of the next.
> >>>
> >>> We have relocs, oa and sseu all using self-modifying code. I expect we
> >>> will have PTE modifications and much more done via the GPU in the near
> >>> future. All rely on the CS_STALL doing exactly what it says on the tin.
> >>> -Chris
> >>>
> >>
> >> I guess the easiest solution is then to keep the parser off outside of
> >> user batches. We can default to off and then restore what the user has
> >> programmed before the BBSTART. It's not a breach of contract if we say
> >> that if you opt-in to the parser then you need to make sure your batches
> >> are not self-modifying, right?
> > 
> > Is it just the MI_ARB_ONOFF bits, and is that still a privileged
> > command? i.e. can userspace change mode by itself, or it is a
> > context-param?
> 
> It's the ARB_CHECK, not the ARB_ONOFF, so yes, it is not privileged and 
> userspace can modify it itself. It would've been easier if it was a 
> context param :)

Does it go across a context switch? That might be an easy solution for
our internal requests (already true for oa/sseu where we use one context
to modify another). I do worry though if there might be leakage
across our flush-invalidate barriers between userspace batches.
-Chris
Daniele Ceraolo Spurio Aug. 23, 2019, 5:01 p.m. UTC | #10
On 8/23/19 9:31 AM, Chris Wilson wrote:
> Quoting Daniele Ceraolo Spurio (2019-08-23 16:56:54)
>>
>>
>> On 8/23/19 8:52 AM, Chris Wilson wrote:
>>> Quoting Daniele Ceraolo Spurio (2019-08-23 16:39:14)
>>>>
>>>>
>>>> On 8/23/19 8:28 AM, Chris Wilson wrote:
>>>>> Quoting Chris Wilson (2019-08-23 16:10:48)
>>>>>> Quoting Daniele Ceraolo Spurio (2019-08-23 16:05:45)
>>>>>>>
>>>>>>>
>>>>>>> On 8/23/19 7:26 AM, Chris Wilson wrote:
>>>>>>>> Quoting Chris Wilson (2019-08-23 08:27:25)
>>>>>>>>> Quoting Daniele Ceraolo Spurio (2019-08-23 03:09:09)
>>>>>>>>>> TGL has an improved CS pre-parser that can now pre-fetch commands across
>>>>>>>>>> batch boundaries. This improves performances when lots of small batches
>>>>>>>>>> are used, but has an impact on self-modifying code. If we want to modify
>>>>>>>>>> the content of a batch from another ring/batch, we need to either
>>>>>>>>>> guarantee that the memory location is updated before the pre-parser gets
>>>>>>>>>> to it or we need to turn the pre-parser off around the modification.
>>>>>>>>>> In i915, we use self-modifying code only for GPU relocations.
>>>>>>>>>>
>>>>>>>>>> The pre-parser fetches across memory synchronization commands as well,
>>>>>>>>>> so the only way to guarantee that the writes land before the parser gets
>>>>>>>>>> to it is to have more instructions between the sync and the destination
>>>>>>>>>> than the parser FIFO depth, which is not an optimal solution.
>>>>>>>>>
>>>>>>>>> Well, our ABI is that memory is coherent before the breadcrumb of *each*
>>>>>>>>> batch. That is a fundamental requirement for our signaling to userspace.
>>>>>>>>> Please tell me that there is a context flag to turn this off, or we else
>>>>>>>>> we need to emit 32x flushes or whatever it takes.
>>>>>>>>
>>>>>>> Are you referring to the specific case where we have a request modifying
>>>>>>> an object that is then used as a batch in the next request? Because
>>>>>>> coherency of objects that are not executed as batches is not impacted.
>>>>>>
>>>>>> "Fetches across memory sync" sounds like a major ABI break. The batches
>>>>>> are a hard serialisation barrier, with memory coherency guaranteed prior
>>>>>> to the signaling at the end of one batch and clear caches guaranteed at
>>>>>> the start of the next.
>>>>>
>>>>> We have relocs, oa and sseu all using self-modifying code. I expect we
>>>>> will have PTE modifications and much more done via the GPU in the near
>>>>> future. All rely on the CS_STALL doing exactly what it says on the tin.
>>>>> -Chris
>>>>>
>>>>
>>>> I guess the easiest solution is then to keep the parser off outside of
>>>> user batches. We can default to off and then restore what the user has
>>>> programmed before the BBSTART. It's not a breach of contract if we say
>>>> that if you opt-in to the parser then you need to make sure your batches
>>>> are not self-modifying, right?
>>>
>>> Is it just the MI_ARB_ONOFF bits, and is that still a privileged
>>> command? i.e. can userspace change mode by itself, or it is a
>>> context-param?
>>
>> It's the ARB_CHECK, not the ARB_ONOFF, so yes, it is not privileged and
>> userspace can modify it itself. It would've been easier if it was a
>> context param :)
> 
> Does it go across a context switch? That might be an easy solution for
> our internal requests (already true for oa/sseu where we use one context
> to modify another). I do worry though if there might be leakage
> across our flush-invalidate barriers between userspace batches.

The pre-fetching? no, AFAIK that's confined to the context, so moving 
the relocs to another context would work; the status of the parser is 
also ctx save/restored. What leakage case are you worried about? The 
memory synchronization between contexts is unchanged, a context can only 
mess up its own instructions. AFAIU the only thing that's possible is 
that if a batch modifies the next batch in the same context then the CS 
won't see the update.

Daniele

> -Chris
>
Chris Wilson Aug. 23, 2019, 5:28 p.m. UTC | #11
Quoting Daniele Ceraolo Spurio (2019-08-23 18:01:03)
> 
> 
> On 8/23/19 9:31 AM, Chris Wilson wrote:
> > Quoting Daniele Ceraolo Spurio (2019-08-23 16:56:54)
> >>
> >>
> >> On 8/23/19 8:52 AM, Chris Wilson wrote:
> >>> Quoting Daniele Ceraolo Spurio (2019-08-23 16:39:14)
> >>>>
> >>>>
> >>>> On 8/23/19 8:28 AM, Chris Wilson wrote:
> >>>>> Quoting Chris Wilson (2019-08-23 16:10:48)
> >>>>>> Quoting Daniele Ceraolo Spurio (2019-08-23 16:05:45)
> >>>>>>>
> >>>>>>>
> >>>>>>> On 8/23/19 7:26 AM, Chris Wilson wrote:
> >>>>>>>> Quoting Chris Wilson (2019-08-23 08:27:25)
> >>>>>>>>> Quoting Daniele Ceraolo Spurio (2019-08-23 03:09:09)
> >>>>>>>>>> TGL has an improved CS pre-parser that can now pre-fetch commands across
> >>>>>>>>>> batch boundaries. This improves performances when lots of small batches
> >>>>>>>>>> are used, but has an impact on self-modifying code. If we want to modify
> >>>>>>>>>> the content of a batch from another ring/batch, we need to either
> >>>>>>>>>> guarantee that the memory location is updated before the pre-parser gets
> >>>>>>>>>> to it or we need to turn the pre-parser off around the modification.
> >>>>>>>>>> In i915, we use self-modifying code only for GPU relocations.
> >>>>>>>>>>
> >>>>>>>>>> The pre-parser fetches across memory synchronization commands as well,
> >>>>>>>>>> so the only way to guarantee that the writes land before the parser gets
> >>>>>>>>>> to it is to have more instructions between the sync and the destination
> >>>>>>>>>> than the parser FIFO depth, which is not an optimal solution.
> >>>>>>>>>
> >>>>>>>>> Well, our ABI is that memory is coherent before the breadcrumb of *each*
> >>>>>>>>> batch. That is a fundamental requirement for our signaling to userspace.
> >>>>>>>>> Please tell me that there is a context flag to turn this off, or we else
> >>>>>>>>> we need to emit 32x flushes or whatever it takes.
> >>>>>>>>
> >>>>>>> Are you referring to the specific case where we have a request modifying
> >>>>>>> an object that is then used as a batch in the next request? Because
> >>>>>>> coherency of objects that are not executed as batches is not impacted.
> >>>>>>
> >>>>>> "Fetches across memory sync" sounds like a major ABI break. The batches
> >>>>>> are a hard serialisation barrier, with memory coherency guaranteed prior
> >>>>>> to the signaling at the end of one batch and clear caches guaranteed at
> >>>>>> the start of the next.
> >>>>>
> >>>>> We have relocs, oa and sseu all using self-modifying code. I expect we
> >>>>> will have PTE modifications and much more done via the GPU in the near
> >>>>> future. All rely on the CS_STALL doing exactly what it says on the tin.
> >>>>> -Chris
> >>>>>
> >>>>
> >>>> I guess the easiest solution is then to keep the parser off outside of
> >>>> user batches. We can default to off and then restore what the user has
> >>>> programmed before the BBSTART. It's not a breach of contract if we say
> >>>> that if you opt-in to the parser then you need to make sure your batches
> >>>> are not self-modifying, right?
> >>>
> >>> Is it just the MI_ARB_ONOFF bits, and is that still a privileged
> >>> command? i.e. can userspace change mode by itself, or it is a
> >>> context-param?
> >>
> >> It's the ARB_CHECK, not the ARB_ONOFF, so yes, it is not privileged and
> >> userspace can modify it itself. It would've been easier if it was a
> >> context param :)
> > 
> > Does it go across a context switch? That might be an easy solution for
> > our internal requests (already true for oa/sseu where we use one context
> > to modify another). I do worry though if there might be leakage
> > across our flush-invalidate barriers between userspace batches.
> 
> The pre-fetching? no, AFAIK that's confined to the context, so moving 
> the relocs to another context would work; the status of the parser is 
> also ctx save/restored. What leakage case are you worried about? The 
> memory synchronization between contexts is unchanged, a context can only 
> mess up its own instructions. AFAIU the only thing that's possible is 
> that if a batch modifies the next batch in the same context then the CS 
> won't see the update.

Our mission is to ensure that one user cannot mess with another. So if
the context boundary is solid, and the user cannot interfere with the
TLBs (or at least not able to manipulate an invalid lookup into fetching
other data), then the flush-invalidate should be solid between batches. 
(One always expects the worst though, and we already know the horrors of
speculative fetches from the CPU filling the cache with stale data and
now it seems like the GPU may be learning similar evil tricks...)

I'm still unnerved though about the prospect of a soft bottom-of-pipe
sync. But that's just standard fare for pipecontrols.

Ok, it should be easy enough to create a context on demand for the
"rare" GPU relocs and I never have to worry about this again. :)
-Chris

Patch
diff mbox series

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index b5f6937369ea..45e84f28276c 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -254,6 +254,7 @@  struct i915_execbuffer {
 
 		struct i915_request *rq;
 		u32 *rq_cmd;
+		unsigned int size;
 		unsigned int rq_size;
 	} reloc_cache;
 
@@ -904,6 +905,7 @@  static void reloc_cache_init(struct reloc_cache *cache,
 	cache->needs_unfenced = INTEL_INFO(i915)->unfenced_needs_alignment;
 	cache->node.allocated = false;
 	cache->rq = NULL;
+	cache->size = PAGE_SIZE;
 	cache->rq_size = 0;
 }
 
@@ -928,7 +930,8 @@  static inline struct i915_ggtt *cache_to_ggtt(struct reloc_cache *cache)
 
 static void reloc_gpu_flush(struct reloc_cache *cache)
 {
-	GEM_BUG_ON(cache->rq_size >= cache->rq->batch->obj->base.size / sizeof(u32));
+	GEM_BUG_ON(cache->rq_size >= cache->size / sizeof(u32));
+
 	cache->rq_cmd[cache->rq_size] = MI_BATCH_BUFFER_END;
 
 	__i915_gem_object_flush_map(cache->rq->batch->obj, 0, cache->rq_size);
@@ -1142,10 +1145,11 @@  static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
 	struct intel_engine_pool_node *pool;
 	struct i915_request *rq;
 	struct i915_vma *batch;
+	u32 reserved_size = eb->engine->emit_preparser_enable_size_dw * sizeof(u32);
 	u32 *cmd;
 	int err;
 
-	pool = intel_engine_pool_get(&eb->engine->pool, PAGE_SIZE);
+	pool = intel_engine_pool_get(&eb->engine->pool, cache->size);
 	if (IS_ERR(pool))
 		return PTR_ERR(pool);
 
@@ -1158,6 +1162,9 @@  static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
 		goto out_pool;
 	}
 
+	/* we reserve a portion of the batch for the pre-parser re-enabling */
+	cache->size -= reserved_size;
+
 	batch = i915_vma_instance(pool->obj, vma->vm, NULL);
 	if (IS_ERR(batch)) {
 		err = PTR_ERR(batch);
@@ -1182,9 +1189,39 @@  static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
 	if (err)
 		goto err_request;
 
+	if (eb->engine->emit_preparser_disable) {
+		err = eb->engine->emit_preparser_disable(rq,
+							 cmd + cache->size / sizeof(u32));
+		if (err)
+			goto skip_request;
+	}
+
 	err = eb->engine->emit_bb_start(rq,
-					batch->node.start, PAGE_SIZE,
+					batch->node.start, cache->size,
 					cache->gen > 5 ? 0 : I915_DISPATCH_SECURE);
+
+	/*
+	 * Nothing we can do to fix this if we fail to re-enable, so print an
+	 * error and keep going. The context will still be functional, but
+	 * performance might be reduced.
+	 * We attemt this even if emit_bb_start failed to try and at least
+	 * restore the parser status.
+	 */
+	if (eb->engine->emit_preparser_disable) {
+		int ret;
+		eb->engine->emit_flush(rq, EMIT_FLUSH);
+
+		ret = eb->engine->emit_bb_start(rq,
+			batch->node.start + cache->size, reserved_size,
+			cache->gen > 5 ? 0 : I915_DISPATCH_SECURE);
+		if (ret)
+			DRM_ERROR("Failed to re-enable pre-parser for "
+				  "ctx=%u on engine%u:%u\n",
+				  rq->gem_context->hw_id,
+				  eb->engine->uabi_class,
+				  eb->engine->uabi_instance);
+	}
+
 	if (err)
 		goto skip_request;
 
@@ -1226,7 +1263,7 @@  static u32 *reloc_gpu(struct i915_execbuffer *eb,
 	struct reloc_cache *cache = &eb->reloc_cache;
 	u32 *cmd;
 
-	if (cache->rq_size > PAGE_SIZE/sizeof(u32) - (len + 1))
+	if (cache->rq_size > cache->size/sizeof(u32) - (len + 1))
 		reloc_gpu_flush(cache);
 
 	if (unlikely(!cache->rq)) {
diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
index d3c6993f4f46..1c9e38586af8 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -185,9 +185,18 @@  intel_write_status_page(struct intel_engine_cs *engine, int reg, u32 value)
 #define I915_GEM_HWS_PREEMPT_ADDR	(I915_GEM_HWS_PREEMPT * sizeof(u32))
 #define I915_GEM_HWS_SEQNO		0x40
 #define I915_GEM_HWS_SEQNO_ADDR		(I915_GEM_HWS_SEQNO * sizeof(u32))
+/*
+ * note: we treat the cacheline starting from I915_GEM_HWS_SEQNO the same as a
+ * pre-ctx HSWP. see layout below.
+ */
 #define I915_GEM_HWS_SCRATCH		0x80
 #define I915_GEM_HWS_SCRATCH_ADDR	(I915_GEM_HWS_SCRATCH * sizeof(u32))
 
+/* offset in per-context hwsp, 1 cacheline in size */
+#define I915_GEM_CTX_HWS_SEQNO		0x0
+#define I915_GEM_CTX_HWS_PREFETCH_MASK	0x14
+#define I915_GEM_CTX_HWS_PREFETCH_VAL	0x15
+
 #define I915_HWS_CSB_BUF0_INDEX		0x10
 #define I915_HWS_CSB_WRITE_INDEX	0x1f
 #define CNL_HWS_CSB_WRITE_INDEX		0x2f
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index a82cea95c2f2..bf1a2d7b3a4f 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -435,6 +435,10 @@  struct intel_engine_cs {
 						 u32 *cs);
 	unsigned int	emit_fini_breadcrumb_dw;
 
+	int		(*emit_preparser_disable)(struct i915_request *rq,
+						  u32 *batch);
+	unsigned int	emit_preparser_enable_size_dw;
+
 	/* Pass the request to the hardware queue (e.g. directly into
 	 * the legacy ringbuffer or to the end of an execlist).
 	 *
diff --git a/drivers/gpu/drm/i915/gt/intel_gpu_commands.h b/drivers/gpu/drm/i915/gt/intel_gpu_commands.h
index 86e00a2db8a4..d9e82a4de854 100644
--- a/drivers/gpu/drm/i915/gt/intel_gpu_commands.h
+++ b/drivers/gpu/drm/i915/gt/intel_gpu_commands.h
@@ -53,6 +53,17 @@ 
 #define   MI_ARB_ENABLE			(1<<0)
 #define   MI_ARB_DISABLE		(0<<0)
 #define MI_BATCH_BUFFER_END	MI_INSTR(0x0a, 0)
+#define MI_CONDITIONAL_BATCH_BUFFER_END MI_INSTR(0x36, 0)
+#define   MI_CBBEND_GLOBAL_GTT		(1<<22)
+#define   MI_CBBEND_COMPARE_SEMAPHORE	(1<<21) /* Gen12+*/
+#define   MI_CBBEND_COMPARE_MASK	(1<<19) /* Gen9+*/
+#define   MI_CBBEND_END_CURRENT_LEVEL	(1<<18) /* Gen12+*/
+#define   MI_CBBEND_MAD_GT_IDD		(0 << 12) /* Gen12+*/
+#define   MI_CBBEND_MAD_GTE_IDD		(1 << 12) /* Gen12+*/
+#define   MI_CBBEND_MAD_LT_IDD		(2 << 12) /* Gen12+*/
+#define   MI_CBBEND_MAD_LTE_IDD		(3 << 12) /* Gen12+*/
+#define   MI_CBBEND_MAD_EQ_IDD		(4 << 12) /* Gen12+*/
+#define   MI_CBBEND_MAD_NE_IDD		(5 << 12) /* Gen12+*/
 #define MI_SUSPEND_FLUSH	MI_INSTR(0x0b, 0)
 #define   MI_SUSPEND_FLUSH_EN	(1<<0)
 #define MI_SET_APPID		MI_INSTR(0x0e, 0)
@@ -136,6 +147,7 @@ 
 #define MI_STORE_REGISTER_MEM        MI_INSTR(0x24, 1)
 #define MI_STORE_REGISTER_MEM_GEN8   MI_INSTR(0x24, 2)
 #define   MI_SRM_LRM_GLOBAL_GTT		(1<<22)
+#define   MI_SRM_ADD_CS_MMIO_START	(1<<19) /* gen11+ */
 #define MI_FLUSH_DW		MI_INSTR(0x26, 1) /* for GEN6 */
 #define   MI_FLUSH_DW_STORE_INDEX	(1<<21)
 #define   MI_INVALIDATE_TLB		(1<<18)
@@ -158,6 +170,9 @@ 
 #define MI_BATCH_BUFFER_START_GEN8	MI_INSTR(0x31, 1)
 #define   MI_BATCH_RESOURCE_STREAMER (1<<10)
 
+#define MI_ARB_CHECK            MI_INSTR(0x05, 0)
+#define   MI_ARB_CHECK_PREPARSER_DISABLE BIT(0)
+#define   MI_ARB_CHECK_PREPARSER_DISABLE_MASK BIT(8)
 /*
  * 3D instructions used by the kernel
  */
@@ -239,7 +254,6 @@ 
  * Commands used only by the command parser
  */
 #define MI_SET_PREDICATE        MI_INSTR(0x01, 0)
-#define MI_ARB_CHECK            MI_INSTR(0x05, 0)
 #define MI_RS_CONTROL           MI_INSTR(0x06, 0)
 #define MI_URB_ATOMIC_ALLOC     MI_INSTR(0x09, 0)
 #define MI_PREDICATE            MI_INSTR(0x0C, 0)
@@ -255,7 +269,6 @@ 
 #define MI_RS_STORE_DATA_IMM    MI_INSTR(0x2B, 0)
 #define MI_LOAD_URB_MEM         MI_INSTR(0x2C, 0)
 #define MI_STORE_URB_MEM        MI_INSTR(0x2D, 0)
-#define MI_CONDITIONAL_BATCH_BUFFER_END MI_INSTR(0x36, 0)
 
 #define PIPELINE_SELECT                ((0x3<<29)|(0x1<<27)|(0x1<<24)|(0x4<<16))
 #define GFX_OP_3DSTATE_VF_STATISTICS   ((0x3<<29)|(0x1<<27)|(0x0<<24)|(0xB<<16))
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index d42584439f51..2e560bbb3bdf 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2947,6 +2947,59 @@  static u32 *gen11_emit_fini_breadcrumb_rcs(struct i915_request *request,
 	return gen8_emit_fini_breadcrumb_footer(request, cs);
 }
 
+static u32 *__gen12_emit_preparser_enable(u32 *batch, u32 target_addr)
+{
+	/* return early if the pre-parser was disabled when we started */
+	*batch++ = MI_CONDITIONAL_BATCH_BUFFER_END | 2 |
+			MI_CBBEND_GLOBAL_GTT |
+			MI_CBBEND_COMPARE_SEMAPHORE |
+			MI_CBBEND_END_CURRENT_LEVEL |
+			MI_CBBEND_MAD_NE_IDD; /* return early if EQ */
+	*batch++ = PREFETCH_DISABLE_STATUS;
+	*batch++ = target_addr;
+	*batch++ = 0;
+
+	/* turn the parser back on */
+	*batch++ = MI_ARB_CHECK | MI_ARB_CHECK_PREPARSER_DISABLE_MASK;
+	*batch++ = MI_BATCH_BUFFER_END;
+
+	return batch;
+}
+
+static int gen12_emit_preparser_disable(struct i915_request *rq, u32 *batch)
+{
+	/* 2 dwords, mask and value (in that order) */
+	u32 target_offset = rq->hw_context->timeline->hwsp_offset +
+				I915_GEM_CTX_HWS_PREFETCH_MASK * sizeof(u32);
+	u32 *cs;
+
+	cs = intel_ring_begin(rq, 6);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	/* save the current status of the pre-parser ... */
+	*cs++ = MI_STORE_REGISTER_MEM_GEN8 |
+			MI_SRM_LRM_GLOBAL_GTT |
+			MI_SRM_ADD_CS_MMIO_START;
+	*cs++ = i915_mmio_reg_offset(RING_INSTPM(0));
+	*cs++ = target_offset + sizeof(u32); /* 2nd dword */
+	*cs++ = 0;
+
+	/* ... and turn it off */
+	*cs++ = MI_ARB_CHECK |
+		MI_ARB_CHECK_PREPARSER_DISABLE_MASK |
+		MI_ARB_CHECK_PREPARSER_DISABLE;
+
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(rq, cs);
+
+	/* now prepare the batch that will re-enable the parser */
+	__gen12_emit_preparser_enable(batch, target_offset);
+
+	return 0;
+}
+
 static void execlists_park(struct intel_engine_cs *engine)
 {
 	del_timer(&engine->execlists.timer);
@@ -3017,6 +3070,13 @@  logical_ring_default_vfuncs(struct intel_engine_cs *engine)
 		engine->emit_bb_start = gen8_emit_bb_start;
 	else
 		engine->emit_bb_start = gen9_emit_bb_start;
+
+	if (INTEL_GEN(engine->i915) >= 12) {
+		u32 tmp[16];
+		engine->emit_preparser_disable = gen12_emit_preparser_disable;
+		engine->emit_preparser_enable_size_dw =
+			__gen12_emit_preparser_enable(tmp, 0) - tmp;
+	}
 }
 
 static inline void
diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c b/drivers/gpu/drm/i915/gt/intel_timeline.c
index 02fbe11b671b..89e1fd0b96ab 100644
--- a/drivers/gpu/drm/i915/gt/intel_timeline.c
+++ b/drivers/gpu/drm/i915/gt/intel_timeline.c
@@ -209,6 +209,7 @@  int intel_timeline_init(struct intel_timeline *timeline,
 			struct i915_vma *hwsp)
 {
 	void *vaddr;
+	u32 *hwsp_map;
 
 	kref_init(&timeline->kref);
 	atomic_set(&timeline->pin_count, 0);
@@ -244,8 +245,16 @@  int intel_timeline_init(struct intel_timeline *timeline,
 			return PTR_ERR(vaddr);
 	}
 
-	timeline->hwsp_seqno =
-		memset(vaddr + timeline->hwsp_offset, 0, CACHELINE_BYTES);
+	hwsp_map = memset(vaddr + timeline->hwsp_offset, 0, CACHELINE_BYTES);
+	timeline->hwsp_seqno = hwsp_map;
+
+	/*
+	 * For checking the status of the pre-fetcher, we need save the value
+	 * of the INSTPM register and then evaluate it against the mask during
+	 * a BBEND. This requires the mask and the register value to be in 2
+	 * consecutive dwords in memory. We use 2 dword in the HWSP for this
+	 */
+	hwsp_map[I915_GEM_CTX_HWS_PREFETCH_MASK] = PREFETCH_DISABLE_STATUS;
 
 	timeline->hwsp_ggtt = i915_vma_get(hwsp);
 	GEM_BUG_ON(timeline->hwsp_offset >= hwsp->size);
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index a092b34c269d..504a513cffee 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -2549,6 +2549,7 @@  static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define RING_DMA_FADD(base)	_MMIO((base) + 0x78)
 #define RING_DMA_FADD_UDW(base)	_MMIO((base) + 0x60) /* gen8+ */
 #define RING_INSTPM(base)	_MMIO((base) + 0xc0)
+#define  PREFETCH_DISABLE_STATUS REG_BIT(12) /* gen12+ */
 #define RING_MI_MODE(base)	_MMIO((base) + 0x9c)
 #define INSTPS		_MMIO(0x2070) /* 965+ only */
 #define GEN4_INSTDONE1	_MMIO(0x207c) /* 965+ only, aka INSTDONE_2 on SNB */