[02/17] drm/i915/ringbuffer: Brute force context restore

Message ID	20180610194325.13467-3-chris@chris-wilson.co.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Chris Wilson <chris@chris-wilson.co.uk> To: intel-gfx@lists.freedesktop.org Date: Sun, 10 Jun 2018 20:43:10 +0100 Message-Id: <20180610194325.13467-3-chris@chris-wilson.co.uk> In-Reply-To: <20180610194325.13467-1-chris@chris-wilson.co.uk> References: <20180610194325.13467-1-chris@chris-wilson.co.uk> Subject: [Intel-gfx] [PATCH 02/17] drm/i915/ringbuffer: Brute force context restore Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Chris Wilson June 10, 2018, 7:43 p.m. UTC

An issue encountered with switching mm on gen7 is that the GPU likes to
hang (with the VS unit busy) when told to force restore the current
context. We can simply workaround this by substituting the
MI_FORCE_RESTORE flag with a round-trip through the kernel_context,
forcing the context to be saved and restored; thereby reloading the
PP_DIR registers and updating the modified page directory!

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Matthew Auld <matthew.william.auld@gmail.com>
---
 drivers/gpu/drm/i915/intel_ringbuffer.c | 31 ++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

Tvrtko Ursulin June 11, 2018, 10 a.m. UTC | #1

On 10/06/2018 20:43, Chris Wilson wrote:
> An issue encountered with switching mm on gen7 is that the GPU likes to
> hang (with the VS unit busy) when told to force restore the current
> context. We can simply workaround this by substituting the
> MI_FORCE_RESTORE flag with a round-trip through the kernel_context,
> forcing the context to be saved and restored; thereby reloading the
> PP_DIR registers and updating the modified page directory!
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Matthew Auld <matthew.william.auld@gmail.com>
> ---
>   drivers/gpu/drm/i915/intel_ringbuffer.c | 31 ++++++++++++++++++++++---
>   1 file changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index 65811e2fa7da..6bfa6030198d 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -1458,6 +1458,7 @@ static inline int mi_set_context(struct i915_request *rq, u32 flags)
>   		(HAS_LEGACY_SEMAPHORES(i915) && IS_GEN7(i915)) ?
>   		INTEL_INFO(i915)->num_rings - 1 :
>   		0;
> +	bool force_restore = false;
>   	int len;
>   	u32 *cs;
>   
> @@ -1471,6 +1472,12 @@ static inline int mi_set_context(struct i915_request *rq, u32 flags)
>   	len = 4;
>   	if (IS_GEN7(i915))
>   		len += 2 + (num_rings ? 4*num_rings + 6 : 0);
> +	if (flags & MI_FORCE_RESTORE) {
> +		GEM_BUG_ON(flags & MI_RESTORE_INHIBIT);
> +		flags &= ~MI_FORCE_RESTORE;
> +		force_restore = true;
> +		len += 2;
> +	}
>   
>   	cs = intel_ring_begin(rq, len);
>   	if (IS_ERR(cs))
> @@ -1495,6 +1502,21 @@ static inline int mi_set_context(struct i915_request *rq, u32 flags)
>   		}
>   	}
>   
> +	if (force_restore) {
> +		/*
> +		 * The HW doesn't handle being told to restore the current
> +		 * context very well. Quite often it likes goes to go off and
> +		 * sulk, especially when it is meant to be reloading PP_DIR.
> +		 * A very simple fix to force the reload is to simply switch
> +		 * away from the current context and back again.
> +		 */
> +		*cs++ = MI_SET_CONTEXT;
> +		*cs++ = i915_ggtt_offset(to_intel_context(i915->kernel_context,
> +							  engine)->state) |
> +			MI_MM_SPACE_GTT |
> +			MI_RESTORE_INHIBIT;
> +	}
> +
>   	*cs++ = MI_NOOP;
>   	*cs++ = MI_SET_CONTEXT;
>   	*cs++ = i915_ggtt_offset(rq->hw_context->state) | flags;
> @@ -1585,11 +1607,14 @@ static int switch_context(struct i915_request *rq)
>   
>   		to_mm->pd_dirty_rings &= ~intel_engine_flag(engine);
>   		engine->legacy_active_ppgtt = to_mm;
> -		hw_flags = MI_FORCE_RESTORE;
> +
> +		if (to_ctx == from_ctx) {

Contexts can be the same here , when the parent condition is "if (to_mm 
!= from_mm || to_mm ...)" ?

> +			hw_flags = MI_FORCE_RESTORE;
> +			from_ctx = NULL;

Now on the error path we can end up with engine->legacy_active_context 
== NULL, but commands to switch have been put to the ring. Is that OK?

> +		}
>   	}
>   
> -	if (rq->hw_context->state &&
> -	    (to_ctx != from_ctx || hw_flags & MI_FORCE_RESTORE)) {
> +	if (rq->hw_context->state && to_ctx != from_ctx) {
>   		GEM_BUG_ON(engine->id != RCS);
>   
>   		/*
> 

Regards,

Tvrtko

Chris Wilson June 11, 2018, 10:04 a.m. UTC | #2

Quoting Tvrtko Ursulin (2018-06-11 11:00:31)
> 
> On 10/06/2018 20:43, Chris Wilson wrote:
> > @@ -1585,11 +1607,14 @@ static int switch_context(struct i915_request *rq)
> >   
> >               to_mm->pd_dirty_rings &= ~intel_engine_flag(engine);
> >               engine->legacy_active_ppgtt = to_mm;
> > -             hw_flags = MI_FORCE_RESTORE;
> > +
> > +             if (to_ctx == from_ctx) {
> 
> Contexts can be the same here , when the parent condition is "if (to_mm 
> != from_mm || to_mm ...)" ?
> 
> > +                     hw_flags = MI_FORCE_RESTORE;
> > +                     from_ctx = NULL;
> 
> Now on the error path we can end up with engine->legacy_active_context 
> == NULL, but commands to switch have been put to the ring. Is that OK?

On the error path, we unwind the commands in the ring and will overwrite
them with the next request (see i915_request_alloc() err_unwind:)

So leaving legacy_active_context = NULL here just ensures we emit a
context switch next time, no matter what. That is no bad thing as the HW
ignores redundant MI_SET_CONTEXT (and requires the FORCE_RESTORE flag to
force a redundant reload).
-Chris

Mika Kuoppala June 11, 2018, 10:07 a.m. UTC | #3

Chris Wilson <chris@chris-wilson.co.uk> writes:

> An issue encountered with switching mm on gen7 is that the GPU likes to
> hang (with the VS unit busy) when told to force restore the current
> context. We can simply workaround this by substituting the
> MI_FORCE_RESTORE flag with a round-trip through the kernel_context,
> forcing the context to be saved and restored; thereby reloading the
> PP_DIR registers and updating the modified page directory!
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Matthew Auld <matthew.william.auld@gmail.com>
> ---
>  drivers/gpu/drm/i915/intel_ringbuffer.c | 31 ++++++++++++++++++++++---
>  1 file changed, 28 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index 65811e2fa7da..6bfa6030198d 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -1458,6 +1458,7 @@ static inline int mi_set_context(struct i915_request *rq, u32 flags)
>  		(HAS_LEGACY_SEMAPHORES(i915) && IS_GEN7(i915)) ?
>  		INTEL_INFO(i915)->num_rings - 1 :
>  		0;
> +	bool force_restore = false;
>  	int len;
>  	u32 *cs;
>  
> @@ -1471,6 +1472,12 @@ static inline int mi_set_context(struct i915_request *rq, u32 flags)
>  	len = 4;
>  	if (IS_GEN7(i915))
>  		len += 2 + (num_rings ? 4*num_rings + 6 : 0);
> +	if (flags & MI_FORCE_RESTORE) {
> +		GEM_BUG_ON(flags & MI_RESTORE_INHIBIT);
> +		flags &= ~MI_FORCE_RESTORE;
> +		force_restore = true;
> +		len += 2;
> +	}
>  
>  	cs = intel_ring_begin(rq, len);
>  	if (IS_ERR(cs))
> @@ -1495,6 +1502,21 @@ static inline int mi_set_context(struct i915_request *rq, u32 flags)
>  		}
>  	}
>  
> +	if (force_restore) {
> +		/*
> +		 * The HW doesn't handle being told to restore the current
> +		 * context very well. Quite often it likes goes to go off and
> +		 * sulk, especially when it is meant to be reloading PP_DIR.
> +		 * A very simple fix to force the reload is to simply switch
> +		 * away from the current context and back again.
> +		 */
> +		*cs++ = MI_SET_CONTEXT;
> +		*cs++ = i915_ggtt_offset(to_intel_context(i915->kernel_context,
> +							  engine)->state) |
> +			MI_MM_SPACE_GTT |
> +			MI_RESTORE_INHIBIT;

The above comment could be more verbose about the INHIBIT flag,
like we discussed in irc. We dont actually restore the kernel
context image state and we will trample it with current ctx.

But as we don't ever run anything through kernel_context,
this should be fine and creating an another context just
for switching through seems overkill.

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>


> +	}
> +
>  	*cs++ = MI_NOOP;
>  	*cs++ = MI_SET_CONTEXT;
>  	*cs++ = i915_ggtt_offset(rq->hw_context->state) | flags;
> @@ -1585,11 +1607,14 @@ static int switch_context(struct i915_request *rq)
>  
>  		to_mm->pd_dirty_rings &= ~intel_engine_flag(engine);
>  		engine->legacy_active_ppgtt = to_mm;
> -		hw_flags = MI_FORCE_RESTORE;
> +
> +		if (to_ctx == from_ctx) {
> +			hw_flags = MI_FORCE_RESTORE;
> +			from_ctx = NULL;
> +		}
>  	}
>  
> -	if (rq->hw_context->state &&
> -	    (to_ctx != from_ctx || hw_flags & MI_FORCE_RESTORE)) {
> +	if (rq->hw_context->state && to_ctx != from_ctx) {
>  		GEM_BUG_ON(engine->id != RCS);
>  
>  		/*
> -- 
> 2.17.1

Joonas Lahtinen June 11, 2018, 10:11 a.m. UTC | #4

Quoting Chris Wilson (2018-06-10 22:43:10)
> An issue encountered with switching mm on gen7 is that the GPU likes to
> hang (with the VS unit busy) when told to force restore the current
> context. We can simply workaround this by substituting the
> MI_FORCE_RESTORE flag with a round-trip through the kernel_context,
> forcing the context to be saved and restored; thereby reloading the
> PP_DIR registers and updating the modified page directory!
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Matthew Auld <matthew.william.auld@gmail.com>

Commit message checks out with code, and as you obviously have it
crunching through the testsuites, must be the right thing to do.

Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Regards, Joonas

Tvrtko Ursulin June 11, 2018, 10:23 a.m. UTC | #5

On 11/06/2018 11:04, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-06-11 11:00:31)
>>
>> On 10/06/2018 20:43, Chris Wilson wrote:
>>> @@ -1585,11 +1607,14 @@ static int switch_context(struct i915_request *rq)
>>>    
>>>                to_mm->pd_dirty_rings &= ~intel_engine_flag(engine);
>>>                engine->legacy_active_ppgtt = to_mm;
>>> -             hw_flags = MI_FORCE_RESTORE;
>>> +
>>> +             if (to_ctx == from_ctx) {
>>
>> Contexts can be the same here , when the parent condition is "if (to_mm
>> != from_mm || to_mm ...)" ?
>>
>>> +                     hw_flags = MI_FORCE_RESTORE;
>>> +                     from_ctx = NULL;
>>
>> Now on the error path we can end up with engine->legacy_active_context
>> == NULL, but commands to switch have been put to the ring. Is that OK?
> 
> On the error path, we unwind the commands in the ring and will overwrite
> them with the next request (see i915_request_alloc() err_unwind:)
> 
> So leaving legacy_active_context = NULL here just ensures we emit a
> context switch next time, no matter what. That is no bad thing as the HW
> ignores redundant MI_SET_CONTEXT (and requires the FORCE_RESTORE flag to
> force a redundant reload).

But we don't set MI_FORCE_RESTORE if the old context is equal to new 
one, since we discarded information about the previous one?

Regards,

Tvrtko

Chris Wilson June 11, 2018, 10:33 a.m. UTC | #6

Quoting Tvrtko Ursulin (2018-06-11 11:23:53)
> 
> On 11/06/2018 11:04, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-06-11 11:00:31)
> >>
> >> On 10/06/2018 20:43, Chris Wilson wrote:
> >>> @@ -1585,11 +1607,14 @@ static int switch_context(struct i915_request *rq)
> >>>    
> >>>                to_mm->pd_dirty_rings &= ~intel_engine_flag(engine);
> >>>                engine->legacy_active_ppgtt = to_mm;
> >>> -             hw_flags = MI_FORCE_RESTORE;
> >>> +
> >>> +             if (to_ctx == from_ctx) {
> >>
> >> Contexts can be the same here , when the parent condition is "if (to_mm
> >> != from_mm || to_mm ...)" ?
> >>
> >>> +                     hw_flags = MI_FORCE_RESTORE;
> >>> +                     from_ctx = NULL;
> >>
> >> Now on the error path we can end up with engine->legacy_active_context
> >> == NULL, but commands to switch have been put to the ring. Is that OK?
> > 
> > On the error path, we unwind the commands in the ring and will overwrite
> > them with the next request (see i915_request_alloc() err_unwind:)
> > 
> > So leaving legacy_active_context = NULL here just ensures we emit a
> > context switch next time, no matter what. That is no bad thing as the HW
> > ignores redundant MI_SET_CONTEXT (and requires the FORCE_RESTORE flag to
> > force a redundant reload).
> 
> But we don't set MI_FORCE_RESTORE if the old context is equal to new 
> one, since we discarded information about the previous one?

Bah, no I was thinking the opposite that we would just end up with more
FORCE_RESTORE. In the next patch, we kill it. Let's just undo all the
unnecessary changes here (can you tell I was avoiding work?)
-Chris

[02/17] drm/i915/ringbuffer: Brute force context restore

Commit Message

Comments

Patch