diff mbox series

[3/9] drm/i915/gt: Unlock engine-pm after queuing the kernel context switch

Message ID 20191120093302.3723715-3-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show
Series [1/9] drm/i915/selftests: Take a ref to the request we wait upon | expand

Commit Message

Chris Wilson Nov. 20, 2019, 9:32 a.m. UTC
In commit a79ca656b648 ("drm/i915: Push the wakeref->count deferral to
the backend"), I erroneously concluded that we last modify the engine
inside __i915_request_commit() meaning that we could enable concurrent
submission for userspace as we enqueued this request. However, this
falls into a trap with other users of the engine->kernel_context waking
up and submitting their request before the idle-switch is queued, with
the result that the kernel_context is executed out-of-sequence most
likely upsetting the GPU and certainly ourselves when we try to retire
the out-of-sequence requests.

As such we need to hold onto the effective engine->kernel_context mutex
lock (via the engine pm mutex proxy) until we have finish queuing the
request to the engine.

v2: Serialise against concurrent intel_gt_retire_requests()
v3: Describe the hairy locking scheme with intel_gt_retire_requests()
for future reference.
v4: Combine timeline->lock and engine pm release; it's hairy.

Fixes: a79ca656b648 ("drm/i915: Push the wakeref->count deferral to the backend")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_pm.c | 47 +++++++++++++++++++----
 1 file changed, 40 insertions(+), 7 deletions(-)

Comments

Tvrtko Ursulin Nov. 20, 2019, 11:58 a.m. UTC | #1
On 20/11/2019 09:32, Chris Wilson wrote:
> In commit a79ca656b648 ("drm/i915: Push the wakeref->count deferral to
> the backend"), I erroneously concluded that we last modify the engine
> inside __i915_request_commit() meaning that we could enable concurrent
> submission for userspace as we enqueued this request. However, this
> falls into a trap with other users of the engine->kernel_context waking
> up and submitting their request before the idle-switch is queued, with
> the result that the kernel_context is executed out-of-sequence most
> likely upsetting the GPU and certainly ourselves when we try to retire
> the out-of-sequence requests.
> 
> As such we need to hold onto the effective engine->kernel_context mutex
> lock (via the engine pm mutex proxy) until we have finish queuing the
> request to the engine.
> 
> v2: Serialise against concurrent intel_gt_retire_requests()
> v3: Describe the hairy locking scheme with intel_gt_retire_requests()
> for future reference.
> v4: Combine timeline->lock and engine pm release; it's hairy.
> 
> Fixes: a79ca656b648 ("drm/i915: Push the wakeref->count deferral to the backend")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_engine_pm.c | 47 +++++++++++++++++++----
>   1 file changed, 40 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> index 3c0f490ff2c7..1f517357a268 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> @@ -73,8 +73,25 @@ static inline void __timeline_mark_unlock(struct intel_context *ce,
>   
>   #endif /* !IS_ENABLED(CONFIG_LOCKDEP) */
>   
> +static void
> +__intel_timeline_enter_and_pm_release(struct intel_timeline *tl,
> +				      struct intel_engine_cs *engine)
> +{
> +	struct intel_gt_timelines *timelines = &engine->gt->timelines;
> +
> +	spin_lock(&timelines->lock);
> +
> +	if (!atomic_fetch_inc(&tl->active_count))
> +		list_add_tail(&tl->link, &timelines->active_list);

Hmm with these new part it maybe matches/answers my question from 
"drm/i915/gt: Close race between engine_park and 
intel_gt_retire_requests". I think at least. Since it now adds a second 
place timeline can enter the active_list.

But no, where does the intel_timeline_enter race come from? Can't be 
userspace submission since they are blocked on wf->lock.

Regards,

Tvrtko

> +
> +	__intel_wakeref_defer_park(&engine->wakeref);
> +
> +	spin_unlock(&timelines->lock);
> +}
> +
>   static bool switch_to_kernel_context(struct intel_engine_cs *engine)
>   {
> +	struct intel_context *ce = engine->kernel_context;
>   	struct i915_request *rq;
>   	unsigned long flags;
>   	bool result = true;
> @@ -98,16 +115,31 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
>   	 * This should hold true as we can only park the engine after
>   	 * retiring the last request, thus all rings should be empty and
>   	 * all timelines idle.
> +	 *
> +	 * For unlocking, there are 2 other parties and the GPU who have a
> +	 * stake here.
> +	 *
> +	 * A new gpu user will be waiting on the engine-pm to start their
> +	 * engine_unpark. New waiters are predicated on engine->wakeref.count
> +	 * and so intel_wakeref_defer_park() acts like a mutex_unlock of the
> +	 * engine->wakeref.
> +	 *
> +	 * The other party is intel_gt_retire_requests(), which is walking the
> +	 * list of active timelines looking for completions. Meanwhile as soon
> +	 * as we call __i915_request_queue(), the GPU may complete our request.
> +	 * Ergo, if we put ourselves on the timelines.active_list
> +	 * (se intel_timeline_enter()) before we increment the
> +	 * engine->wakeref.count, we may see the request completion and retire
> +	 * it causing an undeflow of the engine->wakeref.
>   	 */
> -	flags = __timeline_mark_lock(engine->kernel_context);
> +	flags = __timeline_mark_lock(ce);
> +	GEM_BUG_ON(atomic_read(&ce->timeline->active_count) < 0);
>   
> -	rq = __i915_request_create(engine->kernel_context, GFP_NOWAIT);
> +	rq = __i915_request_create(ce, GFP_NOWAIT);
>   	if (IS_ERR(rq))
>   		/* Context switch failed, hope for the best! Maybe reset? */
>   		goto out_unlock;
>   
> -	intel_timeline_enter(i915_request_timeline(rq));
> -
>   	/* Check again on the next retirement. */
>   	engine->wakeref_serial = engine->serial + 1;
>   	i915_request_add_active_barriers(rq);
> @@ -116,13 +148,14 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
>   	rq->sched.attr.priority = I915_PRIORITY_BARRIER;
>   	__i915_request_commit(rq);
>   
> -	/* Release our exclusive hold on the engine */
> -	__intel_wakeref_defer_park(&engine->wakeref);
>   	__i915_request_queue(rq, NULL);
>   
> +	/* Expose ourselves to intel_gt_retire_requests() and new submission */
> +	__intel_timeline_enter_and_pm_release(ce->timeline, engine);
> +
>   	result = false;
>   out_unlock:
> -	__timeline_mark_unlock(engine->kernel_context, flags);
> +	__timeline_mark_unlock(ce, flags);
>   	return result;
>   }
>   
>
Chris Wilson Nov. 20, 2019, 12:07 p.m. UTC | #2
Quoting Tvrtko Ursulin (2019-11-20 11:58:27)
> 
> On 20/11/2019 09:32, Chris Wilson wrote:
> > In commit a79ca656b648 ("drm/i915: Push the wakeref->count deferral to
> > the backend"), I erroneously concluded that we last modify the engine
> > inside __i915_request_commit() meaning that we could enable concurrent
> > submission for userspace as we enqueued this request. However, this
> > falls into a trap with other users of the engine->kernel_context waking
> > up and submitting their request before the idle-switch is queued, with
> > the result that the kernel_context is executed out-of-sequence most
> > likely upsetting the GPU and certainly ourselves when we try to retire
> > the out-of-sequence requests.
> > 
> > As such we need to hold onto the effective engine->kernel_context mutex
> > lock (via the engine pm mutex proxy) until we have finish queuing the
> > request to the engine.
> > 
> > v2: Serialise against concurrent intel_gt_retire_requests()
> > v3: Describe the hairy locking scheme with intel_gt_retire_requests()
> > for future reference.
> > v4: Combine timeline->lock and engine pm release; it's hairy.
> > 
> > Fixes: a79ca656b648 ("drm/i915: Push the wakeref->count deferral to the backend")
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_engine_pm.c | 47 +++++++++++++++++++----
> >   1 file changed, 40 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > index 3c0f490ff2c7..1f517357a268 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > @@ -73,8 +73,25 @@ static inline void __timeline_mark_unlock(struct intel_context *ce,
> >   
> >   #endif /* !IS_ENABLED(CONFIG_LOCKDEP) */
> >   
> > +static void
> > +__intel_timeline_enter_and_pm_release(struct intel_timeline *tl,
> > +                                   struct intel_engine_cs *engine)
> > +{
> > +     struct intel_gt_timelines *timelines = &engine->gt->timelines;
> > +
> > +     spin_lock(&timelines->lock);
> > +
> > +     if (!atomic_fetch_inc(&tl->active_count))
> > +             list_add_tail(&tl->link, &timelines->active_list);
> 
> Hmm with these new part it maybe matches/answers my question from 
> "drm/i915/gt: Close race between engine_park and 
> intel_gt_retire_requests". I think at least. Since it now adds a second 
> place timeline can enter the active_list.
> 
> But no, where does the intel_timeline_enter race come from? Can't be 
> userspace submission since they are blocked on wf->lock.

The race is not just with intel_timeline_enter, but with
intel_gt_retire_requests.

As soon as we are on that list, we may be retired. If we are retired
before adjusting the engine->wakeref.count, we are b0rked.

As soon as we adjust the engine->wakeref.count, another submission may
call intel_timeline_enter, and again may even retire this request. The
enter itself is perfectly fine, but we need to serialise against the
retires.
-Chris
Tvrtko Ursulin Nov. 20, 2019, 12:40 p.m. UTC | #3
On 20/11/2019 12:07, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2019-11-20 11:58:27)
>>
>> On 20/11/2019 09:32, Chris Wilson wrote:
>>> In commit a79ca656b648 ("drm/i915: Push the wakeref->count deferral to
>>> the backend"), I erroneously concluded that we last modify the engine
>>> inside __i915_request_commit() meaning that we could enable concurrent
>>> submission for userspace as we enqueued this request. However, this
>>> falls into a trap with other users of the engine->kernel_context waking
>>> up and submitting their request before the idle-switch is queued, with
>>> the result that the kernel_context is executed out-of-sequence most
>>> likely upsetting the GPU and certainly ourselves when we try to retire
>>> the out-of-sequence requests.
>>>
>>> As such we need to hold onto the effective engine->kernel_context mutex
>>> lock (via the engine pm mutex proxy) until we have finish queuing the
>>> request to the engine.
>>>
>>> v2: Serialise against concurrent intel_gt_retire_requests()
>>> v3: Describe the hairy locking scheme with intel_gt_retire_requests()
>>> for future reference.
>>> v4: Combine timeline->lock and engine pm release; it's hairy.
>>>
>>> Fixes: a79ca656b648 ("drm/i915: Push the wakeref->count deferral to the backend")
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_engine_pm.c | 47 +++++++++++++++++++----
>>>    1 file changed, 40 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>> index 3c0f490ff2c7..1f517357a268 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>> @@ -73,8 +73,25 @@ static inline void __timeline_mark_unlock(struct intel_context *ce,
>>>    
>>>    #endif /* !IS_ENABLED(CONFIG_LOCKDEP) */
>>>    
>>> +static void
>>> +__intel_timeline_enter_and_pm_release(struct intel_timeline *tl,
>>> +                                   struct intel_engine_cs *engine)
>>> +{
>>> +     struct intel_gt_timelines *timelines = &engine->gt->timelines;
>>> +
>>> +     spin_lock(&timelines->lock);
>>> +
>>> +     if (!atomic_fetch_inc(&tl->active_count))
>>> +             list_add_tail(&tl->link, &timelines->active_list);
>>
>> Hmm with these new part it maybe matches/answers my question from
>> "drm/i915/gt: Close race between engine_park and
>> intel_gt_retire_requests". I think at least. Since it now adds a second
>> place timeline can enter the active_list.
>>
>> But no, where does the intel_timeline_enter race come from? Can't be
>> userspace submission since they are blocked on wf->lock.
> 
> The race is not just with intel_timeline_enter, but with
> intel_gt_retire_requests.
> 
> As soon as we are on that list, we may be retired. If we are retired
> before adjusting the engine->wakeref.count, we are b0rked.
> 
> As soon as we adjust the engine->wakeref.count, another submission may
> call intel_timeline_enter, and again may even retire this request. The
> enter itself is perfectly fine, but we need to serialise against the
> retires.

I think the two patches combined work, I am just not sure two 
atomic_fetch_inc()->list_add() are needed now that you re-ordered 
__i915_requests_queue and __intel_wakeref_defer_park - that's the part 
which is confusing me. But it also doesn't harm...

Regards,

Tvrtko
Chris Wilson Nov. 20, 2019, 12:44 p.m. UTC | #4
Quoting Tvrtko Ursulin (2019-11-20 12:40:13)
> 
> On 20/11/2019 12:07, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2019-11-20 11:58:27)
> >>
> >> On 20/11/2019 09:32, Chris Wilson wrote:
> >>> In commit a79ca656b648 ("drm/i915: Push the wakeref->count deferral to
> >>> the backend"), I erroneously concluded that we last modify the engine
> >>> inside __i915_request_commit() meaning that we could enable concurrent
> >>> submission for userspace as we enqueued this request. However, this
> >>> falls into a trap with other users of the engine->kernel_context waking
> >>> up and submitting their request before the idle-switch is queued, with
> >>> the result that the kernel_context is executed out-of-sequence most
> >>> likely upsetting the GPU and certainly ourselves when we try to retire
> >>> the out-of-sequence requests.
> >>>
> >>> As such we need to hold onto the effective engine->kernel_context mutex
> >>> lock (via the engine pm mutex proxy) until we have finish queuing the
> >>> request to the engine.
> >>>
> >>> v2: Serialise against concurrent intel_gt_retire_requests()
> >>> v3: Describe the hairy locking scheme with intel_gt_retire_requests()
> >>> for future reference.
> >>> v4: Combine timeline->lock and engine pm release; it's hairy.
> >>>
> >>> Fixes: a79ca656b648 ("drm/i915: Push the wakeref->count deferral to the backend")
> >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>> ---
> >>>    drivers/gpu/drm/i915/gt/intel_engine_pm.c | 47 +++++++++++++++++++----
> >>>    1 file changed, 40 insertions(+), 7 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> >>> index 3c0f490ff2c7..1f517357a268 100644
> >>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> >>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> >>> @@ -73,8 +73,25 @@ static inline void __timeline_mark_unlock(struct intel_context *ce,
> >>>    
> >>>    #endif /* !IS_ENABLED(CONFIG_LOCKDEP) */
> >>>    
> >>> +static void
> >>> +__intel_timeline_enter_and_pm_release(struct intel_timeline *tl,
> >>> +                                   struct intel_engine_cs *engine)
> >>> +{
> >>> +     struct intel_gt_timelines *timelines = &engine->gt->timelines;
> >>> +
> >>> +     spin_lock(&timelines->lock);
> >>> +
> >>> +     if (!atomic_fetch_inc(&tl->active_count))
> >>> +             list_add_tail(&tl->link, &timelines->active_list);
> >>
> >> Hmm with these new part it maybe matches/answers my question from
> >> "drm/i915/gt: Close race between engine_park and
> >> intel_gt_retire_requests". I think at least. Since it now adds a second
> >> place timeline can enter the active_list.
> >>
> >> But no, where does the intel_timeline_enter race come from? Can't be
> >> userspace submission since they are blocked on wf->lock.
> > 
> > The race is not just with intel_timeline_enter, but with
> > intel_gt_retire_requests.
> > 
> > As soon as we are on that list, we may be retired. If we are retired
> > before adjusting the engine->wakeref.count, we are b0rked.
> > 
> > As soon as we adjust the engine->wakeref.count, another submission may
> > call intel_timeline_enter, and again may even retire this request. The
> > enter itself is perfectly fine, but we need to serialise against the
> > retires.
> 
> I think the two patches combined work, I am just not sure two 
> atomic_fetch_inc()->list_add() are needed now that you re-ordered 
> __i915_requests_queue and __intel_wakeref_defer_park - that's the part 
> which is confusing me. But it also doesn't harm...

I tried to get away with not, but the selftests hammer very heavily on
the engine->kernel_context so we do encounter the scenarios where we are
using the kernel_context to park on one cpu while submitting a new
request on another.

I would have got away with it but for these pesky selftests.
-Chris
Tvrtko Ursulin Nov. 20, 2019, 1:19 p.m. UTC | #5
666
On 20/11/2019 12:44, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2019-11-20 12:40:13)
>>
>> On 20/11/2019 12:07, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2019-11-20 11:58:27)
>>>>
>>>> On 20/11/2019 09:32, Chris Wilson wrote:
>>>>> In commit a79ca656b648 ("drm/i915: Push the wakeref->count deferral to
>>>>> the backend"), I erroneously concluded that we last modify the engine
>>>>> inside __i915_request_commit() meaning that we could enable concurrent
>>>>> submission for userspace as we enqueued this request. However, this
>>>>> falls into a trap with other users of the engine->kernel_context waking
>>>>> up and submitting their request before the idle-switch is queued, with
>>>>> the result that the kernel_context is executed out-of-sequence most
>>>>> likely upsetting the GPU and certainly ourselves when we try to retire
>>>>> the out-of-sequence requests.
>>>>>
>>>>> As such we need to hold onto the effective engine->kernel_context mutex
>>>>> lock (via the engine pm mutex proxy) until we have finish queuing the
>>>>> request to the engine.
>>>>>
>>>>> v2: Serialise against concurrent intel_gt_retire_requests()
>>>>> v3: Describe the hairy locking scheme with intel_gt_retire_requests()
>>>>> for future reference.
>>>>> v4: Combine timeline->lock and engine pm release; it's hairy.
>>>>>
>>>>> Fixes: a79ca656b648 ("drm/i915: Push the wakeref->count deferral to the backend")
>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/i915/gt/intel_engine_pm.c | 47 +++++++++++++++++++----
>>>>>     1 file changed, 40 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>> index 3c0f490ff2c7..1f517357a268 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>> @@ -73,8 +73,25 @@ static inline void __timeline_mark_unlock(struct intel_context *ce,
>>>>>     
>>>>>     #endif /* !IS_ENABLED(CONFIG_LOCKDEP) */
>>>>>     
>>>>> +static void
>>>>> +__intel_timeline_enter_and_pm_release(struct intel_timeline *tl,
>>>>> +                                   struct intel_engine_cs *engine)
>>>>> +{
>>>>> +     struct intel_gt_timelines *timelines = &engine->gt->timelines;
>>>>> +
>>>>> +     spin_lock(&timelines->lock);
>>>>> +
>>>>> +     if (!atomic_fetch_inc(&tl->active_count))
>>>>> +             list_add_tail(&tl->link, &timelines->active_list);
>>>>
>>>> Hmm with these new part it maybe matches/answers my question from
>>>> "drm/i915/gt: Close race between engine_park and
>>>> intel_gt_retire_requests". I think at least. Since it now adds a second
>>>> place timeline can enter the active_list.
>>>>
>>>> But no, where does the intel_timeline_enter race come from? Can't be
>>>> userspace submission since they are blocked on wf->lock.
>>>
>>> The race is not just with intel_timeline_enter, but with
>>> intel_gt_retire_requests.
>>>
>>> As soon as we are on that list, we may be retired. If we are retired
>>> before adjusting the engine->wakeref.count, we are b0rked.
>>>
>>> As soon as we adjust the engine->wakeref.count, another submission may
>>> call intel_timeline_enter, and again may even retire this request. The
>>> enter itself is perfectly fine, but we need to serialise against the
>>> retires.
>>
>> I think the two patches combined work, I am just not sure two
>> atomic_fetch_inc()->list_add() are needed now that you re-ordered
>> __i915_requests_queue and __intel_wakeref_defer_park - that's the part
>> which is confusing me. But it also doesn't harm...
> 
> I tried to get away with not, but the selftests hammer very heavily on
> the engine->kernel_context so we do encounter the scenarios where we are
> using the kernel_context to park on one cpu while submitting a new
> request on another.
> 
> I would have got away with it but for these pesky selftests.

Okay, I'll trust you on that:

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
index 3c0f490ff2c7..1f517357a268 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
@@ -73,8 +73,25 @@  static inline void __timeline_mark_unlock(struct intel_context *ce,
 
 #endif /* !IS_ENABLED(CONFIG_LOCKDEP) */
 
+static void
+__intel_timeline_enter_and_pm_release(struct intel_timeline *tl,
+				      struct intel_engine_cs *engine)
+{
+	struct intel_gt_timelines *timelines = &engine->gt->timelines;
+
+	spin_lock(&timelines->lock);
+
+	if (!atomic_fetch_inc(&tl->active_count))
+		list_add_tail(&tl->link, &timelines->active_list);
+
+	__intel_wakeref_defer_park(&engine->wakeref);
+
+	spin_unlock(&timelines->lock);
+}
+
 static bool switch_to_kernel_context(struct intel_engine_cs *engine)
 {
+	struct intel_context *ce = engine->kernel_context;
 	struct i915_request *rq;
 	unsigned long flags;
 	bool result = true;
@@ -98,16 +115,31 @@  static bool switch_to_kernel_context(struct intel_engine_cs *engine)
 	 * This should hold true as we can only park the engine after
 	 * retiring the last request, thus all rings should be empty and
 	 * all timelines idle.
+	 *
+	 * For unlocking, there are 2 other parties and the GPU who have a
+	 * stake here.
+	 *
+	 * A new gpu user will be waiting on the engine-pm to start their
+	 * engine_unpark. New waiters are predicated on engine->wakeref.count
+	 * and so intel_wakeref_defer_park() acts like a mutex_unlock of the
+	 * engine->wakeref.
+	 *
+	 * The other party is intel_gt_retire_requests(), which is walking the
+	 * list of active timelines looking for completions. Meanwhile as soon
+	 * as we call __i915_request_queue(), the GPU may complete our request.
+	 * Ergo, if we put ourselves on the timelines.active_list
+	 * (se intel_timeline_enter()) before we increment the
+	 * engine->wakeref.count, we may see the request completion and retire
+	 * it causing an undeflow of the engine->wakeref.
 	 */
-	flags = __timeline_mark_lock(engine->kernel_context);
+	flags = __timeline_mark_lock(ce);
+	GEM_BUG_ON(atomic_read(&ce->timeline->active_count) < 0);
 
-	rq = __i915_request_create(engine->kernel_context, GFP_NOWAIT);
+	rq = __i915_request_create(ce, GFP_NOWAIT);
 	if (IS_ERR(rq))
 		/* Context switch failed, hope for the best! Maybe reset? */
 		goto out_unlock;
 
-	intel_timeline_enter(i915_request_timeline(rq));
-
 	/* Check again on the next retirement. */
 	engine->wakeref_serial = engine->serial + 1;
 	i915_request_add_active_barriers(rq);
@@ -116,13 +148,14 @@  static bool switch_to_kernel_context(struct intel_engine_cs *engine)
 	rq->sched.attr.priority = I915_PRIORITY_BARRIER;
 	__i915_request_commit(rq);
 
-	/* Release our exclusive hold on the engine */
-	__intel_wakeref_defer_park(&engine->wakeref);
 	__i915_request_queue(rq, NULL);
 
+	/* Expose ourselves to intel_gt_retire_requests() and new submission */
+	__intel_timeline_enter_and_pm_release(ce->timeline, engine);
+
 	result = false;
 out_unlock:
-	__timeline_mark_unlock(engine->kernel_context, flags);
+	__timeline_mark_unlock(ce, flags);
 	return result;
 }