diff mbox

[v4] drm/i915: Execlists small cleanups and micro-optimisations

Message ID 1456505912-22286-1-git-send-email-tvrtko.ursulin@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Tvrtko Ursulin Feb. 26, 2016, 4:58 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Assorted changes in the areas of code cleanup, reduction of
invariant conditional in the interrupt handler and lock
contention and MMIO access optimisation.

 * Remove needless initialization.
 * Improve cache locality by reorganizing code and/or using
   branch hints to keep unexpected or error conditions out
   of line.
 * Favor busy submit path vs. empty queue.
 * Less branching in hot-paths.

v2:

 * Avoid mmio reads when possible. (Chris Wilson)
 * Use natural integer size for csb indices.
 * Remove useless return value from execlists_update_context.
 * Extract 32-bit ppgtt PDPs update so it is out of line and
   shared with two callers.
 * Grab forcewake across all mmio operations to ease the
   load on uncore lock and use chepear mmio ops.

v3:

 * Removed some more pointless u8 data types.
 * Removed unused return from execlists_context_queue.
 * Commit message updates.

v4:
 * Unclumsify the unqueue if statement. (Chris Wilson)
 * Hide forcewake from the queuing function. (Chris Wilson)

Version 3 now makes the irq handling code path ~20% smaller on
48-bit PPGTT hardware, and a little bit less elsewhere. Hot
paths are mostly in-line now and hammering on the uncore
spinlock is greatly reduced together with mmio traffic to an
extent.

Benchmarking with "gem_latency -n 100" (keep submitting
batches with 100 nop instruction) shows approximately 4% higher
throughput, 2% less CPU time and 22% smaller latencies. This was
on a big-core while small-cores could benefit even more.

Most likely reason for the improvements are the MMIO
optimization and uncore lock traffic reduction.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/intel_lrc.c        | 214 +++++++++++++++++---------------
 drivers/gpu/drm/i915/intel_ringbuffer.h |   3 +-
 2 files changed, 114 insertions(+), 103 deletions(-)

Comments

Chris Wilson Feb. 26, 2016, 8:24 p.m. UTC | #1
On Fri, Feb 26, 2016 at 04:58:32PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Assorted changes in the areas of code cleanup, reduction of
> invariant conditional in the interrupt handler and lock
> contention and MMIO access optimisation.
> 
>  * Remove needless initialization.
>  * Improve cache locality by reorganizing code and/or using
>    branch hints to keep unexpected or error conditions out
>    of line.
>  * Favor busy submit path vs. empty queue.
>  * Less branching in hot-paths.
> 
> v2:
> 
>  * Avoid mmio reads when possible. (Chris Wilson)
>  * Use natural integer size for csb indices.
>  * Remove useless return value from execlists_update_context.
>  * Extract 32-bit ppgtt PDPs update so it is out of line and
>    shared with two callers.
>  * Grab forcewake across all mmio operations to ease the
>    load on uncore lock and use chepear mmio ops.
> 
> v3:
> 
>  * Removed some more pointless u8 data types.
>  * Removed unused return from execlists_context_queue.
>  * Commit message updates.
> 
> v4:
>  * Unclumsify the unqueue if statement. (Chris Wilson)
>  * Hide forcewake from the queuing function. (Chris Wilson)
> 
> Version 3 now makes the irq handling code path ~20% smaller on
> 48-bit PPGTT hardware, and a little bit less elsewhere. Hot
> paths are mostly in-line now and hammering on the uncore
> spinlock is greatly reduced together with mmio traffic to an
> extent.
> 
> Benchmarking with "gem_latency -n 100" (keep submitting
> batches with 100 nop instruction) shows approximately 4% higher
> throughput, 2% less CPU time and 22% smaller latencies. This was
> on a big-core while small-cores could benefit even more.

Just add a quick comment about "gem_latency -n 0" suggesting an oddity
with synchronous workloads that bears further study (just so that we
have the hint/reminder about the test case to run).
 
> Most likely reason for the improvements are the MMIO
> optimization and uncore lock traffic reduction.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris
Tvrtko Ursulin Feb. 29, 2016, 10:45 a.m. UTC | #2
On 26/02/16 20:24, Chris Wilson wrote:
> On Fri, Feb 26, 2016 at 04:58:32PM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> Assorted changes in the areas of code cleanup, reduction of
>> invariant conditional in the interrupt handler and lock
>> contention and MMIO access optimisation.
>>
>>   * Remove needless initialization.
>>   * Improve cache locality by reorganizing code and/or using
>>     branch hints to keep unexpected or error conditions out
>>     of line.
>>   * Favor busy submit path vs. empty queue.
>>   * Less branching in hot-paths.
>>
>> v2:
>>
>>   * Avoid mmio reads when possible. (Chris Wilson)
>>   * Use natural integer size for csb indices.
>>   * Remove useless return value from execlists_update_context.
>>   * Extract 32-bit ppgtt PDPs update so it is out of line and
>>     shared with two callers.
>>   * Grab forcewake across all mmio operations to ease the
>>     load on uncore lock and use chepear mmio ops.
>>
>> v3:
>>
>>   * Removed some more pointless u8 data types.
>>   * Removed unused return from execlists_context_queue.
>>   * Commit message updates.
>>
>> v4:
>>   * Unclumsify the unqueue if statement. (Chris Wilson)
>>   * Hide forcewake from the queuing function. (Chris Wilson)
>>
>> Version 3 now makes the irq handling code path ~20% smaller on
>> 48-bit PPGTT hardware, and a little bit less elsewhere. Hot
>> paths are mostly in-line now and hammering on the uncore
>> spinlock is greatly reduced together with mmio traffic to an
>> extent.
>>
>> Benchmarking with "gem_latency -n 100" (keep submitting
>> batches with 100 nop instruction) shows approximately 4% higher
>> throughput, 2% less CPU time and 22% smaller latencies. This was
>> on a big-core while small-cores could benefit even more.
>
> Just add a quick comment about "gem_latency -n 0" suggesting an oddity
> with synchronous workloads that bears further study (just so that we
> have the hint/reminder about the test case to run).
>

This ok?

"""
One unexplained result is with "gem_latency -n 0" (dispatching
empty batches) which shows 5% more throughput, 8% less CPU time,
25% better producer and consumer latencies, but 15% higher
dispatch latency which looks like a possible measuring artifact.
"""

>> Most likely reason for the improvements are the MMIO
>> optimization and uncore lock traffic reduction.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>

Thanks!

Regards,

Tvrtko
Chris Wilson Feb. 29, 2016, 10:53 a.m. UTC | #3
On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
> This ok?
> 
> """
> One unexplained result is with "gem_latency -n 0" (dispatching
> empty batches) which shows 5% more throughput, 8% less CPU time,
> 25% better producer and consumer latencies, but 15% higher
> dispatch latency which looks like a possible measuring artifact.
> """

I doubt it is a measuring artefact since throughput = 1/(dispatch +
latency + test overhead), and the dispatch latency here is larger than
the wakeup latency and so has greater impact on throughput in this
scenario.
-Chris
Tvrtko Ursulin Feb. 29, 2016, 11:01 a.m. UTC | #4
On 29/02/16 10:53, Chris Wilson wrote:
> On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
>> This ok?
>>
>> """
>> One unexplained result is with "gem_latency -n 0" (dispatching
>> empty batches) which shows 5% more throughput, 8% less CPU time,
>> 25% better producer and consumer latencies, but 15% higher
>> dispatch latency which looks like a possible measuring artifact.
>> """
>
> I doubt it is a measuring artefact since throughput = 1/(dispatch +
> latency + test overhead), and the dispatch latency here is larger than
> the wakeup latency and so has greater impact on throughput in this
> scenario.

I don't follow you, if dispatch latency has larger effect on throughput 
how to explain the increase and still better throughput?

I see in gem_latency this block:

	measure_latency(p, &p->latency);
	igt_stats_push(&p->dispatch, *p->last_timestamp - start);

measure_latency waits for the batch to complete and then dispatch 
latency uses p->last_timestamp which is something written by the GPU and 
not a CPU view of the latency ?

Regards,

Tvrtko
Chris Wilson Feb. 29, 2016, 11:13 a.m. UTC | #5
On Mon, Feb 29, 2016 at 11:01:49AM +0000, Tvrtko Ursulin wrote:
> 
> On 29/02/16 10:53, Chris Wilson wrote:
> >On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
> >>This ok?
> >>
> >>"""
> >>One unexplained result is with "gem_latency -n 0" (dispatching
> >>empty batches) which shows 5% more throughput, 8% less CPU time,
> >>25% better producer and consumer latencies, but 15% higher
> >>dispatch latency which looks like a possible measuring artifact.
> >>"""
> >
> >I doubt it is a measuring artefact since throughput = 1/(dispatch +
> >latency + test overhead), and the dispatch latency here is larger than
> >the wakeup latency and so has greater impact on throughput in this
> >scenario.
> 
> I don't follow you, if dispatch latency has larger effect on
> throughput how to explain the increase and still better throughput?
> 
> I see in gem_latency this block:
> 
> 	measure_latency(p, &p->latency);
> 	igt_stats_push(&p->dispatch, *p->last_timestamp - start);
> 
> measure_latency waits for the batch to complete and then dispatch
> latency uses p->last_timestamp which is something written by the GPU
> and not a CPU view of the latency ?

Exactly, measurements are entirely made from the running engine clock
(which is ~80ns clock, and should be verified during init). The register
is read before dispatch, inside the batch and then at wakeup, but the
information is presented as dispatch = batch - start and
wakeup = end - batch, so to get the duration (end - start) we need
to add the two together. Throughput will also include some overhead from
the test iteration (that will mainly be scheduler interference).

My comment about dispatch having greater effect, is in terms of
its higher absolute value (so the relative % means a larger change wrt
throughput).
-Chris
Tvrtko Ursulin Feb. 29, 2016, 11:40 a.m. UTC | #6
On 29/02/16 11:13, Chris Wilson wrote:
> On Mon, Feb 29, 2016 at 11:01:49AM +0000, Tvrtko Ursulin wrote:
>>
>> On 29/02/16 10:53, Chris Wilson wrote:
>>> On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
>>>> This ok?
>>>>
>>>> """
>>>> One unexplained result is with "gem_latency -n 0" (dispatching
>>>> empty batches) which shows 5% more throughput, 8% less CPU time,
>>>> 25% better producer and consumer latencies, but 15% higher
>>>> dispatch latency which looks like a possible measuring artifact.
>>>> """
>>>
>>> I doubt it is a measuring artefact since throughput = 1/(dispatch +
>>> latency + test overhead), and the dispatch latency here is larger than
>>> the wakeup latency and so has greater impact on throughput in this
>>> scenario.
>>
>> I don't follow you, if dispatch latency has larger effect on
>> throughput how to explain the increase and still better throughput?
>>
>> I see in gem_latency this block:
>>
>> 	measure_latency(p, &p->latency);
>> 	igt_stats_push(&p->dispatch, *p->last_timestamp - start);
>>
>> measure_latency waits for the batch to complete and then dispatch
>> latency uses p->last_timestamp which is something written by the GPU
>> and not a CPU view of the latency ?
>
> Exactly, measurements are entirely made from the running engine clock
> (which is ~80ns clock, and should be verified during init). The register
> is read before dispatch, inside the batch and then at wakeup, but the
> information is presented as dispatch = batch - start and
> wakeup = end - batch, so to get the duration (end - start) we need
> to add the two together. Throughput will also include some overhead from
> the test iteration (that will mainly be scheduler interference).
>
> My comment about dispatch having greater effect, is in terms of
> its higher absolute value (so the relative % means a larger change wrt
> throughput).

Change to this then?

"""
     One unexplained result is with "gem_latency -n 0" (dispatching
     empty batches) which shows 5% more throughput, 8% less CPU time,
     25% better producer and consumer latencies, but 15% higher
     dispatch latency which looks like an amplified effect of test
     overhead.
"""

Regards,

Tvrtko
Chris Wilson Feb. 29, 2016, 11:48 a.m. UTC | #7
On Mon, Feb 29, 2016 at 11:40:37AM +0000, Tvrtko Ursulin wrote:
> 
> 
> On 29/02/16 11:13, Chris Wilson wrote:
> >On Mon, Feb 29, 2016 at 11:01:49AM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 29/02/16 10:53, Chris Wilson wrote:
> >>>On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
> >>>>This ok?
> >>>>
> >>>>"""
> >>>>One unexplained result is with "gem_latency -n 0" (dispatching
> >>>>empty batches) which shows 5% more throughput, 8% less CPU time,
> >>>>25% better producer and consumer latencies, but 15% higher
> >>>>dispatch latency which looks like a possible measuring artifact.
> >>>>"""
> >>>
> >>>I doubt it is a measuring artefact since throughput = 1/(dispatch +
> >>>latency + test overhead), and the dispatch latency here is larger than
> >>>the wakeup latency and so has greater impact on throughput in this
> >>>scenario.
> >>
> >>I don't follow you, if dispatch latency has larger effect on
> >>throughput how to explain the increase and still better throughput?
> >>
> >>I see in gem_latency this block:
> >>
> >>	measure_latency(p, &p->latency);
> >>	igt_stats_push(&p->dispatch, *p->last_timestamp - start);
> >>
> >>measure_latency waits for the batch to complete and then dispatch
> >>latency uses p->last_timestamp which is something written by the GPU
> >>and not a CPU view of the latency ?
> >
> >Exactly, measurements are entirely made from the running engine clock
> >(which is ~80ns clock, and should be verified during init). The register
> >is read before dispatch, inside the batch and then at wakeup, but the
> >information is presented as dispatch = batch - start and
> >wakeup = end - batch, so to get the duration (end - start) we need
> >to add the two together. Throughput will also include some overhead from
> >the test iteration (that will mainly be scheduler interference).
> >
> >My comment about dispatch having greater effect, is in terms of
> >its higher absolute value (so the relative % means a larger change wrt
> >throughput).
> 
> Change to this then?
> 
> """
>     One unexplained result is with "gem_latency -n 0" (dispatching
>     empty batches) which shows 5% more throughput, 8% less CPU time,
>     25% better producer and consumer latencies, but 15% higher
>     dispatch latency which looks like an amplified effect of test
>     overhead.
> """

No. Dispatch latency is important and this attempts to pass the change
off a test effect when to the best of my knowledge it is a valid external
observation of the system.
-Chris
Tvrtko Ursulin Feb. 29, 2016, 11:59 a.m. UTC | #8
On 29/02/16 11:48, Chris Wilson wrote:
> On Mon, Feb 29, 2016 at 11:40:37AM +0000, Tvrtko Ursulin wrote:
>>
>>
>> On 29/02/16 11:13, Chris Wilson wrote:
>>> On Mon, Feb 29, 2016 at 11:01:49AM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 29/02/16 10:53, Chris Wilson wrote:
>>>>> On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
>>>>>> This ok?
>>>>>>
>>>>>> """
>>>>>> One unexplained result is with "gem_latency -n 0" (dispatching
>>>>>> empty batches) which shows 5% more throughput, 8% less CPU time,
>>>>>> 25% better producer and consumer latencies, but 15% higher
>>>>>> dispatch latency which looks like a possible measuring artifact.
>>>>>> """
>>>>>
>>>>> I doubt it is a measuring artefact since throughput = 1/(dispatch +
>>>>> latency + test overhead), and the dispatch latency here is larger than
>>>>> the wakeup latency and so has greater impact on throughput in this
>>>>> scenario.
>>>>
>>>> I don't follow you, if dispatch latency has larger effect on
>>>> throughput how to explain the increase and still better throughput?
>>>>
>>>> I see in gem_latency this block:
>>>>
>>>> 	measure_latency(p, &p->latency);
>>>> 	igt_stats_push(&p->dispatch, *p->last_timestamp - start);
>>>>
>>>> measure_latency waits for the batch to complete and then dispatch
>>>> latency uses p->last_timestamp which is something written by the GPU
>>>> and not a CPU view of the latency ?
>>>
>>> Exactly, measurements are entirely made from the running engine clock
>>> (which is ~80ns clock, and should be verified during init). The register
>>> is read before dispatch, inside the batch and then at wakeup, but the
>>> information is presented as dispatch = batch - start and
>>> wakeup = end - batch, so to get the duration (end - start) we need
>>> to add the two together. Throughput will also include some overhead from
>>> the test iteration (that will mainly be scheduler interference).
>>>
>>> My comment about dispatch having greater effect, is in terms of
>>> its higher absolute value (so the relative % means a larger change wrt
>>> throughput).
>>
>> Change to this then?
>>
>> """
>>      One unexplained result is with "gem_latency -n 0" (dispatching
>>      empty batches) which shows 5% more throughput, 8% less CPU time,
>>      25% better producer and consumer latencies, but 15% higher
>>      dispatch latency which looks like an amplified effect of test
>>      overhead.
>> """
>
> No. Dispatch latency is important and this attempts to pass the change
> off a test effect when to the best of my knowledge it is a valid external
> observation of the system.

I just don't understand how can it be valid when we have executed more 
empty batches than before in a unit of time.

Because even dispatch + wake up latency is worse, but throughput is 
still better.

Sounds impossible to me so it must be the effect of using two different 
time sources. CPU side to measure throughput and GPU side to measure 
dispatch latency.I don't know, could you suggest a paragraph to add to 
the commit message so we can close on this?

Regards,

Tvrtko
Tvrtko Ursulin March 1, 2016, 10:21 a.m. UTC | #9
On 29/02/16 11:59, Tvrtko Ursulin wrote:
>
>
> On 29/02/16 11:48, Chris Wilson wrote:
>> On Mon, Feb 29, 2016 at 11:40:37AM +0000, Tvrtko Ursulin wrote:
>>>
>>>
>>> On 29/02/16 11:13, Chris Wilson wrote:
>>>> On Mon, Feb 29, 2016 at 11:01:49AM +0000, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 29/02/16 10:53, Chris Wilson wrote:
>>>>>> On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
>>>>>>> This ok?
>>>>>>>
>>>>>>> """
>>>>>>> One unexplained result is with "gem_latency -n 0" (dispatching
>>>>>>> empty batches) which shows 5% more throughput, 8% less CPU time,
>>>>>>> 25% better producer and consumer latencies, but 15% higher
>>>>>>> dispatch latency which looks like a possible measuring artifact.
>>>>>>> """
>>>>>>
>>>>>> I doubt it is a measuring artefact since throughput = 1/(dispatch +
>>>>>> latency + test overhead), and the dispatch latency here is larger
>>>>>> than
>>>>>> the wakeup latency and so has greater impact on throughput in this
>>>>>> scenario.
>>>>>
>>>>> I don't follow you, if dispatch latency has larger effect on
>>>>> throughput how to explain the increase and still better throughput?
>>>>>
>>>>> I see in gem_latency this block:
>>>>>
>>>>>     measure_latency(p, &p->latency);
>>>>>     igt_stats_push(&p->dispatch, *p->last_timestamp - start);
>>>>>
>>>>> measure_latency waits for the batch to complete and then dispatch
>>>>> latency uses p->last_timestamp which is something written by the GPU
>>>>> and not a CPU view of the latency ?
>>>>
>>>> Exactly, measurements are entirely made from the running engine clock
>>>> (which is ~80ns clock, and should be verified during init). The
>>>> register
>>>> is read before dispatch, inside the batch and then at wakeup, but the
>>>> information is presented as dispatch = batch - start and
>>>> wakeup = end - batch, so to get the duration (end - start) we need
>>>> to add the two together. Throughput will also include some overhead
>>>> from
>>>> the test iteration (that will mainly be scheduler interference).
>>>>
>>>> My comment about dispatch having greater effect, is in terms of
>>>> its higher absolute value (so the relative % means a larger change wrt
>>>> throughput).
>>>
>>> Change to this then?
>>>
>>> """
>>>      One unexplained result is with "gem_latency -n 0" (dispatching
>>>      empty batches) which shows 5% more throughput, 8% less CPU time,
>>>      25% better producer and consumer latencies, but 15% higher
>>>      dispatch latency which looks like an amplified effect of test
>>>      overhead.
>>> """
>>
>> No. Dispatch latency is important and this attempts to pass the change
>> off a test effect when to the best of my knowledge it is a valid external
>> observation of the system.
>
> I just don't understand how can it be valid when we have executed more
> empty batches than before in a unit of time.
>
> Because even dispatch + wake up latency is worse, but throughput is
> still better.
>
> Sounds impossible to me so it must be the effect of using two different
> time sources. CPU side to measure throughput and GPU side to measure
> dispatch latency.I don't know, could you suggest a paragraph to add to
> the commit message so we can close on this?

Happy with simply leaving out any attempts of explaining the oddity like:

"""
One odd result is with "gem_latency -n 0" (dispatching empty batches) 
which shows 5% more throughput, 8% less CPU time, 25% better producer 
and consumer latencies, but 15% higher dispatch latency which is yet 
unexplained.
"""

?

Regards,

Tvrtko
Chris Wilson March 1, 2016, 10:32 a.m. UTC | #10
On Mon, Feb 29, 2016 at 11:59:26AM +0000, Tvrtko Ursulin wrote:
> 
> 
> On 29/02/16 11:48, Chris Wilson wrote:
> >On Mon, Feb 29, 2016 at 11:40:37AM +0000, Tvrtko Ursulin wrote:
> >>
> >>
> >>On 29/02/16 11:13, Chris Wilson wrote:
> >>>On Mon, Feb 29, 2016 at 11:01:49AM +0000, Tvrtko Ursulin wrote:
> >>>>
> >>>>On 29/02/16 10:53, Chris Wilson wrote:
> >>>>>On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
> >>>>>>This ok?
> >>>>>>
> >>>>>>"""
> >>>>>>One unexplained result is with "gem_latency -n 0" (dispatching
> >>>>>>empty batches) which shows 5% more throughput, 8% less CPU time,
> >>>>>>25% better producer and consumer latencies, but 15% higher
> >>>>>>dispatch latency which looks like a possible measuring artifact.
> >>>>>>"""
> >>>>>
> >>>>>I doubt it is a measuring artefact since throughput = 1/(dispatch +
> >>>>>latency + test overhead), and the dispatch latency here is larger than
> >>>>>the wakeup latency and so has greater impact on throughput in this
> >>>>>scenario.
> >>>>
> >>>>I don't follow you, if dispatch latency has larger effect on
> >>>>throughput how to explain the increase and still better throughput?
> >>>>
> >>>>I see in gem_latency this block:
> >>>>
> >>>>	measure_latency(p, &p->latency);
> >>>>	igt_stats_push(&p->dispatch, *p->last_timestamp - start);
> >>>>
> >>>>measure_latency waits for the batch to complete and then dispatch
> >>>>latency uses p->last_timestamp which is something written by the GPU
> >>>>and not a CPU view of the latency ?
> >>>
> >>>Exactly, measurements are entirely made from the running engine clock
> >>>(which is ~80ns clock, and should be verified during init). The register
> >>>is read before dispatch, inside the batch and then at wakeup, but the
> >>>information is presented as dispatch = batch - start and
> >>>wakeup = end - batch, so to get the duration (end - start) we need
> >>>to add the two together. Throughput will also include some overhead from
> >>>the test iteration (that will mainly be scheduler interference).
> >>>
> >>>My comment about dispatch having greater effect, is in terms of
> >>>its higher absolute value (so the relative % means a larger change wrt
> >>>throughput).
> >>
> >>Change to this then?
> >>
> >>"""
> >>     One unexplained result is with "gem_latency -n 0" (dispatching
> >>     empty batches) which shows 5% more throughput, 8% less CPU time,
> >>     25% better producer and consumer latencies, but 15% higher
> >>     dispatch latency which looks like an amplified effect of test
> >>     overhead.
> >>"""
> >
> >No. Dispatch latency is important and this attempts to pass the change
> >off a test effect when to the best of my knowledge it is a valid external
> >observation of the system.
> 
> I just don't understand how can it be valid when we have executed
> more empty batches than before in a unit of time.
> 
> Because even dispatch + wake up latency is worse, but throughput is
> still better.
> 
> Sounds impossible to me so it must be the effect of using two
> different time sources. CPU side to measure throughput and GPU side
> to measure dispatch latency.

Sure, but it is not impossible. The actual throughput (measured as
number of cycles per test run) is a product of the CPU scheduler on top
of everything else, We could measure the test overhead by measuring the
GPU cycles between batches, and the most likely explanation is likely to
be exactly who the isr is interrupting, setting maxcpus=1 would cancel
the variation should that be the case.

> I don't know, could you suggest a
> paragraph to add to the commit message so we can close on this?

On to the next...
-Chris
Chris Wilson March 1, 2016, 10:32 a.m. UTC | #11
On Tue, Mar 01, 2016 at 10:21:45AM +0000, Tvrtko Ursulin wrote:
> 
> 
> On 29/02/16 11:59, Tvrtko Ursulin wrote:
> >
> >
> >On 29/02/16 11:48, Chris Wilson wrote:
> >>On Mon, Feb 29, 2016 at 11:40:37AM +0000, Tvrtko Ursulin wrote:
> >>>
> >>>
> >>>On 29/02/16 11:13, Chris Wilson wrote:
> >>>>On Mon, Feb 29, 2016 at 11:01:49AM +0000, Tvrtko Ursulin wrote:
> >>>>>
> >>>>>On 29/02/16 10:53, Chris Wilson wrote:
> >>>>>>On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
> >>>>>>>This ok?
> >>>>>>>
> >>>>>>>"""
> >>>>>>>One unexplained result is with "gem_latency -n 0" (dispatching
> >>>>>>>empty batches) which shows 5% more throughput, 8% less CPU time,
> >>>>>>>25% better producer and consumer latencies, but 15% higher
> >>>>>>>dispatch latency which looks like a possible measuring artifact.
> >>>>>>>"""
> >>>>>>
> >>>>>>I doubt it is a measuring artefact since throughput = 1/(dispatch +
> >>>>>>latency + test overhead), and the dispatch latency here is larger
> >>>>>>than
> >>>>>>the wakeup latency and so has greater impact on throughput in this
> >>>>>>scenario.
> >>>>>
> >>>>>I don't follow you, if dispatch latency has larger effect on
> >>>>>throughput how to explain the increase and still better throughput?
> >>>>>
> >>>>>I see in gem_latency this block:
> >>>>>
> >>>>>    measure_latency(p, &p->latency);
> >>>>>    igt_stats_push(&p->dispatch, *p->last_timestamp - start);
> >>>>>
> >>>>>measure_latency waits for the batch to complete and then dispatch
> >>>>>latency uses p->last_timestamp which is something written by the GPU
> >>>>>and not a CPU view of the latency ?
> >>>>
> >>>>Exactly, measurements are entirely made from the running engine clock
> >>>>(which is ~80ns clock, and should be verified during init). The
> >>>>register
> >>>>is read before dispatch, inside the batch and then at wakeup, but the
> >>>>information is presented as dispatch = batch - start and
> >>>>wakeup = end - batch, so to get the duration (end - start) we need
> >>>>to add the two together. Throughput will also include some overhead
> >>>>from
> >>>>the test iteration (that will mainly be scheduler interference).
> >>>>
> >>>>My comment about dispatch having greater effect, is in terms of
> >>>>its higher absolute value (so the relative % means a larger change wrt
> >>>>throughput).
> >>>
> >>>Change to this then?
> >>>
> >>>"""
> >>>     One unexplained result is with "gem_latency -n 0" (dispatching
> >>>     empty batches) which shows 5% more throughput, 8% less CPU time,
> >>>     25% better producer and consumer latencies, but 15% higher
> >>>     dispatch latency which looks like an amplified effect of test
> >>>     overhead.
> >>>"""
> >>
> >>No. Dispatch latency is important and this attempts to pass the change
> >>off a test effect when to the best of my knowledge it is a valid external
> >>observation of the system.
> >
> >I just don't understand how can it be valid when we have executed more
> >empty batches than before in a unit of time.
> >
> >Because even dispatch + wake up latency is worse, but throughput is
> >still better.
> >
> >Sounds impossible to me so it must be the effect of using two different
> >time sources. CPU side to measure throughput and GPU side to measure
> >dispatch latency.I don't know, could you suggest a paragraph to add to
> >the commit message so we can close on this?
> 
> Happy with simply leaving out any attempts of explaining the oddity like:
> 
> """
> One odd result is with "gem_latency -n 0" (dispatching empty
> batches) which shows 5% more throughput, 8% less CPU time, 25%
> better producer and consumer latencies, but 15% higher dispatch
> latency which is yet unexplained.
> """

Yes!
-Chris
Tvrtko Ursulin March 1, 2016, 10:41 a.m. UTC | #12
On 01/03/16 10:32, Chris Wilson wrote:
> On Tue, Mar 01, 2016 at 10:21:45AM +0000, Tvrtko Ursulin wrote:
>>
>>
>> On 29/02/16 11:59, Tvrtko Ursulin wrote:
>>>
>>>
>>> On 29/02/16 11:48, Chris Wilson wrote:
>>>> On Mon, Feb 29, 2016 at 11:40:37AM +0000, Tvrtko Ursulin wrote:
>>>>>
>>>>>
>>>>> On 29/02/16 11:13, Chris Wilson wrote:
>>>>>> On Mon, Feb 29, 2016 at 11:01:49AM +0000, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> On 29/02/16 10:53, Chris Wilson wrote:
>>>>>>>> On Mon, Feb 29, 2016 at 10:45:34AM +0000, Tvrtko Ursulin wrote:
>>>>>>>>> This ok?
>>>>>>>>>
>>>>>>>>> """
>>>>>>>>> One unexplained result is with "gem_latency -n 0" (dispatching
>>>>>>>>> empty batches) which shows 5% more throughput, 8% less CPU time,
>>>>>>>>> 25% better producer and consumer latencies, but 15% higher
>>>>>>>>> dispatch latency which looks like a possible measuring artifact.
>>>>>>>>> """
>>>>>>>>
>>>>>>>> I doubt it is a measuring artefact since throughput = 1/(dispatch +
>>>>>>>> latency + test overhead), and the dispatch latency here is larger
>>>>>>>> than
>>>>>>>> the wakeup latency and so has greater impact on throughput in this
>>>>>>>> scenario.
>>>>>>>
>>>>>>> I don't follow you, if dispatch latency has larger effect on
>>>>>>> throughput how to explain the increase and still better throughput?
>>>>>>>
>>>>>>> I see in gem_latency this block:
>>>>>>>
>>>>>>>     measure_latency(p, &p->latency);
>>>>>>>     igt_stats_push(&p->dispatch, *p->last_timestamp - start);
>>>>>>>
>>>>>>> measure_latency waits for the batch to complete and then dispatch
>>>>>>> latency uses p->last_timestamp which is something written by the GPU
>>>>>>> and not a CPU view of the latency ?
>>>>>>
>>>>>> Exactly, measurements are entirely made from the running engine clock
>>>>>> (which is ~80ns clock, and should be verified during init). The
>>>>>> register
>>>>>> is read before dispatch, inside the batch and then at wakeup, but the
>>>>>> information is presented as dispatch = batch - start and
>>>>>> wakeup = end - batch, so to get the duration (end - start) we need
>>>>>> to add the two together. Throughput will also include some overhead
>>>>>> from
>>>>>> the test iteration (that will mainly be scheduler interference).
>>>>>>
>>>>>> My comment about dispatch having greater effect, is in terms of
>>>>>> its higher absolute value (so the relative % means a larger change wrt
>>>>>> throughput).
>>>>>
>>>>> Change to this then?
>>>>>
>>>>> """
>>>>>      One unexplained result is with "gem_latency -n 0" (dispatching
>>>>>      empty batches) which shows 5% more throughput, 8% less CPU time,
>>>>>      25% better producer and consumer latencies, but 15% higher
>>>>>      dispatch latency which looks like an amplified effect of test
>>>>>      overhead.
>>>>> """
>>>>
>>>> No. Dispatch latency is important and this attempts to pass the change
>>>> off a test effect when to the best of my knowledge it is a valid external
>>>> observation of the system.
>>>
>>> I just don't understand how can it be valid when we have executed more
>>> empty batches than before in a unit of time.
>>>
>>> Because even dispatch + wake up latency is worse, but throughput is
>>> still better.
>>>
>>> Sounds impossible to me so it must be the effect of using two different
>>> time sources. CPU side to measure throughput and GPU side to measure
>>> dispatch latency.I don't know, could you suggest a paragraph to add to
>>> the commit message so we can close on this?
>>
>> Happy with simply leaving out any attempts of explaining the oddity like:
>>
>> """
>> One odd result is with "gem_latency -n 0" (dispatching empty
>> batches) which shows 5% more throughput, 8% less CPU time, 25%
>> better producer and consumer latencies, but 15% higher dispatch
>> latency which is yet unexplained.
>> """
>
> Yes!

Thanks! Patch merged.

I'll try CSB read outside the execlists lock to see if that helps any 
(into a temporary buffer).

What about your patch to move it all to a bottom handler? Are we going 
to progress that one?

Regards,

Tvrtko
Chris Wilson March 1, 2016, 10:57 a.m. UTC | #13
On Tue, Mar 01, 2016 at 10:41:26AM +0000, Tvrtko Ursulin wrote:
> What about your patch to move it all to a bottom handler? Are we
> going to progress that one?

I looked for the regressions and found an interesting slowdown in single nop
batches (almost 2x as slow). Another one that should not happen, as along
that idle-execute-sync path the code should be the same (the elsp should
be written directly by the submitter in both cases). CPU time looks
pretty much identical between the two, so I'm left questioning the
scheduler and the impact of using another kthread. The effect is also
only apparent on bsw not big core, perhaps more indication of a
context-switch versus CPU cache? I also found another way to lockup the
interrupt handler thanks to its while(1) looping over GT + display
interrupts.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index f0a57afc8dff..247daf80664b 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -269,6 +269,9 @@  logical_ring_init_platform_invariants(struct intel_engine_cs *ring)
 {
 	struct drm_device *dev = ring->dev;
 
+	if (IS_GEN8(dev) || IS_GEN9(dev))
+		ring->idle_lite_restore_wa = ~0;
+
 	ring->disable_lite_restore_wa = (IS_SKL_REVID(dev, 0, SKL_REVID_B0) ||
 					IS_BXT_REVID(dev, 0, BXT_REVID_A1)) &&
 					(ring->id == VCS || ring->id == VCS2);
@@ -372,8 +375,6 @@  static void execlists_elsp_write(struct drm_i915_gem_request *rq0,
 	rq0->elsp_submitted++;
 
 	/* You must always write both descriptors in the order below. */
-	spin_lock(&dev_priv->uncore.lock);
-	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
 	I915_WRITE_FW(RING_ELSP(ring), upper_32_bits(desc[1]));
 	I915_WRITE_FW(RING_ELSP(ring), lower_32_bits(desc[1]));
 
@@ -383,11 +384,18 @@  static void execlists_elsp_write(struct drm_i915_gem_request *rq0,
 
 	/* ELSP is a wo register, use another nearby reg for posting */
 	POSTING_READ_FW(RING_EXECLIST_STATUS_LO(ring));
-	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
-	spin_unlock(&dev_priv->uncore.lock);
 }
 
-static int execlists_update_context(struct drm_i915_gem_request *rq)
+static void
+execlists_update_context_pdps(struct i915_hw_ppgtt *ppgtt, u32 *reg_state)
+{
+	ASSIGN_CTX_PDP(ppgtt, reg_state, 3);
+	ASSIGN_CTX_PDP(ppgtt, reg_state, 2);
+	ASSIGN_CTX_PDP(ppgtt, reg_state, 1);
+	ASSIGN_CTX_PDP(ppgtt, reg_state, 0);
+}
+
+static void execlists_update_context(struct drm_i915_gem_request *rq)
 {
 	struct intel_engine_cs *ring = rq->ring;
 	struct i915_hw_ppgtt *ppgtt = rq->ctx->ppgtt;
@@ -395,19 +403,13 @@  static int execlists_update_context(struct drm_i915_gem_request *rq)
 
 	reg_state[CTX_RING_TAIL+1] = rq->tail;
 
-	if (ppgtt && !USES_FULL_48BIT_PPGTT(ppgtt->base.dev)) {
-		/* True 32b PPGTT with dynamic page allocation: update PDP
-		 * registers and point the unallocated PDPs to scratch page.
-		 * PML4 is allocated during ppgtt init, so this is not needed
-		 * in 48-bit mode.
-		 */
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 3);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 2);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 1);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 0);
-	}
-
-	return 0;
+	/* True 32b PPGTT with dynamic page allocation: update PDP
+	 * registers and point the unallocated PDPs to scratch page.
+	 * PML4 is allocated during ppgtt init, so this is not needed
+	 * in 48-bit mode.
+	 */
+	if (ppgtt && !USES_FULL_48BIT_PPGTT(ppgtt->base.dev))
+		execlists_update_context_pdps(ppgtt, reg_state);
 }
 
 static void execlists_submit_requests(struct drm_i915_gem_request *rq0,
@@ -421,10 +423,10 @@  static void execlists_submit_requests(struct drm_i915_gem_request *rq0,
 	execlists_elsp_write(rq0, rq1);
 }
 
-static void execlists_context_unqueue(struct intel_engine_cs *ring)
+static void execlists_context_unqueue__locked(struct intel_engine_cs *ring)
 {
 	struct drm_i915_gem_request *req0 = NULL, *req1 = NULL;
-	struct drm_i915_gem_request *cursor = NULL, *tmp = NULL;
+	struct drm_i915_gem_request *cursor, *tmp;
 
 	assert_spin_locked(&ring->execlist_lock);
 
@@ -434,9 +436,6 @@  static void execlists_context_unqueue(struct intel_engine_cs *ring)
 	 */
 	WARN_ON(!intel_irqs_enabled(ring->dev->dev_private));
 
-	if (list_empty(&ring->execlist_queue))
-		return;
-
 	/* Try to read in pairs */
 	list_for_each_entry_safe(cursor, tmp, &ring->execlist_queue,
 				 execlist_link) {
@@ -451,37 +450,48 @@  static void execlists_context_unqueue(struct intel_engine_cs *ring)
 			req0 = cursor;
 		} else {
 			req1 = cursor;
+			WARN_ON(req1->elsp_submitted);
 			break;
 		}
 	}
 
-	if (IS_GEN8(ring->dev) || IS_GEN9(ring->dev)) {
+	if (unlikely(!req0))
+		return;
+
+	if (req0->elsp_submitted & ring->idle_lite_restore_wa) {
 		/*
-		 * WaIdleLiteRestore: make sure we never cause a lite
-		 * restore with HEAD==TAIL
+		 * WaIdleLiteRestore: make sure we never cause a lite restore
+		 * with HEAD==TAIL.
+		 *
+		 * Apply the wa NOOPS to prevent ring:HEAD == req:TAIL as we
+		 * resubmit the request. See gen8_emit_request() for where we
+		 * prepare the padding after the end of the request.
 		 */
-		if (req0->elsp_submitted) {
-			/*
-			 * Apply the wa NOOPS to prevent ring:HEAD == req:TAIL
-			 * as we resubmit the request. See gen8_emit_request()
-			 * for where we prepare the padding after the end of the
-			 * request.
-			 */
-			struct intel_ringbuffer *ringbuf;
+		struct intel_ringbuffer *ringbuf;
 
-			ringbuf = req0->ctx->engine[ring->id].ringbuf;
-			req0->tail += 8;
-			req0->tail &= ringbuf->size - 1;
-		}
+		ringbuf = req0->ctx->engine[ring->id].ringbuf;
+		req0->tail += 8;
+		req0->tail &= ringbuf->size - 1;
 	}
 
-	WARN_ON(req1 && req1->elsp_submitted);
-
 	execlists_submit_requests(req0, req1);
 }
 
-static bool execlists_check_remove_request(struct intel_engine_cs *ring,
-					   u32 request_id)
+static void execlists_context_unqueue(struct intel_engine_cs *ring)
+{
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+
+	spin_lock(&dev_priv->uncore.lock);
+	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
+
+	execlists_context_unqueue__locked(ring);
+
+	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
+	spin_unlock(&dev_priv->uncore.lock);
+}
+
+static unsigned int
+execlists_check_remove_request(struct intel_engine_cs *ring, u32 request_id)
 {
 	struct drm_i915_gem_request *head_req;
 
@@ -491,33 +501,41 @@  static bool execlists_check_remove_request(struct intel_engine_cs *ring,
 					    struct drm_i915_gem_request,
 					    execlist_link);
 
-	if (head_req != NULL) {
-		if (intel_execlists_ctx_id(head_req->ctx, ring) == request_id) {
-			WARN(head_req->elsp_submitted == 0,
-			     "Never submitted head request\n");
+	if (!head_req)
+		return 0;
 
-			if (--head_req->elsp_submitted <= 0) {
-				list_move_tail(&head_req->execlist_link,
-					       &ring->execlist_retired_req_list);
-				return true;
-			}
-		}
-	}
+	if (unlikely(intel_execlists_ctx_id(head_req->ctx, ring) != request_id))
+		return 0;
+
+	WARN(head_req->elsp_submitted == 0, "Never submitted head request\n");
+
+	if (--head_req->elsp_submitted > 0)
+		return 0;
+
+	list_move_tail(&head_req->execlist_link,
+		       &ring->execlist_retired_req_list);
 
-	return false;
+	return 1;
 }
 
-static void get_context_status(struct intel_engine_cs *ring,
-			       u8 read_pointer,
-			       u32 *status, u32 *context_id)
+static u32
+get_context_status(struct intel_engine_cs *ring, unsigned int read_pointer,
+		   u32 *context_id)
 {
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	u32 status;
 
-	if (WARN_ON(read_pointer >= GEN8_CSB_ENTRIES))
-		return;
+	read_pointer %= GEN8_CSB_ENTRIES;
+
+	status = I915_READ_FW(RING_CONTEXT_STATUS_BUF_LO(ring, read_pointer));
+
+	if (status & GEN8_CTX_STATUS_IDLE_ACTIVE)
+		return 0;
 
-	*status = I915_READ(RING_CONTEXT_STATUS_BUF_LO(ring, read_pointer));
-	*context_id = I915_READ(RING_CONTEXT_STATUS_BUF_HI(ring, read_pointer));
+	*context_id = I915_READ_FW(RING_CONTEXT_STATUS_BUF_HI(ring,
+							      read_pointer));
+
+	return status;
 }
 
 /**
@@ -531,30 +549,27 @@  void intel_lrc_irq_handler(struct intel_engine_cs *ring)
 {
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
 	u32 status_pointer;
-	u8 read_pointer;
-	u8 write_pointer;
+	unsigned int read_pointer, write_pointer;
 	u32 status = 0;
 	u32 status_id;
-	u32 submit_contexts = 0;
+	unsigned int submit_contexts = 0;
+
+	spin_lock(&ring->execlist_lock);
 
-	status_pointer = I915_READ(RING_CONTEXT_STATUS_PTR(ring));
+	spin_lock(&dev_priv->uncore.lock);
+	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
+
+	status_pointer = I915_READ_FW(RING_CONTEXT_STATUS_PTR(ring));
 
 	read_pointer = ring->next_context_status_buffer;
 	write_pointer = GEN8_CSB_WRITE_PTR(status_pointer);
 	if (read_pointer > write_pointer)
 		write_pointer += GEN8_CSB_ENTRIES;
 
-	spin_lock(&ring->execlist_lock);
-
 	while (read_pointer < write_pointer) {
+		status = get_context_status(ring, ++read_pointer, &status_id);
 
-		get_context_status(ring, ++read_pointer % GEN8_CSB_ENTRIES,
-				   &status, &status_id);
-
-		if (status & GEN8_CTX_STATUS_IDLE_ACTIVE)
-			continue;
-
-		if (status & GEN8_CTX_STATUS_PREEMPTED) {
+		if (unlikely(status & GEN8_CTX_STATUS_PREEMPTED)) {
 			if (status & GEN8_CTX_STATUS_LITE_RESTORE) {
 				if (execlists_check_remove_request(ring, status_id))
 					WARN(1, "Lite Restored request removed from queue\n");
@@ -562,37 +577,36 @@  void intel_lrc_irq_handler(struct intel_engine_cs *ring)
 				WARN(1, "Preemption without Lite Restore\n");
 		}
 
-		if ((status & GEN8_CTX_STATUS_ACTIVE_IDLE) ||
-		    (status & GEN8_CTX_STATUS_ELEMENT_SWITCH)) {
-			if (execlists_check_remove_request(ring, status_id))
-				submit_contexts++;
-		}
+		if (status & (GEN8_CTX_STATUS_ACTIVE_IDLE |
+		    GEN8_CTX_STATUS_ELEMENT_SWITCH))
+			submit_contexts +=
+				execlists_check_remove_request(ring, status_id);
 	}
 
-	if (ring->disable_lite_restore_wa) {
-		/* Prevent a ctx to preempt itself */
-		if ((status & GEN8_CTX_STATUS_ACTIVE_IDLE) &&
-		    (submit_contexts != 0))
-			execlists_context_unqueue(ring);
-	} else if (submit_contexts != 0) {
-		execlists_context_unqueue(ring);
+	if (submit_contexts) {
+		if (!ring->disable_lite_restore_wa ||
+		    (status & GEN8_CTX_STATUS_ACTIVE_IDLE))
+			execlists_context_unqueue__locked(ring);
 	}
 
-	spin_unlock(&ring->execlist_lock);
-
-	if (unlikely(submit_contexts > 2))
-		DRM_ERROR("More than two context complete events?\n");
-
 	ring->next_context_status_buffer = write_pointer % GEN8_CSB_ENTRIES;
 
 	/* Update the read pointer to the old write pointer. Manual ringbuffer
 	 * management ftw </sarcasm> */
-	I915_WRITE(RING_CONTEXT_STATUS_PTR(ring),
-		   _MASKED_FIELD(GEN8_CSB_READ_PTR_MASK,
-				 ring->next_context_status_buffer << 8));
+	I915_WRITE_FW(RING_CONTEXT_STATUS_PTR(ring),
+		      _MASKED_FIELD(GEN8_CSB_READ_PTR_MASK,
+				    ring->next_context_status_buffer << 8));
+
+	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
+	spin_unlock(&dev_priv->uncore.lock);
+
+	spin_unlock(&ring->execlist_lock);
+
+	if (unlikely(submit_contexts > 2))
+		DRM_ERROR("More than two context complete events?\n");
 }
 
-static int execlists_context_queue(struct drm_i915_gem_request *request)
+static void execlists_context_queue(struct drm_i915_gem_request *request)
 {
 	struct intel_engine_cs *ring = request->ring;
 	struct drm_i915_gem_request *cursor;
@@ -629,8 +643,6 @@  static int execlists_context_queue(struct drm_i915_gem_request *request)
 		execlists_context_unqueue(ring);
 
 	spin_unlock_irq(&ring->execlist_lock);
-
-	return 0;
 }
 
 static int logical_ring_invalidate_all_caches(struct drm_i915_gem_request *req)
@@ -1549,7 +1561,7 @@  static int gen8_init_common_ring(struct intel_engine_cs *ring)
 {
 	struct drm_device *dev = ring->dev;
 	struct drm_i915_private *dev_priv = dev->dev_private;
-	u8 next_context_status_buffer_hw;
+	unsigned int next_context_status_buffer_hw;
 
 	lrc_setup_hardware_status_page(ring,
 				dev_priv->kernel_context->engine[ring->id].state);
@@ -2012,6 +2024,7 @@  void intel_logical_ring_cleanup(struct intel_engine_cs *ring)
 		ring->status_page.obj = NULL;
 	}
 
+	ring->idle_lite_restore_wa = 0;
 	ring->disable_lite_restore_wa = false;
 	ring->ctx_desc_template = 0;
 
@@ -2416,10 +2429,7 @@  populate_lr_context(struct intel_context *ctx, struct drm_i915_gem_object *ctx_o
 		 * With dynamic page allocation, PDPs may not be allocated at
 		 * this point. Point the unallocated PDPs to the scratch page
 		 */
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 3);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 2);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 1);
-		ASSIGN_CTX_PDP(ppgtt, reg_state, 0);
+		execlists_update_context_pdps(ppgtt, reg_state);
 	}
 
 	if (ring->id == RCS) {
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 566b0ae10ce0..dd910d30a380 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -271,7 +271,8 @@  struct  intel_engine_cs {
 	spinlock_t execlist_lock;
 	struct list_head execlist_queue;
 	struct list_head execlist_retired_req_list;
-	u8 next_context_status_buffer;
+	unsigned int next_context_status_buffer;
+	unsigned int idle_lite_restore_wa;
 	bool disable_lite_restore_wa;
 	u32 ctx_desc_template;
 	u32             irq_keep_mask; /* bitmask for interrupts that should not be masked */