diff mbox

[2/2] drm/i915/guc: default to using GuC submission where possible

Message ID 1461349375-5394-2-git-send-email-david.s.gordon@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dave Gordon April 22, 2016, 6:22 p.m. UTC
This patch simply changes the default value of "enable_guc_submission"
from 0 (never) to -1 (auto). This means that GuC submission will be
used if the platform has a GuC, the GuC supports the request submission
protocol, and any required GuC firmwware was successfully loaded. If any
of these conditions are not met, the driver will fall back to using
execlist mode.

Signed-off-by: Dave Gordon <david.s.gordon@intel.com>
---
 drivers/gpu/drm/i915/i915_params.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Chris Wilson April 22, 2016, 6:45 p.m. UTC | #1
On Fri, Apr 22, 2016 at 07:22:55PM +0100, Dave Gordon wrote:
> This patch simply changes the default value of "enable_guc_submission"
> from 0 (never) to -1 (auto). This means that GuC submission will be
> used if the platform has a GuC, the GuC supports the request submission
> protocol, and any required GuC firmwware was successfully loaded. If any
> of these conditions are not met, the driver will fall back to using
> execlist mode.

There are several shortcomings yet, in particular you've decided to add
new ABI...

i915_guc_wq_check_space(): Do not return ETIMEDOUT. This function's
return code is ABI.

Your choices are either EAGAIN if you think the hardware will catch up, or
more likely EIO and reset the hardware.

guc_add_workqueue_item: cannot fail

It must be fixed such that it is a void function without possibility of
failure. Not that you even successfully hooked up the failure paths in
the current code.

Same for guc_ring_doorbell.

And what exactly is that atomic64_cmpxchg() serialising with? There are
no other CPUs contending with the write, and neither does the GuC (and I
doubt it is taking any notice of the lock cmpxchg). Using cmpxchg where
a single WRITE_ONCE() of a 32bit value wins the perf prize for hotest
instruction and function in the kernel.
-Chris
Chris Wilson April 22, 2016, 6:51 p.m. UTC | #2
On Fri, Apr 22, 2016 at 07:45:15PM +0100, Chris Wilson wrote:
> On Fri, Apr 22, 2016 at 07:22:55PM +0100, Dave Gordon wrote:
> > This patch simply changes the default value of "enable_guc_submission"
> > from 0 (never) to -1 (auto). This means that GuC submission will be
> > used if the platform has a GuC, the GuC supports the request submission
> > protocol, and any required GuC firmwware was successfully loaded. If any
> > of these conditions are not met, the driver will fall back to using
> > execlist mode.

I just remembered something else.

 * Work Items:
 * There are several types of work items that the host may place into a
 * workqueue, each with its own requirements and limitations. Currently only
 * WQ_TYPE_INORDER is needed to support legacy submission via GuC, which
 * represents in-order queue. The kernel driver packs ring tail pointer and an
 * ELSP context descriptor dword into Work Item.

Is this right? You only allocate a single client covering all engines and
specify INORDER. We expect parallel execution between engines, is this
supported? Empirically it seems like guc is only executing commands in
series across engines and not in parallel.
-Chris
Dave Gordon April 25, 2016, 7:31 a.m. UTC | #3
On 22/04/16 19:51, Chris Wilson wrote:
> On Fri, Apr 22, 2016 at 07:45:15PM +0100, Chris Wilson wrote:
>> On Fri, Apr 22, 2016 at 07:22:55PM +0100, Dave Gordon wrote:
>>> This patch simply changes the default value of "enable_guc_submission"
>>> from 0 (never) to -1 (auto). This means that GuC submission will be
>>> used if the platform has a GuC, the GuC supports the request submission
>>> protocol, and any required GuC firmwware was successfully loaded. If any
>>> of these conditions are not met, the driver will fall back to using
>>> execlist mode.
>
> I just remembered something else.
>
>   * Work Items:
>   * There are several types of work items that the host may place into a
>   * workqueue, each with its own requirements and limitations. Currently only
>   * WQ_TYPE_INORDER is needed to support legacy submission via GuC, which
>   * represents in-order queue. The kernel driver packs ring tail pointer and an
>   * ELSP context descriptor dword into Work Item.
>
> Is this right? You only allocate a single client covering all engines and
> specify INORDER. We expect parallel execution between engines, is this
> supported? Empirically it seems like guc is only executing commands in
> series across engines and not in parallel.
> -Chris

AFAIK, INORDER represents in-order executions of elements in the GuC's 
(internal) submission queue, which is per-engine; i.e. this option 
bypasses the GuC's internal scheduling algorithms and makes the GuC 
behave as a simple dispatcher. It demultiplexes work queue items into 
the multiple submission queues, then executes them in order from there.

Alex can probably confirm this in the GuC code, but I really think we'd 
have noticed if execution were serialised across engines. For a start, 
the validation tests that have one engine busy-spin while waiting for a 
batch on a different engine to update a buffer wouldn't ever finish.

For other reasons, however, John & I are planning to test a 
one-client-per-engine configuration for use by the GPU scheduler.

.Dave.
Chris Wilson April 25, 2016, 8:29 a.m. UTC | #4
On Mon, Apr 25, 2016 at 08:31:07AM +0100, Dave Gordon wrote:
> On 22/04/16 19:51, Chris Wilson wrote:
> >On Fri, Apr 22, 2016 at 07:45:15PM +0100, Chris Wilson wrote:
> >>On Fri, Apr 22, 2016 at 07:22:55PM +0100, Dave Gordon wrote:
> >>>This patch simply changes the default value of "enable_guc_submission"
> >>>from 0 (never) to -1 (auto). This means that GuC submission will be
> >>>used if the platform has a GuC, the GuC supports the request submission
> >>>protocol, and any required GuC firmwware was successfully loaded. If any
> >>>of these conditions are not met, the driver will fall back to using
> >>>execlist mode.
> >
> >I just remembered something else.
> >
> >  * Work Items:
> >  * There are several types of work items that the host may place into a
> >  * workqueue, each with its own requirements and limitations. Currently only
> >  * WQ_TYPE_INORDER is needed to support legacy submission via GuC, which
> >  * represents in-order queue. The kernel driver packs ring tail pointer and an
> >  * ELSP context descriptor dword into Work Item.
> >
> >Is this right? You only allocate a single client covering all engines and
> >specify INORDER. We expect parallel execution between engines, is this
> >supported? Empirically it seems like guc is only executing commands in
> >series across engines and not in parallel.
> >-Chris
> 
> AFAIK, INORDER represents in-order executions of elements in the
> GuC's (internal) submission queue, which is per-engine; i.e. this
> option bypasses the GuC's internal scheduling algorithms and makes
> the GuC behave as a simple dispatcher. It demultiplexes work queue
> items into the multiple submission queues, then executes them in
> order from there.
> 
> Alex can probably confirm this in the GuC code, but I really think
> we'd have noticed if execution were serialised across engines. For a
> start, the validation tests that have one engine busy-spin while
> waiting for a batch on a different engine to update a buffer
> wouldn't ever finish.

That doesn't seem to be the issue, we can run in parallel it seems
(busy-spin on one engine doesn't prevent a write on the second). It's 
just the latency it seems. Overall the execution latency goes up
substantially with guc, and in this case it does not seem to be executing
the second execbuf on the second ring until after the first completes.
-Chris
Dave Gordon April 25, 2016, 10:07 a.m. UTC | #5
On 22/04/16 19:45, Chris Wilson wrote:
> On Fri, Apr 22, 2016 at 07:22:55PM +0100, Dave Gordon wrote:
>> This patch simply changes the default value of "enable_guc_submission"
>> from 0 (never) to -1 (auto). This means that GuC submission will be
>> used if the platform has a GuC, the GuC supports the request submission
>> protocol, and any required GuC firmwware was successfully loaded. If any
>> of these conditions are not met, the driver will fall back to using
>> execlist mode.
>
> There are several shortcomings yet, in particular you've decided to add
> new ABI...
>
> i915_guc_wq_check_space(): Do not return ETIMEDOUT. This function's
> return code is ABI.
>
> Your choices are either EAGAIN if you think the hardware will catch up, or
> more likely EIO and reset the hardware.

This layer doesn't have enough information to determine that we need a 
reset, so it'll have to be EAGAIN. TDR should catch the case where we 
continue to be unable to submit for an extended period, for whatever reason.

I don't like the spinwait anyway, so maybe we should just return the 
immediate result of sampling the WQ space once, and let the upper layers 
deal with how or whether to wait-and-retry.

> guc_add_workqueue_item: cannot fail
>
> It must be fixed such that it is a void function without possibility of
> failure.

The code clearly indicates that this "shouldn't happen", and yes, it 
would indicate an inconsistency in the internal state of the driver if 
the WARN-and-return-error path were taken.

However, the reason for including such a check is because we're crossing 
between logical submodules here. The GuC submission code requires that 
the LRC code (which is NOT GuC-specific) has previously called the 
precheck-for-space function and got the go-ahead signal. So this is 
essentially checking that the internal protocol has been followed 
correctly. If the LRC code were updated such that it might miss the 
precheck-for-space in some paths, this WARN-and-fail check would help 
identify the protocol error.

I can change it to a BUG() if you prefer!

> Not that you even successfully hooked up the failure paths in
> the current code.

The current code is broken at least because the whole callchain is 
declared as returning an int error code, only to find that the topmost 
level says that add_request/emit_request are not allowed to fail!

It would be rather better if the top level handled all errors in 
submission, even if it were only by dropping the request and resetting 
the engine!

> Same for guc_ring_doorbell.

The rationale for an error return is slightly different here, as this 
would not indicate an internal logic error in the driver, but rather a 
failure of communication between the driver and the doorbell hardware 
(or possibly the GuC firmware, but I don't think a GuC bug could cause 
more than one failure-to-update, hence the retry count is only 2).

If the doorbell hardware is not working correctly, there is nothing we 
can do about it here; the only option is to report the failure up to 
some layer that can recover, probably by resetting the GPU.

> And what exactly is that atomic64_cmpxchg() serialising with? There are
> no other CPUs contending with the write, and neither does the GuC (and I
> doubt it is taking any notice of the lock cmpxchg). Using cmpxchg where
> a single WRITE_ONCE() of a 32bit value wins the perf prize for hotest
> instruction and function in the kernel.
> -Chris

The doorbell controller hardware, I should think. The BSpec describes 
using LOCK_CMPXCHG8B to update doorbells, so I think this code is just 
based on what it says there. If the CPU hardware doesn't implement it 
efficiently, surely the GPU h/w designers wouldn't have mandated it in 
this way?

Maybe Alex or Tom know more about the CPU<->doorbell<->GuC signalling.

.Dave.
Chris Wilson April 25, 2016, 10:39 a.m. UTC | #6
On Mon, Apr 25, 2016 at 11:07:13AM +0100, Dave Gordon wrote:
> On 22/04/16 19:45, Chris Wilson wrote:
> >On Fri, Apr 22, 2016 at 07:22:55PM +0100, Dave Gordon wrote:
> >>This patch simply changes the default value of "enable_guc_submission"
> >>from 0 (never) to -1 (auto). This means that GuC submission will be
> >>used if the platform has a GuC, the GuC supports the request submission
> >>protocol, and any required GuC firmwware was successfully loaded. If any
> >>of these conditions are not met, the driver will fall back to using
> >>execlist mode.
> >
> >There are several shortcomings yet, in particular you've decided to add
> >new ABI...
> >
> >i915_guc_wq_check_space(): Do not return ETIMEDOUT. This function's
> >return code is ABI.
> >
> >Your choices are either EAGAIN if you think the hardware will catch up, or
> >more likely EIO and reset the hardware.
> 
> This layer doesn't have enough information to determine that we need
> a reset, so it'll have to be EAGAIN. TDR should catch the case where
> we continue to be unable to submit for an extended period, for
> whatever reason.

You don't? This is an interface to the hardware, if you don't know if
the GuC is responding who does? The other layers only look at whether we
advancing through the requests (either breadcrumbs or timer ticks).

If you return EAGAIN, we busyspin through the entire execbuffer. That's
acceptable - as we will catch the reset at some point. If we busyspin
here, we have to hook into the reset to allow it to steal the
struct_mutex.

> I don't like the spinwait anyway, so maybe we should just return the
> immediate result of sampling the WQ space once, and let the upper
> layers deal with how or whether to wait-and-retry.

I would have it busyspin for a few microseconds. None of the upper
layers have any more clue than you here, though one option is to keep
the list of requests and wait for the oldest to expire - and we should
be able to find that from the engine, I forget if there is a 1:1
correspondence between engine and workqueue though.

> >guc_add_workqueue_item: cannot fail
> >
> >It must be fixed such that it is a void function without possibility of
> >failure.
> 
> The code clearly indicates that this "shouldn't happen", and yes, it
> would indicate an inconsistency in the internal state of the driver
> if the WARN-and-return-error path were taken.
> 
> However, the reason for including such a check is because we're
> crossing between logical submodules here. The GuC submission code
> requires that the LRC code (which is NOT GuC-specific) has
> previously called the precheck-for-space function and got the
> go-ahead signal. So this is essentially checking that the internal
> protocol has been followed correctly. If the LRC code were updated
> such that it might miss the precheck-for-space in some paths, this
> WARN-and-fail check would help identify the protocol error.
> 
> I can change it to a BUG() if you prefer!

Personally, yes. And if it is ever hit in practice and not by an igt,
that looks doubly bad.

add-request cannot fail as we cannot unwind the transactions for the
request. We could defer the global state updates until after the
transaction is committed, but don't today and we then have to find a way
to mitigate that cost to every request vs making the error path simpler.

> >Not that you even successfully hooked up the failure paths in
> >the current code.
> 
> The current code is broken at least because the whole callchain is
> declared as returning an int error code, only to find that the
> topmost level says that add_request/emit_request are not allowed to
> fail!
> 
> It would be rather better if the top level handled all errors in
> submission, even if it were only by dropping the request and
> resetting the engine!
> 
> >Same for guc_ring_doorbell.
> 
> The rationale for an error return is slightly different here, as
> this would not indicate an internal logic error in the driver, but
> rather a failure of communication between the driver and the
> doorbell hardware (or possibly the GuC firmware, but I don't think a
> GuC bug could cause more than one failure-to-update, hence the retry
> count is only 2).
> 
> If the doorbell hardware is not working correctly, there is nothing
> we can do about it here; the only option is to report the failure up
> to some layer that can recover, probably by resetting the GPU.

Exactly. Just pretend it worked and we will reset the GPU when we find
out it didn't.

> >And what exactly is that atomic64_cmpxchg() serialising with? There are
> >no other CPUs contending with the write, and neither does the GuC (and I
> >doubt it is taking any notice of the lock cmpxchg). Using cmpxchg where
> >a single WRITE_ONCE() of a 32bit value wins the perf prize for hotest
> >instruction and function in the kernel.
> 
> The doorbell controller hardware, I should think. The BSpec
> describes using LOCK_CMPXCHG8B to update doorbells, so I think this
> code is just based on what it says there. If the CPU hardware
> doesn't implement it efficiently, surely the GPU h/w designers
> wouldn't have mandated it in this way?

Wow, I'm surprised that they would put into the same domain. Still,
unless you are actually serialising with another writer, what is the
point of using lock cmpxchg? E.g. an xchg would be enough to enforce
ordering, and you should ask them again if this is not a little overkill
for one-way signaling.
-Chris
Dave Gordon April 26, 2016, 8:49 a.m. UTC | #7
On 25/04/16 11:39, Chris Wilson wrote:
> On Mon, Apr 25, 2016 at 11:07:13AM +0100, Dave Gordon wrote:
>> On 22/04/16 19:45, Chris Wilson wrote:
>>> On Fri, Apr 22, 2016 at 07:22:55PM +0100, Dave Gordon wrote:
>>>> This patch simply changes the default value of "enable_guc_submission"
>>> >from 0 (never) to -1 (auto). This means that GuC submission will be
>>>> used if the platform has a GuC, the GuC supports the request submission
>>>> protocol, and any required GuC firmwware was successfully loaded. If any
>>>> of these conditions are not met, the driver will fall back to using
>>>> execlist mode.
>>>
>>> There are several shortcomings yet, in particular you've decided to add
>>> new ABI...
>>>
>>> i915_guc_wq_check_space(): Do not return ETIMEDOUT. This function's
>>> return code is ABI.
>>>
>>> Your choices are either EAGAIN if you think the hardware will catch up, or
>>> more likely EIO and reset the hardware.
>>
>> This layer doesn't have enough information to determine that we need
>> a reset, so it'll have to be EAGAIN. TDR should catch the case where
>> we continue to be unable to submit for an extended period, for
>> whatever reason.
>
> You don't? This is an interface to the hardware, if you don't know if
> the GuC is responding who does? The other layers only look at whether we
> advancing through the requests (either breadcrumbs or timer ticks).

Well this particular function is only looking at the state of the WQ, 
and doesn't track how fast we're sending requests or how many requests 
are outstanding. So it can't really say whether a lack of space is a 
real issue, or just a transient condition that we'll recover from (but 
see below).

> If you return EAGAIN, we busyspin through the entire execbuffer. That's
> acceptable - as we will catch the reset at some point. If we busyspin
> here, we have to hook into the reset to allow it to steal the
> struct_mutex.
>
>> I don't like the spinwait anyway, so maybe we should just return the
>> immediate result of sampling the WQ space once, and let the upper
>> layers deal with how or whether to wait-and-retry.
>
> I would have it busyspin for a few microseconds. None of the upper
> layers have any more clue than you here, though one option is to keep
> the list of requests and wait for the oldest to expire - and we should
> be able to find that from the engine, I forget if there is a 1:1
> correspondence between engine and workqueue though.

So, I tested the idea of just returning -EAGAIN immediately, and it 
seems that we *can* fill the WQ -- but only in one particular scenario. 
That was gem_ctx_switch -- presumably because that involves a lot of 
work by the GPU and extra overhead for the GuC. Otherwise, the WQ was 
never full. And in that test, the strategy of returning -EAGAIN worked 
perfectly well, with no other errors triggered as a result. So this may 
well be the right way to go -- it's certainly the simplest :)

>>> guc_add_workqueue_item: cannot fail
>>>
>>> It must be fixed such that it is a void function without possibility of
>>> failure.
>>
>> The code clearly indicates that this "shouldn't happen", and yes, it
>> would indicate an inconsistency in the internal state of the driver
>> if the WARN-and-return-error path were taken.
>>
>> However, the reason for including such a check is because we're
>> crossing between logical submodules here. The GuC submission code
>> requires that the LRC code (which is NOT GuC-specific) has
>> previously called the precheck-for-space function and got the
>> go-ahead signal. So this is essentially checking that the internal
>> protocol has been followed correctly. If the LRC code were updated
>> such that it might miss the precheck-for-space in some paths, this
>> WARN-and-fail check would help identify the protocol error.
>>
>> I can change it to a BUG() if you prefer!
>
> Personally, yes. And if it is ever hit in practice and not by an igt,
> that looks doubly bad.
>
> add-request cannot fail as we cannot unwind the transactions for the
> request. We could defer the global state updates until after the
> transaction is committed, but don't today and we then have to find a way
> to mitigate that cost to every request vs making the error path simpler.

Or we could make the global state updates reversible, as we're doing in 
the scheduler code. But one way or another, I think we should be moving 
towards a transactional model where it either works or fails with no 
side-effects.

>>> Not that you even successfully hooked up the failure paths in
>>> the current code.
>>
>> The current code is broken at least because the whole callchain is
>> declared as returning an int error code, only to find that the
>> topmost level says that add_request/emit_request are not allowed to
>> fail!
>>
>> It would be rather better if the top level handled all errors in
>> submission, even if it were only by dropping the request and
>> resetting the engine!
>>
>>> Same for guc_ring_doorbell.
>>
>> The rationale for an error return is slightly different here, as
>> this would not indicate an internal logic error in the driver, but
>> rather a failure of communication between the driver and the
>> doorbell hardware (or possibly the GuC firmware, but I don't think a
>> GuC bug could cause more than one failure-to-update, hence the retry
>> count is only 2).
>>
>> If the doorbell hardware is not working correctly, there is nothing
>> we can do about it here; the only option is to report the failure up
>> to some layer that can recover, probably by resetting the GPU.
>
> Exactly. Just pretend it worked and we will reset the GPU when we find
> out it didn't.

It's important to keep s/w state self-consistent, otherwise code will 
likely crash when its preconditions are not met; and it's good to keep 
s/w state consistent with that of the h/w, otherwise system behaviour is 
likely to be unpredictable.

So I'd rather actively report a detected problem than just let a 
background task nondeterministically find out later that "something went 
wrong". Much easier to pin down the first symptom of the failure that 
way, and therefore maybe easier to find the root cause.

>>> And what exactly is that atomic64_cmpxchg() serialising with? There are
>>> no other CPUs contending with the write, and neither does the GuC (and I
>>> doubt it is taking any notice of the lock cmpxchg). Using cmpxchg where
>>> a single WRITE_ONCE() of a 32bit value wins the perf prize for hotest
>>> instruction and function in the kernel.
>>
>> The doorbell controller hardware, I should think. The BSpec
>> describes using LOCK_CMPXCHG8B to update doorbells, so I think this
>> code is just based on what it says there. If the CPU hardware
>> doesn't implement it efficiently, surely the GPU h/w designers
>> wouldn't have mandated it in this way?
>
> Wow, I'm surprised that they would put into the same domain. Still,
> unless you are actually serialising with another writer, what is the
> point of using lock cmpxchg? E.g. an xchg would be enough to enforce
> ordering, and you should ask them again if this is not a little overkill
> for one-way signaling.
> -Chris

The doorbell controller is essentially a cache-snoop, looking for writes 
to the nominated line and sending an interrupt to the GuC when it sees 
one. If I were designing that, I might well think it would be simpler 
(for the hardware) only to watch for writes that were special in some 
way, such as having LOCK asserted. That seems a perfectly reasonable way 
of distinguishing the special "ring the doorbell, wake the GuC" write 
from just any old access that might at first look like a cache hit. 
(BTW, I don't think the current doorbell controller actually *does* 
watch only for locked cycles, but some version might). So I'd rather 
stick with the exact sequence that the BSpec suggests, at least until 
it's all been thoroughly tested ("correctness first, optimise later").

.Dave.
Dave Gordon April 26, 2016, 9:52 a.m. UTC | #8
On 26/04/16 09:49, Dave Gordon wrote:
> On 25/04/16 11:39, Chris Wilson wrote:
>> On Mon, Apr 25, 2016 at 11:07:13AM +0100, Dave Gordon wrote:
>>> On 22/04/16 19:45, Chris Wilson wrote:

[snip]

>>>> And what exactly is that atomic64_cmpxchg() serialising with? There are
>>>> no other CPUs contending with the write, and neither does the GuC
>>>> (and I
>>>> doubt it is taking any notice of the lock cmpxchg). Using cmpxchg where
>>>> a single WRITE_ONCE() of a 32bit value wins the perf prize for hotest
>>>> instruction and function in the kernel.
>>>
>>> The doorbell controller hardware, I should think. The BSpec
>>> describes using LOCK_CMPXCHG8B to update doorbells, so I think this
>>> code is just based on what it says there. If the CPU hardware
>>> doesn't implement it efficiently, surely the GPU h/w designers
>>> wouldn't have mandated it in this way?
>>
>> Wow, I'm surprised that they would put into the same domain. Still,
>> unless you are actually serialising with another writer, what is the
>> point of using lock cmpxchg? E.g. an xchg would be enough to enforce
>> ordering, and you should ask them again if this is not a little overkill
>> for one-way signaling.
>> -Chris

As for performance, while LOCK_CMPXCHG8B might be an expensive 
instruction, we're only executing ONE per request. I suspect that the 
cumulative cost of all the extra memory accesses caused by extra 
indirections and poor structure layout cost far more than any single 
instruction ever can.

Top things in this area might be:

* all the macros taking "dev" instead of "dev_priv"
* pointer dances in general (a->b->c.d->e) where we could add a shortcut 
pointer from a to c (or c.d), or from a or b to e.
* way too much repetition of a->b->c, a->b->d, a->b->e in the same 
function, which the compiler *may* optimise, but probably won't if there 
are any function calls around. Adding a local for a->b will almost 
certainly help, or at least incur no penalty and be easier to read.
* awkwardly sized or misaligned structure members, and bitfield bools 
rather than 1-byte flags

So let's nibble away at these before we worry about the cost of a single 
x86 instruction!

.Dave.
Chris Wilson April 26, 2016, 10:35 a.m. UTC | #9
On Tue, Apr 26, 2016 at 10:52:41AM +0100, Dave Gordon wrote:
> On 26/04/16 09:49, Dave Gordon wrote:
> >On 25/04/16 11:39, Chris Wilson wrote:
> >>On Mon, Apr 25, 2016 at 11:07:13AM +0100, Dave Gordon wrote:
> >>>On 22/04/16 19:45, Chris Wilson wrote:
> 
> [snip]
> 
> >>>>And what exactly is that atomic64_cmpxchg() serialising with? There are
> >>>>no other CPUs contending with the write, and neither does the GuC
> >>>>(and I
> >>>>doubt it is taking any notice of the lock cmpxchg). Using cmpxchg where
> >>>>a single WRITE_ONCE() of a 32bit value wins the perf prize for hotest
> >>>>instruction and function in the kernel.
> >>>
> >>>The doorbell controller hardware, I should think. The BSpec
> >>>describes using LOCK_CMPXCHG8B to update doorbells, so I think this
> >>>code is just based on what it says there. If the CPU hardware
> >>>doesn't implement it efficiently, surely the GPU h/w designers
> >>>wouldn't have mandated it in this way?
> >>
> >>Wow, I'm surprised that they would put into the same domain. Still,
> >>unless you are actually serialising with another writer, what is the
> >>point of using lock cmpxchg? E.g. an xchg would be enough to enforce
> >>ordering, and you should ask them again if this is not a little overkill
> >>for one-way signaling.
> >>-Chris
> 
> As for performance, while LOCK_CMPXCHG8B might be an expensive
> instruction, we're only executing ONE per request. I suspect that
> the cumulative cost of all the extra memory accesses caused by extra
> indirections and poor structure layout cost far more than any single
> instruction ever can.
> 
> Top things in this area might be:
> 
> * all the macros taking "dev" instead of "dev_priv"
> * pointer dances in general (a->b->c.d->e) where we could add a
> shortcut pointer from a to c (or c.d), or from a or b to e.
> * way too much repetition of a->b->c, a->b->d, a->b->e in the same
> function, which the compiler *may* optimise, but probably won't if
> there are any function calls around. Adding a local for a->b will
> almost certainly help, or at least incur no penalty and be easier to
> read.
> * awkwardly sized or misaligned structure members, and bitfield
> bools rather than 1-byte flags
> 
> So let's nibble away at these before we worry about the cost of a
> single x86 instruction!

You can either assume that I applied the patches I sent to the ml for
the above points, or just look at the delta between execlists and guc
and worry about the regressions.
-Chris
Dave Gordon April 26, 2016, 1:36 p.m. UTC | #10
On 26/04/16 11:35, Chris Wilson wrote:
> On Tue, Apr 26, 2016 at 10:52:41AM +0100, Dave Gordon wrote:
>> On 26/04/16 09:49, Dave Gordon wrote:
>>> On 25/04/16 11:39, Chris Wilson wrote:
>>>> On Mon, Apr 25, 2016 at 11:07:13AM +0100, Dave Gordon wrote:
>>>>> On 22/04/16 19:45, Chris Wilson wrote:
>>
>> [snip]
>>
>>>>>> And what exactly is that atomic64_cmpxchg() serialising with? There are
>>>>>> no other CPUs contending with the write, and neither does the GuC
>>>>>> (and I
>>>>>> doubt it is taking any notice of the lock cmpxchg). Using cmpxchg where
>>>>>> a single WRITE_ONCE() of a 32bit value wins the perf prize for hotest
>>>>>> instruction and function in the kernel.
>>>>>
>>>>> The doorbell controller hardware, I should think. The BSpec
>>>>> describes using LOCK_CMPXCHG8B to update doorbells, so I think this
>>>>> code is just based on what it says there. If the CPU hardware
>>>>> doesn't implement it efficiently, surely the GPU h/w designers
>>>>> wouldn't have mandated it in this way?
>>>>
>>>> Wow, I'm surprised that they would put into the same domain. Still,
>>>> unless you are actually serialising with another writer, what is the
>>>> point of using lock cmpxchg? E.g. an xchg would be enough to enforce
>>>> ordering, and you should ask them again if this is not a little overkill
>>>> for one-way signaling.
>>>> -Chris
>>
>> As for performance, while LOCK_CMPXCHG8B might be an expensive
>> instruction, we're only executing ONE per request. I suspect that
>> the cumulative cost of all the extra memory accesses caused by extra
>> indirections and poor structure layout cost far more than any single
>> instruction ever can.
>>
>> Top things in this area might be:
>>
>> * all the macros taking "dev" instead of "dev_priv"
>> * pointer dances in general (a->b->c.d->e) where we could add a
>> shortcut pointer from a to c (or c.d), or from a or b to e.
>> * way too much repetition of a->b->c, a->b->d, a->b->e in the same
>> function, which the compiler *may* optimise, but probably won't if
>> there are any function calls around. Adding a local for a->b will
>> almost certainly help, or at least incur no penalty and be easier to
>> read.
>> * awkwardly sized or misaligned structure members, and bitfield
>> bools rather than 1-byte flags
>>
>> So let's nibble away at these before we worry about the cost of a
>> single x86 instruction!
>
> You can either assume that I applied the patches I sent to the ml for
> the above points, or just look at the delta between execlists and guc
> and worry about the regressions.
> -Chris

With the latest version of this patchset (not yet posted), I see GuC 
submission just slightly *faster* than execlists on the render ring :)
However it's still slower on the others, where the minimum execlist time 
is less.

Just running noops, execlist submission on render shows a downward trend 
as more noop batches are submitted, flattening to a limit of just over 
3us/batch. GuC mode shows the same trend, and bottoms out at just UNDER 
3us/batch. The crossover seems to be 512-1024 batches/cycle, where the 
reduced interrupt overhead outweighs the added latency.

It probably helps that we're no longer mapping & unmapping the doorbell 
page on each request!

.Dave.
Daniel Vetter April 26, 2016, 2 p.m. UTC | #11
On Mon, Apr 25, 2016 at 09:29:42AM +0100, Chris Wilson wrote:
> On Mon, Apr 25, 2016 at 08:31:07AM +0100, Dave Gordon wrote:
> > On 22/04/16 19:51, Chris Wilson wrote:
> > >On Fri, Apr 22, 2016 at 07:45:15PM +0100, Chris Wilson wrote:
> > >>On Fri, Apr 22, 2016 at 07:22:55PM +0100, Dave Gordon wrote:
> > >>>This patch simply changes the default value of "enable_guc_submission"
> > >>>from 0 (never) to -1 (auto). This means that GuC submission will be
> > >>>used if the platform has a GuC, the GuC supports the request submission
> > >>>protocol, and any required GuC firmwware was successfully loaded. If any
> > >>>of these conditions are not met, the driver will fall back to using
> > >>>execlist mode.
> > >
> > >I just remembered something else.
> > >
> > >  * Work Items:
> > >  * There are several types of work items that the host may place into a
> > >  * workqueue, each with its own requirements and limitations. Currently only
> > >  * WQ_TYPE_INORDER is needed to support legacy submission via GuC, which
> > >  * represents in-order queue. The kernel driver packs ring tail pointer and an
> > >  * ELSP context descriptor dword into Work Item.
> > >
> > >Is this right? You only allocate a single client covering all engines and
> > >specify INORDER. We expect parallel execution between engines, is this
> > >supported? Empirically it seems like guc is only executing commands in
> > >series across engines and not in parallel.
> > >-Chris
> > 
> > AFAIK, INORDER represents in-order executions of elements in the
> > GuC's (internal) submission queue, which is per-engine; i.e. this
> > option bypasses the GuC's internal scheduling algorithms and makes
> > the GuC behave as a simple dispatcher. It demultiplexes work queue
> > items into the multiple submission queues, then executes them in
> > order from there.
> > 
> > Alex can probably confirm this in the GuC code, but I really think
> > we'd have noticed if execution were serialised across engines. For a
> > start, the validation tests that have one engine busy-spin while
> > waiting for a batch on a different engine to update a buffer
> > wouldn't ever finish.
> 
> That doesn't seem to be the issue, we can run in parallel it seems
> (busy-spin on one engine doesn't prevent a write on the second). It's 
> just the latency it seems. Overall the execution latency goes up
> substantially with guc, and in this case it does not seem to be executing
> the second execbuf on the second ring until after the first completes.

That sounds like a decent bug in guc code, and defeats the point of all
the work to speed up execlist submission going on right now.

Can we have non-slow guc somehow? Do we need to escalate this to the
firmware folks and first make sure they have a firmware released which
doesn't like to twiddle thumsb (assuming it's a guc issue indeed and not
in how we submit things)?

Afaiui the point of guc was to reduce submission latency by again having a
queue to submit to, instead of the 1.5 submit ports with execlist. There's
other reasons on top, but if firmware engineers butchered that it doesn't
look good.
-Daniel
Dave Gordon April 27, 2016, 5:53 p.m. UTC | #12
On 26/04/16 15:00, Daniel Vetter wrote:
> On Mon, Apr 25, 2016 at 09:29:42AM +0100, Chris Wilson wrote:
>> On Mon, Apr 25, 2016 at 08:31:07AM +0100, Dave Gordon wrote:
>>> On 22/04/16 19:51, Chris Wilson wrote:
>>>> On Fri, Apr 22, 2016 at 07:45:15PM +0100, Chris Wilson wrote:
>>>>> On Fri, Apr 22, 2016 at 07:22:55PM +0100, Dave Gordon wrote:
>>>>>> This patch simply changes the default value of "enable_guc_submission"
>>>>> >from 0 (never) to -1 (auto). This means that GuC submission will be
>>>>>> used if the platform has a GuC, the GuC supports the request submission
>>>>>> protocol, and any required GuC firmwware was successfully loaded. If any
>>>>>> of these conditions are not met, the driver will fall back to using
>>>>>> execlist mode.
>>>>
>>>> I just remembered something else.
>>>>
>>>>   * Work Items:
>>>>   * There are several types of work items that the host may place into a
>>>>   * workqueue, each with its own requirements and limitations. Currently only
>>>>   * WQ_TYPE_INORDER is needed to support legacy submission via GuC, which
>>>>   * represents in-order queue. The kernel driver packs ring tail pointer and an
>>>>   * ELSP context descriptor dword into Work Item.
>>>>
>>>> Is this right? You only allocate a single client covering all engines and
>>>> specify INORDER. We expect parallel execution between engines, is this
>>>> supported? Empirically it seems like guc is only executing commands in
>>>> series across engines and not in parallel.
>>>> -Chris
>>>
>>> AFAIK, INORDER represents in-order executions of elements in the
>>> GuC's (internal) submission queue, which is per-engine; i.e. this
>>> option bypasses the GuC's internal scheduling algorithms and makes
>>> the GuC behave as a simple dispatcher. It demultiplexes work queue
>>> items into the multiple submission queues, then executes them in
>>> order from there.
>>>
>>> Alex can probably confirm this in the GuC code, but I really think
>>> we'd have noticed if execution were serialised across engines. For a
>>> start, the validation tests that have one engine busy-spin while
>>> waiting for a batch on a different engine to update a buffer
>>> wouldn't ever finish.
>>
>> That doesn't seem to be the issue, we can run in parallel it seems
>> (busy-spin on one engine doesn't prevent a write on the second). It's
>> just the latency it seems. Overall the execution latency goes up
>> substantially with guc, and in this case it does not seem to be executing
>> the second execbuf on the second ring until after the first completes.
>
> That sounds like a decent bug in guc code, and defeats the point of all
> the work to speed up execlist submission going on right now.
>
> Can we have non-slow guc somehow? Do we need to escalate this to the
> firmware folks and first make sure they have a firmware released which
> doesn't like to twiddle thumsb (assuming it's a guc issue indeed and not
> in how we submit things)?

According to the numbers I was getting yesterday, GuC submission is now 
slightly faster than execlists on the render engine (because execlists 
is slower on that engine), but still a bit slower on the others. See

http://www.spinics.net/lists/intel-gfx/msg94140.html

> Afaiui the point of guc was to reduce submission latency by again having a
> queue to submit to, instead of the 1.5 submit ports with execlist. There's
> other reasons on top, but if firmware engineers butchered that it doesn't
> look good.
> -Daniel

I don't think it was ever about latency. I think the GuC was added to 
reduce the overhead of fielding context-switch interrupts on the CPU.

.Dave.
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index 79be9f8..0df8aa4 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -55,7 +55,7 @@  struct i915_params i915 __read_mostly = {
 	.nuclear_pageflip = 0,
 	.edp_vswing = 0,
 	.enable_guc_loading = -1,
-	.enable_guc_submission = 0,
+	.enable_guc_submission = -1,
 	.guc_log_level = -1,
 	.enable_dp_mst = true,
 	.inject_load_failure = 0,
@@ -206,7 +206,7 @@  struct i915_params i915 __read_mostly = {
 module_param_named_unsafe(enable_guc_submission, i915.enable_guc_submission, int, 0400);
 MODULE_PARM_DESC(enable_guc_submission,
 		"Enable GuC submission "
-		"(-1=auto, 0=never [default], 1=if available, 2=required)");
+		"(-1=auto [default], 0=never, 1=if available, 2=required)");
 
 module_param_named(guc_log_level, i915.guc_log_level, int, 0400);
 MODULE_PARM_DESC(guc_log_level,