diff mbox

[3/3] drm/i915: Do not serialize forcewake acquire across domains

Message ID 1459788671-17501-3-git-send-email-tvrtko.ursulin@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Tvrtko Ursulin April 4, 2016, 4:51 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

On platforms with multiple forcewake domains it seems more efficient
to request all desired ones and then to wait for acks to avoid
needlessly serializing on each domain.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Chris Wilson April 4, 2016, 7:07 p.m. UTC | #1
On Mon, Apr 04, 2016 at 05:51:11PM +0100, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> On platforms with multiple forcewake domains it seems more efficient
> to request all desired ones and then to wait for acks to avoid
> needlessly serializing on each domain.

Not convinced since we have more machines with one domain than two. What
I did was to compact the domains array so that we only iterated over the
known set - but that feels overkill when we only have two domains today.

For the same reason (only one machine with two domains), I didn't think
seperate functions to iterate over one domain and another to iterate
over all was worth it.

What you can do though is remove an excess posting read from
fw_domains_put.

Compared to the cost of a register access (the spinlock irq mostly) the
iterator doesn't strike me as being that worthwhile an optimisation
target.
-Chris
Tvrtko Ursulin April 5, 2016, 9:02 a.m. UTC | #2
On 04/04/16 20:07, Chris Wilson wrote:
> On Mon, Apr 04, 2016 at 05:51:11PM +0100, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> On platforms with multiple forcewake domains it seems more efficient
>> to request all desired ones and then to wait for acks to avoid
>> needlessly serializing on each domain.
>
> Not convinced since we have more machines with one domain than two. What
> I did was to compact the domains array so that we only iterated over the
> known set - but that feels overkill when we only have two domains today.
>
> For the same reason (only one machine with two domains), I didn't think
> seperate functions to iterate over one domain and another to iterate
> over all was worth it.
>
> What you can do though is remove an excess posting read from
> fw_domains_put.
>
> Compared to the cost of a register access (the spinlock irq mostly) the
> iterator doesn't strike me as being that worthwhile an optimisation
> target.

Correct, I thought we agreed that the majority of the CPU time 
attributed to fw_domains_get is from the busy spinning while waiting on 
the ack from the GPU.

This patch is not optimising the iterator, but requests all domains to 
be woken up and then waits for acks. It changes the time spent busy 
spinning from Td1 + ... + Td2 to max(Td1...Tdn).

Yes it is only interesting for platforms with more than one fw domain. 
But since we agreed iterator is not significant, the fact that it adds 
two loops* over the array should not be noticeable vs. the gain for 
multi-fw domain machines (which will be more and more of as time goes by).

Regards,

Tvrtko

* Also because 2/3 from this serious has shrunk the iterator 
considerably, even with two iterations fw_domains_get remains pretty 
much the same size now with two loops, vs one loop before it.
Chris Wilson April 5, 2016, 9:47 a.m. UTC | #3
On Tue, Apr 05, 2016 at 10:02:28AM +0100, Tvrtko Ursulin wrote:
> 
> On 04/04/16 20:07, Chris Wilson wrote:
> >On Mon, Apr 04, 2016 at 05:51:11PM +0100, Tvrtko Ursulin wrote:
> >>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>
> >>On platforms with multiple forcewake domains it seems more efficient
> >>to request all desired ones and then to wait for acks to avoid
> >>needlessly serializing on each domain.
> >
> >Not convinced since we have more machines with one domain than two. What
> >I did was to compact the domains array so that we only iterated over the
> >known set - but that feels overkill when we only have two domains today.
> >
> >For the same reason (only one machine with two domains), I didn't think
> >seperate functions to iterate over one domain and another to iterate
> >over all was worth it.
> >
> >What you can do though is remove an excess posting read from
> >fw_domains_put.
> >
> >Compared to the cost of a register access (the spinlock irq mostly) the
> >iterator doesn't strike me as being that worthwhile an optimisation
> >target.
> 
> Correct, I thought we agreed that the majority of the CPU time
> attributed to fw_domains_get is from the busy spinning while waiting
> on the ack from the GPU.
> 
> This patch is not optimising the iterator, but requests all domains
> to be woken up and then waits for acks. It changes the time spent
> busy spinning from Td1 + ... + Td2 to max(Td1...Tdn).
> 
> Yes it is only interesting for platforms with more than one fw
> domain. But since we agreed iterator is not significant, the fact
> that it adds two loops* over the array should not be noticeable vs.
> the gain for multi-fw domain machines (which will be more and more
> of as time goes by).

Then we should be first looking at the cases where we are requesting
multiple domains to be woken where we only need one.

Anyway you've changed my mind,
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 49edd641b434..03674c3cfaf7 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -115,8 +115,10 @@  fw_domains_get(struct drm_i915_private *dev_priv, enum forcewake_domains fw_doma
 	for_each_fw_domain_mask(d, fw_domains, dev_priv) {
 		fw_domain_wait_ack_clear(d);
 		fw_domain_get(d);
-		fw_domain_wait_ack(d);
 	}
+
+	for_each_fw_domain_mask(d, fw_domains, dev_priv)
+		fw_domain_wait_ack(d);
 }
 
 static void