Message ID | 1459788671-17501-3-git-send-email-tvrtko.ursulin@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Apr 04, 2016 at 05:51:11PM +0100, Tvrtko Ursulin wrote: > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > On platforms with multiple forcewake domains it seems more efficient > to request all desired ones and then to wait for acks to avoid > needlessly serializing on each domain. Not convinced since we have more machines with one domain than two. What I did was to compact the domains array so that we only iterated over the known set - but that feels overkill when we only have two domains today. For the same reason (only one machine with two domains), I didn't think seperate functions to iterate over one domain and another to iterate over all was worth it. What you can do though is remove an excess posting read from fw_domains_put. Compared to the cost of a register access (the spinlock irq mostly) the iterator doesn't strike me as being that worthwhile an optimisation target. -Chris
On 04/04/16 20:07, Chris Wilson wrote: > On Mon, Apr 04, 2016 at 05:51:11PM +0100, Tvrtko Ursulin wrote: >> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >> >> On platforms with multiple forcewake domains it seems more efficient >> to request all desired ones and then to wait for acks to avoid >> needlessly serializing on each domain. > > Not convinced since we have more machines with one domain than two. What > I did was to compact the domains array so that we only iterated over the > known set - but that feels overkill when we only have two domains today. > > For the same reason (only one machine with two domains), I didn't think > seperate functions to iterate over one domain and another to iterate > over all was worth it. > > What you can do though is remove an excess posting read from > fw_domains_put. > > Compared to the cost of a register access (the spinlock irq mostly) the > iterator doesn't strike me as being that worthwhile an optimisation > target. Correct, I thought we agreed that the majority of the CPU time attributed to fw_domains_get is from the busy spinning while waiting on the ack from the GPU. This patch is not optimising the iterator, but requests all domains to be woken up and then waits for acks. It changes the time spent busy spinning from Td1 + ... + Td2 to max(Td1...Tdn). Yes it is only interesting for platforms with more than one fw domain. But since we agreed iterator is not significant, the fact that it adds two loops* over the array should not be noticeable vs. the gain for multi-fw domain machines (which will be more and more of as time goes by). Regards, Tvrtko * Also because 2/3 from this serious has shrunk the iterator considerably, even with two iterations fw_domains_get remains pretty much the same size now with two loops, vs one loop before it.
On Tue, Apr 05, 2016 at 10:02:28AM +0100, Tvrtko Ursulin wrote: > > On 04/04/16 20:07, Chris Wilson wrote: > >On Mon, Apr 04, 2016 at 05:51:11PM +0100, Tvrtko Ursulin wrote: > >>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >> > >>On platforms with multiple forcewake domains it seems more efficient > >>to request all desired ones and then to wait for acks to avoid > >>needlessly serializing on each domain. > > > >Not convinced since we have more machines with one domain than two. What > >I did was to compact the domains array so that we only iterated over the > >known set - but that feels overkill when we only have two domains today. > > > >For the same reason (only one machine with two domains), I didn't think > >seperate functions to iterate over one domain and another to iterate > >over all was worth it. > > > >What you can do though is remove an excess posting read from > >fw_domains_put. > > > >Compared to the cost of a register access (the spinlock irq mostly) the > >iterator doesn't strike me as being that worthwhile an optimisation > >target. > > Correct, I thought we agreed that the majority of the CPU time > attributed to fw_domains_get is from the busy spinning while waiting > on the ack from the GPU. > > This patch is not optimising the iterator, but requests all domains > to be woken up and then waits for acks. It changes the time spent > busy spinning from Td1 + ... + Td2 to max(Td1...Tdn). > > Yes it is only interesting for platforms with more than one fw > domain. But since we agreed iterator is not significant, the fact > that it adds two loops* over the array should not be noticeable vs. > the gain for multi-fw domain machines (which will be more and more > of as time goes by). Then we should be first looking at the cases where we are requesting multiple domains to be woken where we only need one. Anyway you've changed my mind, Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> -Chris
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index 49edd641b434..03674c3cfaf7 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -115,8 +115,10 @@ fw_domains_get(struct drm_i915_private *dev_priv, enum forcewake_domains fw_doma for_each_fw_domain_mask(d, fw_domains, dev_priv) { fw_domain_wait_ack_clear(d); fw_domain_get(d); - fw_domain_wait_ack(d); } + + for_each_fw_domain_mask(d, fw_domains, dev_priv) + fw_domain_wait_ack(d); } static void