drm/i915: Replace some more busy waits with normal ones

Message ID	1458743527-25392-1-git-send-email-tvrtko.ursulin@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> To: Intel-gfx@lists.freedesktop.org Date: Wed, 23 Mar 2016 14:32:07 +0000 Message-Id: <1458743527-25392-1-git-send-email-tvrtko.ursulin@linux.intel.com> Cc: Mika Kuoppala <mika.kuoppala@intel.com>, Paulo Zanoni <paulo.r.zanoni@intel.com> Subject: [Intel-gfx] [PATCH] drm/i915: Replace some more busy waits with normal ones Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Tvrtko Ursulin March 23, 2016, 2:32 p.m. UTC

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

When I added an assert to catch non-atomic users of
wait_for_atomic_us in 0351b93992aa463cc3e7f358ddec2709f9390756
("drm/i915: Do not lie about atomic timeout granularity"),
I have missed some callers which use it from obviously
non-atomic context.

Replace them with sleeping waits which support micro-second
timeout granularity since 3f177625ee896f5d3c62fa6a49554a9c0243bceb
("drm/i915: Add wait_for_us").

Note however than a fix for wait_for is needed to a clock with
more granularity than jiffies. In the above referenced patch
I have switched the arguments to micro-seconds, but failed to
upgrade the clock as well, as Mika has later discovered.

Open question here is whether we should allow sleeping waits
of less than 10us which usleep_range recommends against. And
this patch actually touches one call site which asks for 1us
timeout.

These might be better served with wait_for_atomic_us, in which
case the inatomic warning there should be made dependant on
the requested timeout.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Paulo Zanoni <paulo.r.zanoni@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/intel_display.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

Tvrtko Ursulin March 23, 2016, 2:38 p.m. UTC | #1

Should have sent this as RFC..

On 23/03/16 14:32, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> When I added an assert to catch non-atomic users of
> wait_for_atomic_us in 0351b93992aa463cc3e7f358ddec2709f9390756
> ("drm/i915: Do not lie about atomic timeout granularity"),
> I have missed some callers which use it from obviously
> non-atomic context.
>
> Replace them with sleeping waits which support micro-second
> timeout granularity since 3f177625ee896f5d3c62fa6a49554a9c0243bceb
> ("drm/i915: Add wait_for_us").
>
> Note however than a fix for wait_for is needed to a clock with
> more granularity than jiffies. In the above referenced patch
> I have switched the arguments to micro-seconds, but failed to
> upgrade the clock as well, as Mika has later discovered.
>
> Open question here is whether we should allow sleeping waits
> of less than 10us which usleep_range recommends against. And
> this patch actually touches one call site which asks for 1us
> timeout.
>
> These might be better served with wait_for_atomic_us, in which
> case the inatomic warning there should be made dependant on
> the requested timeout.

For discussion - does the above sound like a better plan than this 
patch? To sum up my proposal:

1. Allow wait for_atomic_us for < 10us waits and keep using it for such 
waiters.

2. Upgrade the clock in wait_for to something more precise than jiffies 
so timeouts from 10us and up can be handled properly. Note that 
currently this is  only and issue in the failure/timeout mode. In the 
expected case the current implementation is fine.

Equally as under 1), put a BUILD_BUG_ON in wait_for for <10us waits.

Regards,

Tvrtko

kernel test robot March 23, 2016, 2:50 p.m. UTC | #2

Hi Tvrtko,

[auto build test ERROR on drm-intel/for-linux-next]
[also build test ERROR on v4.5 next-20160323]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Tvrtko-Ursulin/drm-i915-Replace-some-more-busy-waits-with-normal-ones/20160323-224126
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: i386-randconfig-x000-201612 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/gpu/drm/i915/intel_display.c: In function 'lpt_reset_fdi_mphy':
>> drivers/gpu/drm/i915/intel_display.c:8386:6: error: implicit declaration of function 'wait_for_us' [-Werror=implicit-function-declaration]
     if (wait_for_us(I915_READ(SOUTH_CHICKEN2) &
         ^
   cc1: some warnings being treated as errors

vim +/wait_for_us +8386 drivers/gpu/drm/i915/intel_display.c

  8380		uint32_t tmp;
  8381	
  8382		tmp = I915_READ(SOUTH_CHICKEN2);
  8383		tmp |= FDI_MPHY_IOSFSB_RESET_CTL;
  8384		I915_WRITE(SOUTH_CHICKEN2, tmp);
  8385	
> 8386		if (wait_for_us(I915_READ(SOUTH_CHICKEN2) &
  8387				FDI_MPHY_IOSFSB_RESET_STATUS, 100))
  8388			DRM_ERROR("FDI mPHY reset assert timeout\n");
  8389	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

kernel test robot March 23, 2016, 2:58 p.m. UTC | #3

Hi Tvrtko,

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on v4.5 next-20160323]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Tvrtko-Ursulin/drm-i915-Replace-some-more-busy-waits-with-normal-ones/20160323-224126
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-randconfig-x000-201612 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/uapi/linux/stddef.h:1:0,
                    from include/linux/stddef.h:4,
                    from include/uapi/linux/posix_types.h:4,
                    from include/uapi/linux/types.h:13,
                    from include/linux/types.h:5,
                    from include/linux/list.h:4,
                    from include/linux/dmi.h:4,
                    from drivers/gpu/drm/i915/intel_display.c:27:
   drivers/gpu/drm/i915/intel_display.c: In function 'lpt_reset_fdi_mphy':
   drivers/gpu/drm/i915/intel_display.c:8386:6: error: implicit declaration of function 'wait_for_us' [-Werror=implicit-function-declaration]
     if (wait_for_us(I915_READ(SOUTH_CHICKEN2) &
         ^
   include/linux/compiler.h:147:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^
>> drivers/gpu/drm/i915/intel_display.c:8386:2: note: in expansion of macro 'if'
     if (wait_for_us(I915_READ(SOUTH_CHICKEN2) &
     ^
   cc1: some warnings being treated as errors

vim +/if +8386 drivers/gpu/drm/i915/intel_display.c

  8370			I915_WRITE(PCH_DREF_CONTROL, val);
  8371			POSTING_READ(PCH_DREF_CONTROL);
  8372			udelay(200);
  8373		}
  8374	
  8375		BUG_ON(val != final);
  8376	}
  8377	
  8378	static void lpt_reset_fdi_mphy(struct drm_i915_private *dev_priv)
  8379	{
  8380		uint32_t tmp;
  8381	
  8382		tmp = I915_READ(SOUTH_CHICKEN2);
  8383		tmp |= FDI_MPHY_IOSFSB_RESET_CTL;
  8384		I915_WRITE(SOUTH_CHICKEN2, tmp);
  8385	
> 8386		if (wait_for_us(I915_READ(SOUTH_CHICKEN2) &
  8387				FDI_MPHY_IOSFSB_RESET_STATUS, 100))
  8388			DRM_ERROR("FDI mPHY reset assert timeout\n");
  8389	
  8390		tmp = I915_READ(SOUTH_CHICKEN2);
  8391		tmp &= ~FDI_MPHY_IOSFSB_RESET_CTL;
  8392		I915_WRITE(SOUTH_CHICKEN2, tmp);
  8393	
  8394		if (wait_for_us((I915_READ(SOUTH_CHICKEN2) &

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

Mika Kuoppala March 23, 2016, 3:43 p.m. UTC | #4

Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> writes:

> [ text/plain ]
>
> Should have sent this as RFC..
>
> On 23/03/16 14:32, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> When I added an assert to catch non-atomic users of
>> wait_for_atomic_us in 0351b93992aa463cc3e7f358ddec2709f9390756
>> ("drm/i915: Do not lie about atomic timeout granularity"),
>> I have missed some callers which use it from obviously
>> non-atomic context.
>>
>> Replace them with sleeping waits which support micro-second
>> timeout granularity since 3f177625ee896f5d3c62fa6a49554a9c0243bceb
>> ("drm/i915: Add wait_for_us").
>>
>> Note however than a fix for wait_for is needed to a clock with
>> more granularity than jiffies. In the above referenced patch
>> I have switched the arguments to micro-seconds, but failed to
>> upgrade the clock as well, as Mika has later discovered.
>>
>> Open question here is whether we should allow sleeping waits
>> of less than 10us which usleep_range recommends against. And
>> this patch actually touches one call site which asks for 1us
>> timeout.
>>
>> These might be better served with wait_for_atomic_us, in which
>> case the inatomic warning there should be made dependant on
>> the requested timeout.
>
> For discussion - does the above sound like a better plan than this 
> patch? To sum up my proposal:
>

What I have aimed for was that we only have wait_for and wait_for_atomic.

The sleeping one operates on 1ms granularity and the nonsleeping
one on usecs.

> 1. Allow wait for_atomic_us for < 10us waits and keep using it for such 
> waiters.

I have modified the wait_for to do few busy cycles on the
start of the wait and then adaptive backoff if condition is not
yet met. In hopes that we could convert few atomic_waits for this.

> 2. Upgrade the clock in wait_for to something more precise than jiffies 
> so timeouts from 10us and up can be handled properly. Note that 
> currently this is  only and issue in the failure/timeout mode. In the 
> expected case the current implementation is fine.
>

I would not go this route. If you really really want <1ms response
this should be explicit in the callsite. Disclaimer: i don't
know all the callsites and requirements.

> Equally as under 1), put a BUILD_BUG_ON in wait_for for <10us waits.
>

This is what I had in mind (wip/rfc):
https://cgit.freedesktop.org/~miku/drm-intel/log/?h=wait_until

Spiced with your patch and few build_bug_on, I think the
wait_for_atomic(_us) might become rare thing.

Thanks,
-Mika

> Regards,
>
> Tvrtko
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Tvrtko Ursulin March 23, 2016, 4:24 p.m. UTC | #5

On 23/03/16 15:43, Mika Kuoppala wrote:
> Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> writes:
>
>> [ text/plain ]
>>
>> Should have sent this as RFC..
>>
>> On 23/03/16 14:32, Tvrtko Ursulin wrote:
>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>
>>> When I added an assert to catch non-atomic users of
>>> wait_for_atomic_us in 0351b93992aa463cc3e7f358ddec2709f9390756
>>> ("drm/i915: Do not lie about atomic timeout granularity"),
>>> I have missed some callers which use it from obviously
>>> non-atomic context.
>>>
>>> Replace them with sleeping waits which support micro-second
>>> timeout granularity since 3f177625ee896f5d3c62fa6a49554a9c0243bceb
>>> ("drm/i915: Add wait_for_us").
>>>
>>> Note however than a fix for wait_for is needed to a clock with
>>> more granularity than jiffies. In the above referenced patch
>>> I have switched the arguments to micro-seconds, but failed to
>>> upgrade the clock as well, as Mika has later discovered.
>>>
>>> Open question here is whether we should allow sleeping waits
>>> of less than 10us which usleep_range recommends against. And
>>> this patch actually touches one call site which asks for 1us
>>> timeout.
>>>
>>> These might be better served with wait_for_atomic_us, in which
>>> case the inatomic warning there should be made dependant on
>>> the requested timeout.
>>
>> For discussion - does the above sound like a better plan than this
>> patch? To sum up my proposal:
>>
>
> What I have aimed for was that we only have wait_for and wait_for_atomic.
>
> The sleeping one operates on 1ms granularity and the nonsleeping
> one on usecs.

Okay, if you think 1ms is enough for all callers.

>> 1. Allow wait for_atomic_us for < 10us waits and keep using it for such
>> waiters.
>
> I have modified the wait_for to do few busy cycles on the
> start of the wait and then adaptive backoff if condition is not
> yet met. In hopes that we could convert few atomic_waits for this.

Sounds good.

>> 2. Upgrade the clock in wait_for to something more precise than jiffies
>> so timeouts from 10us and up can be handled properly. Note that
>> currently this is  only and issue in the failure/timeout mode. In the
>> expected case the current implementation is fine.
>>
>
> I would not go this route. If you really really want <1ms response
> this should be explicit in the callsite. Disclaimer: i don't
> know all the callsites and requirements.

It is explicit, just that it is currently broken on the timeout front. 
But never mind, if the precision is really not required then it is good 
to get rid of it.

>> Equally as under 1), put a BUILD_BUG_ON in wait_for for <10us waits.
>>
>
> This is what I had in mind (wip/rfc):
> https://cgit.freedesktop.org/~miku/drm-intel/log/?h=wait_until
>
> Spiced with your patch and few build_bug_on, I think the
> wait_for_atomic(_us) might become rare thing.

I had a brief look at your tree - looks like a comprehensive approach 
and in overall a good one. Have spotted some small details, but I can 
comment on when you post it.

Biggest thing to make sure is that you don't add a lot of cycles to the 
forcewake loops since for example fw_domains_get can be the hottest i915 
function on some benchmarks.

(This area slightly annoys me anyway with redundant looping over 
forcewake domains and we could also potentially optimize the ack waiting 
by first requesting all we want, and then doing the waits. That would be 
one additional loop, but if removed the other one, code would stay at 
the same number of domain loops.)

Regards,

Tvrtko

Chris Wilson March 23, 2016, 4:40 p.m. UTC | #6

On Wed, Mar 23, 2016 at 04:24:48PM +0000, Tvrtko Ursulin wrote:
> Biggest thing to make sure is that you don't add a lot of cycles to
> the forcewake loops since for example fw_domains_get can be the
> hottest i915 function on some benchmarks.
> 
> (This area slightly annoys me anyway with redundant looping over
> forcewake domains and we could also potentially optimize the ack
> waiting by first requesting all we want, and then doing the waits.
> That would be one additional loop, but if removed the other one,
> code would stay at the same number of domain loops.)

I hear you. I just end up weeping in the corner when I see fw_domain_get
on the profile.

We already do have a mitigation scheme to hold onto the forcewake for an
extra jiffie every time. I don't like it, but without it fw_domains_get
becomes a real hog.

Note that one thing we can actually do is restrict the domains we wakeup
for the engines (engine->fw_domain) in execlists_submit, that should
help chv/skl+ a small amount.

I don't have a good idea for how to keep rc6 residency high but avoid
forcewake when those darn elsp require forcewake. As does gen6+ legacy
RING_TAIL writes. And even then that spinlock causes quite a bit of
traffic when it shouldn't be contended. I've been thinking of whether we
can have multiple locks (hashed by register) but we would then still
need some cross-communication for the common forcewake.
-Chris

Tvrtko Ursulin March 24, 2016, 11:37 a.m. UTC | #7

On 23/03/16 16:40, Chris Wilson wrote:
> On Wed, Mar 23, 2016 at 04:24:48PM +0000, Tvrtko Ursulin wrote:
>> Biggest thing to make sure is that you don't add a lot of cycles to
>> the forcewake loops since for example fw_domains_get can be the
>> hottest i915 function on some benchmarks.
>>
>> (This area slightly annoys me anyway with redundant looping over
>> forcewake domains and we could also potentially optimize the ack
>> waiting by first requesting all we want, and then doing the waits.
>> That would be one additional loop, but if removed the other one,
>> code would stay at the same number of domain loops.)
>
> I hear you. I just end up weeping in the corner when I see fw_domain_get
> on the profile.
>
> We already do have a mitigation scheme to hold onto the forcewake for an
> extra jiffie every time. I don't like it, but without it fw_domains_get
> becomes a real hog.

I am pretty sure I've seen some tests which somehow defeat the jiffie 
delay and we end up re-acquiring every ms/jiffie. This is something I 
wanted to get to the bottom of but did not get round to yet. It was 
totally unexpected because the test is hammering on everything.

> Note that one thing we can actually do is restrict the domains we wakeup
> for the engines (engine->fw_domain) in execlists_submit, that should
> help chv/skl+ a small amount.

I even have a patch to do that somewhere. :)

> I don't have a good idea for how to keep rc6 residency high but avoid
> forcewake when those darn elsp require forcewake. As does gen6+ legacy
> RING_TAIL writes. And even then that spinlock causes quite a bit of
> traffic when it shouldn't be contended. I've been thinking of whether we
> can have multiple locks (hashed by register) but we would then still
> need some cross-communication for the common forcewake.

Maybe it is not worth it at this point. This is pretty well optimised 
now and could switch to the next target. Like maybe move to active and 
retire__read, or retired_req_list, or something.

Regards,

Tvrtko

Chris Wilson March 24, 2016, 12:27 p.m. UTC | #8

On Thu, Mar 24, 2016 at 11:37:07AM +0000, Tvrtko Ursulin wrote:
> 
> On 23/03/16 16:40, Chris Wilson wrote:
> >On Wed, Mar 23, 2016 at 04:24:48PM +0000, Tvrtko Ursulin wrote:
> >>Biggest thing to make sure is that you don't add a lot of cycles to
> >>the forcewake loops since for example fw_domains_get can be the
> >>hottest i915 function on some benchmarks.
> >>
> >>(This area slightly annoys me anyway with redundant looping over
> >>forcewake domains and we could also potentially optimize the ack
> >>waiting by first requesting all we want, and then doing the waits.
> >>That would be one additional loop, but if removed the other one,
> >>code would stay at the same number of domain loops.)
> >
> >I hear you. I just end up weeping in the corner when I see fw_domain_get
> >on the profile.
> >
> >We already do have a mitigation scheme to hold onto the forcewake for an
> >extra jiffie every time. I don't like it, but without it fw_domains_get
> >becomes a real hog.
> 
> I am pretty sure I've seen some tests which somehow defeat the
> jiffie delay and we end up re-acquiring every ms/jiffie. This is
> something I wanted to get to the bottom of but did not get round to
> yet. It was totally unexpected because the test is hammering on
> everything.

Absolutely sure it is not just the delay in acquiring the ack? And
spinning on waiting for the thread_c0 doesn't come cheap? I've just
written off fw_domain_get being high on the profiles simply due to that
we have to spin so long (I'm jaded because on Sandybridge spinning for
50us+ isn't uncommon iirc).

You're right though, we should instrument it and check it is working.

> >Note that one thing we can actually do is restrict the domains we wakeup
> >for the engines (engine->fw_domain) in execlists_submit, that should
> >help chv/skl+ a small amount.
> 
> I even have a patch to do that somewhere. :)

So did I!

> >I don't have a good idea for how to keep rc6 residency high but avoid
> >forcewake when those darn elsp require forcewake. As does gen6+ legacy
> >RING_TAIL writes. And even then that spinlock causes quite a bit of
> >traffic when it shouldn't be contended. I've been thinking of whether we
> >can have multiple locks (hashed by register) but we would then still
> >need some cross-communication for the common forcewake.
> 
> Maybe it is not worth it at this point. This is pretty well
> optimised now and could switch to the next target. Like maybe move
> to active and retire__read, or retired_req_list, or something.

There is a challenge in that we are both lazy in how often we check for
retirement/idleness (because it's work that we can postpone and if we do
so, quite often that work is no longer required!) and that just because
the driver believes the GPU should be busy, it can quite happily power
itself down between operations (though that's really for gen6/gen7
semaphores with the current state of the driver).

I don't have a better plan then to reduce the mmio and spinlocks where
possible.
-Chris

Tvrtko Ursulin March 24, 2016, 1:06 p.m. UTC | #9

On 24/03/16 12:27, Chris Wilson wrote:
> On Thu, Mar 24, 2016 at 11:37:07AM +0000, Tvrtko Ursulin wrote:
>>
>> On 23/03/16 16:40, Chris Wilson wrote:
>>> On Wed, Mar 23, 2016 at 04:24:48PM +0000, Tvrtko Ursulin wrote:
>>>> Biggest thing to make sure is that you don't add a lot of cycles to
>>>> the forcewake loops since for example fw_domains_get can be the
>>>> hottest i915 function on some benchmarks.
>>>>
>>>> (This area slightly annoys me anyway with redundant looping over
>>>> forcewake domains and we could also potentially optimize the ack
>>>> waiting by first requesting all we want, and then doing the waits.
>>>> That would be one additional loop, but if removed the other one,
>>>> code would stay at the same number of domain loops.)
>>>
>>> I hear you. I just end up weeping in the corner when I see fw_domain_get
>>> on the profile.
>>>
>>> We already do have a mitigation scheme to hold onto the forcewake for an
>>> extra jiffie every time. I don't like it, but without it fw_domains_get
>>> becomes a real hog.
>>
>> I am pretty sure I've seen some tests which somehow defeat the
>> jiffie delay and we end up re-acquiring every ms/jiffie. This is
>> something I wanted to get to the bottom of but did not get round to
>> yet. It was totally unexpected because the test is hammering on
>> everything.
>
> Absolutely sure it is not just the delay in acquiring the ack? And
> spinning on waiting for the thread_c0 doesn't come cheap? I've just
> written off fw_domain_get being high on the profiles simply due to that
> we have to spin so long (I'm jaded because on Sandybridge spinning for
> 50us+ isn't uncommon iirc).

I am not sure, I just know I had a printk in the timer release and it 
was firing every millisecond which completely perplexed me since I was 
running gem_exec_nop/all at the time.

Good point on that the cost might actually be in the wait for acks.

Regards,

Tvrtko

Chris Wilson March 24, 2016, 1:16 p.m. UTC | #10

On Thu, Mar 24, 2016 at 01:06:41PM +0000, Tvrtko Ursulin wrote:
> 
> On 24/03/16 12:27, Chris Wilson wrote:
> >On Thu, Mar 24, 2016 at 11:37:07AM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 23/03/16 16:40, Chris Wilson wrote:
> >>>On Wed, Mar 23, 2016 at 04:24:48PM +0000, Tvrtko Ursulin wrote:
> >>>>Biggest thing to make sure is that you don't add a lot of cycles to
> >>>>the forcewake loops since for example fw_domains_get can be the
> >>>>hottest i915 function on some benchmarks.
> >>>>
> >>>>(This area slightly annoys me anyway with redundant looping over
> >>>>forcewake domains and we could also potentially optimize the ack
> >>>>waiting by first requesting all we want, and then doing the waits.
> >>>>That would be one additional loop, but if removed the other one,
> >>>>code would stay at the same number of domain loops.)
> >>>
> >>>I hear you. I just end up weeping in the corner when I see fw_domain_get
> >>>on the profile.
> >>>
> >>>We already do have a mitigation scheme to hold onto the forcewake for an
> >>>extra jiffie every time. I don't like it, but without it fw_domains_get
> >>>becomes a real hog.
> >>
> >>I am pretty sure I've seen some tests which somehow defeat the
> >>jiffie delay and we end up re-acquiring every ms/jiffie. This is
> >>something I wanted to get to the bottom of but did not get round to
> >>yet. It was totally unexpected because the test is hammering on
> >>everything.
> >
> >Absolutely sure it is not just the delay in acquiring the ack? And
> >spinning on waiting for the thread_c0 doesn't come cheap? I've just
> >written off fw_domain_get being high on the profiles simply due to that
> >we have to spin so long (I'm jaded because on Sandybridge spinning for
> >50us+ isn't uncommon iirc).
> 
> I am not sure, I just know I had a printk in the timer release and
> it was firing every millisecond which completely perplexed me since
> I was running gem_exec_nop/all at the time.

Well we don't need to arm the timer in both the get and put, do we Mika!

Mika, please send a fix so we only arm the timer when putting. And blame
the reviewer.

Oh, that is bad. /o\
-Chris

Chris Wilson March 24, 2016, 1:31 p.m. UTC | #11

On Thu, Mar 24, 2016 at 01:16:40PM +0000, Chris Wilson wrote:
> On Thu, Mar 24, 2016 at 01:06:41PM +0000, Tvrtko Ursulin wrote:
> > 
> > On 24/03/16 12:27, Chris Wilson wrote:
> > >On Thu, Mar 24, 2016 at 11:37:07AM +0000, Tvrtko Ursulin wrote:
> > >>
> > >>On 23/03/16 16:40, Chris Wilson wrote:
> > >>>On Wed, Mar 23, 2016 at 04:24:48PM +0000, Tvrtko Ursulin wrote:
> > >>>>Biggest thing to make sure is that you don't add a lot of cycles to
> > >>>>the forcewake loops since for example fw_domains_get can be the
> > >>>>hottest i915 function on some benchmarks.
> > >>>>
> > >>>>(This area slightly annoys me anyway with redundant looping over
> > >>>>forcewake domains and we could also potentially optimize the ack
> > >>>>waiting by first requesting all we want, and then doing the waits.
> > >>>>That would be one additional loop, but if removed the other one,
> > >>>>code would stay at the same number of domain loops.)
> > >>>
> > >>>I hear you. I just end up weeping in the corner when I see fw_domain_get
> > >>>on the profile.
> > >>>
> > >>>We already do have a mitigation scheme to hold onto the forcewake for an
> > >>>extra jiffie every time. I don't like it, but without it fw_domains_get
> > >>>becomes a real hog.
> > >>
> > >>I am pretty sure I've seen some tests which somehow defeat the
> > >>jiffie delay and we end up re-acquiring every ms/jiffie. This is
> > >>something I wanted to get to the bottom of but did not get round to
> > >>yet. It was totally unexpected because the test is hammering on
> > >>everything.
> > >
> > >Absolutely sure it is not just the delay in acquiring the ack? And
> > >spinning on waiting for the thread_c0 doesn't come cheap? I've just
> > >written off fw_domain_get being high on the profiles simply due to that
> > >we have to spin so long (I'm jaded because on Sandybridge spinning for
> > >50us+ isn't uncommon iirc).
> > 
> > I am not sure, I just know I had a printk in the timer release and
> > it was firing every millisecond which completely perplexed me since
> > I was running gem_exec_nop/all at the time.
> 
> Well we don't need to arm the timer in both the get and put, do we Mika!
> 
> Mika, please send a fix so we only arm the timer when putting. And blame
> the reviewer.

Even worse, you copied my mistake! Darn.
-Chris

drm/i915: Replace some more busy waits with normal ones

Commit Message

Comments

Patch