[4/5] drm/i915: Do not lie about atomic wait granularity
diff mbox

Message ID 1457015805-23742-4-git-send-email-tvrtko.ursulin@linux.intel.com
State New
Headers show

Commit Message

Tvrtko Ursulin March 3, 2016, 2:36 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Currently the wait_for_atomic_us only allows for a millisecond
granularity which is not nice towards callers requesting small
micro-second waits.

Re-implement it so micro-second granularity is really supported
and not just in the name of the macro.

This has another beneficial side effect that it improves
"gem_latency -n 100" results by approximately 2.5% (throughput
and latencies) and 3% (CPU usage). (Note this improvement is
relative to not yet merged execlist lock uncontention patch
which moves the CSB MMIO outside this lock.)

v2:
  * Warn when used from non-atomic context (if possible).
  * Warn on too long atomic waits.

v3:
  * Added comment explaining CONFIG_PREEMPT_COUNT.
  * Fixed pre-processor indentation.
  (Chris Wilson)

v4:
 * Commit msg update (gem_latency) and rebase.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/intel_drv.h | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

Comments

Chris Wilson March 3, 2016, 3:11 p.m. UTC | #1
On Thu, Mar 03, 2016 at 02:36:44PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Currently the wait_for_atomic_us only allows for a millisecond
> granularity which is not nice towards callers requesting small
> micro-second waits.

Hmm, by granularity I think of the interval between COND reads. That
should be the limiting factor in how fast we respond to the change in
event (and so for the atomic variants should be virtually the same).

Your suggestion that it is icache or similar is probably along the right
path. A quick code size measurement of a test function would be a good
supporting argument.
 
> Re-implement it so micro-second granularity is really supported
> and not just in the name of the macro.
> 
> This has another beneficial side effect that it improves
> "gem_latency -n 100" results by approximately 2.5% (throughput
> and latencies) and 3% (CPU usage). (Note this improvement is
> relative to not yet merged execlist lock uncontention patch
> which moves the CSB MMIO outside this lock.)
> 
> v2:
>   * Warn when used from non-atomic context (if possible).
>   * Warn on too long atomic waits.
> 
> v3:
>   * Added comment explaining CONFIG_PREEMPT_COUNT.
>   * Fixed pre-processor indentation.
>   (Chris Wilson)
> 
> v4:
>  * Commit msg update (gem_latency) and rebase.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/intel_drv.h | 31 +++++++++++++++++++++++++++++--
>  1 file changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> index c2a62e9554b3..b6fcb5c800e7 100644
> --- a/drivers/gpu/drm/i915/intel_drv.h
> +++ b/drivers/gpu/drm/i915/intel_drv.h
> @@ -44,6 +44,10 @@
>   * contexts. Note that it's important that we check the condition again after
>   * having timed out, since the timeout could be due to preemption or similar and
>   * we've never had a chance to check the condition before the timeout.
> + *
> + * TODO: When modesetting has fully transitioned to atomic, the below
> + * drm_can_sleep() can be removed and in_atomic()/!in_atomic() asserts
> + * added.
>   */
>  #define _wait_for(COND, US, W) ({ \
>  	unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;	\
> @@ -66,8 +70,31 @@
>  #define wait_for(COND, MS)	  	_wait_for((COND), (MS) * 1000, 1000)
>  #define wait_for_us(COND, US)	  	_wait_for((COND), (US), 1)
>  
> -#define wait_for_atomic(COND, MS) 	_wait_for((COND), (MS) * 1000, 0)
> -#define wait_for_atomic_us(COND, US)	_wait_for((COND), (US), 0)
> +/* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */
> +#if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT)
> +# define _WAIT_FOR_ATOMIC_CHECK WARN_ON_ONCE(!in_atomic())
> +#else
> +# define _WAIT_FOR_ATOMIC_CHECK do { } while(0)
> +#endif
> +
> +#define _wait_for_atomic(COND, US) ({ \
> +	unsigned long end__; \
> +	int ret__ = 0; \
> +	_WAIT_FOR_ATOMIC_CHECK; \
> +	BUILD_BUG_ON((US) > 50000); \
> +	end__ = (local_clock() >> 10) + (US) + 1; \
> +	while (!(COND)) { \
> +		if (time_after((unsigned long)(local_clock() >> 10), end__)) { \

A comment here about why we can forgo the COND recheck would be
worthwhile.

/* Unlike the regular wait_for(), this atomic variant cannot be
 * preempted (and we'll just ignore the issue of irq interruptions) and
 * so we know that no time has passed since the last check of COND and
 * can immediately report the timeout.
 */
-Chris
Tvrtko Ursulin March 3, 2016, 3:43 p.m. UTC | #2
On 03/03/16 15:11, Chris Wilson wrote:
> On Thu, Mar 03, 2016 at 02:36:44PM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> Currently the wait_for_atomic_us only allows for a millisecond
>> granularity which is not nice towards callers requesting small
>> micro-second waits.
>
> Hmm, by granularity I think of the interval between COND reads. That
> should be the limiting factor in how fast we respond to the change in
> event (and so for the atomic variants should be virtually the same).

Yeah not that granularity, should I say "timeout granularity" then in 
the commit? Before if someone wanted to wait for, say 10us, it would 
have busy looped for one jiffie instead. In the timeout scenario that 
is. That is the headline improvement.

> Your suggestion that it is icache or similar is probably along the right
> path. A quick code size measurement of a test function would be a good
> supporting argument.

I am not sure. It shaves ~3.3% (12 of original 353 bytes) of the whole 
fw_domains_get which even if it is quite hot as you know, and even 3% 
from the main loop in wait for atomic (3% here is a glorious one byte). 
So I am not sure, but gem_latency results were quite convincing. 
Unexplained for me.

>> Re-implement it so micro-second granularity is really supported
>> and not just in the name of the macro.
>>
>> This has another beneficial side effect that it improves
>> "gem_latency -n 100" results by approximately 2.5% (throughput
>> and latencies) and 3% (CPU usage). (Note this improvement is
>> relative to not yet merged execlist lock uncontention patch
>> which moves the CSB MMIO outside this lock.)
>>
>> v2:
>>    * Warn when used from non-atomic context (if possible).
>>    * Warn on too long atomic waits.
>>
>> v3:
>>    * Added comment explaining CONFIG_PREEMPT_COUNT.
>>    * Fixed pre-processor indentation.
>>    (Chris Wilson)
>>
>> v4:
>>   * Commit msg update (gem_latency) and rebase.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
>> ---
>>   drivers/gpu/drm/i915/intel_drv.h | 31 +++++++++++++++++++++++++++++--
>>   1 file changed, 29 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
>> index c2a62e9554b3..b6fcb5c800e7 100644
>> --- a/drivers/gpu/drm/i915/intel_drv.h
>> +++ b/drivers/gpu/drm/i915/intel_drv.h
>> @@ -44,6 +44,10 @@
>>    * contexts. Note that it's important that we check the condition again after
>>    * having timed out, since the timeout could be due to preemption or similar and
>>    * we've never had a chance to check the condition before the timeout.
>> + *
>> + * TODO: When modesetting has fully transitioned to atomic, the below
>> + * drm_can_sleep() can be removed and in_atomic()/!in_atomic() asserts
>> + * added.
>>    */
>>   #define _wait_for(COND, US, W) ({ \
>>   	unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;	\
>> @@ -66,8 +70,31 @@
>>   #define wait_for(COND, MS)	  	_wait_for((COND), (MS) * 1000, 1000)
>>   #define wait_for_us(COND, US)	  	_wait_for((COND), (US), 1)
>>
>> -#define wait_for_atomic(COND, MS) 	_wait_for((COND), (MS) * 1000, 0)
>> -#define wait_for_atomic_us(COND, US)	_wait_for((COND), (US), 0)
>> +/* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */
>> +#if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT)
>> +# define _WAIT_FOR_ATOMIC_CHECK WARN_ON_ONCE(!in_atomic())
>> +#else
>> +# define _WAIT_FOR_ATOMIC_CHECK do { } while(0)
>> +#endif
>> +
>> +#define _wait_for_atomic(COND, US) ({ \
>> +	unsigned long end__; \
>> +	int ret__ = 0; \
>> +	_WAIT_FOR_ATOMIC_CHECK; \
>> +	BUILD_BUG_ON((US) > 50000); \
>> +	end__ = (local_clock() >> 10) + (US) + 1; \
>> +	while (!(COND)) { \
>> +		if (time_after((unsigned long)(local_clock() >> 10), end__)) { \
>
> A comment here about why we can forgo the COND recheck would be
> worthwhile.
>
> /* Unlike the regular wait_for(), this atomic variant cannot be
>   * preempted (and we'll just ignore the issue of irq interruptions) and
>   * so we know that no time has passed since the last check of COND and
>   * can immediately report the timeout.
>   */

Will add.

Regards,

Tvrtko
Chris Wilson March 3, 2016, 4:10 p.m. UTC | #3
On Thu, Mar 03, 2016 at 03:43:49PM +0000, Tvrtko Ursulin wrote:
> 
> On 03/03/16 15:11, Chris Wilson wrote:
> >On Thu, Mar 03, 2016 at 02:36:44PM +0000, Tvrtko Ursulin wrote:
> >>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>
> >>Currently the wait_for_atomic_us only allows for a millisecond
> >>granularity which is not nice towards callers requesting small
> >>micro-second waits.
> >
> >Hmm, by granularity I think of the interval between COND reads. That
> >should be the limiting factor in how fast we respond to the change in
> >event (and so for the atomic variants should be virtually the same).
> 
> Yeah not that granularity, should I say "timeout granularity" then
> in the commit? Before if someone wanted to wait for, say 10us, it
> would have busy looped for one jiffie instead. In the timeout
> scenario that is. That is the headline improvement.

"timeout granularity" would be much clearer

> >Your suggestion that it is icache or similar is probably along the right
> >path. A quick code size measurement of a test function would be a good
> >supporting argument.
> 
> I am not sure. It shaves ~3.3% (12 of original 353 bytes) of the
> whole fw_domains_get which even if it is quite hot as you know, and
> even 3% from the main loop in wait for atomic (3% here is a glorious
> one byte). So I am not sure, but gem_latency results were quite
> convincing. Unexplained for me.

Ok. The small improvement in addition to stronger compile-time checks is
certainly enough justification. The motivation is usually to avoid
hitting the wait_for(register) loop in the first place, and a good
mystery is always a good read for a rainy day.
-Chris

Patch
diff mbox

diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
index c2a62e9554b3..b6fcb5c800e7 100644
--- a/drivers/gpu/drm/i915/intel_drv.h
+++ b/drivers/gpu/drm/i915/intel_drv.h
@@ -44,6 +44,10 @@ 
  * contexts. Note that it's important that we check the condition again after
  * having timed out, since the timeout could be due to preemption or similar and
  * we've never had a chance to check the condition before the timeout.
+ *
+ * TODO: When modesetting has fully transitioned to atomic, the below
+ * drm_can_sleep() can be removed and in_atomic()/!in_atomic() asserts
+ * added.
  */
 #define _wait_for(COND, US, W) ({ \
 	unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;	\
@@ -66,8 +70,31 @@ 
 #define wait_for(COND, MS)	  	_wait_for((COND), (MS) * 1000, 1000)
 #define wait_for_us(COND, US)	  	_wait_for((COND), (US), 1)
 
-#define wait_for_atomic(COND, MS) 	_wait_for((COND), (MS) * 1000, 0)
-#define wait_for_atomic_us(COND, US)	_wait_for((COND), (US), 0)
+/* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */
+#if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT)
+# define _WAIT_FOR_ATOMIC_CHECK WARN_ON_ONCE(!in_atomic())
+#else
+# define _WAIT_FOR_ATOMIC_CHECK do { } while(0)
+#endif
+
+#define _wait_for_atomic(COND, US) ({ \
+	unsigned long end__; \
+	int ret__ = 0; \
+	_WAIT_FOR_ATOMIC_CHECK; \
+	BUILD_BUG_ON((US) > 50000); \
+	end__ = (local_clock() >> 10) + (US) + 1; \
+	while (!(COND)) { \
+		if (time_after((unsigned long)(local_clock() >> 10), end__)) { \
+			ret__ = -ETIMEDOUT; \
+			break; \
+		} \
+		cpu_relax(); \
+	} \
+	ret__; \
+})
+
+#define wait_for_atomic(COND, MS)	_wait_for_atomic((COND), (MS) * 1000)
+#define wait_for_atomic_us(COND, US)	_wait_for_atomic((COND), (US))
 
 #define KHz(x) (1000 * (x))
 #define MHz(x) KHz(1000 * (x))