[4/5] drm/i915: Do not lie about atomic wait granularity

Message ID	1457015805-23742-4-git-send-email-tvrtko.ursulin@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> To: Intel-gfx@lists.freedesktop.org Date: Thu, 3 Mar 2016 14:36:44 +0000 Message-Id: <1457015805-23742-4-git-send-email-tvrtko.ursulin@linux.intel.com> In-Reply-To: <1457015805-23742-1-git-send-email-tvrtko.ursulin@linux.intel.com> References: <1457015805-23742-1-git-send-email-tvrtko.ursulin@linux.intel.com> Subject: [Intel-gfx] [PATCH 4/5] drm/i915: Do not lie about atomic wait granularity Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Message ID

1457015805-23742-4-git-send-email-tvrtko.ursulin@linux.intel.com (mailing list archive)

State

New, archived

Headers

From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Intel-gfx@lists.freedesktop.org
Date: Thu,  3 Mar 2016 14:36:44 +0000
Message-Id: <1457015805-23742-4-git-send-email-tvrtko.ursulin@linux.intel.com>
In-Reply-To: <1457015805-23742-1-git-send-email-tvrtko.ursulin@linux.intel.com>
References: <1457015805-23742-1-git-send-email-tvrtko.ursulin@linux.intel.com>
Subject: [Intel-gfx] [PATCH 4/5] drm/i915: Do not lie about atomic wait
	granularity
Precedence: list
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Commit Message

Tvrtko Ursulin March 3, 2016, 2:36 p.m. UTC

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Currently the wait_for_atomic_us only allows for a millisecond
granularity which is not nice towards callers requesting small
micro-second waits.

Re-implement it so micro-second granularity is really supported
and not just in the name of the macro.

This has another beneficial side effect that it improves
"gem_latency -n 100" results by approximately 2.5% (throughput
and latencies) and 3% (CPU usage). (Note this improvement is
relative to not yet merged execlist lock uncontention patch
which moves the CSB MMIO outside this lock.)

v2:
  * Warn when used from non-atomic context (if possible).
  * Warn on too long atomic waits.

v3:
  * Added comment explaining CONFIG_PREEMPT_COUNT.
  * Fixed pre-processor indentation.
  (Chris Wilson)

v4:
 * Commit msg update (gem_latency) and rebase.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/intel_drv.h | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

Comments

Chris Wilson March 3, 2016, 3:11 p.m. UTC | #1

On Thu, Mar 03, 2016 at 02:36:44PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Currently the wait_for_atomic_us only allows for a millisecond
> granularity which is not nice towards callers requesting small
> micro-second waits.

Hmm, by granularity I think of the interval between COND reads. That
should be the limiting factor in how fast we respond to the change in
event (and so for the atomic variants should be virtually the same).

Your suggestion that it is icache or similar is probably along the right
path. A quick code size measurement of a test function would be a good
supporting argument.
 
> Re-implement it so micro-second granularity is really supported
> and not just in the name of the macro.
> 
> This has another beneficial side effect that it improves
> "gem_latency -n 100" results by approximately 2.5% (throughput
> and latencies) and 3% (CPU usage). (Note this improvement is
> relative to not yet merged execlist lock uncontention patch
> which moves the CSB MMIO outside this lock.)
> 
> v2:
>   * Warn when used from non-atomic context (if possible).
>   * Warn on too long atomic waits.
> 
> v3:
>   * Added comment explaining CONFIG_PREEMPT_COUNT.
>   * Fixed pre-processor indentation.
>   (Chris Wilson)
> 
> v4:
>  * Commit msg update (gem_latency) and rebase.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/intel_drv.h | 31 +++++++++++++++++++++++++++++--
>  1 file changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> index c2a62e9554b3..b6fcb5c800e7 100644
> --- a/drivers/gpu/drm/i915/intel_drv.h
> +++ b/drivers/gpu/drm/i915/intel_drv.h
> @@ -44,6 +44,10 @@
>   * contexts. Note that it's important that we check the condition again after
>   * having timed out, since the timeout could be due to preemption or similar and
>   * we've never had a chance to check the condition before the timeout.
> + *
> + * TODO: When modesetting has fully transitioned to atomic, the below
> + * drm_can_sleep() can be removed and in_atomic()/!in_atomic() asserts
> + * added.
>   */
>  #define _wait_for(COND, US, W) ({ \
>  	unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;	\
> @@ -66,8 +70,31 @@
>  #define wait_for(COND, MS)	  	_wait_for((COND), (MS) * 1000, 1000)
>  #define wait_for_us(COND, US)	  	_wait_for((COND), (US), 1)
>  
> -#define wait_for_atomic(COND, MS) 	_wait_for((COND), (MS) * 1000, 0)
> -#define wait_for_atomic_us(COND, US)	_wait_for((COND), (US), 0)
> +/* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */
> +#if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT)
> +# define _WAIT_FOR_ATOMIC_CHECK WARN_ON_ONCE(!in_atomic())
> +#else
> +# define _WAIT_FOR_ATOMIC_CHECK do { } while(0)
> +#endif
> +
> +#define _wait_for_atomic(COND, US) ({ \
> +	unsigned long end__; \
> +	int ret__ = 0; \
> +	_WAIT_FOR_ATOMIC_CHECK; \
> +	BUILD_BUG_ON((US) > 50000); \
> +	end__ = (local_clock() >> 10) + (US) + 1; \
> +	while (!(COND)) { \
> +		if (time_after((unsigned long)(local_clock() >> 10), end__)) { \

A comment here about why we can forgo the COND recheck would be
worthwhile.

/* Unlike the regular wait_for(), this atomic variant cannot be
 * preempted (and we'll just ignore the issue of irq interruptions) and
 * so we know that no time has passed since the last check of COND and
 * can immediately report the timeout.
 */
-Chris

Tvrtko Ursulin March 3, 2016, 3:43 p.m. UTC | #2

On 03/03/16 15:11, Chris Wilson wrote:
> On Thu, Mar 03, 2016 at 02:36:44PM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> Currently the wait_for_atomic_us only allows for a millisecond
>> granularity which is not nice towards callers requesting small
>> micro-second waits.
>
> Hmm, by granularity I think of the interval between COND reads. That
> should be the limiting factor in how fast we respond to the change in
> event (and so for the atomic variants should be virtually the same).

Yeah not that granularity, should I say "timeout granularity" then in 
the commit? Before if someone wanted to wait for, say 10us, it would 
have busy looped for one jiffie instead. In the timeout scenario that 
is. That is the headline improvement.

> Your suggestion that it is icache or similar is probably along the right
> path. A quick code size measurement of a test function would be a good
> supporting argument.

I am not sure. It shaves ~3.3% (12 of original 353 bytes) of the whole 
fw_domains_get which even if it is quite hot as you know, and even 3% 
from the main loop in wait for atomic (3% here is a glorious one byte). 
So I am not sure, but gem_latency results were quite convincing. 
Unexplained for me.

>> Re-implement it so micro-second granularity is really supported
>> and not just in the name of the macro.
>>
>> This has another beneficial side effect that it improves
>> "gem_latency -n 100" results by approximately 2.5% (throughput
>> and latencies) and 3% (CPU usage). (Note this improvement is
>> relative to not yet merged execlist lock uncontention patch
>> which moves the CSB MMIO outside this lock.)
>>
>> v2:
>>    * Warn when used from non-atomic context (if possible).
>>    * Warn on too long atomic waits.
>>
>> v3:
>>    * Added comment explaining CONFIG_PREEMPT_COUNT.
>>    * Fixed pre-processor indentation.
>>    (Chris Wilson)
>>
>> v4:
>>   * Commit msg update (gem_latency) and rebase.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
>> ---
>>   drivers/gpu/drm/i915/intel_drv.h | 31 +++++++++++++++++++++++++++++--
>>   1 file changed, 29 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
>> index c2a62e9554b3..b6fcb5c800e7 100644
>> --- a/drivers/gpu/drm/i915/intel_drv.h
>> +++ b/drivers/gpu/drm/i915/intel_drv.h
>> @@ -44,6 +44,10 @@
>>    * contexts. Note that it's important that we check the condition again after
>>    * having timed out, since the timeout could be due to preemption or similar and
>>    * we've never had a chance to check the condition before the timeout.
>> + *
>> + * TODO: When modesetting has fully transitioned to atomic, the below
>> + * drm_can_sleep() can be removed and in_atomic()/!in_atomic() asserts
>> + * added.
>>    */
>>   #define _wait_for(COND, US, W) ({ \
>>   	unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;	\
>> @@ -66,8 +70,31 @@
>>   #define wait_for(COND, MS)	  	_wait_for((COND), (MS) * 1000, 1000)
>>   #define wait_for_us(COND, US)	  	_wait_for((COND), (US), 1)
>>
>> -#define wait_for_atomic(COND, MS) 	_wait_for((COND), (MS) * 1000, 0)
>> -#define wait_for_atomic_us(COND, US)	_wait_for((COND), (US), 0)
>> +/* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */
>> +#if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT)
>> +# define _WAIT_FOR_ATOMIC_CHECK WARN_ON_ONCE(!in_atomic())
>> +#else
>> +# define _WAIT_FOR_ATOMIC_CHECK do { } while(0)
>> +#endif
>> +
>> +#define _wait_for_atomic(COND, US) ({ \
>> +	unsigned long end__; \
>> +	int ret__ = 0; \
>> +	_WAIT_FOR_ATOMIC_CHECK; \
>> +	BUILD_BUG_ON((US) > 50000); \
>> +	end__ = (local_clock() >> 10) + (US) + 1; \
>> +	while (!(COND)) { \
>> +		if (time_after((unsigned long)(local_clock() >> 10), end__)) { \
>
> A comment here about why we can forgo the COND recheck would be
> worthwhile.
>
> /* Unlike the regular wait_for(), this atomic variant cannot be
>   * preempted (and we'll just ignore the issue of irq interruptions) and
>   * so we know that no time has passed since the last check of COND and
>   * can immediately report the timeout.
>   */

Will add.

Regards,

Tvrtko

Chris Wilson March 3, 2016, 4:10 p.m. UTC | #3

On Thu, Mar 03, 2016 at 03:43:49PM +0000, Tvrtko Ursulin wrote:
> 
> On 03/03/16 15:11, Chris Wilson wrote:
> >On Thu, Mar 03, 2016 at 02:36:44PM +0000, Tvrtko Ursulin wrote:
> >>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>
> >>Currently the wait_for_atomic_us only allows for a millisecond
> >>granularity which is not nice towards callers requesting small
> >>micro-second waits.
> >
> >Hmm, by granularity I think of the interval between COND reads. That
> >should be the limiting factor in how fast we respond to the change in
> >event (and so for the atomic variants should be virtually the same).
> 
> Yeah not that granularity, should I say "timeout granularity" then
> in the commit? Before if someone wanted to wait for, say 10us, it
> would have busy looped for one jiffie instead. In the timeout
> scenario that is. That is the headline improvement.

"timeout granularity" would be much clearer

> >Your suggestion that it is icache or similar is probably along the right
> >path. A quick code size measurement of a test function would be a good
> >supporting argument.
> 
> I am not sure. It shaves ~3.3% (12 of original 353 bytes) of the
> whole fw_domains_get which even if it is quite hot as you know, and
> even 3% from the main loop in wait for atomic (3% here is a glorious
> one byte). So I am not sure, but gem_latency results were quite
> convincing. Unexplained for me.

Ok. The small improvement in addition to stronger compile-time checks is
certainly enough justification. The motivation is usually to avoid
hitting the wait_for(register) loop in the first place, and a good
mystery is always a good read for a rainy day.
-Chris

diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
index c2a62e9554b3..b6fcb5c800e7 100644
--- a/drivers/gpu/drm/i915/intel_drv.h
+++ b/drivers/gpu/drm/i915/intel_drv.h
@@ -44,6 +44,10 @@ 
  * contexts. Note that it's important that we check the condition again after
  * having timed out, since the timeout could be due to preemption or similar and
  * we've never had a chance to check the condition before the timeout.
+ *
+ * TODO: When modesetting has fully transitioned to atomic, the below
+ * drm_can_sleep() can be removed and in_atomic()/!in_atomic() asserts
+ * added.
  */
 #define _wait_for(COND, US, W) ({ \
 	unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;	\
@@ -66,8 +70,31 @@ 
 #define wait_for(COND, MS)	  	_wait_for((COND), (MS) * 1000, 1000)
 #define wait_for_us(COND, US)	  	_wait_for((COND), (US), 1)
 
-#define wait_for_atomic(COND, MS) 	_wait_for((COND), (MS) * 1000, 0)
-#define wait_for_atomic_us(COND, US)	_wait_for((COND), (US), 0)
+/* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */
+#if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT)
+# define _WAIT_FOR_ATOMIC_CHECK WARN_ON_ONCE(!in_atomic())
+#else
+# define _WAIT_FOR_ATOMIC_CHECK do { } while(0)
+#endif
+
+#define _wait_for_atomic(COND, US) ({ \
+	unsigned long end__; \
+	int ret__ = 0; \
+	_WAIT_FOR_ATOMIC_CHECK; \
+	BUILD_BUG_ON((US) > 50000); \
+	end__ = (local_clock() >> 10) + (US) + 1; \
+	while (!(COND)) { \
+		if (time_after((unsigned long)(local_clock() >> 10), end__)) { \
+			ret__ = -ETIMEDOUT; \
+			break; \
+		} \
+		cpu_relax(); \
+	} \
+	ret__; \
+})
+
+#define wait_for_atomic(COND, MS)	_wait_for_atomic((COND), (MS) * 1000)
+#define wait_for_atomic_us(COND, US)	_wait_for_atomic((COND), (US))
 
 #define KHz(x) (1000 * (x))
 #define MHz(x) KHz(1000 * (x))

[4/5] drm/i915: Do not lie about atomic wait granularity

Commit Message

Comments

Patch