[6/7] drm/i915/pmu: Add running counter

Message ID	20180606144010.9367-1-tvrtko.ursulin@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Tvrtko Ursulin <tursulin@ursulin.net> To: Intel-gfx@lists.freedesktop.org Date: Wed, 6 Jun 2018 15:40:10 +0100 Message-Id: <20180606144010.9367-1-tvrtko.ursulin@linux.intel.com> In-Reply-To: <20180606124848.13050-7-tvrtko.ursulin@linux.intel.com> References: <20180606124848.13050-7-tvrtko.ursulin@linux.intel.com> Subject: [Intel-gfx] [PATCH 6/7] drm/i915/pmu: Add running counter Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Message ID

20180606144010.9367-1-tvrtko.ursulin@linux.intel.com (mailing list archive)

State

New, archived

Headers

From: Tvrtko Ursulin <tursulin@ursulin.net>
To: Intel-gfx@lists.freedesktop.org
Date: Wed,  6 Jun 2018 15:40:10 +0100
Message-Id: <20180606144010.9367-1-tvrtko.ursulin@linux.intel.com>
In-Reply-To: <20180606124848.13050-7-tvrtko.ursulin@linux.intel.com>
References: <20180606124848.13050-7-tvrtko.ursulin@linux.intel.com>
Subject: [Intel-gfx] [PATCH 6/7] drm/i915/pmu: Add running counter
Precedence: list
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Commit Message

Tvrtko Ursulin June 6, 2018, 2:40 p.m. UTC

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

We add a PMU counter to expose the number of requests currently executing
on the GPU.

This is useful to analyze the overall load of the system.

v2:
 * Rebase.
 * Drop floating point constant. (Chris Wilson)

v3:
 * Change scale to 1024 for faster arithmetics. (Chris Wilson)

v4:
 * Refactored for timer period accounting.

v5:
 * Avoid 64-division. (Chris Wilson)

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_pmu.c         | 20 ++++++++++++++++++--
 drivers/gpu/drm/i915/intel_ringbuffer.h |  2 +-
 include/uapi/drm/i915_drm.h             |  5 +++++
 3 files changed, 24 insertions(+), 3 deletions(-)

Comments

Chris Wilson June 6, 2018, 3:23 p.m. UTC | #1

Quoting Tvrtko Ursulin (2018-06-06 15:40:10)
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> We add a PMU counter to expose the number of requests currently executing
> on the GPU.
> 
> This is useful to analyze the overall load of the system.
> 
> v2:
>  * Rebase.
>  * Drop floating point constant. (Chris Wilson)
> 
> v3:
>  * Change scale to 1024 for faster arithmetics. (Chris Wilson)
> 
> v4:
>  * Refactored for timer period accounting.
> 
> v5:
>  * Avoid 64-division. (Chris Wilson)
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>  #define ENGINE_SAMPLE_BITS (1 << I915_PMU_SAMPLE_BITS)
>  
> @@ -226,6 +227,13 @@ engines_sample(struct drm_i915_private *dev_priv, unsigned int period_ns)
>                                         div_u64((u64)period_ns *
>                                                 I915_SAMPLE_QUEUED_DIVISOR,
>                                                 1000000));
> +
> +               if (engine->pmu.enable & BIT(I915_SAMPLE_RUNNING))
> +                       add_sample_mult(&engine->pmu.sample[I915_SAMPLE_RUNNING],
> +                                       last_seqno - current_seqno,
> +                                       div_u64((u64)period_ns *
> +                                               I915_SAMPLE_QUEUED_DIVISOR,
> +                                               1000000));

Are we worried about losing precision with qd.ns?

add_sample_mult(SAMPLE, x, period_ns); here

> @@ -560,7 +569,8 @@ static u64 __i915_pmu_event_read(struct perf_event *event)
>                         val = engine->pmu.sample[sample].cur;
>  
>                         if (sample == I915_SAMPLE_QUEUED ||
> -                           sample == I915_SAMPLE_RUNNABLE)
> +                           sample == I915_SAMPLE_RUNNABLE ||
> +                           sample == I915_SAMPLE_RUNNING)
>                                 val = div_u64(val, MSEC_PER_SEC);  /* to qd */

and val = div_u64(val * I915_SAMPLE_QUEUED_DIVISOR, NSEC_PER_SEC);

So that gives us a limit of ~1 million qd (assuming the user cares for
about 1s intervals). Up to 8 million wlog with

	val = div_u64(val * I915_SAMPLE_QUEUED_DIVISOR/8, NSEC_PER_SEC/8);

Anyway, just concerned to have more than one 64b division and want to
provoke you into thinking of a way of avoiding it :)
-Chris

Tvrtko Ursulin June 6, 2018, 3:52 p.m. UTC | #2

On 06/06/2018 16:23, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-06-06 15:40:10)
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> We add a PMU counter to expose the number of requests currently executing
>> on the GPU.
>>
>> This is useful to analyze the overall load of the system.
>>
>> v2:
>>   * Rebase.
>>   * Drop floating point constant. (Chris Wilson)
>>
>> v3:
>>   * Change scale to 1024 for faster arithmetics. (Chris Wilson)
>>
>> v4:
>>   * Refactored for timer period accounting.
>>
>> v5:
>>   * Avoid 64-division. (Chris Wilson)
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> ---
>>   #define ENGINE_SAMPLE_BITS (1 << I915_PMU_SAMPLE_BITS)
>>   
>> @@ -226,6 +227,13 @@ engines_sample(struct drm_i915_private *dev_priv, unsigned int period_ns)
>>                                          div_u64((u64)period_ns *
>>                                                  I915_SAMPLE_QUEUED_DIVISOR,
>>                                                  1000000));
>> +
>> +               if (engine->pmu.enable & BIT(I915_SAMPLE_RUNNING))
>> +                       add_sample_mult(&engine->pmu.sample[I915_SAMPLE_RUNNING],
>> +                                       last_seqno - current_seqno,
>> +                                       div_u64((u64)period_ns *
>> +                                               I915_SAMPLE_QUEUED_DIVISOR,
>> +                                               1000000));
> 
> Are we worried about losing precision with qd.ns?
> 
> add_sample_mult(SAMPLE, x, period_ns); here
> 
>> @@ -560,7 +569,8 @@ static u64 __i915_pmu_event_read(struct perf_event *event)
>>                          val = engine->pmu.sample[sample].cur;
>>   
>>                          if (sample == I915_SAMPLE_QUEUED ||
>> -                           sample == I915_SAMPLE_RUNNABLE)
>> +                           sample == I915_SAMPLE_RUNNABLE ||
>> +                           sample == I915_SAMPLE_RUNNING)
>>                                  val = div_u64(val, MSEC_PER_SEC);  /* to qd */
> 
> and val = div_u64(val * I915_SAMPLE_QUEUED_DIVISOR, NSEC_PER_SEC);

Yeah that works, thanks.

> So that gives us a limit of ~1 million qd (assuming the user cares for
> about 1s intervals). Up to 8 million wlog with
> 
> 	val = div_u64(val * I915_SAMPLE_QUEUED_DIVISOR/8, NSEC_PER_SEC/8);

Or keep in qd.us as for frequency. I think precision is plenty in any case.

> Anyway, just concerned to have more than one 64b division and want to
> provoke you into thinking of a way of avoiding it :)

It is an optimized 64-bit divide, or 64-divide as I faltered in the 
commit message :), so not as bad as 64/64, but still your idea is very good.

Regards,

Tvrtko

diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
index 46a516a748c8..9ecaf662b5c1 100644
--- a/drivers/gpu/drm/i915/i915_pmu.c
+++ b/drivers/gpu/drm/i915/i915_pmu.c
@@ -17,7 +17,8 @@ 
 	 BIT(I915_SAMPLE_WAIT) | \
 	 BIT(I915_SAMPLE_SEMA) | \
 	 BIT(I915_SAMPLE_QUEUED) | \
-	 BIT(I915_SAMPLE_RUNNABLE))
+	 BIT(I915_SAMPLE_RUNNABLE) | \
+	 BIT(I915_SAMPLE_RUNNING))
 
 #define ENGINE_SAMPLE_BITS (1 << I915_PMU_SAMPLE_BITS)
 
@@ -226,6 +227,13 @@  engines_sample(struct drm_i915_private *dev_priv, unsigned int period_ns)
 					div_u64((u64)period_ns *
 						I915_SAMPLE_QUEUED_DIVISOR,
 						1000000));
+
+		if (engine->pmu.enable & BIT(I915_SAMPLE_RUNNING))
+			add_sample_mult(&engine->pmu.sample[I915_SAMPLE_RUNNING],
+					last_seqno - current_seqno,
+					div_u64((u64)period_ns *
+						I915_SAMPLE_QUEUED_DIVISOR,
+						1000000));
 	}
 
 	if (fw)
@@ -341,6 +349,7 @@  engine_event_status(struct intel_engine_cs *engine,
 	case I915_SAMPLE_WAIT:
 	case I915_SAMPLE_QUEUED:
 	case I915_SAMPLE_RUNNABLE:
+	case I915_SAMPLE_RUNNING:
 		break;
 	case I915_SAMPLE_SEMA:
 		if (INTEL_GEN(engine->i915) < 6)
@@ -560,7 +569,8 @@  static u64 __i915_pmu_event_read(struct perf_event *event)
 			val = engine->pmu.sample[sample].cur;
 
 			if (sample == I915_SAMPLE_QUEUED ||
-			    sample == I915_SAMPLE_RUNNABLE)
+			    sample == I915_SAMPLE_RUNNABLE ||
+			    sample == I915_SAMPLE_RUNNING)
 				val = div_u64(val, MSEC_PER_SEC);  /* to qd */
 		}
 	} else {
@@ -858,6 +868,7 @@  add_pmu_attr(struct perf_pmu_events_attr *attr, const char *name,
 /* No brackets or quotes below please. */
 #define I915_SAMPLE_QUEUED_SCALE 0.0009765625
 #define I915_SAMPLE_RUNNABLE_SCALE 0.0009765625
+#define I915_SAMPLE_RUNNING_SCALE 0.0009765625
 
 static struct attribute **
 create_event_attributes(struct drm_i915_private *i915)
@@ -885,6 +896,8 @@  create_event_attributes(struct drm_i915_private *i915)
 				     __stringify(I915_SAMPLE_QUEUED_SCALE)),
 		__engine_event_scale(I915_SAMPLE_RUNNABLE, "runnable",
 				     __stringify(I915_SAMPLE_RUNNABLE_SCALE)),
+		__engine_event_scale(I915_SAMPLE_RUNNING, "running",
+				     __stringify(I915_SAMPLE_RUNNING_SCALE)),
 	};
 	unsigned int count = 0;
 	struct perf_pmu_events_attr *pmu_attr = NULL, *pmu_iter;
@@ -900,6 +913,9 @@  create_event_attributes(struct drm_i915_private *i915)
 	BUILD_BUG_ON(I915_SAMPLE_RUNNABLE_DIVISOR !=
 		     (1 / I915_SAMPLE_RUNNABLE_SCALE));
 
+	BUILD_BUG_ON(I915_SAMPLE_RUNNING_DIVISOR !=
+		     (1 / I915_SAMPLE_RUNNING_SCALE));
+
 	/* Count how many counters we will be exposing. */
 	for (i = 0; i < ARRAY_SIZE(events); i++) {
 		if (!config_status(i915, events[i].config))
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 703cea694f0d..bff20cfd6870 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -420,7 +420,7 @@  struct intel_engine_cs {
 		 *
 		 * Our internal timer stores the current counters in this field.
 		 */
-#define I915_ENGINE_SAMPLE_MAX (I915_SAMPLE_RUNNABLE + 1)
+#define I915_ENGINE_SAMPLE_MAX (I915_SAMPLE_RUNNING + 1)
 		struct i915_pmu_sample sample[I915_ENGINE_SAMPLE_MAX];
 	} pmu;
 
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index cf0265b20e37..9a00c30e4071 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -113,11 +113,13 @@  enum drm_i915_pmu_engine_sample {
 	I915_SAMPLE_SEMA = 2,
 	I915_SAMPLE_QUEUED = 3,
 	I915_SAMPLE_RUNNABLE = 4,
+	I915_SAMPLE_RUNNING = 5,
 };
 
  /* Divide counter value by divisor to get the real value. */
 #define I915_SAMPLE_QUEUED_DIVISOR (1024)
 #define I915_SAMPLE_RUNNABLE_DIVISOR (1024)
+#define I915_SAMPLE_RUNNING_DIVISOR (1024)
 
 #define I915_PMU_SAMPLE_BITS (4)
 #define I915_PMU_SAMPLE_MASK (0xf)
@@ -145,6 +147,9 @@  enum drm_i915_pmu_engine_sample {
 #define I915_PMU_ENGINE_RUNNABLE(class, instance) \
 	__I915_PMU_ENGINE(class, instance, I915_SAMPLE_RUNNABLE)
 
+#define I915_PMU_ENGINE_RUNNING(class, instance) \
+	__I915_PMU_ENGINE(class, instance, I915_SAMPLE_RUNNING)
+
 #define __I915_PMU_OTHER(x) (__I915_PMU_ENGINE(0xff, 0xff, 0xf) + 1 + (x))
 
 #define I915_PMU_ACTUAL_FREQUENCY	__I915_PMU_OTHER(0)

[6/7] drm/i915/pmu: Add running counter

Commit Message

Comments

Patch