[1/5] drm/i915: Track per-context engine busyness
diff mbox series

Message ID 20191216120704.958-2-tvrtko.ursulin@linux.intel.com
State New
Headers show
Series
  • Per client engine busyness
Related show

Commit Message

Tvrtko Ursulin Dec. 16, 2019, 12:07 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Some customers want to know how much of the GPU time are their clients
using in order to make dynamic load balancing decisions.

With the hooks already in place which track the overall engine busyness,
we can extend that slightly to split that time between contexts.

v2: Fix accounting for tail updates.
v3: Rebase.
v4: Mark currently running contexts as active on stats enable.
v5: Include some headers to fix the build.
v6: Added fine grained lock.
v7: Convert to seqlock. (Chris Wilson)
v8: Rebase and tidy with helpers.
v9: Refactor.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       | 20 ++++++++
 drivers/gpu/drm/i915/gt/intel_context.h       | 11 +++++
 drivers/gpu/drm/i915/gt/intel_context_types.h |  9 ++++
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 16 ++++++-
 drivers/gpu/drm/i915/gt/intel_lrc.c           | 47 ++++++++++++++++---
 5 files changed, 95 insertions(+), 8 deletions(-)

Comments

Chris Wilson Dec. 16, 2019, 12:40 p.m. UTC | #1
Quoting Tvrtko Ursulin (2019-12-16 12:07:00)
> @@ -1389,6 +1415,9 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
>                 write_desc(execlists,
>                            rq ? execlists_update_context(rq) : 0,
>                            n);
> +
> +               if (n == 0)
> +                       intel_context_stats_start(&rq->hw_context->stats);

Too early? (Think preemption requests that may not begin for a few
hundred ms.) Mark it as started on promotion instead (should be within a
few microseconds, if not ideally a few 10 ns)? Then you will also have
better symmetry in process_csb, suggesting that we can have a routine
that takes the current *execlists->active with fewer code changes.
-Chris
Tvrtko Ursulin Dec. 16, 2019, 1:09 p.m. UTC | #2
On 16/12/2019 12:40, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2019-12-16 12:07:00)
>> @@ -1389,6 +1415,9 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
>>                  write_desc(execlists,
>>                             rq ? execlists_update_context(rq) : 0,
>>                             n);
>> +
>> +               if (n == 0)
>> +                       intel_context_stats_start(&rq->hw_context->stats);
> 
> Too early? (Think preemption requests that may not begin for a few
> hundred ms.) Mark it as started on promotion instead (should be within a
> few microseconds, if not ideally a few 10 ns)? Then you will also have
> better symmetry in process_csb, suggesting that we can have a routine
> that takes the current *execlists->active with fewer code changes.

Good point, I was disliking the csb latencies and completely missed the 
preemption side of things. Symmetry will be much better in more than one 
aspect.

Regards,

Tvrtko
Tvrtko Ursulin Jan. 30, 2020, 6:05 p.m. UTC | #3
On 16/12/2019 13:09, Tvrtko Ursulin wrote:
> 
> On 16/12/2019 12:40, Chris Wilson wrote:
>> Quoting Tvrtko Ursulin (2019-12-16 12:07:00)
>>> @@ -1389,6 +1415,9 @@ static void execlists_submit_ports(struct 
>>> intel_engine_cs *engine)
>>>                  write_desc(execlists,
>>>                             rq ? execlists_update_context(rq) : 0,
>>>                             n);
>>> +
>>> +               if (n == 0)
>>> +                       
>>> intel_context_stats_start(&rq->hw_context->stats);
>>
>> Too early? (Think preemption requests that may not begin for a few
>> hundred ms.) Mark it as started on promotion instead (should be within a
>> few microseconds, if not ideally a few 10 ns)? Then you will also have
>> better symmetry in process_csb, suggesting that we can have a routine
>> that takes the current *execlists->active with fewer code changes.
> 
> Good point, I was disliking the csb latencies and completely missed the 
> preemption side of things. Symmetry will be much better in more than one 
> aspect.

Downside of having it in process_csb is really bad accuracy with short 
batches like gem_exec_nop. :( process_csb() latency I think. It gets a 
little bit better for this particular workload if I move the start point 
to submit_ports(), but that has that other problem with preemption.

After this woes I was hopeful pphwsp context runtime could have an 
advantage here, but then I discover it is occasionally not monotonic. At 
least with the spammy gem_exec_nop it occasionally but regularly jumps a 
tiny bit backward:

[ 8802.082980]  (new=7282101 old=7282063 d=38)
[ 8802.083007]  (new=7282139 old=7282101 d=38)
[ 8802.083051]  (new=7282250 old=7282139 d=111)
[ 8802.083077]  (new=7282214 old=7282250 d=-36)
[ 8802.083103]  (new=7282255 old=7282214 d=41)
[ 8802.083129]  (new=7282293 old=7282255 d=38)
[ 8802.083155]  (new=7282331 old=7282293 d=38)

Ouch. Time to sleep on it.

Regards,

Tvrtko
Chris Wilson Jan. 30, 2020, 9:42 p.m. UTC | #4
Quoting Tvrtko Ursulin (2020-01-30 18:05:03)
> 
> On 16/12/2019 13:09, Tvrtko Ursulin wrote:
> > 
> > On 16/12/2019 12:40, Chris Wilson wrote:
> >> Quoting Tvrtko Ursulin (2019-12-16 12:07:00)
> >>> @@ -1389,6 +1415,9 @@ static void execlists_submit_ports(struct 
> >>> intel_engine_cs *engine)
> >>>                  write_desc(execlists,
> >>>                             rq ? execlists_update_context(rq) : 0,
> >>>                             n);
> >>> +
> >>> +               if (n == 0)
> >>> +                       
> >>> intel_context_stats_start(&rq->hw_context->stats);
> >>
> >> Too early? (Think preemption requests that may not begin for a few
> >> hundred ms.) Mark it as started on promotion instead (should be within a
> >> few microseconds, if not ideally a few 10 ns)? Then you will also have
> >> better symmetry in process_csb, suggesting that we can have a routine
> >> that takes the current *execlists->active with fewer code changes.
> > 
> > Good point, I was disliking the csb latencies and completely missed the 
> > preemption side of things. Symmetry will be much better in more than one 
> > aspect.
> 
> Downside of having it in process_csb is really bad accuracy with short 
> batches like gem_exec_nop. :( process_csb() latency I think.

Scary. I hope we can get some insight and kill some latency. Usually
ends in looking at CPU scheduler traces in dismay.

nvme touts "interrupt steering" as crucial to maintaining caches at high
packet frequencies.

We may also see some gains from staring at profilers, but off the top of
my head the worst latency is due to engine->active.lock contention, and
associated irq-off periods.

> It gets a 
> little bit better for this particular workload if I move the start point 
> to submit_ports(), but that has that other problem with preemption.
> 
> After this woes I was hopeful pphwsp context runtime could have an 
> advantage here, but then I discover it is occasionally not monotonic. At 
> least with the spammy gem_exec_nop it occasionally but regularly jumps a 
> tiny bit backward:
> 
> [ 8802.082980]  (new=7282101 old=7282063 d=38)
> [ 8802.083007]  (new=7282139 old=7282101 d=38)
> [ 8802.083051]  (new=7282250 old=7282139 d=111)
> [ 8802.083077]  (new=7282214 old=7282250 d=-36)
> [ 8802.083103]  (new=7282255 old=7282214 d=41)
> [ 8802.083129]  (new=7282293 old=7282255 d=38)
> [ 8802.083155]  (new=7282331 old=7282293 d=38)
> 
> Ouch. Time to sleep on it.

Also scary. How, how, how???
-Chris

Patch
diff mbox series

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index b1e346d2d35f..b211b48d6cae 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -243,6 +243,7 @@  intel_context_init(struct intel_context *ce,
 	INIT_LIST_HEAD(&ce->signals);
 
 	mutex_init(&ce->pin_mutex);
+	seqlock_init(&ce->stats.lock);
 
 	i915_active_init(&ce->active,
 			 __intel_context_active, __intel_context_retire);
@@ -337,6 +338,25 @@  struct i915_request *intel_context_create_request(struct intel_context *ce)
 	return rq;
 }
 
+ktime_t intel_context_get_busy_time(struct intel_context *ce)
+{
+	unsigned int seq;
+	ktime_t total;
+
+	do {
+		seq = read_seqbegin(&ce->stats.lock);
+
+		total = ce->stats.total;
+
+		if (ce->stats.active)
+			total = ktime_add(total,
+					  ktime_sub(ktime_get(),
+						    ce->stats.start));
+	} while (read_seqretry(&ce->stats.lock, seq));
+
+	return total;
+}
+
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftest_context.c"
 #endif
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index b39eb1fcfbca..3a15cf32f0a3 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -160,4 +160,15 @@  static inline struct intel_ring *__intel_context_ring_size(u64 sz)
 	return u64_to_ptr(struct intel_ring, sz);
 }
 
+static inline void
+__intel_context_stats_start(struct intel_context_stats *stats, ktime_t now)
+{
+	if (!stats->active) {
+		stats->start = now;
+		stats->active = true;
+	}
+}
+
+ktime_t intel_context_get_busy_time(struct intel_context *ce);
+
 #endif /* __INTEL_CONTEXT_H__ */
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index d1204cc899a3..12cbad0798cb 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -11,6 +11,7 @@ 
 #include <linux/list.h>
 #include <linux/mutex.h>
 #include <linux/types.h>
+#include <linux/seqlock.h>
 
 #include "i915_active_types.h"
 #include "i915_utils.h"
@@ -76,6 +77,14 @@  struct intel_context {
 
 	/** sseu: Control eu/slice partitioning */
 	struct intel_sseu sseu;
+
+	/** stats: Context GPU engine busyness tracking. */
+	struct intel_context_stats {
+		seqlock_t lock;
+		bool active;
+		ktime_t start;
+		ktime_t total;
+	} stats;
 };
 
 #endif /* __INTEL_CONTEXT_TYPES__ */
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 3d1d48bf90cf..ac08781c8b24 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1577,8 +1577,20 @@  int intel_enable_engine_stats(struct intel_engine_cs *engine)
 
 		engine->stats.enabled_at = ktime_get();
 
-		/* XXX submission method oblivious? */
-		for (port = execlists->active; (rq = *port); port++)
+		/*
+		 * Mark currently running context as active.
+		 * XXX submission method oblivious?
+		 */
+
+		rq = NULL;
+		port = execlists->active;
+		if (port)
+			rq = *port;
+		if (rq)
+			__intel_context_stats_start(&rq->hw_context->stats,
+						    engine->stats.enabled_at);
+
+		for (; (rq = *port); port++)
 			engine->stats.active++;
 
 		for (port = execlists->pending; (rq = *port); port++) {
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 4ebfecd95032..1f158cb439bc 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -940,6 +940,7 @@  static void intel_engine_context_in(struct intel_engine_cs *engine)
 	if (engine->stats.enabled > 0) {
 		if (engine->stats.active++ == 0)
 			engine->stats.start = ktime_get();
+
 		GEM_BUG_ON(engine->stats.active == 0);
 	}
 
@@ -1088,6 +1089,30 @@  static void reset_active(struct i915_request *rq,
 	ce->lrc_desc |= CTX_DESC_FORCE_RESTORE;
 }
 
+static void
+intel_context_stats_start(struct intel_context_stats *stats)
+{
+	write_seqlock(&stats->lock);
+	__intel_context_stats_start(stats, ktime_get());
+	write_sequnlock(&stats->lock);
+}
+
+static void
+intel_context_stats_stop(struct intel_context_stats *stats)
+{
+	unsigned long flags;
+
+	if (!READ_ONCE(stats->active))
+		return;
+
+	write_seqlock_irqsave(&stats->lock, flags);
+	GEM_BUG_ON(!READ_ONCE(stats->active));
+	stats->total = ktime_add(stats->total,
+				 ktime_sub(ktime_get(), stats->start));
+	stats->active = false;
+	write_sequnlock_irqrestore(&stats->lock, flags);
+}
+
 static inline struct intel_engine_cs *
 __execlists_schedule_in(struct i915_request *rq)
 {
@@ -1155,7 +1180,7 @@  static inline void
 __execlists_schedule_out(struct i915_request *rq,
 			 struct intel_engine_cs * const engine)
 {
-	struct intel_context * const ce = rq->hw_context;
+	struct intel_context *ce = rq->hw_context;
 
 	/*
 	 * NB process_csb() is not under the engine->active.lock and hence
@@ -1172,6 +1197,7 @@  __execlists_schedule_out(struct i915_request *rq,
 		intel_engine_add_retire(engine, ce->timeline);
 
 	intel_engine_context_out(engine);
+	intel_context_stats_stop(&ce->stats);
 	execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_OUT);
 	intel_gt_pm_put_async(engine->gt);
 
@@ -1389,6 +1415,9 @@  static void execlists_submit_ports(struct intel_engine_cs *engine)
 		write_desc(execlists,
 			   rq ? execlists_update_context(rq) : 0,
 			   n);
+
+		if (n == 0)
+			intel_context_stats_start(&rq->hw_context->stats);
 	}
 
 	/* we need to manually load the submit queue */
@@ -2197,7 +2226,11 @@  static void process_csb(struct intel_engine_cs *engine)
 
 			WRITE_ONCE(execlists->pending[0], NULL);
 		} else {
-			GEM_BUG_ON(!*execlists->active);
+			struct i915_request *rq = *execlists->active++;
+
+			GEM_BUG_ON(!rq);
+			GEM_BUG_ON(execlists->active - execlists->inflight >
+				   execlists_num_ports(execlists));
 
 			/* port0 completed, advanced to port1 */
 			trace_ports(execlists, "completed", execlists->active);
@@ -2208,12 +2241,14 @@  static void process_csb(struct intel_engine_cs *engine)
 			 * coherent (visible from the CPU) before the
 			 * user interrupt and CSB is processed.
 			 */
-			GEM_BUG_ON(!i915_request_completed(*execlists->active) &&
+			GEM_BUG_ON(!i915_request_completed(rq) &&
 				   !reset_in_progress(execlists));
-			execlists_schedule_out(*execlists->active++);
 
-			GEM_BUG_ON(execlists->active - execlists->inflight >
-				   execlists_num_ports(execlists));
+			execlists_schedule_out(rq);
+			rq = *execlists->active;
+			if (rq)
+				intel_context_stats_start(&rq->hw_context->stats);
+
 		}
 	} while (head != tail);