[22/41] drm/i915: Fair low-latency scheduling

The first "scheduler" was a topographical sorting of requests into
priority order. The execution order was deterministic, the earliest
submitted, highest priority request would be executed first. Priority
inheritance ensured that inversions were kept at bay, and allowed us to
dynamically boost priorities (e.g. for interactive pageflips).

The minimalistic timeslicing scheme was an attempt to introduce fairness
between long running requests, by evicting the active request at the end
of a timeslice and moving it to the back of its priority queue (while
ensuring that dependencies were kept in order). For short running
requests from many clients of equal priority, the scheme is still very
much FIFO submission ordering, and as unfair as before.

To impose fairness, we need an external metric that ensures that clients
are interpersed, so we don't execute one long chain from client A before
executing any of client B. This could be imposed by the clients
themselves by using fences based on an external clock, that is they only
submit work for a "frame" at frame-intervals, instead of submitting as
much work as they are able to. The standard SwapBuffers approach is akin
to double bufferring, where as one frame is being executed, the next is
being submitted, such that there is always a maximum of two frames per
client in the pipeline and so ideally maintains consistent input-output
latency. Even this scheme exhibits unfairness under load as a single
client will execute two frames back to back before the next, and with
enough clients, deadlines will be missed.

The idea introduced by BFS/MuQSS is that fairness is introduced by
metering with an external clock. Every request, when it becomes ready to
execute is assigned a virtual deadline, and execution order is then
determined by earliest deadline. Priority is used as a hint, rather than
strict ordering, where high priority requests have earlier deadlines,
but not necessarily earlier than outstanding work. Thus work is executed
in order of 'readiness', with timeslicing to demote long running work.

The Achille's heel of this scheduler is its strong preference for
low-latency and favouring of new queues. Whereas it was easy to dominate
the old scheduler by flooding it with many requests over a short period
of time, the new scheduler can be dominated by a 'synchronous' client
that waits for each of its requests to complete before submitting the
next. As such a client has no history, it is always considered
ready-to-run and receives an earlier deadline than the long running
requests. This is compensated for by refreshing the current execution's
deadline and by disallowing preemption for timeslice shuffling.

In contrast, one key advantage of disconnecting the sort key from the
priority value is that we can freely adjust the deadline to compensate
for other factors. This is used in conjunction with submitting requests
ahead-of-schedule that then busywait on the GPU using semaphores. Since
we don't want to spend a timeslice busywaiting instead of doing real
work when available, we deprioritise work by giving the semaphore waits
a later virtual deadline. The priority deboost is applied to semaphore
workloads after they miss a semaphore wait and a new context is pending.
The request is then restored to its normal priority once the semaphores
are signaled so that it not unfairly penalised under contention by
remaining at a far future deadline. This is a much improved and cleaner
version of commit f9e9e9de58c7 ("drm/i915: Prioritise non-busywait
semaphore workloads").

To check the impact on throughput (often the downfall of latency
sensitive schedulers), we used gem_wsim to simulate various transcode
workloads with different load balancers, and varying the number of
competing [heterogenous] clients. On Kabylake gt3e running at fixed
clocks,

+delta%------------------------------------------------------------------+
|       a                                                                |
|       a                                                                |
|       a                                                                |
|       a                                                                |
|       aa                                                               |
|      aaa                                                               |
|      aaaa                                                              |
|     aaaaaa                                                             |
|     aaaaaa                                                             |
|     aaaaaa   a                a                                        |
| aa  aaaaaa a a      a  a   aa a       a         a       a             a|
||______M__A__________|                                                  |
+------------------------------------------------------------------------+
    N           Min           Max        Median          Avg       Stddev
  108    -4.6326643     47.797855 -0.00069639128     2.116185   7.6764049

Reviewing the same workloads on Tigerlake,

+delta%------------------------------------------------------------------+
|       a                                                                |
|       a                                                                |
|       a                                                                |
|       aa a                                                             |
|       aaaa                                                             |
|       aaaa                                                             |
|    aaaaaaa                                                             |
|    aaaaaaa                                                             |
|    aaaaaaa      a   a   aa  a         a                         a      |
| aaaaaaaaaa a aa a a a aaaa aa   a     a        aa               a     a|
||_______M____A_____________|                                            |
+------------------------------------------------------------------------+
    N           Min           Max        Median          Avg       Stddev
  108     -4.258712      46.83081    0.36853159    4.1415662     9.461689

The expectation is that by deliberately increasing the number of context
switches to improve fairness between clients, throughput will be
diminished. What we do see is are small fluctations around no change,
with the median result being improved throughput. The dramatic
improvement is from reintroducing the improved no-semaphore boosting,
which avoids accidentally preventing scheduling of ready workloads due
to busy spinners.

This scheduler is based on MuQSS by Dr Con Kolivas.

Testcase: igt/gem_exec_fair
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   2 -
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
 drivers/gpu/drm/i915/gt/intel_engine_pm.c     |   4 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  14 -
 .../drm/i915/gt/intel_execlists_submission.c  | 205 ++++-----
 drivers/gpu/drm/i915/gt/selftest_execlists.c  |  30 +-
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   5 +-
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |   1 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   5 -
 drivers/gpu/drm/i915/i915_priolist_types.h    |   7 +-
 drivers/gpu/drm/i915/i915_request.c           |  14 +-
 drivers/gpu/drm/i915/i915_scheduler.c         | 433 +++++++++++++-----
 drivers/gpu/drm/i915/i915_scheduler.h         |  16 +-
 drivers/gpu/drm/i915/i915_scheduler_types.h   |  23 +
 drivers/gpu/drm/i915/selftests/i915_request.c |   1 +
 .../gpu/drm/i915/selftests/i915_scheduler.c   | 136 ++++++
 16 files changed, 630 insertions(+), 267 deletions(-)

Message ID	20210125140136.10494-22-chris@chris-wilson.co.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=x+Hj=G4=lists.freedesktop.org=intel-gfx-bounces@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B0B68C433E6 for <intel-gfx@archiver.kernel.org>; Mon, 25 Jan 2021 14:02:39 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 6EFEE230FD for <intel-gfx@archiver.kernel.org>; Mon, 25 Jan 2021 14:02:39 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6EFEE230FD Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=chris-wilson.co.uk Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=intel-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id AA7E66E113; Mon, 25 Jan 2021 14:02:34 +0000 (UTC) Received: from fireflyinternet.com (unknown [77.68.26.236]) by gabe.freedesktop.org (Postfix) with ESMTPS id 5971E6E0FE for <intel-gfx@lists.freedesktop.org>; Mon, 25 Jan 2021 14:02:00 +0000 (UTC) X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Received: from build.alporthouse.com (unverified [78.156.65.138]) by fireflyinternet.com (Firefly Internet (M1)) with ESMTP id 23693644-1500050 for multiple; Mon, 25 Jan 2021 14:01:40 +0000 From: Chris Wilson <chris@chris-wilson.co.uk> To: intel-gfx@lists.freedesktop.org Date: Mon, 25 Jan 2021 14:01:17 +0000 Message-Id: <20210125140136.10494-22-chris@chris-wilson.co.uk> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20210125140136.10494-1-chris@chris-wilson.co.uk> References: <20210125140136.10494-1-chris@chris-wilson.co.uk> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 22/41] drm/i915: Fair low-latency scheduling X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development <intel-gfx.lists.freedesktop.org> List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/intel-gfx>, <mailto:intel-gfx-request@lists.freedesktop.org?subject=unsubscribe> List-Archive: <https://lists.freedesktop.org/archives/intel-gfx> List-Post: <mailto:intel-gfx@lists.freedesktop.org> List-Help: <mailto:intel-gfx-request@lists.freedesktop.org?subject=help> List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/intel-gfx>, <mailto:intel-gfx-request@lists.freedesktop.org?subject=subscribe> Cc: thomas.hellstrom@intel.com, Chris Wilson <chris@chris-wilson.co.uk> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	[01/41] drm/i915/selftests: Check for engine-reset errors in the middle of workarounds \| expand [01/41] drm/i915/selftests: Check for engine-reset errors in the middle of workarounds [02/41] drm/i915/gt: Move the defer_request waiter active assertion [03/41] drm/i915: Replace engine->schedule() with a known request operation [04/41] drm/i915: Teach the i915_dependency to use a double-lock [05/41] drm/i915: Restructure priority inheritance [06/41] drm/i915/selftests: Measure set-priority duration [07/41] drm/i915/selftests: Exercise priority inheritance around an engine loop [08/41] drm/i915: Improve DFS for priority inheritance [09/41] drm/i915/selftests: Exercise relative mmio paths to non-privileged registers [10/41] drm/i915/selftests: Exercise cross-process context isolation [11/41] drm/i915: Extract request submission from execlists [12/41] drm/i915: Extract request rewinding from execlists [13/41] drm/i915: Extract request suspension from the execlists [14/41] drm/i915: Extract the ability to defer and rerun a request later [15/41] drm/i915: Fix the iterative dfs for defering requests [16/41] drm/i915: Move common active lists from engine to i915_scheduler [17/41] drm/i915: Move scheduler queue [18/41] drm/i915: Move tasklet from execlists to sched [19/41] drm/i915/gt: Show scheduler queues when dumping state [20/41] drm/i915: Replace priolist rbtree with a skiplist [21/41] drm/i915: Wrap cmpxchg64 with try_cmpxchg64() helper [22/41] drm/i915: Fair low-latency scheduling [23/41] drm/i915/gt: Specify a deadline for the heartbeat [24/41] drm/i915: Extend the priority boosting for the display with a deadline [25/41] drm/i915/gt: Support virtual engine queues [26/41] drm/i915: Move saturated workload detection back to the context [27/41] drm/i915: Bump default timeslicing quantum to 5ms [28/41] drm/i915/gt: Wrap intel_timeline.has_initial_breadcrumb [29/41] drm/i915/gt: Track timeline GGTT offset separately from subpage offset [30/41] drm/i915/gt: Add timeline "mode" [31/41] drm/i915/gt: Use indices for writing into relative timelines [32/41] drm/i915/selftests: Exercise relative timeline modes [33/41] drm/i915/gt: Use ppHWSP for unshared non-semaphore related timelines [34/41] Restore "drm/i915: drop engine_pin/unpin_breadcrumbs_irq" [35/41] drm/i915/gt: Couple tasklet scheduling for all CS interrupts [36/41] drm/i915/gt: Support creation of 'internal' rings [37/41] drm/i915/gt: Use client timeline address for seqno writes [38/41] drm/i915/gt: Infrastructure for ring scheduling [39/41] drm/i915/gt: Implement ring scheduler for gen4-7 [40/41] drm/i915/gt: Enable ring scheduling for gen5-7 [41/41] drm/i915: Support secure dispatch on gen6/gen7

[22/41] drm/i915: Fair low-latency scheduling

Commit Message

Comments

Patch