[10/31] drm/i915: Fair low-latency scheduling

The first "scheduler" was a topographical sorting of requests into
priority order. The execution order was deterministic, the earliest
submitted, highest priority request would be executed first. Priority
inheritance ensured that inversions were kept at bay, and allowed us to
dynamically boost priorities (e.g. for interactive pageflips).

The minimalistic timeslicing scheme was an attempt to introduce fairness
between long running requests, by evicting the active request at the end
of a timeslice and moving it to the back of its priority queue (while
ensuring that dependencies were kept in order). For short running
requests from many clients of equal priority, the scheme is still very
much FIFO submission ordering, and as unfair as before.

To impose fairness, we need an external metric that ensures that clients
are interspersed, so we don't execute one long chain from client A before
executing any of client B. This could be imposed by the clients
themselves by using fences based on an external clock, that is they only
submit work for a "frame" at frame-intervals, instead of submitting as
much work as they are able to. The standard SwapBuffers approach is akin
to double buffering, where as one frame is being executed, the next is
being submitted, such that there is always a maximum of two frames per
client in the pipeline and so ideally maintains consistent input-output
latency. Even this scheme exhibits unfairness under load as a single
client will execute two frames back to back before the next, and with
enough clients, deadlines will be missed.

The idea introduced by BFS/MuQSS is that fairness is introduced by
metering with an external clock. Every request, when it becomes ready to
execute is assigned a virtual deadline, and execution order is then
determined by earliest deadline. Priority is used as a hint, rather than
strict ordering, where high priority requests have earlier deadlines,
but not necessarily earlier than outstanding work. Thus work is executed
in order of 'readiness', with timeslicing to demote long running work.

The Achille's heel of this scheduler is its strong preference for
low-latency and favouring of new queues. Whereas it was easy to dominate
the old scheduler by flooding it with many requests over a short period
of time, the new scheduler can be dominated by a 'synchronous' client
that waits for each of its requests to complete before submitting the
next. As such a client has no history, it is always considered
ready-to-run and receives an earlier deadline than the long running
requests. This is compensated for by refreshing the current execution's
deadline and by disallowing preemption for timeslice shuffling.

In contrast, one key advantage of disconnecting the sort key from the
priority value is that we can freely adjust the deadline to compensate
for other factors. This is used in conjunction with submitting requests
ahead-of-schedule that then busywait on the GPU using semaphores. Since
we don't want to spend a timeslice busywaiting instead of doing real
work when available, we deprioritise work by giving the semaphore waits
a later virtual deadline. The priority deboost is applied to semaphore
workloads after they miss a semaphore wait and a new context is pending.
The request is then restored to its normal priority once the semaphores
are signaled so that it not unfairly penalised under contention by
remaining at a far future deadline. This is a much improved and cleaner
version of commit f9e9e9de58c7 ("drm/i915: Prioritise non-busywait
semaphore workloads").

To check the impact on throughput (often the downfall of latency
sensitive schedulers), we used gem_wsim to simulate various transcode
workloads with different load balancers, and varying the number of
competing [heterogeneous] clients. On Kabylake gt3e running at fixed
cpu/gpu clocks,

+delta%------------------------------------------------------------------+
|       a                                                                |
|       a                                                                |
|       a                                                                |
|       a                                                                |
|       aa                                                               |
|      aaa                                                               |
|      aaaa                                                              |
|     aaaaaa                                                             |
|     aaaaaa                                                             |
|     aaaaaa   a                a                                        |
| aa  aaaaaa a a      a  a   aa a       a         a       a             a|
||______M__A__________|                                                  |
+------------------------------------------------------------------------+
    N           Min           Max        Median          Avg       Stddev
  108    -4.6326643     47.797855 -0.00069639128     2.116185   7.6764049

Each point is the relative percentage change in gem_wsim's work-per-second
score [using the median result of 120 25s runs, the relative change
computed as (B/A - 1) * 100]; 0 being no change.

Reviewing the same workloads on Tigerlake,

+delta%------------------------------------------------------------------+
|       a                                                                |
|       a                                                                |
|       a                                                                |
|       aa a                                                             |
|       aaaa                                                             |
|       aaaa                                                             |
|    aaaaaaa                                                             |
|    aaaaaaa                                                             |
|    aaaaaaa      a   a   aa  a         a                         a      |
| aaaaaaaaaa a aa a a a aaaa aa   a     a        aa               a     a|
||_______M____A_____________|                                            |
+------------------------------------------------------------------------+
    N           Min           Max        Median          Avg       Stddev
  108     -4.258712      46.83081    0.36853159    4.1415662     9.461689

The expectation is that by deliberately increasing the number of context
switches to improve fairness between clients, throughput will be
diminished. What we do see are small fluctuations around no change, with
the median result being improved throughput. The dramatic improvement is
from reintroducing the improved no-semaphore boosting, which avoids
accidentally preventing scheduling of ready workloads due to busy
spinners (i.e. avoids wasting cycles when there is work to be done).

We expect to see no change in single client workloads such as games,
though running multiple applications on a desktop should have reduced
jitter i.e. smoother input-output latency.

This scheduler is based on MuQSS by Dr Con Kolivas.

v2: More commentary, especially around where we reset the deadlines.

Testcase: igt/gem_exec_fair
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/Kconfig.profile          |  62 +++
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   2 -
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
 drivers/gpu/drm/i915/gt/intel_engine_pm.c     |   4 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  14 -
 drivers/gpu/drm/i915/gt/intel_engine_user.c   |   1 +
 .../drm/i915/gt/intel_execlists_submission.c  | 233 ++++----
 drivers/gpu/drm/i915/gt/selftest_execlists.c  |  30 +-
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   5 +-
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |   1 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   4 -
 drivers/gpu/drm/i915/i915_priolist_types.h    |   7 +-
 drivers/gpu/drm/i915/i915_request.c           |  19 +-
 drivers/gpu/drm/i915/i915_scheduler.c         | 518 +++++++++++++-----
 drivers/gpu/drm/i915/i915_scheduler.h         |  18 +-
 drivers/gpu/drm/i915/i915_scheduler_types.h   |  39 ++
 drivers/gpu/drm/i915/selftests/i915_request.c |   1 +
 .../gpu/drm/i915/selftests/i915_scheduler.c   | 136 +++++
 include/uapi/drm/i915_drm.h                   |   1 +
 19 files changed, 810 insertions(+), 286 deletions(-)

Message ID	20210208105236.28498-10-chris@chris-wilson.co.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=Rx2D=HK=lists.freedesktop.org=intel-gfx-bounces@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 44979C433DB for <intel-gfx@archiver.kernel.org>; Mon, 8 Feb 2021 10:53:23 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id CBC7F64E7A for <intel-gfx@archiver.kernel.org>; Mon, 8 Feb 2021 10:53:22 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CBC7F64E7A Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=chris-wilson.co.uk Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=intel-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id C823A6E88B; Mon, 8 Feb 2021 10:53:15 +0000 (UTC) Received: from fireflyinternet.com (unknown [77.68.26.236]) by gabe.freedesktop.org (Postfix) with ESMTPS id 6FA066E1A3 for <intel-gfx@lists.freedesktop.org>; Mon, 8 Feb 2021 10:52:57 +0000 (UTC) X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.69.177; Received: from build.alporthouse.com (unverified [78.156.69.177]) by fireflyinternet.com (Firefly Internet (M1)) with ESMTP id 23809217-1500050 for multiple; Mon, 08 Feb 2021 10:52:43 +0000 From: Chris Wilson <chris@chris-wilson.co.uk> To: intel-gfx@lists.freedesktop.org Date: Mon, 8 Feb 2021 10:52:15 +0000 Message-Id: <20210208105236.28498-10-chris@chris-wilson.co.uk> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20210208105236.28498-1-chris@chris-wilson.co.uk> References: <20210208105236.28498-1-chris@chris-wilson.co.uk> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 10/31] drm/i915: Fair low-latency scheduling X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development <intel-gfx.lists.freedesktop.org> List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/intel-gfx>, <mailto:intel-gfx-request@lists.freedesktop.org?subject=unsubscribe> List-Archive: <https://lists.freedesktop.org/archives/intel-gfx> List-Post: <mailto:intel-gfx@lists.freedesktop.org> List-Help: <mailto:intel-gfx-request@lists.freedesktop.org?subject=help> List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/intel-gfx>, <mailto:intel-gfx-request@lists.freedesktop.org?subject=subscribe> Cc: Chris Wilson <chris@chris-wilson.co.uk> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	[01/31] drm/i915/gt: Ratelimit heartbeat completion probing \| expand [01/31] drm/i915/gt: Ratelimit heartbeat completion probing [02/31] drm/i915: Move context revocation to scheduler [03/31] drm/i915: Introduce the scheduling mode [04/31] drm/i915: Move timeslicing flag to scheduler [05/31] drm/i915/gt: Declare when we enabled timeslicing [06/31] drm/i915: Move busywaiting control to the scheduler [07/31] drm/i915: Move preempt-reset flag to the scheduler [08/31] drm/i915: Fix the iterative dfs for defering requests [09/31] drm/i915: Replace priolist rbtree with a skiplist [10/31] drm/i915: Fair low-latency scheduling [11/31] drm/i915/gt: Specify a deadline for the heartbeat [12/31] drm/i915: Extend the priority boosting for the display with a deadline [13/31] drm/i915/gt: Support virtual engine queues [14/31] drm/i915: Move saturated workload detection back to the context [15/31] drm/i915: Bump default timeslicing quantum to 5ms [16/31] drm/i915/gt: Delay taking irqoff for execlists submission [17/31] drm/i915/gt: Convert the legacy ring submission to use the scheduling interface [18/31] drm/i915/gt: Wrap intel_timeline.has_initial_breadcrumb [19/31] drm/i915/gt: Track timeline GGTT offset separately from subpage offset [20/31] drm/i915/gt: Add timeline "mode" [21/31] drm/i915/gt: Use indices for writing into relative timelines [22/31] drm/i915/selftests: Exercise relative timeline modes [23/31] drm/i915/gt: Use ppHWSP for unshared non-semaphore related timelines [24/31] Restore "drm/i915: drop engine_pin/unpin_breadcrumbs_irq" [25/31] drm/i915/gt: Support creation of 'internal' rings [26/31] drm/i915/gt: Use client timeline address for seqno writes [27/31] drm/i915/gt: Infrastructure for ring scheduling [28/31] drm/i915/gt: Implement ring scheduler for gen4-7 [29/31] drm/i915/gt: Enable ring scheduling for gen5-7 [30/31] drm/i915: Support secure dispatch on gen6/gen7 [31/31] drm/i915/gt: Limit C-states while waiting for requests

[10/31] drm/i915: Fair low-latency scheduling

Commit Message

Comments

Patch