From patchwork Tue Mar  1 13:25:37 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
X-Patchwork-Id: 8465171
Return-Path: <intel-gfx-bounces@lists.freedesktop.org>
X-Original-To: patchwork-intel-gfx@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.136])
	by patchwork1.web.kernel.org (Postfix) with ESMTP id 3AA039F38C
	for <patchwork-intel-gfx@patchwork.kernel.org>;
	Tue,  1 Mar 2016 13:25:47 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id 2C2BF2027D
	for <patchwork-intel-gfx@patchwork.kernel.org>;
	Tue,  1 Mar 2016 13:25:46 +0000 (UTC)
Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177])
	by mail.kernel.org (Postfix) with ESMTP id E8D2820123
	for <patchwork-intel-gfx@patchwork.kernel.org>;
	Tue,  1 Mar 2016 13:25:44 +0000 (UTC)
Received: from gabe.freedesktop.org (localhost [127.0.0.1])
	by gabe.freedesktop.org (Postfix) with ESMTP id 54E006E5FE;
	Tue,  1 Mar 2016 13:25:42 +0000 (UTC)
X-Original-To: Intel-gfx@lists.freedesktop.org
Delivered-To: Intel-gfx@lists.freedesktop.org
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
	by gabe.freedesktop.org (Postfix) with ESMTP id DA3596E5FF
	for <Intel-gfx@lists.freedesktop.org>;
	Tue,  1 Mar 2016 13:25:39 +0000 (UTC)
Received: from fmsmga003.fm.intel.com ([10.253.24.29])
	by fmsmga103.fm.intel.com with ESMTP; 01 Mar 2016 05:25:39 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.22,523,1449561600"; d="scan'208";a="661827510"
Received: from tursulin-linux.isw.intel.com ([10.102.226.196])
	by FMSMGA003.fm.intel.com with ESMTP; 01 Mar 2016 05:25:38 -0800
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Intel-gfx@lists.freedesktop.org
Date: Tue,  1 Mar 2016 13:25:37 +0000
Message-Id: <1456838737-6669-1-git-send-email-tvrtko.ursulin@linux.intel.com>
X-Mailer: git-send-email 1.9.1
Subject: [Intel-gfx] [PATCH] drm/i915: Move CSB MMIO reads out of the
	execlists lock
X-BeenThere: intel-gfx@lists.freedesktop.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Intel graphics driver community testing & development
	<intel-gfx.lists.freedesktop.org>
List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <https://lists.freedesktop.org/archives/intel-gfx>
List-Post: <mailto:intel-gfx@lists.freedesktop.org>
List-Help: <mailto:intel-gfx-request@lists.freedesktop.org?subject=help>
List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=subscribe>
MIME-Version: 1.0
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
X-Spam-Status: No, score=-3.2 required=5.0 tests=BAYES_00,HK_RANDOM_FROM,
	RCVD_IN_DNSWL_MED, RP_MATCHES_RCVD,
	UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

By reading the CSB (slow MMIO accesses) into a temporary local
buffer we can decrease the duration of holding the execlist
lock.

Main advantage is that during heavy batch buffer submission we
reduce the execlist lock contention, which should decrease the
latency and CPU usage between the submitting userspace process
and interrupt handling.

Downside is that we need to grab and relase the forcewake twice,
but as the below numbers will show this is completely hidden
by the primary gains.

Testing with "gem_latency -n 100" (submit batch buffers with a
hundred nops each) shows more than doubling of the throughput
and more than halving of the dispatch latency, overall latency
and CPU time spend in the submitting process.

Submitting empty batches ("gem_latency -n 0") does not seem
significantly affected by this change with throughput and CPU
time improving by half a percent, and overall latency worsening
by the same amount.

Above tests were done in a hundred runs on a big core Broadwell.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/intel_lrc.c | 73 ++++++++++++++++++++--------------------
 1 file changed, 36 insertions(+), 37 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 27c9ee3f7372..9000d6ac8d65 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -416,15 +416,23 @@ static void execlists_update_context(struct drm_i915_gem_request *rq)
 static void execlists_submit_requests(struct drm_i915_gem_request *rq0,
 				      struct drm_i915_gem_request *rq1)
 {
+	struct drm_i915_private *dev_priv = rq0->ring->dev->dev_private;
+
 	execlists_update_context(rq0);
 
 	if (rq1)
 		execlists_update_context(rq1);
 
+	spin_lock(&dev_priv->uncore.lock);
+	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
+
 	execlists_elsp_write(rq0, rq1);
+
+	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
+	spin_unlock(&dev_priv->uncore.lock);
 }
 
-static void execlists_context_unqueue__locked(struct intel_engine_cs *ring)
+static void execlists_context_unqueue(struct intel_engine_cs *ring)
 {
 	struct drm_i915_gem_request *req0 = NULL, *req1 = NULL;
 	struct drm_i915_gem_request *cursor, *tmp;
@@ -478,19 +486,6 @@ static void execlists_context_unqueue__locked(struct intel_engine_cs *ring)
 	execlists_submit_requests(req0, req1);
 }
 
-static void execlists_context_unqueue(struct intel_engine_cs *ring)
-{
-	struct drm_i915_private *dev_priv = ring->dev->dev_private;
-
-	spin_lock(&dev_priv->uncore.lock);
-	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
-
-	execlists_context_unqueue__locked(ring);
-
-	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
-	spin_unlock(&dev_priv->uncore.lock);
-}
-
 static unsigned int
 execlists_check_remove_request(struct intel_engine_cs *ring, u32 request_id)
 {
@@ -551,12 +546,10 @@ void intel_lrc_irq_handler(struct intel_engine_cs *ring)
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
 	u32 status_pointer;
 	unsigned int read_pointer, write_pointer;
-	u32 status = 0;
-	u32 status_id;
+	u32 csb[GEN8_CSB_ENTRIES][2];
+	unsigned int csb_read = 0, i;
 	unsigned int submit_contexts = 0;
 
-	spin_lock(&ring->execlist_lock);
-
 	spin_lock(&dev_priv->uncore.lock);
 	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
 
@@ -568,39 +561,45 @@ void intel_lrc_irq_handler(struct intel_engine_cs *ring)
 		write_pointer += GEN8_CSB_ENTRIES;
 
 	while (read_pointer < write_pointer) {
-		status = get_context_status(ring, ++read_pointer, &status_id);
+		csb[csb_read][0] = get_context_status(ring, ++read_pointer,
+						      &csb[csb_read][1]);
+		csb_read++;
+	}
 
-		if (unlikely(status & GEN8_CTX_STATUS_PREEMPTED)) {
-			if (status & GEN8_CTX_STATUS_LITE_RESTORE) {
-				if (execlists_check_remove_request(ring, status_id))
+	ring->next_context_status_buffer = write_pointer % GEN8_CSB_ENTRIES;
+
+	/* Update the read pointer to the old write pointer. Manual ringbuffer
+	 * management ftw </sarcasm> */
+	I915_WRITE_FW(RING_CONTEXT_STATUS_PTR(ring),
+		      _MASKED_FIELD(GEN8_CSB_READ_PTR_MASK,
+				    ring->next_context_status_buffer << 8));
+
+	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
+	spin_unlock(&dev_priv->uncore.lock);
+
+	spin_lock(&ring->execlist_lock);
+
+	for (i = 0; i < csb_read; i++) {
+		if (unlikely(csb[i][0] & GEN8_CTX_STATUS_PREEMPTED)) {
+			if (csb[i][0] & GEN8_CTX_STATUS_LITE_RESTORE) {
+				if (execlists_check_remove_request(ring, csb[i][1]))
 					WARN(1, "Lite Restored request removed from queue\n");
 			} else
 				WARN(1, "Preemption without Lite Restore\n");
 		}
 
-		if (status & (GEN8_CTX_STATUS_ACTIVE_IDLE |
+		if (csb[i][0] & (GEN8_CTX_STATUS_ACTIVE_IDLE |
 		    GEN8_CTX_STATUS_ELEMENT_SWITCH))
 			submit_contexts +=
-				execlists_check_remove_request(ring, status_id);
+				execlists_check_remove_request(ring, csb[i][1]);
 	}
 
 	if (submit_contexts) {
 		if (!ring->disable_lite_restore_wa ||
-		    (status & GEN8_CTX_STATUS_ACTIVE_IDLE))
-			execlists_context_unqueue__locked(ring);
+		    (csb[i][0] & GEN8_CTX_STATUS_ACTIVE_IDLE))
+			execlists_context_unqueue(ring);
 	}
 
-	ring->next_context_status_buffer = write_pointer % GEN8_CSB_ENTRIES;
-
-	/* Update the read pointer to the old write pointer. Manual ringbuffer
-	 * management ftw </sarcasm> */
-	I915_WRITE_FW(RING_CONTEXT_STATUS_PTR(ring),
-		      _MASKED_FIELD(GEN8_CSB_READ_PTR_MASK,
-				    ring->next_context_status_buffer << 8));
-
-	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
-	spin_unlock(&dev_priv->uncore.lock);
-
 	spin_unlock(&ring->execlist_lock);
 
 	if (unlikely(submit_contexts > 2))