diff mbox series

[1/2] drm/i915/active: Fix misuse of non-idle barriers as fence trackers

Message ID 20230213232128.59217-2-janusz.krzysztofik@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Resolve barrier tasks list related issues | expand

Commit Message

Janusz Krzysztofik Feb. 13, 2023, 11:21 p.m. UTC
Users reported oopses on list corruptions when using i915 perf with a
number of concurrently running graphics applications.  Root cause analysis
pointed out to an issue in barrier processing code -- a race among perf
open / close replacing active barriers with perf requests on kernel
contexts and concurrent barrier preallocate / acquire operations performed
during user context first pin / last unpin.

When adding a request to a composite tracker, we try to use an existing
fence tracker already registered with that composite tracker.  The tracker
we obtain can already track another fence, can be an idle barrier, or an
active barrier.  The first two cases seem easy to handle and we seem to
do that correctly.  In the last case, we attempt to replace the active
barrier with our request.  However, when the tracker occurs a barrier
and we try to delete that barrier from a list of barrier tasks it belongs
to, we ignore return value from that operation, which informs us whether
the deletion succeeded or not, and we reuse the barrier as if it was
idle.

On the other side, barriers are now deleted from a barrier tasks list by
temporarily removing the list content, traversing that content with skip
over the node to be deleted, then adding the modified content back to the
list.  Since that complex operation is not serialized with other
concurrent uses of the list, including similar barrier deletions,
functions that depend on the list being either empty or not empty can take
wrong decisions.

A failed barrier deletion can be a side effect of the way we have it
implemented -- another thread can have temporarily emptied the list before
we manage to do that.  If we ignore such failure, that other thread can
then add our fence tracker back to the barrier tasks list.  Since the same
structure field of the tracker is used as a list node with both barrier
tasks lists and fence callback lists, list corruption occurs.  However,
list corruptions were still observed when running the user workload on top
of an experimental patch that serialized all operations on barrier tasks
lists with a spinlock.  Then, other race scenarios leading to list
corruptions likely exist.

Based on those observations, respecting the return code from barrier
deletion seems required for effectively fixing the issue.  However, we
need to handle those few above mentioned cases carefully.

Respect results of barrier deletion attempts -- mark the barrier as idle
only after successfully deleted from the list.  Then, before proceeding
with setting our fence as the one currently tracked, make sure that the
tracker we've got is not a non-idle barrier.  If that check fails, don't
use that tracker but go back and try to acquire a new, usable one.

Extensive testing shows that this patch effectively fixes barrier related
list corruptions.  However, other potentially related issues have been
observed, reported by a new subtest we have developed for this case,
with this fix applied.  Then, we may still want to get back to this soon
and refactor our intentionally racy but potentially fragile way of barrier
tasks lists handling.

Fixes: d8af05ff38ae ("drm/i915: Allow sharing the idle-barrier from other kernel requests")
References: https://gitlab.freedesktop.org/drm/intel/-/issues/6333
Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
Cc: stable@vger.kernel.org # v5.4
---
 drivers/gpu/drm/i915/i915_active.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
index 7412abf166a8c..f9282b8c87c1c 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -422,12 +422,12 @@  replace_barrier(struct i915_active *ref, struct i915_active_fence *active)
 	 * we can use it to substitute for the pending idle-barrer
 	 * request that we want to emit on the kernel_context.
 	 */
-	__active_del_barrier(ref, node_from_active(active));
-	return true;
+	return __active_del_barrier(ref, node_from_active(active));
 }
 
 int i915_active_add_request(struct i915_active *ref, struct i915_request *rq)
 {
+	u64 idx = i915_request_timeline(rq)->fence_context;
 	struct dma_fence *fence = &rq->fence;
 	struct i915_active_fence *active;
 	int err;
@@ -437,16 +437,19 @@  int i915_active_add_request(struct i915_active *ref, struct i915_request *rq)
 	if (err)
 		return err;
 
-	active = active_instance(ref, i915_request_timeline(rq)->fence_context);
-	if (!active) {
-		err = -ENOMEM;
-		goto out;
-	}
+	do {
+		active = active_instance(ref, idx);
+		if (!active) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		if (replace_barrier(ref, active)) {
+			RCU_INIT_POINTER(active->fence, NULL);
+			atomic_dec(&ref->count);
+		}
+	} while (is_barrier(active));
 
-	if (replace_barrier(ref, active)) {
-		RCU_INIT_POINTER(active->fence, NULL);
-		atomic_dec(&ref->count);
-	}
 	if (!__i915_active_fence_set(active, fence))
 		__i915_active_acquire(ref);