From patchwork Fri Oct 2 12:25:06 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Brezillon X-Patchwork-Id: 11813331 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6614F139A for ; Fri, 2 Oct 2020 12:25:15 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 43B512074B for ; Fri, 2 Oct 2020 12:25:15 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 43B512074B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=collabora.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 166896E951; Fri, 2 Oct 2020 12:25:14 +0000 (UTC) X-Original-To: dri-devel@lists.freedesktop.org Delivered-To: dri-devel@lists.freedesktop.org Received: from bhuna.collabora.co.uk (bhuna.collabora.co.uk [IPv6:2a00:1098:0:82:1000:25:2eeb:e3e3]) by gabe.freedesktop.org (Postfix) with ESMTPS id 6ADC16E946 for ; Fri, 2 Oct 2020 12:25:12 +0000 (UTC) Received: from localhost.localdomain (unknown [IPv6:2a01:e0a:2c:6930:5cf4:84a1:2763:fe0d]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: bbrezillon) by bhuna.collabora.co.uk (Postfix) with ESMTPSA id F33DD29B0E8; Fri, 2 Oct 2020 13:25:10 +0100 (BST) From: Boris Brezillon To: Rob Herring , Tomeu Vizoso , Alyssa Rosenzweig , Steven Price , Robin Murphy Subject: [PATCH v3] drm/panfrost: Fix job timeout handling Date: Fri, 2 Oct 2020 14:25:06 +0200 Message-Id: <20201002122506.1374183-1-boris.brezillon@collabora.com> X-Mailer: git-send-email 2.26.2 MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Boris Brezillon , stable@vger.kernel.org, dri-devel@lists.freedesktop.org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" If more than two jobs end up timeout-ing concurrently, only one of them (the one attached to the scheduler acquiring the lock) is fully handled. The other one remains in a dangling state where it's no longer part of the scheduling queue, but still blocks something in scheduler, leading to repetitive timeouts when new jobs are queued. Let's make sure all bad jobs are properly handled by the thread acquiring the lock. v3: - Add Steven's R-b - Don't take the sched_lock when stopping the schedulers v2: - Fix the subject prefix - Stop the scheduler before returning from panfrost_job_timedout() - Call cancel_delayed_work_sync() after drm_sched_stop() to make sure no timeout handlers are in flight when we reset the GPU (Steven Price) - Make sure we release the reset lock before restarting the schedulers (Steven Price) Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver") Cc: Signed-off-by: Boris Brezillon Reviewed-by: Steven Price --- drivers/gpu/drm/panfrost/panfrost_job.c | 62 +++++++++++++++++++++---- 1 file changed, 53 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 30e7b7196dab..d0469e944143 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -25,7 +25,8 @@ struct panfrost_queue_state { struct drm_gpu_scheduler sched; - + bool stopped; + struct mutex lock; u64 fence_context; u64 emit_seqno; }; @@ -369,6 +370,24 @@ void panfrost_job_enable_interrupts(struct panfrost_device *pfdev) job_write(pfdev, JOB_INT_MASK, irq_mask); } +static bool panfrost_scheduler_stop(struct panfrost_queue_state *queue, + struct drm_sched_job *bad) +{ + bool stopped = false; + + mutex_lock(&queue->lock); + if (!queue->stopped) { + drm_sched_stop(&queue->sched, bad); + if (bad) + drm_sched_increase_karma(bad); + queue->stopped = true; + stopped = true; + } + mutex_unlock(&queue->lock); + + return stopped; +} + static void panfrost_job_timedout(struct drm_sched_job *sched_job) { struct panfrost_job *job = to_panfrost_job(sched_job); @@ -392,19 +411,39 @@ static void panfrost_job_timedout(struct drm_sched_job *sched_job) job_read(pfdev, JS_TAIL_LO(js)), sched_job); + /* Scheduler is already stopped, nothing to do. */ + if (!panfrost_scheduler_stop(&pfdev->js->queue[js], sched_job)) + return; + if (!mutex_trylock(&pfdev->reset_lock)) return; for (i = 0; i < NUM_JOB_SLOTS; i++) { struct drm_gpu_scheduler *sched = &pfdev->js->queue[i].sched; - drm_sched_stop(sched, sched_job); - if (js != i) - /* Ensure any timeouts on other slots have finished */ + /* + * If the queue is still active, make sure we wait for any + * pending timeouts. + */ + if (!pfdev->js->queue[i].stopped) cancel_delayed_work_sync(&sched->work_tdr); - } - drm_sched_increase_karma(sched_job); + /* + * If the scheduler was not already stopped, there's a tiny + * chance a timeout has expired just before we stopped it, and + * drm_sched_stop() does not flush pending works. Let's flush + * them now so the timeout handler doesn't get called in the + * middle of a reset. + */ + if (panfrost_scheduler_stop(&pfdev->js->queue[i], NULL)) + cancel_delayed_work_sync(&sched->work_tdr); + + /* + * Now that we cancelled the pending timeouts, we can safely + * reset the stopped state. + */ + pfdev->js->queue[i].stopped = false; + } spin_lock_irqsave(&pfdev->js->job_lock, flags); for (i = 0; i < NUM_JOB_SLOTS; i++) { @@ -421,11 +460,11 @@ static void panfrost_job_timedout(struct drm_sched_job *sched_job) for (i = 0; i < NUM_JOB_SLOTS; i++) drm_sched_resubmit_jobs(&pfdev->js->queue[i].sched); + mutex_unlock(&pfdev->reset_lock); + /* restart scheduler after GPU is usable again */ for (i = 0; i < NUM_JOB_SLOTS; i++) drm_sched_start(&pfdev->js->queue[i].sched, true); - - mutex_unlock(&pfdev->reset_lock); } static const struct drm_sched_backend_ops panfrost_sched_ops = { @@ -558,6 +597,7 @@ int panfrost_job_open(struct panfrost_file_priv *panfrost_priv) int ret, i; for (i = 0; i < NUM_JOB_SLOTS; i++) { + mutex_init(&js->queue[i].lock); sched = &js->queue[i].sched; ret = drm_sched_entity_init(&panfrost_priv->sched_entity[i], DRM_SCHED_PRIORITY_NORMAL, &sched, @@ -570,10 +610,14 @@ int panfrost_job_open(struct panfrost_file_priv *panfrost_priv) void panfrost_job_close(struct panfrost_file_priv *panfrost_priv) { + struct panfrost_device *pfdev = panfrost_priv->pfdev; + struct panfrost_job_slot *js = pfdev->js; int i; - for (i = 0; i < NUM_JOB_SLOTS; i++) + for (i = 0; i < NUM_JOB_SLOTS; i++) { drm_sched_entity_destroy(&panfrost_priv->sched_entity[i]); + mutex_destroy(&js->queue[i].lock); + } } int panfrost_job_is_idle(struct panfrost_device *pfdev)