From patchwork Thu Oct 29 10:47:32 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Brezillon X-Patchwork-Id: 11866027 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0AC47C55178 for ; Thu, 29 Oct 2020 10:47:42 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 7761020790 for ; Thu, 29 Oct 2020 10:47:41 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7761020790 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=collabora.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 91FD36ECA7; Thu, 29 Oct 2020 10:47:40 +0000 (UTC) Received: from bhuna.collabora.co.uk (bhuna.collabora.co.uk [46.235.227.227]) by gabe.freedesktop.org (Postfix) with ESMTPS id 341316EB97 for ; Thu, 29 Oct 2020 10:47:39 +0000 (UTC) Received: from localhost.localdomain (unknown [IPv6:2a01:e0a:2c:6930:5cf4:84a1:2763:fe0d]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: bbrezillon) by bhuna.collabora.co.uk (Postfix) with ESMTPSA id 82CF01F45954; Thu, 29 Oct 2020 10:47:37 +0000 (GMT) From: Boris Brezillon To: Rob Herring , Tomeu Vizoso , Alyssa Rosenzweig , Steven Price , Robin Murphy Subject: [PATCH v2] drm/panfrost: Fix a race in the job timeout handling (again) Date: Thu, 29 Oct 2020 11:47:32 +0100 Message-Id: <20201029104732.293437-1-boris.brezillon@collabora.com> X-Mailer: git-send-email 2.26.2 MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Boris Brezillon , stable@vger.kernel.org, dri-devel@lists.freedesktop.org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" In our last attempt to fix races in the panfrost_job_timedout() path we overlooked the case where a re-submitted job immediately triggers a fault. This lead to a situation where we try to stop a scheduler that's not resumed yet and lose the 'timedout' event without restarting the timeout, thus blocking the whole queue. Let's fix that by tracking timeouts occurring between the drm_sched_resubmit_jobs() and drm_sched_start() calls. v2: - Fix another race (reported by Steven) Fixes: 1a11a88cfd9a ("drm/panfrost: Fix job timeout handling") Cc: Signed-off-by: Boris Brezillon --- drivers/gpu/drm/panfrost/panfrost_job.c | 61 +++++++++++++++++-------- 1 file changed, 43 insertions(+), 18 deletions(-) diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index d0469e944143..0f9a34f5c6d0 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -26,6 +26,7 @@ struct panfrost_queue_state { struct drm_gpu_scheduler sched; bool stopped; + bool timedout; struct mutex lock; u64 fence_context; u64 emit_seqno; @@ -383,11 +384,33 @@ static bool panfrost_scheduler_stop(struct panfrost_queue_state *queue, queue->stopped = true; stopped = true; } + queue->timedout = true; mutex_unlock(&queue->lock); return stopped; } +static void panfrost_scheduler_start(struct panfrost_queue_state *queue) +{ + if (WARN_ON(!queue->stopped)) + return; + + mutex_lock(&queue->lock); + drm_sched_start(&queue->sched, true); + + /* + * We might have missed fault-timeouts (AKA immediate timeouts) while + * the scheduler was stopped. Let's fake a new fault to trigger an + * immediate reset. + */ + if (queue->timedout) + drm_sched_fault(&queue->sched); + + queue->timedout = false; + queue->stopped = false; + mutex_unlock(&queue->lock); +} + static void panfrost_job_timedout(struct drm_sched_job *sched_job) { struct panfrost_job *job = to_panfrost_job(sched_job); @@ -422,27 +445,20 @@ static void panfrost_job_timedout(struct drm_sched_job *sched_job) struct drm_gpu_scheduler *sched = &pfdev->js->queue[i].sched; /* - * If the queue is still active, make sure we wait for any - * pending timeouts. + * Stop the scheduler and wait for any pending timeout handler + * to return. */ - if (!pfdev->js->queue[i].stopped) + panfrost_scheduler_stop(&pfdev->js->queue[i], NULL); + if (i != js) cancel_delayed_work_sync(&sched->work_tdr); /* - * If the scheduler was not already stopped, there's a tiny - * chance a timeout has expired just before we stopped it, and - * drm_sched_stop() does not flush pending works. Let's flush - * them now so the timeout handler doesn't get called in the - * middle of a reset. + * We do another stop after cancel_delayed_work_sync() to make + * sure we don't race against another thread finishing its + * reset (the restart queue steps are not protected by the + * reset lock). */ - if (panfrost_scheduler_stop(&pfdev->js->queue[i], NULL)) - cancel_delayed_work_sync(&sched->work_tdr); - - /* - * Now that we cancelled the pending timeouts, we can safely - * reset the stopped state. - */ - pfdev->js->queue[i].stopped = false; + panfrost_scheduler_stop(&pfdev->js->queue[i], NULL); } spin_lock_irqsave(&pfdev->js->job_lock, flags); @@ -457,14 +473,23 @@ static void panfrost_job_timedout(struct drm_sched_job *sched_job) panfrost_device_reset(pfdev); - for (i = 0; i < NUM_JOB_SLOTS; i++) + for (i = 0; i < NUM_JOB_SLOTS; i++) { + /* + * The GPU is idle, and the scheduler is stopped, we can safely + * reset the ->timedout state without taking any lock. We need + * to do that before calling drm_sched_resubmit_jobs() though, + * because the resubmission might trigger immediate faults + * which we want to catch. + */ + pfdev->js->queue[i].timedout = false; drm_sched_resubmit_jobs(&pfdev->js->queue[i].sched); + } mutex_unlock(&pfdev->reset_lock); /* restart scheduler after GPU is usable again */ for (i = 0; i < NUM_JOB_SLOTS; i++) - drm_sched_start(&pfdev->js->queue[i].sched, true); + panfrost_scheduler_start(&pfdev->js->queue[i]); } static const struct drm_sched_backend_ops panfrost_sched_ops = {