[2/2] drm/lima: mask irqs in timeout path before hard reset

Message ID	20240401212002.1191549-3-nunes.erico@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Erico Nunes <nunes.erico@gmail.com> To: Qiang Yu <yuq825@gmail.com>, anarsoul@gmail.com, dri-devel@lists.freedesktop.org, lima@lists.freedesktop.org Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>, Maxime Ripard <mripard@kernel.org>, Thomas Zimmermann <tzimmermann@suse.de>, David Airlie <airlied@gmail.com>, Daniel Vetter <daniel@ffwll.ch>, christian.koenig@amd.com, megi@xff.cz, linux-kernel@vger.kernel.org, Erico Nunes <nunes.erico@gmail.com> Subject: [PATCH 2/2] drm/lima: mask irqs in timeout path before hard reset Date: Mon, 1 Apr 2024 23:20:02 +0200 Message-ID: <20240401212002.1191549-3-nunes.erico@gmail.com> In-Reply-To: <20240401212002.1191549-1-nunes.erico@gmail.com> References: <20240401212002.1191549-1-nunes.erico@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm/lima: fix devfreq refcount imbalance for job timeouts \| expand [0/2] drm/lima: fix devfreq refcount imbalance for job timeouts [1/2] drm/lima: add mask irq callback to gp and pp [2/2] drm/lima: mask irqs in timeout path before hard reset

Message ID

20240401212002.1191549-3-nunes.erico@gmail.com (mailing list archive)

State

New, archived

Headers

From: Erico Nunes <nunes.erico@gmail.com>
To: Qiang Yu <yuq825@gmail.com>, anarsoul@gmail.com,
 dri-devel@lists.freedesktop.org, lima@lists.freedesktop.org
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>,
 Maxime Ripard <mripard@kernel.org>,
 Thomas Zimmermann <tzimmermann@suse.de>, David Airlie <airlied@gmail.com>,
 Daniel Vetter <daniel@ffwll.ch>, christian.koenig@amd.com, megi@xff.cz,
 linux-kernel@vger.kernel.org, Erico Nunes <nunes.erico@gmail.com>
Subject: [PATCH 2/2] drm/lima: mask irqs in timeout path before hard reset
Date: Mon,  1 Apr 2024 23:20:02 +0200
Message-ID: <20240401212002.1191549-3-nunes.erico@gmail.com>
In-Reply-To: <20240401212002.1191549-1-nunes.erico@gmail.com>
References: <20240401212002.1191549-1-nunes.erico@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Series

drm/lima: fix devfreq refcount imbalance for job timeouts | expand

Commit Message

Erico Nunes April 1, 2024, 9:20 p.m. UTC

There is a race condition in which a rendering job might take just long
enough to trigger the drm sched job timeout handler but also still
complete before the hard reset is done by the timeout handler.
This runs into race conditions not expected by the timeout handler.
In some very specific cases it currently may result in a refcount
imbalance on lima_pm_idle, with a stack dump such as:

[10136.669170] WARNING: CPU: 0 PID: 0 at drivers/gpu/drm/lima/lima_devfreq.c:205 lima_devfreq_record_idle+0xa0/0xb0
...
[10136.669459] pc : lima_devfreq_record_idle+0xa0/0xb0
...
[10136.669628] Call trace:
[10136.669634]  lima_devfreq_record_idle+0xa0/0xb0
[10136.669646]  lima_sched_pipe_task_done+0x5c/0xb0
[10136.669656]  lima_gp_irq_handler+0xa8/0x120
[10136.669666]  __handle_irq_event_percpu+0x48/0x160
[10136.669679]  handle_irq_event+0x4c/0xc0

We can prevent that race condition entirely by masking the irqs at the
beginning of the timeout handler, at which point we give up on waiting
for that job entirely.
The irqs will be enabled again at the next hard reset which is already
done as a recovery by the timeout handler.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
---
 drivers/gpu/drm/lima/lima_sched.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Qiang Yu April 4, 2024, 12:32 p.m. UTC | #1

Reviewed-by: Qiang Yu <yuq825@gmail.com>

On Tue, Apr 2, 2024 at 5:20 AM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> There is a race condition in which a rendering job might take just long
> enough to trigger the drm sched job timeout handler but also still
> complete before the hard reset is done by the timeout handler.
> This runs into race conditions not expected by the timeout handler.
> In some very specific cases it currently may result in a refcount
> imbalance on lima_pm_idle, with a stack dump such as:
>
> [10136.669170] WARNING: CPU: 0 PID: 0 at drivers/gpu/drm/lima/lima_devfreq.c:205 lima_devfreq_record_idle+0xa0/0xb0
> ...
> [10136.669459] pc : lima_devfreq_record_idle+0xa0/0xb0
> ...
> [10136.669628] Call trace:
> [10136.669634]  lima_devfreq_record_idle+0xa0/0xb0
> [10136.669646]  lima_sched_pipe_task_done+0x5c/0xb0
> [10136.669656]  lima_gp_irq_handler+0xa8/0x120
> [10136.669666]  __handle_irq_event_percpu+0x48/0x160
> [10136.669679]  handle_irq_event+0x4c/0xc0
>
> We can prevent that race condition entirely by masking the irqs at the
> beginning of the timeout handler, at which point we give up on waiting
> for that job entirely.
> The irqs will be enabled again at the next hard reset which is already
> done as a recovery by the timeout handler.
>
> Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
> ---
>  drivers/gpu/drm/lima/lima_sched.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index 66841503a618..bbf3f8feab94 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -430,6 +430,13 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
>                 return DRM_GPU_SCHED_STAT_NOMINAL;
>         }
>
> +       /*
> +        * The task might still finish while this timeout handler runs.
> +        * To prevent a race condition on its completion, mask all irqs
> +        * on the running core until the next hard reset completes.
> +        */
> +       pipe->task_mask_irq(pipe);
> +
>         if (!pipe->error)
>                 DRM_ERROR("%s job timeout\n", lima_ip_name(ip));
>
> --
> 2.44.0
>

diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index 66841503a618..bbf3f8feab94 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -430,6 +430,13 @@  static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
 		return DRM_GPU_SCHED_STAT_NOMINAL;
 	}
 
+	/*
+	 * The task might still finish while this timeout handler runs.
+	 * To prevent a race condition on its completion, mask all irqs
+	 * on the running core until the next hard reset completes.
+	 */
+	pipe->task_mask_irq(pipe);
+
 	if (!pipe->error)
 		DRM_ERROR("%s job timeout\n", lima_ip_name(ip));

[2/2] drm/lima: mask irqs in timeout path before hard reset

Commit Message

Comments

Patch