From patchwork Mon Jun 12 12:07:31 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13276412 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C698BC7EE23 for ; Mon, 12 Jun 2023 12:07:42 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id C7A0A10E010; Mon, 12 Jun 2023 12:07:41 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by gabe.freedesktop.org (Postfix) with ESMTPS id C896F10E010 for ; Mon, 12 Jun 2023 12:07:39 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 9C6076155B for ; Mon, 12 Jun 2023 12:07:38 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 639B9C433D2 for ; Mon, 12 Jun 2023 12:07:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686571658; bh=TkEy6k+aGbSjO6yAnZdhkixIjar1l/3aEPGHti2HxHc=; h=From:To:Subject:Date:From; b=f0WAaemvqilNnz49q7Tktuje/hY/HPj/n5XkF1/Am95W+73JEkPbLbaSasUoH3GQ1 q/leGSSY0VR9tyVWoaJSr/XoWbF6AyE8Or+A6gA9AxsJd5kv+4lK115WmDifHO/pX/ 3ah2m76CnVG7YP04ovxnVtH5tiBwh6D+tX81WWtuHsNsp9VuPVwpz1/4lmmhC+Tj0Z aIjzBbWYnlCUhlq8ZQjBgbth269ljg/YOQiAaVzGqvYtspZt0a7AtZ3oHW2MwKsr75 1YXZWYtBmjtHtOZVmkQJjZcrut92/AmaCuy3iMIo20IS5IqUP5yUJMXKqosVmBp4f/ XzUKdN0wgA16Q== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 1/3] accel/habanalabs: remove pdev check on idle check Date: Mon, 12 Jun 2023 15:07:31 +0300 Message-Id: <20230612120733.3079507-1-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Our simulator supports idle check so no need anymore to check if pdev exists. Signed-off-by: Oded Gabbay Signed-off-by: Oded Gabbay Reviewed-by: Ofir Bitton > --- drivers/accel/habanalabs/common/device.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index 0d02f1f7b994..5e61761b8c11 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -424,7 +424,7 @@ static void hpriv_release(struct kref *ref) /* Check the device idle status and reset if not idle. * Skip it if already in reset, or if device is going to be reset in any case. */ - if (!hdev->reset_info.in_reset && !reset_device && hdev->pdev && !hdev->pldm) + if (!hdev->reset_info.in_reset && !reset_device && !hdev->pldm) device_is_idle = hdev->asic_funcs->is_device_idle(hdev, idle_mask, HL_BUSY_ENGINES_MASK_EXT_SIZE, NULL); if (!device_is_idle) { From patchwork Mon Jun 12 12:07:32 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13276413 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 694FFC7EE25 for ; Mon, 12 Jun 2023 12:07:48 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 502D610E206; Mon, 12 Jun 2023 12:07:42 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by gabe.freedesktop.org (Postfix) with ESMTPS id 55A4310E010 for ; Mon, 12 Jun 2023 12:07:40 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id D118661693 for ; Mon, 12 Jun 2023 12:07:39 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 944C4C433EF for ; Mon, 12 Jun 2023 12:07:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686571659; bh=7m0aXQ68iQWYEEliktpGifEVzQhcpTkJxmBDRCfQfKM=; h=From:To:Subject:Date:In-Reply-To:References:From; b=J34yxTyHJpILIB7tW6SymKxRHYpxpversq9qGGF532YxnV5NMm3McXyZsn2hP8MBN w/ZaLhEbK6BmP/iyl4nzEmHCuHTvON/6F39F7tnDg9REz1RzsJfP+s0cTZQrohbPT0 73Ipogsa0xBfpLM7yIi/BoGrHA1jzdtvM/ByXj+eh0HQQG5uPG01PJ7At7iTqXu3lR 8TPjmXBUuM07g5XnzMLicrOqjZXW1b+gVYwOYE+ebgMO2QAYVV4oDPoXqzwTIaZLGU zleyr6f9YzTef36HAe5lEP0+5MGWMaS8/DjhxRLPsSF92Cso7zYZ3V4zalX2D30rsj hCJuogv8oCHVQ== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 2/3] accel/habanalabs: reset device if scrubbing failed Date: Mon, 12 Jun 2023 15:07:32 +0300 Message-Id: <20230612120733.3079507-2-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230612120733.3079507-1-ogabbay@kernel.org> References: <20230612120733.3079507-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" If scrubbing memory after user released device has failed it means the device is in a bad state and should be reset. Signed-off-by: Oded Gabbay Signed-off-by: Oded Gabbay Reviewed-by: Ofir Bitton > --- drivers/accel/habanalabs/common/device.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index 5e61761b8c11..d7d9198b2103 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -454,8 +454,10 @@ static void hpriv_release(struct kref *ref) /* Scrubbing is handled within hl_device_reset(), so here need to do it directly */ int rc = hdev->asic_funcs->scrub_device_mem(hdev); - if (rc) + if (rc) { dev_err(hdev->dev, "failed to scrub memory from hpriv release (%d)\n", rc); + hl_device_reset(hdev, HL_DRV_RESET_HARD); + } } /* Now we can mark the compute_ctx as not active. Even if a reset is running in a different From patchwork Mon Jun 12 12:07:33 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13276414 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 73717C7EE25 for ; Mon, 12 Jun 2023 12:07:51 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 1F4DA10E21A; Mon, 12 Jun 2023 12:07:46 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by gabe.freedesktop.org (Postfix) with ESMTPS id 6B37810E21E for ; Mon, 12 Jun 2023 12:07:43 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id F0DAC6155B; Mon, 12 Jun 2023 12:07:41 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7F57BC433D2; Mon, 12 Jun 2023 12:07:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686571661; bh=1HArObuFGQRPZl7sq94/064FEUcyqgBoXZSL7Itf610=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=Rl5sSE+yFp2wIJrAFY+hJdd7r2Al39f0vbNkq/S0u26+RtTIuDt+7h+lsYwOaGgwP YWwH2t1jX1wVQ4nGb26KFq3Z+Iy/Gc6iGe46TVzrtsT+cCr0c2SWr2XDK05ZvCBXsn aay3VxTzn+T0YKIqJEfKF8smf/MnLzV14nZpngTe8SjOrz1zqyTRY/UlZQlgQ61koE CXvdn3SnrbIBd8lpleqxa+CQiembxzUS4yhcIeiN9KCsOqkRnS9IZniQC9gaOZfD+h ovuzY2rYJToHXGgPQP7X+423/J7qHQ+AQZO4NCMRlnqWOVGbK5b3zUX8ezAYBAj4Z1 dFh0hDpo2yr6A== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 3/3] accel/habanalabs: dump temperature threshold boot error Date: Mon, 12 Jun 2023 15:07:33 +0300 Message-Id: <20230612120733.3079507-3-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230612120733.3079507-1-ogabbay@kernel.org> References: <20230612120733.3079507-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Ofir Bitton Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Ofir Bitton Add dump of an error reported from f/w during boot time. This error indicates a failure with setting temperature threshold. Signed-off-by: Ofir Bitton Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/firmware_if.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/accel/habanalabs/common/firmware_if.c b/drivers/accel/habanalabs/common/firmware_if.c index 370508e98854..c7da69dbfa0a 100644 --- a/drivers/accel/habanalabs/common/firmware_if.c +++ b/drivers/accel/habanalabs/common/firmware_if.c @@ -724,6 +724,11 @@ static bool fw_report_boot_dev0(struct hl_device *hdev, u32 err_val, err_exists = true; } + if (err_val & CPU_BOOT_ERR0_TMP_THRESH_INIT_FAIL) { + dev_err(hdev->dev, "Device boot error - Failed to set threshold for temperature sensor\n"); + err_exists = true; + } + if (err_val & CPU_BOOT_ERR0_DEVICE_UNUSABLE_FAIL) { /* Ignore this bit, don't prevent driver loading */ dev_dbg(hdev->dev, "device unusable status is set\n");