From patchwork Wed Nov 15 16:39:03 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13457093 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D2F9DC2BB3F for ; Wed, 15 Nov 2023 16:39:26 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id C7C2710E10F; Wed, 15 Nov 2023 16:39:25 +0000 (UTC) Received: from sin.source.kernel.org (sin.source.kernel.org [IPv6:2604:1380:40e1:4800::1]) by gabe.freedesktop.org (Postfix) with ESMTPS id EB9CA10E0DD for ; Wed, 15 Nov 2023 16:39:22 +0000 (UTC) Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 55CE8CE1C90; Wed, 15 Nov 2023 16:39:19 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 78672C433C7; Wed, 15 Nov 2023 16:39:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1700066358; bh=CeqcgqbPmKuc6ujFnqT9hCYTQXV2xrXzIPAKrvqD3A8=; h=From:To:Cc:Subject:Date:From; b=FKGTa0ne/R38V0DiQupeeWYJNf6oPa+bN6URqC6pwh2/K2mCGtfVLQTNrxNXaRUrp 729EnLgDLA1DOhh5+/+hyqUotvOgn/Fih3rk5ZeO2znvbrSKV1qtZ/m8AZ7pIVRgoI 62byx0R6AbopsK13tD4OaaJAgIn/taFmabWLklii9mxM7tOuiEmTuVp/4mWzf+NH7w dBKyUZveGnvUudEcmaAstcAgHtDb/FxaQaKpwkfTXeFPXaCyFiz2fgtMHDamu7pF+K OkbGe9MpySfbk+uVKrpnn/N1pXqK8Ad/gWTMOH4C29VrMipWhSP3LEtgQytYZR4PZl NX4OC9+9HgxjQ== From: Oded Gabbay To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH 01/10] accel/habanalabs/gaudi2: assume hard-reset by FW upon PCIe AXI drain Date: Wed, 15 Nov 2023 18:39:03 +0200 Message-Id: <20231115163912.1243175-1-ogabbay@kernel.org> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Tomer Tayar Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Tomer Tayar When a PCIe AXI drain event happens, it is possible that the driver cannot access the device through PCIe, and therefore cannot send a hard-reset request to FW. Starting from FW version 1.13, FW will initiate a hard-reset in such a case without waiting for a reset request from the driver. Signed-off-by: Tomer Tayar Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/habanalabs.h | 8 ++++++++ drivers/accel/habanalabs/gaudi2/gaudi2.c | 2 ++ 2 files changed, 10 insertions(+) diff --git a/drivers/accel/habanalabs/common/habanalabs.h b/drivers/accel/habanalabs/common/habanalabs.h index 1655c101c705..5c69a482b8de 100644 --- a/drivers/accel/habanalabs/common/habanalabs.h +++ b/drivers/accel/habanalabs/common/habanalabs.h @@ -3594,6 +3594,14 @@ static inline bool hl_is_fw_sw_ver_below(struct hl_device *hdev, u32 fw_sw_major return false; } +static inline bool hl_is_fw_sw_ver_equal_or_greater(struct hl_device *hdev, u32 fw_sw_major, + u32 fw_sw_minor) +{ + return (hdev->fw_sw_major_ver > fw_sw_major || + (hdev->fw_sw_major_ver == fw_sw_major && + hdev->fw_sw_minor_ver >= fw_sw_minor)); +} + /* * Kernel module functions that can be accessed by entire module */ diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2.c b/drivers/accel/habanalabs/gaudi2/gaudi2.c index 819660c684cf..b739078c2d87 100644 --- a/drivers/accel/habanalabs/gaudi2/gaudi2.c +++ b/drivers/accel/habanalabs/gaudi2/gaudi2.c @@ -10007,6 +10007,8 @@ static void gaudi2_handle_eqe(struct hl_device *hdev, struct hl_eq_entry *eq_ent error_count = gaudi2_handle_pcie_drain(hdev, &eq_entry->pcie_drain_ind_data); reset_flags |= HL_DRV_RESET_FW_FATAL_ERR; event_mask |= HL_NOTIFIER_EVENT_GENERAL_HW_ERR; + if (hl_is_fw_sw_ver_equal_or_greater(hdev, 1, 13)) + is_critical = true; break; case GAUDI2_EVENT_PSOC59_RPM_ERROR_OR_DRAIN: From patchwork Wed Nov 15 16:39:04 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13457094 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 8468BC072A2 for ; Wed, 15 Nov 2023 16:39:29 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id F185410E0F0; Wed, 15 Nov 2023 16:39:25 +0000 (UTC) Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by gabe.freedesktop.org (Postfix) with ESMTPS id 0EEB210E0E0 for ; Wed, 15 Nov 2023 16:39:23 +0000 (UTC) Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id EFEECCE1D95; Wed, 15 Nov 2023 16:39:20 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1ED0CC433C8; Wed, 15 Nov 2023 16:39:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1700066360; bh=diI9h96MMsr8an2MYVnN2Z42kQPggaGEHH/n5mgX6Q4=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=UXvWV165zKWxS9koxbLHt6+oJBIVlM3FymNxwckh4dFcNysCimVu71/Fx20C2PfSX fHOImhDq2KujxGRmu2e9gXy9KgARgcBzu8pwjjaO+LnuXHhGHbydN0AlEvVTR8kV3P ggi5OpbbNy3Qd9VrY4cA69gqOlcuakaOlPtBbVbqYo9kEVhjM4M9W+pcG0ouMYnssi z2MuCO2aJNtpVCA9mjUg6R+0q86XNx8NznJT1uNyjd8KeV4dyFVMciYxotvwlaoskk SxQfwzrx5H8il70GeeWjxzwHSif6FxzUT5Vc14se6CO5o2WhYwZE9UkS1iXs0SLov5 75PkEOmzQ2fkA== From: Oded Gabbay To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH 02/10] accel/habanalabs: add log when eq event is not received Date: Wed, 15 Nov 2023 18:39:04 +0200 Message-Id: <20231115163912.1243175-2-ogabbay@kernel.org> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231115163912.1243175-1-ogabbay@kernel.org> References: <20231115163912.1243175-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Farah Kassabri Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Farah Kassabri Add error log when no eq event is received from FW, to cover a scenario when FW is stuck for some reason. In such case driver will not receive neither the eq error interrupt or the eq heartbeat event, and will just initiate a reset without indication in the dmesg about the reason. Signed-off-by: Farah Kassabri Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/device.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index 9711e8fc979d..d95a981b2906 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -1049,10 +1049,12 @@ static void hl_device_eq_heartbeat(struct hl_device *hdev) if (!prop->cpucp_info.eq_health_check_supported) return; - if (hdev->eq_heartbeat_received) + if (hdev->eq_heartbeat_received) { hdev->eq_heartbeat_received = false; - else + } else { + dev_err(hdev->dev, "EQ heartbeat event was not received!\n"); hl_device_cond_reset(hdev, HL_DRV_RESET_HARD, event_mask); + } } static void hl_device_heartbeat(struct work_struct *work) From patchwork Wed Nov 15 16:39:05 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13457095 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 68B4FC2BB3F for ; Wed, 15 Nov 2023 16:39:33 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 6981210E114; Wed, 15 Nov 2023 16:39:32 +0000 (UTC) Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by gabe.freedesktop.org (Postfix) with ESMTPS id 1234E10E0F0 for ; Wed, 15 Nov 2023 16:39:25 +0000 (UTC) Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 64E7ECE1D02 for ; Wed, 15 Nov 2023 16:39:22 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id BA64EC433C7; Wed, 15 Nov 2023 16:39:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1700066361; bh=NbtSh1xIKXUY0wdMyrpHBah0tEYTkXI939m/krGZXYA=; h=From:To:Subject:Date:In-Reply-To:References:From; b=tFj3UGZgfqVN6AHFEPIheLZkcKXr0yXlbZNf+ejnR2Yp5h4MTosnNtrWxlPfvX87c 43z7FChfk+BotOyOXk1RAmUky+v0Dz7GB8gQWj6qcHUbsJgx8k0wmEpfBNaZRst+Ux AH5VUkatVXRq926yqif8vnMa/sTIMtkr3319WeZI8P4VDdFx+zbO/hzM/5ByYmCHcT KpxHZv+SAQXe0lHEL83JQV7vK6Wa0y1SOx/3RO6BlpoEdrjwheKIFrHsY9P81YcvH9 Y1jLpvvfQSj3PDvWc8WfefBkrqy4Qlb4MWgdilZj3vGjDVgKoK31ZCB5blT8GX/Qpe RoZzrk9nE4B9w== From: Oded Gabbay To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH 03/10] accel/habanalabs: add support for Gaudi2C device Date: Wed, 15 Nov 2023 18:39:05 +0200 Message-Id: <20231115163912.1243175-3-ogabbay@kernel.org> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231115163912.1243175-1-ogabbay@kernel.org> References: <20231115163912.1243175-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Gaudi2 with PCI revision ID with the value of '3' represents Gaudi2C device and should be detected and initialized as Gaudi2. Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/device.c | 3 +++ drivers/accel/habanalabs/common/habanalabs.h | 2 ++ drivers/accel/habanalabs/common/habanalabs_drv.c | 3 +++ drivers/accel/habanalabs/common/mmu/mmu.c | 1 + drivers/accel/habanalabs/common/sysfs.c | 3 +++ drivers/accel/habanalabs/include/hw_ip/pci/pci_general.h | 1 + 6 files changed, 13 insertions(+) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index d95a981b2906..d9447aeb3937 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -853,6 +853,9 @@ static int device_early_init(struct hl_device *hdev) gaudi2_set_asic_funcs(hdev); strscpy(hdev->asic_name, "GAUDI2B", sizeof(hdev->asic_name)); break; + case ASIC_GAUDI2C: + gaudi2_set_asic_funcs(hdev); + strscpy(hdev->asic_name, "GAUDI2C", sizeof(hdev->asic_name)); break; default: dev_err(hdev->dev, "Unrecognized ASIC type %d\n", diff --git a/drivers/accel/habanalabs/common/habanalabs.h b/drivers/accel/habanalabs/common/habanalabs.h index 5c69a482b8de..7b0209e5bad6 100644 --- a/drivers/accel/habanalabs/common/habanalabs.h +++ b/drivers/accel/habanalabs/common/habanalabs.h @@ -1262,6 +1262,7 @@ struct hl_dec { * @ASIC_GAUDI_SEC: Gaudi secured device (HL-2000). * @ASIC_GAUDI2: Gaudi2 device. * @ASIC_GAUDI2B: Gaudi2B device. + * @ASIC_GAUDI2C: Gaudi2C device. */ enum hl_asic_type { ASIC_INVALID, @@ -1270,6 +1271,7 @@ enum hl_asic_type { ASIC_GAUDI_SEC, ASIC_GAUDI2, ASIC_GAUDI2B, + ASIC_GAUDI2C, }; struct hl_cs_parser; diff --git a/drivers/accel/habanalabs/common/habanalabs_drv.c b/drivers/accel/habanalabs/common/habanalabs_drv.c index 35ae0ff347f5..e542fd40e16c 100644 --- a/drivers/accel/habanalabs/common/habanalabs_drv.c +++ b/drivers/accel/habanalabs/common/habanalabs_drv.c @@ -141,6 +141,9 @@ static enum hl_asic_type get_asic_type(struct hl_device *hdev) case REV_ID_B: asic_type = ASIC_GAUDI2B; break; + case REV_ID_C: + asic_type = ASIC_GAUDI2C; + break; default: break; } diff --git a/drivers/accel/habanalabs/common/mmu/mmu.c b/drivers/accel/habanalabs/common/mmu/mmu.c index b2145716c605..b654302a68fc 100644 --- a/drivers/accel/habanalabs/common/mmu/mmu.c +++ b/drivers/accel/habanalabs/common/mmu/mmu.c @@ -596,6 +596,7 @@ int hl_mmu_if_set_funcs(struct hl_device *hdev) break; case ASIC_GAUDI2: case ASIC_GAUDI2B: + case ASIC_GAUDI2C: /* MMUs in Gaudi2 are always host resident */ hl_mmu_v2_hr_set_funcs(hdev, &hdev->mmu_func[MMU_HR_PGT]); break; diff --git a/drivers/accel/habanalabs/common/sysfs.c b/drivers/accel/habanalabs/common/sysfs.c index 01f89f029355..278606373055 100644 --- a/drivers/accel/habanalabs/common/sysfs.c +++ b/drivers/accel/habanalabs/common/sysfs.c @@ -251,6 +251,9 @@ static ssize_t device_type_show(struct device *dev, case ASIC_GAUDI2B: str = "GAUDI2B"; break; + case ASIC_GAUDI2C: + str = "GAUDI2C"; + break; default: dev_err(hdev->dev, "Unrecognized ASIC type %d\n", hdev->asic_type); diff --git a/drivers/accel/habanalabs/include/hw_ip/pci/pci_general.h b/drivers/accel/habanalabs/include/hw_ip/pci/pci_general.h index f5d497dc9bdc..4f951cada077 100644 --- a/drivers/accel/habanalabs/include/hw_ip/pci/pci_general.h +++ b/drivers/accel/habanalabs/include/hw_ip/pci/pci_general.h @@ -25,6 +25,7 @@ enum hl_revision_id { REV_ID_INVALID = 0x00, REV_ID_A = 0x01, REV_ID_B = 0x02, + REV_ID_C = 0x03 }; #endif /* INCLUDE_PCI_GENERAL_H_ */ From patchwork Wed Nov 15 16:39:06 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13457096 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 5928BC072A2 for ; Wed, 15 Nov 2023 16:39:35 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 7EC2C10E11B; Wed, 15 Nov 2023 16:39:32 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by gabe.freedesktop.org (Postfix) with ESMTPS id 461B310E0F0 for ; Wed, 15 Nov 2023 16:39:24 +0000 (UTC) Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 98B9D6154F; Wed, 15 Nov 2023 16:39:23 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2E81BC433C9; Wed, 15 Nov 2023 16:39:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1700066363; bh=hH/MtPMheBE4FRnp6r7JcZVzm3gOuvyKUwNseHulg94=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=kVyduxGw7RGktXNcN7T+SGle97iooX2quLtnli9mnVApOKEa8BtmFwngJ42Km8A32 jARlxlOP9xx6aSCE5ndmrJWMEYasSv0811arbT+bdmSpDodAnf3PY44FCYS+ok/O3b uBsOix7P4nF7OBGQtbnE6JD/7+Nfp+SllJcuYjUzKChNuoO8zis+ifrbTAnDBJczU3 S1M3V2jTYPkMTsMaKM649uBdQJlMDKMR67tlzwYw6BEHvOpnY/fPVRWGq8/Jm/oOeq rR4FIwzNS193Ipxn92zAvkqOBH4Pw+anMtS4zxQ7HaTgkYzunzrrN9VD44/l4PDasM PZrGVAHbIGckA== From: Oded Gabbay To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH 04/10] accel/habanalabs: fix EQ heartbeat mechanism Date: Wed, 15 Nov 2023 18:39:06 +0200 Message-Id: <20231115163912.1243175-4-ogabbay@kernel.org> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231115163912.1243175-1-ogabbay@kernel.org> References: <20231115163912.1243175-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Farah Kassabri Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Farah Kassabri Stop rescheduling another heartbeat check when EQ heartbeat check fails as it generates confusing logs in dmesg that the heartbeat fails. Signed-off-by: Farah Kassabri Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/device.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index d9447aeb3937..6bf5f1d0d005 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -1044,20 +1044,21 @@ static bool is_pci_link_healthy(struct hl_device *hdev) return (vendor_id == PCI_VENDOR_ID_HABANALABS); } -static void hl_device_eq_heartbeat(struct hl_device *hdev) +static int hl_device_eq_heartbeat_check(struct hl_device *hdev) { - u64 event_mask = HL_NOTIFIER_EVENT_DEVICE_RESET | HL_NOTIFIER_EVENT_DEVICE_UNAVAILABLE; struct asic_fixed_properties *prop = &hdev->asic_prop; if (!prop->cpucp_info.eq_health_check_supported) - return; + return 0; if (hdev->eq_heartbeat_received) { hdev->eq_heartbeat_received = false; } else { dev_err(hdev->dev, "EQ heartbeat event was not received!\n"); - hl_device_cond_reset(hdev, HL_DRV_RESET_HARD, event_mask); + return -EIO; } + + return 0; } static void hl_device_heartbeat(struct work_struct *work) @@ -1074,10 +1075,9 @@ static void hl_device_heartbeat(struct work_struct *work) /* * For EQ health check need to check if driver received the heartbeat eq event * in order to validate the eq is working. + * Only if both the EQ is healthy and we managed to send the next heartbeat reschedule. */ - hl_device_eq_heartbeat(hdev); - - if (!hdev->asic_funcs->send_heartbeat(hdev)) + if ((!hl_device_eq_heartbeat_check(hdev)) && (!hdev->asic_funcs->send_heartbeat(hdev))) goto reschedule; if (hl_device_operational(hdev, NULL)) From patchwork Wed Nov 15 16:39:07 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13457098 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 9C146C54FB9 for ; Wed, 15 Nov 2023 16:39:40 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 0DD0D10E264; Wed, 15 Nov 2023 16:39:39 +0000 (UTC) Received: from ams.source.kernel.org (ams.source.kernel.org [IPv6:2604:1380:4601:e00::1]) by gabe.freedesktop.org (Postfix) with ESMTPS id 734DE10E116 for ; Wed, 15 Nov 2023 16:39:27 +0000 (UTC) Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by ams.source.kernel.org (Postfix) with ESMTP id 90959B81A67; Wed, 15 Nov 2023 16:39:25 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id CF136C433C8; Wed, 15 Nov 2023 16:39:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1700066364; bh=gfk8zALZbdkLMNgU9Gvg6x8rMFGs5nttVZo1WOrF8z4=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=Y1sPNITGGaeCwOfznhyoXmKJGEFd0tZ++yksjkpidz9p2R/pI7A2/vyZIT56zhhQl H2Zu9ECdXB+0ViFFD6rL81ZhRKBUQZyPJH+TRGJd0uykgthPOEZQoH4pkqwp1OgRO8 eYlBs17wIOYFJLxu3w1XbQdFeHSizwRRIevYRVRgGJ//mZ4V8jxK8KU7JV/vsSeZTb URJPS9RZMjJLmAgylqCEozfgTL0d/2w3DOcTaTRb/MK3jaEUlXJkdguHa2E+ZApntr tMVIhYfRW36fi9h6gfZaDTvrREzef+uaBCG6JSDvqYshritwA52uQ5OjRT9PMK+uf9 VCIgX3hpVyDsQ== From: Oded Gabbay To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH 05/10] accel/habanalabs/gaudi2: fix undef opcode reporting Date: Wed, 15 Nov 2023 18:39:07 +0200 Message-Id: <20231115163912.1243175-5-ogabbay@kernel.org> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231115163912.1243175-1-ogabbay@kernel.org> References: <20231115163912.1243175-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Dafna Hirschfeld Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Dafna Hirschfeld currently the undefined opcode event bit in set only for lower cp and only if 'write_enable' is true. It should be set anyway and for all streams in order to report that event to userspace. Signed-off-by: Dafna Hirschfeld Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/gaudi2/gaudi2.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2.c b/drivers/accel/habanalabs/gaudi2/gaudi2.c index b739078c2d87..5075f92d15cc 100644 --- a/drivers/accel/habanalabs/gaudi2/gaudi2.c +++ b/drivers/accel/habanalabs/gaudi2/gaudi2.c @@ -7929,21 +7929,19 @@ static int gaudi2_handle_qman_err_generic(struct hl_device *hdev, u16 event_type error_count++; } - if (i == QMAN_STREAMS && error_count) { - /* check for undefined opcode */ - if (glbl_sts_val & PDMA0_QM_GLBL_ERR_STS_CP_UNDEF_CMD_ERR_MASK && - hdev->captured_err_info.undef_opcode.write_enable) { + /* check for undefined opcode */ + if (glbl_sts_val & PDMA0_QM_GLBL_ERR_STS_CP_UNDEF_CMD_ERR_MASK) { + *event_mask |= HL_NOTIFIER_EVENT_UNDEFINED_OPCODE; + if (hdev->captured_err_info.undef_opcode.write_enable) { memset(&hdev->captured_err_info.undef_opcode, 0, sizeof(hdev->captured_err_info.undef_opcode)); - - hdev->captured_err_info.undef_opcode.write_enable = false; hdev->captured_err_info.undef_opcode.timestamp = ktime_get(); hdev->captured_err_info.undef_opcode.engine_id = gaudi2_queue_id_to_engine_id[qid_base]; - *event_mask |= HL_NOTIFIER_EVENT_UNDEFINED_OPCODE; } - handle_lower_qman_data_on_err(hdev, qman_base, *event_mask); + if (i == QMAN_STREAMS) + handle_lower_qman_data_on_err(hdev, qman_base, *event_mask); } } From patchwork Wed Nov 15 16:39:08 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13457097 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 57756C2BB3F for ; Wed, 15 Nov 2023 16:39:39 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id B646110E158; Wed, 15 Nov 2023 16:39:38 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by gabe.freedesktop.org (Postfix) with ESMTPS id 72C1510E114 for ; Wed, 15 Nov 2023 16:39:27 +0000 (UTC) Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id E5C876154F; Wed, 15 Nov 2023 16:39:26 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7B26EC433C9; Wed, 15 Nov 2023 16:39:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1700066366; bh=O/lr4P1EsL7YBuHracXPErJjSaguLam5N+lR2mNBaRQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=HloXa8yfi2b8+zalpwQqL7zMe2cDWjk3K16Qu3i1dSW5lnc3nTXjYE+Jeco7gTwPF QFdW9p28kq86x5Kf/alAeyE4tW5CzZ5g0XJ4+eDT0wP9whbbzvfw7P6S3oeMpTNOSR qnVmmB+ZhRgR7quZL+8hkHs3XAx7EWRmv4cuiuvZfQbhi00Ksw2966oNpDIbGSJT/C ndxjkOJZNv0CG0ai5a+kuDri42Gv0KvbyrVmFup51ZvPkKW3dBn17zaViLe4e4X05P wvpsBQf3MUThIlbMtFVLIoa6JhkLd/U87bdhY7hvNcvoSiwQnOSSCC6RSV6S7tjzeg lFCtprQEDNgww== From: Oded Gabbay To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH 06/10] accel/habanalabs: remove 'get temperature' debug print Date: Wed, 15 Nov 2023 18:39:08 +0200 Message-Id: <20231115163912.1243175-6-ogabbay@kernel.org> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231115163912.1243175-1-ogabbay@kernel.org> References: <20231115163912.1243175-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Ofir Bitton Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Ofir Bitton The print was added long back for a specific debug and can now be removed. Signed-off-by: Ofir Bitton Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/hwmon.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/drivers/accel/habanalabs/common/hwmon.c b/drivers/accel/habanalabs/common/hwmon.c index 8598056216e7..1ee2ee07e9ed 100644 --- a/drivers/accel/habanalabs/common/hwmon.c +++ b/drivers/accel/habanalabs/common/hwmon.c @@ -578,10 +578,6 @@ int hl_get_temperature(struct hl_device *hdev, CPUCP_PKT_CTL_OPCODE_SHIFT); pkt.sensor_index = __cpu_to_le16(sensor_index); pkt.type = __cpu_to_le16(attr); - - dev_dbg(hdev->dev, "get temp, ctl 0x%x, sensor %d, type %d\n", - pkt.ctl, pkt.sensor_index, pkt.type); - rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt), 0, &result); From patchwork Wed Nov 15 16:39:09 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13457101 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 961B0C2BB3F for ; Wed, 15 Nov 2023 16:39:50 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id CBAFA10E0DD; Wed, 15 Nov 2023 16:39:49 +0000 (UTC) Received: from sin.source.kernel.org (sin.source.kernel.org [IPv6:2604:1380:40e1:4800::1]) by gabe.freedesktop.org (Postfix) with ESMTPS id D476F10E114 for ; Wed, 15 Nov 2023 16:39:30 +0000 (UTC) Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 0A72ECE1C90; Wed, 15 Nov 2023 16:39:29 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 28D4CC433C8; Wed, 15 Nov 2023 16:39:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1700066368; bh=exKYJcOqE7PAuCguPkuv8mAJu56vWsUJCYEBl4ctl6U=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=e3HR+gVGFqu0vVmFFLsLi6cHtZ8Jb2lvPyuWJmDx9KCYdiSnWppVoyrZdcX/xOoio xhK1BicgbAqrEjdyI8x95IuviXOVX8ZeRGn5eBjumWTtXODIuNpZkEKg17Bkr7t41Z mrX7Bf5mPU+j/ZbD715h7vd0Bi5lYFWNNiKSSbPSxww6r/K6+5TrcuonS8zyUmMQm8 zMJgongjUWJsXNCMYaunSPZEUSLQKQEN3riWs111z1cFYCE0Y70TEvrLCOl1DErBaj Aj80uJgu2mBoCXNztCZa4TgMyls1nu2cVD4ZGZl/CQOJhp0GNEg4h6LQL7TtMDj1C8 yDuBVdDL7I1HA== From: Oded Gabbay To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH 07/10] accel/habanalabs: set hard reset flag if graceful reset is skipped Date: Wed, 15 Nov 2023 18:39:09 +0200 Message-Id: <20231115163912.1243175-7-ogabbay@kernel.org> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231115163912.1243175-1-ogabbay@kernel.org> References: <20231115163912.1243175-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Tomer Tayar Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Tomer Tayar hl_device_cond_reset() might be called with the hard reset flag unset, because a compute reset upon device release as part of a graceful reset is valid. If the conditions for graceful reset are not met, hl_device_reset() will be called for an immediate reset. In this case a compute reset is not valid, so it will be replaced with a hard reset together with a debug message about it. This message might be confusing, as it implies that a compute reset was requested when it shouldn't. To prevent this confusion, set the hard reset flag in hl_device_cond_reset() if going to an immediate reset. Signed-off-by: Tomer Tayar Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/device.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index 6bf5f1d0d005..a365791a9f5c 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -2040,7 +2040,7 @@ int hl_device_cond_reset(struct hl_device *hdev, u32 flags, u64 event_mask) if (ctx) hl_ctx_put(ctx); - return hl_device_reset(hdev, flags); + return hl_device_reset(hdev, flags | HL_DRV_RESET_HARD); } static void hl_notifier_event_send(struct hl_notifier_event *notifier_event, u64 event_mask) From patchwork Wed Nov 15 16:39:10 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13457099 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 5D4FCC072A2 for ; Wed, 15 Nov 2023 16:39:42 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 8B6FF10E54E; Wed, 15 Nov 2023 16:39:39 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by gabe.freedesktop.org (Postfix) with ESMTPS id CB2A110E0DD for ; Wed, 15 Nov 2023 16:39:30 +0000 (UTC) Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 40C9A6154F; Wed, 15 Nov 2023 16:39:30 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C7886C433C9; Wed, 15 Nov 2023 16:39:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1700066369; bh=WZZpUNtc2DPKkk+uXGa4GT4m3SV97GOyHrjsNd9er2Y=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=GRzEctaIriss8XyXc5pNIHTyjlT/C7pWhsoqkZfJ/Lk4BNxVWSAh/cqkcdRmmDv/N viCBbDwH3eJi593IonCjeJ6I9Oquec3a3mmgrUe838VhxepGO9JLNatoJydrwuhoXf s0sKdOCx41ZVekUXqMWKLtM7yAs6WmgfSVqvgUD5r+7i9zlLatYoUfhY7FucSZbRxe plMlrl8x6mEnIZZ7/WT6UeFlBCh6lgZczEI870ARaBfcGCLuf+jQRYW1NKwylmU3xA 3gqKvIBP7rO/If0H2/za29YTnKxopAgQ74US6ONe9+D7ycf9qOc+frk8znQpKoiUDz ayEvZJEcFjmvw== From: Oded Gabbay To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH 08/10] accel/habanalabs/gaudi2: get the correct QM CQ info upon an error Date: Wed, 15 Nov 2023 18:39:10 +0200 Message-Id: <20231115163912.1243175-8-ogabbay@kernel.org> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231115163912.1243175-1-ogabbay@kernel.org> References: <20231115163912.1243175-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Tomer Tayar Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Tomer Tayar Upon a QM error, the address/size from both the CQ and the ARC_CQ are printed, although the instruction that led to the error was received from only one of them. Moreover, in case of a QM undefined opcode, only one of these address/size sets will be captured based on the value of ARC_CQ_PTR. However, this value can be non-zero even if currently the CQ is used, in case the CQ/ARC_CQ are alternately used. Under the assumption of having a stop-on-error configuration, modify to use CP_STS.CUR_CQ field to get the relevant CQ for the QM error. Signed-off-by: Tomer Tayar Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/gaudi2/gaudi2.c | 44 +++++++++---------- .../include/gaudi2/asic_reg/gaudi2_regs.h | 1 + 2 files changed, 23 insertions(+), 22 deletions(-) diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2.c b/drivers/accel/habanalabs/gaudi2/gaudi2.c index 5075f92d15cc..77c480725a84 100644 --- a/drivers/accel/habanalabs/gaudi2/gaudi2.c +++ b/drivers/accel/habanalabs/gaudi2/gaudi2.c @@ -7860,36 +7860,36 @@ static bool gaudi2_handle_ecc_event(struct hl_device *hdev, u16 event_type, static void handle_lower_qman_data_on_err(struct hl_device *hdev, u64 qman_base, u64 event_mask) { - u32 lo, hi, cq_ptr_size, arc_cq_ptr_size; - u64 cq_ptr, arc_cq_ptr, cp_current_inst; - - lo = RREG32(qman_base + QM_CQ_PTR_LO_4_OFFSET); - hi = RREG32(qman_base + QM_CQ_PTR_HI_4_OFFSET); - cq_ptr = ((u64) hi) << 32 | lo; - cq_ptr_size = RREG32(qman_base + QM_CQ_TSIZE_4_OFFSET); - - lo = RREG32(qman_base + QM_ARC_CQ_PTR_LO_OFFSET); - hi = RREG32(qman_base + QM_ARC_CQ_PTR_HI_OFFSET); - arc_cq_ptr = ((u64) hi) << 32 | lo; - arc_cq_ptr_size = RREG32(qman_base + QM_ARC_CQ_TSIZE_OFFSET); + u32 lo, hi, cq_ptr_size, cp_sts; + u64 cq_ptr, cp_current_inst; + bool is_arc_cq; + + cp_sts = RREG32(qman_base + QM_CP_STS_4_OFFSET); + is_arc_cq = FIELD_GET(PDMA0_QM_CP_STS_CUR_CQ_MASK, cp_sts); /* 0 - legacy CQ, 1 - ARC_CQ */ + + if (is_arc_cq) { + lo = RREG32(qman_base + QM_ARC_CQ_PTR_LO_OFFSET); + hi = RREG32(qman_base + QM_ARC_CQ_PTR_HI_OFFSET); + cq_ptr = ((u64) hi) << 32 | lo; + cq_ptr_size = RREG32(qman_base + QM_ARC_CQ_TSIZE_OFFSET); + } else { + lo = RREG32(qman_base + QM_CQ_PTR_LO_4_OFFSET); + hi = RREG32(qman_base + QM_CQ_PTR_HI_4_OFFSET); + cq_ptr = ((u64) hi) << 32 | lo; + cq_ptr_size = RREG32(qman_base + QM_CQ_TSIZE_4_OFFSET); + } lo = RREG32(qman_base + QM_CP_CURRENT_INST_LO_4_OFFSET); hi = RREG32(qman_base + QM_CP_CURRENT_INST_HI_4_OFFSET); cp_current_inst = ((u64) hi) << 32 | lo; dev_info(hdev->dev, - "LowerQM. CQ: {ptr %#llx, size %u}, ARC_CQ: {ptr %#llx, size %u}, CP: {instruction %#llx}\n", - cq_ptr, cq_ptr_size, arc_cq_ptr, arc_cq_ptr_size, cp_current_inst); + "LowerQM. %sCQ: {ptr %#llx, size %u}, CP: {instruction %#llx}\n", + is_arc_cq ? "ARC_" : "", cq_ptr, cq_ptr_size, cp_current_inst); if (event_mask & HL_NOTIFIER_EVENT_UNDEFINED_OPCODE) { - if (arc_cq_ptr) { - hdev->captured_err_info.undef_opcode.cq_addr = arc_cq_ptr; - hdev->captured_err_info.undef_opcode.cq_size = arc_cq_ptr_size; - } else { - hdev->captured_err_info.undef_opcode.cq_addr = cq_ptr; - hdev->captured_err_info.undef_opcode.cq_size = cq_ptr_size; - } - + hdev->captured_err_info.undef_opcode.cq_addr = cq_ptr; + hdev->captured_err_info.undef_opcode.cq_size = cq_ptr_size; hdev->captured_err_info.undef_opcode.stream_id = QMAN_STREAMS; } } diff --git a/drivers/accel/habanalabs/include/gaudi2/asic_reg/gaudi2_regs.h b/drivers/accel/habanalabs/include/gaudi2/asic_reg/gaudi2_regs.h index a08378d0802b..8018214a7b59 100644 --- a/drivers/accel/habanalabs/include/gaudi2/asic_reg/gaudi2_regs.h +++ b/drivers/accel/habanalabs/include/gaudi2/asic_reg/gaudi2_regs.h @@ -250,6 +250,7 @@ #define QM_ARC_CQ_PTR_HI_OFFSET (mmPDMA0_QM_ARC_CQ_PTR_HI - mmPDMA0_QM_BASE) #define QM_ARC_CQ_TSIZE_OFFSET (mmPDMA0_QM_ARC_CQ_TSIZE - mmPDMA0_QM_BASE) +#define QM_CP_STS_4_OFFSET (mmPDMA0_QM_CP_STS_4 - mmPDMA0_QM_BASE) #define QM_CP_CURRENT_INST_LO_4_OFFSET (mmPDMA0_QM_CP_CURRENT_INST_LO_4 - mmPDMA0_QM_BASE) #define QM_CP_CURRENT_INST_HI_4_OFFSET (mmPDMA0_QM_CP_CURRENT_INST_HI_4 - mmPDMA0_QM_BASE) From patchwork Wed Nov 15 16:39:11 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13457102 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 09F20C54FB9 for ; Wed, 15 Nov 2023 16:39:52 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 45C7610E0EE; Wed, 15 Nov 2023 16:39:51 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by gabe.freedesktop.org (Postfix) with ESMTPS id 7AE6710E116 for ; Wed, 15 Nov 2023 16:39:32 +0000 (UTC) Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id DA7C661726; Wed, 15 Nov 2023 16:39:31 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 724F0C433C7; Wed, 15 Nov 2023 16:39:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1700066371; bh=ZLWT97N45Ta6Fe0IDs/iYeT0WAcl/lGZt3orsFz1SZA=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=oLySOhiA1mXQwPlJpUqWK6eDz8jnmmkOA3SHWQaXScZmE4Yutq6tywwIANg93u2t5 hshGS8/bRmU/Zrw/0Xp91Qzcik5L1i1ZdRv4vlD0ghsUGjLuWOsbWVlUSKwwP1ZyIa Kq1d/BAvV0KbVPwJ9vAROTVbyS6fwm/Q0o3UpH8gyy+BDqowDrXLJUTk33pQB1Gg7b EuvYULcCEzVkEk9FyXEJN3TMrFNAAG3Nj4qxn+3MaiPJeCHqW98dXCVwT6JmGNalLr Z4uwtAB9VoDTpnamehPzdzOl6Q+SBrqbgC1cCnX+3F//gPqiIKsPUWIfmhkoCgsl6a j7tBTY3JPC/ag== From: Oded Gabbay To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH 09/10] accel/habanalabs: print error code when mapping fails Date: Wed, 15 Nov 2023 18:39:11 +0200 Message-Id: <20231115163912.1243175-9-ogabbay@kernel.org> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231115163912.1243175-1-ogabbay@kernel.org> References: <20231115163912.1243175-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Dani Liberman Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Dani Liberman Failure to map is considered a non-trivial error and we need to notify the user about it. Signed-off-by: Dani Liberman Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/memory.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/accel/habanalabs/common/memory.c b/drivers/accel/habanalabs/common/memory.c index 0b8689fe0b64..3348ad12c237 100644 --- a/drivers/accel/habanalabs/common/memory.c +++ b/drivers/accel/habanalabs/common/memory.c @@ -955,8 +955,8 @@ static int map_phys_pg_pack(struct hl_ctx *ctx, u64 vaddr, (i + 1) == phys_pg_pack->npages); if (rc) { dev_err(hdev->dev, - "map failed for handle %u, npages: %llu, mapped: %llu", - phys_pg_pack->handle, phys_pg_pack->npages, + "map failed (%d) for handle %u, npages: %llu, mapped: %llu\n", + rc, phys_pg_pack->handle, phys_pg_pack->npages, mapped_pg_cnt); goto err; } @@ -1186,7 +1186,8 @@ static int map_device_va(struct hl_ctx *ctx, struct hl_mem_in *args, u64 *device rc = map_phys_pg_pack(ctx, ret_vaddr, phys_pg_pack); if (rc) { - dev_err(hdev->dev, "mapping page pack failed for handle %u\n", handle); + dev_err(hdev->dev, "mapping page pack failed (%d) for handle %u\n", + rc, handle); mutex_unlock(&hdev->mmu_lock); goto map_err; } From patchwork Wed Nov 15 16:39:12 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13457100 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 08E48C2BB3F for ; Wed, 15 Nov 2023 16:39:44 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 6521410E555; Wed, 15 Nov 2023 16:39:40 +0000 (UTC) Received: from sin.source.kernel.org (sin.source.kernel.org [IPv6:2604:1380:40e1:4800::1]) by gabe.freedesktop.org (Postfix) with ESMTPS id C6E7D10E158 for ; Wed, 15 Nov 2023 16:39:36 +0000 (UTC) Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id A7329CE1E4E; Wed, 15 Nov 2023 16:39:34 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1CD95C43391; Wed, 15 Nov 2023 16:39:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1700066373; bh=wOmG8PDmT8IoBbpblxv4EyHg4BsAIEU5myf7iJkEAf4=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=tG1JjIESXhgem7YL40BoLqvVLLBpiZiTFvfhAzCaJckzDN8YdcmAXG6rrUcp3Gq7R PQSD5AwfHkGMlqo5jC1R2WuFFz9YDIlSnEcGKJzD+fQCuArIHPvdCMk1vTIq9i32G4 K9IWtLQHyoZWhRiOCdoTOjENqYn5gRd13xU6nLT1elzvJV4IpfMhaID9iKeH1fs4Tv KLBX2RZ6JMw9WWfS99INqHknnSUV3WJZkBtlZZzA1TQe6BifHRiu3//DeFjXsEkJ2L 8fsEtbfBWS+dzAOovZOuHxFfifvcUPTLp8aL93uWxPTk55z2RLhU8V0yodEl9VpEHg RtCQVs+NP6x5w== From: Oded Gabbay To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH 10/10] accel/habanalabs: expose module id through sysfs Date: Wed, 15 Nov 2023 18:39:12 +0200 Message-Id: <20231115163912.1243175-10-ogabbay@kernel.org> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231115163912.1243175-1-ogabbay@kernel.org> References: <20231115163912.1243175-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Dani Liberman Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Dani Liberman Module ID exposes the physical location of the device in the server, from the pov of the devices in regard to how they are connected by internal fabric. This information is already exposed in our INFO ioctl, but there are utilities and scripts running in data-center which are already accessing sysfs for topology information and it is easier for them to continue getting that information from sysfs instead of opening a file descriptor. Signed-off-by: Dani Liberman Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- Documentation/ABI/testing/sysfs-driver-habanalabs | 6 ++++++ drivers/accel/habanalabs/common/sysfs.c | 10 ++++++++++ 2 files changed, 16 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-driver-habanalabs b/Documentation/ABI/testing/sysfs-driver-habanalabs index c63ca1ad500d..89fe3b09d4ad 100644 --- a/Documentation/ABI/testing/sysfs-driver-habanalabs +++ b/Documentation/ABI/testing/sysfs-driver-habanalabs @@ -149,6 +149,12 @@ Contact: ogabbay@kernel.org Description: Displays the current clock frequency, in Hz, of the MME compute engine. This property is valid only for the Goya ASIC family +What: /sys/class/accel/accel/device/module_id +Date: Nov 2023 +KernelVersion: not yet upstreamed +Contact: ogabbay@kernel.org +Description: Displays the device's module id + What: /sys/class/accel/accel/device/pci_addr Date: Jan 2019 KernelVersion: 5.1 diff --git a/drivers/accel/habanalabs/common/sysfs.c b/drivers/accel/habanalabs/common/sysfs.c index 278606373055..8d2164691d81 100644 --- a/drivers/accel/habanalabs/common/sysfs.c +++ b/drivers/accel/habanalabs/common/sysfs.c @@ -386,6 +386,14 @@ static ssize_t security_enabled_show(struct device *dev, return sprintf(buf, "%d\n", hdev->asic_prop.fw_security_enabled); } +static ssize_t module_id_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct hl_device *hdev = dev_get_drvdata(dev); + + return sprintf(buf, "%u\n", le32_to_cpu(hdev->asic_prop.cpucp_info.card_location)); +} + static DEVICE_ATTR_RO(armcp_kernel_ver); static DEVICE_ATTR_RO(armcp_ver); static DEVICE_ATTR_RO(cpld_ver); @@ -405,6 +413,7 @@ static DEVICE_ATTR_RO(thermal_ver); static DEVICE_ATTR_RO(uboot_ver); static DEVICE_ATTR_RO(fw_os_ver); static DEVICE_ATTR_RO(security_enabled); +static DEVICE_ATTR_RO(module_id); static struct bin_attribute bin_attr_eeprom = { .attr = {.name = "eeprom", .mode = (0444)}, @@ -430,6 +439,7 @@ static struct attribute *hl_dev_attrs[] = { &dev_attr_uboot_ver.attr, &dev_attr_fw_os_ver.attr, &dev_attr_security_enabled.attr, + &dev_attr_module_id.attr, NULL, };