From patchwork Mon May 1 09:47:49 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13227432 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 2134EC77B73 for ; Mon, 1 May 2023 09:48:13 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id D90BE10E2F4; Mon, 1 May 2023 09:48:04 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by gabe.freedesktop.org (Postfix) with ESMTPS id 3DF3310E223 for ; Mon, 1 May 2023 09:48:03 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id E9A5C60B9E; Mon, 1 May 2023 09:48:00 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7C111C433EF; Mon, 1 May 2023 09:47:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1682934480; bh=nQELu0P7nNiZdfkp3nnHF/gTwZ62dDbBTJqXinFtWRo=; h=From:To:Cc:Subject:Date:From; b=EmeZDxLo94dvDbrzwIvVEDRpqda3T8E1EhNI2YthbIJT4Tn8x08qljX14N2R7aWYe PWoCYhsxT4uoic1k2viEKcIkLAN984O6eEPlnbiadLWwmndPnI1o/IRcBC1CvQeVmc Mu9wz8xFGXuQi3D6cdEeNJ5I3Ia4ucl6HBFNF7dPLr+2SK92MhRMl3mDV1mKs+Fnbk wr/ZuRL7j1VgKCuojm9mw8PxklRC2RCErT/w+CcTDUqRc181CtKSH1oPK64Hv7QmF0 WOun0OVJgAsuORCd2amPgnI2ZsSqb19ORa2UH8IESvoCjFeRhAgHQj2XrGwbSBRDPp l0dl8dxGtvzKg== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 1/6] accel/habanalabs: add missing tpc interrupt info Date: Mon, 1 May 2023 12:47:49 +0300 Message-Id: <20230501094754.100030-1-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Dafna Hirschfeld Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Dafna Hirschfeld For some reason the last possible tpc interrupt cause in gaudi2_tpc_interrupts_cause is missing from the code. Signed-off-by: Dafna Hirschfeld Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/gaudi2/gaudi2.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2.c b/drivers/accel/habanalabs/gaudi2/gaudi2.c index 5c80e7af5b78..058741698d71 100644 --- a/drivers/accel/habanalabs/gaudi2/gaudi2.c +++ b/drivers/accel/habanalabs/gaudi2/gaudi2.c @@ -63,7 +63,7 @@ #define GAUDI2_NUM_OF_CPU_SEI_ERR_CAUSE 3 #define GAUDI2_NUM_OF_QM_SEI_ERR_CAUSE 2 #define GAUDI2_NUM_OF_ROT_ERR_CAUSE 22 -#define GAUDI2_NUM_OF_TPC_INTR_CAUSE 30 +#define GAUDI2_NUM_OF_TPC_INTR_CAUSE 31 #define GAUDI2_NUM_OF_DEC_ERR_CAUSE 25 #define GAUDI2_NUM_OF_MME_ERR_CAUSE 16 #define GAUDI2_NUM_OF_MME_SBTE_ERR_CAUSE 5 @@ -891,6 +891,7 @@ static const char * const gaudi2_tpc_interrupts_cause[GAUDI2_NUM_OF_TPC_INTR_CAU "invalid_lock_access", "LD_L protection violation", "ST_L protection violation", + "D$ L0CS mismatch", }; static const char * const guadi2_mme_error_cause[GAUDI2_NUM_OF_MME_ERR_CAUSE] = { From patchwork Mon May 1 09:47:50 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13227431 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id AB5C4C77B61 for ; Mon, 1 May 2023 09:48:06 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id ADF3510E25E; Mon, 1 May 2023 09:48:04 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by gabe.freedesktop.org (Postfix) with ESMTPS id 6C31210E223 for ; Mon, 1 May 2023 09:48:03 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 58ED160ACF; Mon, 1 May 2023 09:48:02 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id DEF2BC4339B; Mon, 1 May 2023 09:48:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1682934481; bh=qTX/9lCiVwISyLdcD/wOaAxdWt30rB2llfGmYTLEzYA=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=gVU7nZQezkf5psn002TJdw14o6use20M1fV3S7Vv+Leg0HqUaK1f2T+n2LM/wNxIF A/48n3EoB3jcQjcy0Npy05/vZ7qOF7EoL6Ht+Abb1SszZc4iFpmTxANrwV4grE165p U5KSYC7JqyHgQeMkfZqcFIg0WcrJAyzDabg9LdCA1cnuuazwhzhu7yB3ZxfJL/FUdn rPWsjohohQzgkeryZEXCYalqiyefDv015oOIv70W7xjg0DG7jDffIbjSuIdHht5yP4 MIpzlxbPpaRE/OayHItKY97yXeCnwf0IBt4IvUMPVvsLpXXO3ViFFGuVpT77bu44Zv 0Mj0K70ZOoQdQ== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 2/6] accel/habanalabs: add pci health check during heartbeat Date: Mon, 1 May 2023 12:47:50 +0300 Message-Id: <20230501094754.100030-2-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230501094754.100030-1-ogabbay@kernel.org> References: <20230501094754.100030-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Ofir Bitton Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Ofir Bitton Currently upon a heartbeat failure, we don't know if the failure is due to firmware hang or due to a bad PCI link. Hence, we are reading a PCI config space register with a known value (vendor ID) so we will know which of the two possibilities caused the heartbeat failure. Signed-off-by: Ofir Bitton Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/device.c | 15 ++++++++++++++- drivers/accel/habanalabs/common/habanalabs.h | 2 ++ drivers/accel/habanalabs/common/habanalabs_drv.c | 2 -- 3 files changed, 16 insertions(+), 3 deletions(-) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index c8f4a34d2831..98b46b2e1898 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -981,6 +981,18 @@ static void device_early_fini(struct hl_device *hdev) hdev->asic_funcs->early_fini(hdev); } +static bool is_pci_link_healthy(struct hl_device *hdev) +{ + u16 vendor_id; + + if (!hdev->pdev) + return false; + + pci_read_config_word(hdev->pdev, PCI_VENDOR_ID, &vendor_id); + + return (vendor_id == PCI_VENDOR_ID_HABANALABS); +} + static void hl_device_heartbeat(struct work_struct *work) { struct hl_device *hdev = container_of(work, struct hl_device, @@ -995,7 +1007,8 @@ static void hl_device_heartbeat(struct work_struct *work) goto reschedule; if (hl_device_operational(hdev, NULL)) - dev_err(hdev->dev, "Device heartbeat failed!\n"); + dev_err(hdev->dev, "Device heartbeat failed! PCI link is %s\n", + is_pci_link_healthy(hdev) ? "healthy" : "broken"); info.err_type = HL_INFO_FW_HEARTBEAT_ERR; info.event_mask = &event_mask; diff --git a/drivers/accel/habanalabs/common/habanalabs.h b/drivers/accel/habanalabs/common/habanalabs.h index f83ea96c6530..ad412cc01aba 100644 --- a/drivers/accel/habanalabs/common/habanalabs.h +++ b/drivers/accel/habanalabs/common/habanalabs.h @@ -36,6 +36,8 @@ struct hl_device; struct hl_fpriv; +#define PCI_VENDOR_ID_HABANALABS 0x1da3 + /* Use upper bits of mmap offset to store habana driver specific information. * bits[63:59] - Encode mmap type * bits[45:0] - mmap offset value diff --git a/drivers/accel/habanalabs/common/habanalabs_drv.c b/drivers/accel/habanalabs/common/habanalabs_drv.c index a4b3f50f1cba..ea80638fc6b9 100644 --- a/drivers/accel/habanalabs/common/habanalabs_drv.c +++ b/drivers/accel/habanalabs/common/habanalabs_drv.c @@ -54,8 +54,6 @@ module_param(boot_error_status_mask, ulong, 0444); MODULE_PARM_DESC(boot_error_status_mask, "Mask of the error status during device CPU boot (If bitX is cleared then error X is masked. Default all 1's)"); -#define PCI_VENDOR_ID_HABANALABS 0x1da3 - #define PCI_IDS_GOYA 0x0001 #define PCI_IDS_GAUDI 0x1000 #define PCI_IDS_GAUDI_SEC 0x1010 From patchwork Mon May 1 09:47:51 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13227434 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 95CF5C77B73 for ; Mon, 1 May 2023 09:48:18 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 43F3F10E315; Mon, 1 May 2023 09:48:07 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by gabe.freedesktop.org (Postfix) with ESMTPS id 98F0510E223 for ; Mon, 1 May 2023 09:48:04 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id F3FF260C12; Mon, 1 May 2023 09:48:03 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4CFFEC433EF; Mon, 1 May 2023 09:48:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1682934483; bh=ja+ZCwMRg9UzW7YI9W2BAIW77mgPcJJMcMcDfaGWTIQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=TjkJG0FKB8XYZzS+cCJZXAWA/J/Fj+JhbB4fVVeY1NizBaCOMzBC12CGdqXOLou7b Uwjd3gntWfbpaLR8A9JUChk9gkL65g1iGvkfEDZpfyvROUuFMXrzmfz4yUumOGNses eiJzVOK2qj8WFxP2zovLc1X7SqvmWOrwZUEoUaIB+ICMEQIkMt3jJbn/4gf21Om7vP wbFJxckL0yQex0bkujMLiXMyxbdXWXF5P3PFH/zDeip/VAm/0r6MThAXKqASsqnsnO tXx9josyLQ+uCsQ9R8yTR8vp9DDZVlD5lo+1m65hAVCDXMCtISkJ3zM17BO6Jyd1aE gHS5On4B96ixA== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 3/6] accel/habanalabs: expose debugfs files later Date: Mon, 1 May 2023 12:47:51 +0300 Message-Id: <20230501094754.100030-3-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230501094754.100030-1-ogabbay@kernel.org> References: <20230501094754.100030-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Tomer Tayar Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Tomer Tayar Currently the debugfs root folder and files for a device are created at an early step, before the device initialization and before the char device and sysfs files are exposed to user. As there is no real reason not to do it together with the device creation, postpone it to be done right afterwards. The initialization of the debugfs entry structure is left in its current position because it is used before creating the files. Signed-off-by: Tomer Tayar Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/debugfs.c | 37 +++++++----- drivers/accel/habanalabs/common/device.c | 62 +++++++++++--------- drivers/accel/habanalabs/common/habanalabs.h | 6 +- 3 files changed, 61 insertions(+), 44 deletions(-) diff --git a/drivers/accel/habanalabs/common/debugfs.c b/drivers/accel/habanalabs/common/debugfs.c index 22dd17c077c0..29b78c7cf5de 100644 --- a/drivers/accel/habanalabs/common/debugfs.c +++ b/drivers/accel/habanalabs/common/debugfs.c @@ -1756,17 +1756,15 @@ static void add_files_to_device(struct hl_device *hdev, struct hl_dbg_device_ent } } -void hl_debugfs_add_device(struct hl_device *hdev) +int hl_debugfs_device_init(struct hl_device *hdev) { struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs; int count = ARRAY_SIZE(hl_debugfs_list); dev_entry->hdev = hdev; - dev_entry->entry_arr = kmalloc_array(count, - sizeof(struct hl_debugfs_entry), - GFP_KERNEL); + dev_entry->entry_arr = kmalloc_array(count, sizeof(struct hl_debugfs_entry), GFP_KERNEL); if (!dev_entry->entry_arr) - return; + return -ENOMEM; dev_entry->data_dma_blob_desc.size = 0; dev_entry->data_dma_blob_desc.data = NULL; @@ -1787,21 +1785,14 @@ void hl_debugfs_add_device(struct hl_device *hdev) spin_lock_init(&dev_entry->userptr_spinlock); mutex_init(&dev_entry->ctx_mem_hash_mutex); - dev_entry->root = debugfs_create_dir(dev_name(hdev->dev), - hl_debug_root); - - add_files_to_device(hdev, dev_entry, dev_entry->root); - if (!hdev->asic_prop.fw_security_enabled) - add_secured_nodes(dev_entry, dev_entry->root); + return 0; } -void hl_debugfs_remove_device(struct hl_device *hdev) +void hl_debugfs_device_fini(struct hl_device *hdev) { struct hl_dbg_device_entry *entry = &hdev->hl_debugfs; int i; - debugfs_remove_recursive(entry->root); - mutex_destroy(&entry->ctx_mem_hash_mutex); mutex_destroy(&entry->file_mutex); @@ -1814,6 +1805,24 @@ void hl_debugfs_remove_device(struct hl_device *hdev) kfree(entry->entry_arr); } +void hl_debugfs_add_device(struct hl_device *hdev) +{ + struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs; + + dev_entry->root = debugfs_create_dir(dev_name(hdev->dev), hl_debug_root); + + add_files_to_device(hdev, dev_entry, dev_entry->root); + if (!hdev->asic_prop.fw_security_enabled) + add_secured_nodes(dev_entry, dev_entry->root); +} + +void hl_debugfs_remove_device(struct hl_device *hdev) +{ + struct hl_dbg_device_entry *entry = &hdev->hl_debugfs; + + debugfs_remove_recursive(entry->root); +} + void hl_debugfs_add_file(struct hl_fpriv *hpriv) { struct hl_dbg_device_entry *dev_entry = &hpriv->hdev->hl_debugfs; diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index 98b46b2e1898..cab5a63db8c1 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -674,7 +674,7 @@ static int device_init_cdev(struct hl_device *hdev, struct class *class, return 0; } -static int device_cdev_sysfs_add(struct hl_device *hdev) +static int cdev_sysfs_debugfs_add(struct hl_device *hdev) { int rc; @@ -699,7 +699,9 @@ static int device_cdev_sysfs_add(struct hl_device *hdev) goto delete_ctrl_cdev_device; } - hdev->cdev_sysfs_created = true; + hl_debugfs_add_device(hdev); + + hdev->cdev_sysfs_debugfs_created = true; return 0; @@ -710,11 +712,12 @@ static int device_cdev_sysfs_add(struct hl_device *hdev) return rc; } -static void device_cdev_sysfs_del(struct hl_device *hdev) +static void cdev_sysfs_debugfs_remove(struct hl_device *hdev) { - if (!hdev->cdev_sysfs_created) + if (!hdev->cdev_sysfs_debugfs_created) goto put_devices; + hl_debugfs_remove_device(hdev); hl_sysfs_fini(hdev); cdev_device_del(&hdev->cdev_ctrl, hdev->dev_ctrl); cdev_device_del(&hdev->cdev, hdev->dev); @@ -2054,7 +2057,7 @@ static int create_cdev(struct hl_device *hdev) int hl_device_init(struct hl_device *hdev) { int i, rc, cq_cnt, user_interrupt_cnt, cq_ready_cnt; - bool add_cdev_sysfs_on_err = false; + bool expose_interfaces_on_err = false; rc = create_cdev(hdev); if (rc) @@ -2170,16 +2173,22 @@ int hl_device_init(struct hl_device *hdev) hdev->device_release_watchdog_timeout_sec = HL_DEVICE_RELEASE_WATCHDOG_TIMEOUT_SEC; hdev->memory_scrub_val = MEM_SCRUB_DEFAULT_VAL; - hl_debugfs_add_device(hdev); - /* debugfs nodes are created in hl_ctx_init so it must be called after - * hl_debugfs_add_device. + rc = hl_debugfs_device_init(hdev); + if (rc) { + dev_err(hdev->dev, "failed to initialize debugfs entry structure\n"); + kfree(hdev->kernel_ctx); + goto mmu_fini; + } + + /* The debugfs entry structure is accessed in hl_ctx_init(), so it must be called after + * hl_debugfs_device_init(). */ rc = hl_ctx_init(hdev, hdev->kernel_ctx, true); if (rc) { dev_err(hdev->dev, "failed to initialize kernel context\n"); kfree(hdev->kernel_ctx); - goto remove_device_from_debugfs; + goto debugfs_device_fini; } rc = hl_cb_pool_init(hdev); @@ -2195,11 +2204,10 @@ int hl_device_init(struct hl_device *hdev) } /* - * From this point, override rc (=0) in case of an error to allow - * debugging (by adding char devices and create sysfs nodes as part of - * the error flow). + * From this point, override rc (=0) in case of an error to allow debugging + * (by adding char devices and creating sysfs/debugfs files as part of the error flow). */ - add_cdev_sysfs_on_err = true; + expose_interfaces_on_err = true; /* Device is now enabled as part of the initialization requires * communication with the device firmware to get information that @@ -2241,15 +2249,13 @@ int hl_device_init(struct hl_device *hdev) } /* - * Expose devices and sysfs nodes to user. - * From here there is no need to add char devices and create sysfs nodes - * in case of an error. + * Expose devices and sysfs/debugfs files to user. + * From here there is no need to expose them in case of an error. */ - add_cdev_sysfs_on_err = false; - rc = device_cdev_sysfs_add(hdev); + expose_interfaces_on_err = false; + rc = cdev_sysfs_debugfs_add(hdev); if (rc) { - dev_err(hdev->dev, - "Failed to add char devices and sysfs nodes\n"); + dev_err(hdev->dev, "Failed to add char devices and sysfs/debugfs files\n"); rc = 0; goto out_disabled; } @@ -2295,8 +2301,8 @@ int hl_device_init(struct hl_device *hdev) if (hl_ctx_put(hdev->kernel_ctx) != 1) dev_err(hdev->dev, "kernel ctx is still alive on initialization failure\n"); -remove_device_from_debugfs: - hl_debugfs_remove_device(hdev); +debugfs_device_fini: + hl_debugfs_device_fini(hdev); mmu_fini: hl_mmu_fini(hdev); eq_fini: @@ -2320,8 +2326,8 @@ int hl_device_init(struct hl_device *hdev) put_device(hdev->dev); out_disabled: hdev->disabled = true; - if (add_cdev_sysfs_on_err) - device_cdev_sysfs_add(hdev); + if (expose_interfaces_on_err) + cdev_sysfs_debugfs_add(hdev); if (hdev->pdev) dev_err(&hdev->pdev->dev, "Failed to initialize hl%d. Device %s is NOT usable !\n", @@ -2447,8 +2453,6 @@ void hl_device_fini(struct hl_device *hdev) if ((hdev->kernel_ctx) && (hl_ctx_put(hdev->kernel_ctx) != 1)) dev_err(hdev->dev, "kernel ctx is still alive\n"); - hl_debugfs_remove_device(hdev); - hl_dec_fini(hdev); hl_vm_fini(hdev); @@ -2473,8 +2477,10 @@ void hl_device_fini(struct hl_device *hdev) device_early_fini(hdev); - /* Hide devices and sysfs nodes from user */ - device_cdev_sysfs_del(hdev); + /* Hide devices and sysfs/debugfs files from user */ + cdev_sysfs_debugfs_remove(hdev); + + hl_debugfs_device_fini(hdev); pr_info("removed device successfully\n"); } diff --git a/drivers/accel/habanalabs/common/habanalabs.h b/drivers/accel/habanalabs/common/habanalabs.h index ad412cc01aba..ea0914d08bdc 100644 --- a/drivers/accel/habanalabs/common/habanalabs.h +++ b/drivers/accel/habanalabs/common/habanalabs.h @@ -3292,7 +3292,7 @@ struct hl_reset_info { * @in_debug: whether the device is in a state where the profiling/tracing infrastructure * can be used. This indication is needed because in some ASICs we need to do * specific operations to enable that infrastructure. - * @cdev_sysfs_created: were char devices and sysfs nodes created. + * @cdev_sysfs_debugfs_created: were char devices and sysfs/debugfs files created. * @stop_on_err: true if engines should stop on error. * @supports_sync_stream: is sync stream supported. * @sync_stream_queue_idx: helper index for sync stream queues initialization. @@ -3459,7 +3459,7 @@ struct hl_device { u8 init_done; u8 device_cpu_disabled; u8 in_debug; - u8 cdev_sysfs_created; + u8 cdev_sysfs_debugfs_created; u8 stop_on_err; u8 supports_sync_stream; u8 sync_stream_queue_idx; @@ -3978,6 +3978,8 @@ void hl_handle_fw_err(struct hl_device *hdev, struct hl_info_fw_err_info *info); void hl_debugfs_init(void); void hl_debugfs_fini(void); +int hl_debugfs_device_init(struct hl_device *hdev); +void hl_debugfs_device_fini(struct hl_device *hdev); void hl_debugfs_add_device(struct hl_device *hdev); void hl_debugfs_remove_device(struct hl_device *hdev); void hl_debugfs_add_file(struct hl_fpriv *hpriv); From patchwork Mon May 1 09:47:52 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13227433 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 3CED9C77B73 for ; Mon, 1 May 2023 09:48:16 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id EDF1F10E223; Mon, 1 May 2023 09:48:06 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by gabe.freedesktop.org (Postfix) with ESMTPS id F000B10E317 for ; Mon, 1 May 2023 09:48:05 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 5C86060B9E; Mon, 1 May 2023 09:48:05 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id E7A8BC4339B; Mon, 1 May 2023 09:48:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1682934484; bh=NMm0YcU+N5c7ZMiwUlWzPWemYIlkFYmYd+Og7A9ZcFU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=XEMzaU5fy8nzJiPJfIZ/mU4zivqxHMVGwPAndtecZZFR2/xw7FezA+UuCvwzhYwth 8roYffr0cZlMQiykmU86Rh6zwvp6I/fqvaExs3wMaTjgErCQ1LrH2LjVcJC0uH0gEa m8Tk3nY0qKjs0l/yhMtiD/8e9OCWMtq7lB3GD3Wy4GksDdey+JYTLelpJ7H9ungaxz NIF5S68UoyQP6He6p0wIDmaTZy0zW1z5HVCNlnCGtn/X0kw+NnAFbjBo2uPe3NoWzG W5Wq5i1L2zLaU9RpnSYbFC78+BRZps5/hHdVLbuVE/SyIL5ysVnd67ReWa6eL9uCPV eZY+rcMhOv6Bw== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 4/6] accel/habanalabs: poll for device status update following WFE cmd Date: Mon, 1 May 2023 12:47:52 +0300 Message-Id: <20230501094754.100030-4-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230501094754.100030-1-ogabbay@kernel.org> References: <20230501094754.100030-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Koby Elbaz Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Koby Elbaz Currently, we rely on COMMS protocol's ack to verify that WFE command has been acknowledged by the FW. However, this does not guarantee that the device status has been updated. Although unlikely, this could trigger a race since the driver expects the device to be halted at that stage, but it might not be. Therefore, we increase WFE's robustness by polling on the status register that will be updated once the device is actually halted. Signed-off-by: Koby Elbaz Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/firmware_if.c | 28 +++++++++++++++---- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/drivers/accel/habanalabs/common/firmware_if.c b/drivers/accel/habanalabs/common/firmware_if.c index e48f024d8649..eb51d7f70aec 100644 --- a/drivers/accel/habanalabs/common/firmware_if.c +++ b/drivers/accel/habanalabs/common/firmware_if.c @@ -1368,8 +1368,10 @@ void hl_fw_ask_hard_reset_without_linux(struct hl_device *hdev) void hl_fw_ask_halt_machine_without_linux(struct hl_device *hdev) { - struct static_fw_load_mgr *static_loader = - &hdev->fw_loader.static_loader; + struct fw_load_mgr *fw_loader = &hdev->fw_loader; + u32 status, cpu_boot_status_reg, cpu_timeout; + struct static_fw_load_mgr *static_loader; + struct pre_fw_load_props *pre_fw_load; int rc; if (hdev->device_cpu_is_halted) @@ -1377,12 +1379,28 @@ void hl_fw_ask_halt_machine_without_linux(struct hl_device *hdev) /* Stop device CPU to make sure nothing bad happens */ if (hdev->asic_prop.dynamic_fw_load) { + pre_fw_load = &fw_loader->pre_fw_load; + cpu_timeout = fw_loader->cpu_timeout; + cpu_boot_status_reg = pre_fw_load->cpu_boot_status_reg; + rc = hl_fw_dynamic_send_protocol_cmd(hdev, &hdev->fw_loader, - COMMS_GOTO_WFE, 0, false, - hdev->fw_loader.cpu_timeout); - if (rc) + COMMS_GOTO_WFE, 0, false, cpu_timeout); + if (rc) { dev_err(hdev->dev, "Failed sending COMMS_GOTO_WFE\n"); + } else { + rc = hl_poll_timeout( + hdev, + cpu_boot_status_reg, + status, + status == CPU_BOOT_STATUS_IN_WFE, + hdev->fw_poll_interval_usec, + cpu_timeout); + if (rc) + dev_err(hdev->dev, "Current status=%u. Timed-out updating to WFE\n", + status); + } } else { + static_loader = &hdev->fw_loader.static_loader; WREG32(static_loader->kmd_msg_to_cpu_reg, KMD_MSG_GOTO_WFE); msleep(static_loader->cpu_reset_wait_msec); From patchwork Mon May 1 09:47:53 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13227435 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id BD560C77B61 for ; Mon, 1 May 2023 09:48:20 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 4D0DB10E317; Mon, 1 May 2023 09:48:09 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by gabe.freedesktop.org (Postfix) with ESMTPS id 6A77910E317 for ; Mon, 1 May 2023 09:48:07 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id C5F0260C12; Mon, 1 May 2023 09:48:06 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 557FFC433EF; Mon, 1 May 2023 09:48:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1682934486; bh=v9WI2lR65DIK5tBKAXgoWcIG3gqL4iEel8mPIC4USB4=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=OOejkdc2SYVFmdMU1kScnqYHKFNFSSzvj4JQW0iaTp11ecqd8btwxK2gUu1HsZqbf s4W0hpXq79ZFc0r84LbmdQgQa1Z50lsUAADe4Eu28WIZTroXgHyK2oUSpBpFFI1aaB qvkvy4gtfLnPjDYsi6fvOROnGiaLNMaqR2FcMuXjdDcL6plOiTS6Z50Ercv51gedh9 HR7CRu/vvc2eql6jnOlQz9SIGXOaO7yfbEROR7+LtuLOLHWK5uj56Z+5ZFMATslomr pUAlWtGssrPN1+mgcuTaYEfWdBh2kDMpvH8VIJ7m6qyQc07KeIzvLEqm/YHbyLqAls 7Utl3pClE/mww== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 5/6] accel/habanalabs: fix a static warning - 'dubious: x & !y' Date: Mon, 1 May 2023 12:47:53 +0300 Message-Id: <20230501094754.100030-5-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230501094754.100030-1-ogabbay@kernel.org> References: <20230501094754.100030-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Koby Elbaz Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Koby Elbaz Use a straight forward approach to get a conditional result. Signed-off-by: Koby Elbaz Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/gaudi2/gaudi2.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2.c b/drivers/accel/habanalabs/gaudi2/gaudi2.c index 058741698d71..240fecfab608 100644 --- a/drivers/accel/habanalabs/gaudi2/gaudi2.c +++ b/drivers/accel/habanalabs/gaudi2/gaudi2.c @@ -4527,7 +4527,7 @@ static int gaudi2_set_tpc_engine_mode(struct hl_device *hdev, u32 engine_id, u32 reg_base = gaudi2_tpc_cfg_blocks_bases[tpc_id]; reg_addr = reg_base + TPC_CFG_STALL_OFFSET; reg_val = FIELD_PREP(DCORE0_TPC0_CFG_TPC_STALL_V_MASK, - !!(engine_command == HL_ENGINE_STALL)); + (engine_command == HL_ENGINE_STALL) ? 1 : 0); WREG32(reg_addr, reg_val); if (engine_command == HL_ENGINE_RESUME) { @@ -4551,7 +4551,7 @@ static int gaudi2_set_mme_engine_mode(struct hl_device *hdev, u32 engine_id, u32 reg_base = gaudi2_mme_ctrl_lo_blocks_bases[mme_id]; reg_addr = reg_base + MME_CTRL_LO_QM_STALL_OFFSET; reg_val = FIELD_PREP(DCORE0_MME_CTRL_LO_QM_STALL_V_MASK, - !!(engine_command == HL_ENGINE_STALL)); + (engine_command == HL_ENGINE_STALL) ? 1 : 0); WREG32(reg_addr, reg_val); return 0; @@ -4572,7 +4572,7 @@ static int gaudi2_set_edma_engine_mode(struct hl_device *hdev, u32 engine_id, u3 reg_base = gaudi2_dma_core_blocks_bases[edma_id]; reg_addr = reg_base + EDMA_CORE_CFG_STALL_OFFSET; reg_val = FIELD_PREP(DCORE0_EDMA0_CORE_CFG_1_HALT_MASK, - !!(engine_command == HL_ENGINE_STALL)); + (engine_command == HL_ENGINE_STALL) ? 1 : 0); WREG32(reg_addr, reg_val); if (engine_command == HL_ENGINE_STALL) { From patchwork Mon May 1 09:47:54 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13227436 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id EC847C7EE21 for ; Mon, 1 May 2023 09:48:22 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 16CE010E28F; Mon, 1 May 2023 09:48:11 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by gabe.freedesktop.org (Postfix) with ESMTPS id C9F3110E317 for ; Mon, 1 May 2023 09:48:08 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 32D2A60B9E; Mon, 1 May 2023 09:48:08 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id B7A8DC4339B; Mon, 1 May 2023 09:48:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1682934487; bh=LpsEPzX3v5sPO3o34me3QulIAMOh6n5Q5j+IzRCfXsI=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=V0GEflYR0lwokFsW9TR+8lXlzMFN6nu8UpdXcUsel9dQB50bg+ZLEedVSFAPUQcVJ GE/OORIgIc7/B0jTtmxJgwavIW+HbTmVEeDZEL6YH7RCDIHLYZMGfcTVS3V9Zu1v25 sLTizrOHYxjPFQSjTcZClkYWdbX+AiKHODPJRzS/PTd6wek8k4B7wqFdBiuXCJ12tB TirgFq82oTovLuEqMfrENyBKMDPbInR7FfeL2DylvJO0yNdMZcr+55TSWXcarsvaJa WFPKfGCJ5AbwZNYrnntgbXQ8mhYZUpoDYeIFhMHsrD5xDhf8IKbc9Om6dJoUX+DcOU 2RL901WOaghuA== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 6/6] accel/habanalabs: always fetch pci addr_dec error info Date: Mon, 1 May 2023 12:47:54 +0300 Message-Id: <20230501094754.100030-6-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230501094754.100030-1-ogabbay@kernel.org> References: <20230501094754.100030-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Ofir Bitton Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Ofir Bitton Due to missing indication of address decode source (LBW/HBW bus), we should always try and fetch extended information. Signed-off-by: Ofir Bitton Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/gaudi2/gaudi2.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2.c b/drivers/accel/habanalabs/gaudi2/gaudi2.c index 240fecfab608..d21ef9997d05 100644 --- a/drivers/accel/habanalabs/gaudi2/gaudi2.c +++ b/drivers/accel/habanalabs/gaudi2/gaudi2.c @@ -8892,14 +8892,12 @@ static int gaudi2_print_pcie_addr_dec_info(struct hl_device *hdev, u16 event_typ "err cause: %s", gaudi2_pcie_addr_dec_error_cause[i]); error_count++; - switch (intr_cause_data & BIT_ULL(i)) { - case PCIE_WRAP_PCIE_IC_SEI_INTR_IND_AXI_LBW_ERR_INTR_MASK: - hl_check_for_glbl_errors(hdev); - break; - case PCIE_WRAP_PCIE_IC_SEI_INTR_IND_BAD_ACCESS_INTR_MASK: - gaudi2_print_pcie_mstr_rr_mstr_if_razwi_info(hdev, event_mask); - break; - } + /* + * Always check for LBW and HBW additional info as the indication itself is + * sometimes missing + */ + hl_check_for_glbl_errors(hdev); + gaudi2_print_pcie_mstr_rr_mstr_if_razwi_info(hdev, event_mask); } return error_count;