From patchwork Wed Jun 19 06:34:17 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ofir Bitton X-Patchwork-Id: 13703444 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id E9246C27C53 for ; Wed, 19 Jun 2024 06:34:40 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id F38BA10E8B2; Wed, 19 Jun 2024 06:34:39 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=habana.ai header.i=@habana.ai header.b="Fxv1iIJ4"; dkim-atps=neutral Received: from mail02.habana.ai (habanamailrelay02.habana.ai [62.90.112.121]) by gabe.freedesktop.org (Postfix) with ESMTPS id 2300510E8B2 for ; Wed, 19 Jun 2024 06:34:37 +0000 (UTC) Received: internal info suppressed DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=habana.ai; s=default; t=1718778876; bh=7oYiZeDx+fTfe10HbMzhX8OPBVI3EpolpvrHlGHSnw4=; h=From:To:Subject:Date:From; b=Fxv1iIJ4pkR2dIjIynHvrT4ANRzILengTWF9k3ehg2v9UvVrBb4muC/5+wXa1mc+D 6vmTnwH5PcQ7aGiya32rfuT0wRbm4dez38+ZF9q+Vt180ZEU2PAy0VT5IqSBtzwYOe w4n215l7cErbm+9v9hh3Zbx4DFIgh8QiYdnxoq9HLWiKbIbhHvRz1e4EpE6mtemP5N DgAv1F5V9/jiFKO2uby7KlsNAzwF/NJRXOXFZW0L7vM9woUbZ2K7TSEou3tjbIbgVN z9QUGK/kJQpNw5KHvuCxcnaXI2gCBjLhrFoQrPbexb79TwJi9asc9VsEJEzRL4duUJ SzAg+Xje7kr6A== Received: from obitton-vm-u22.habana-labs.com (localhost [127.0.0.1]) by obitton-vm-u22.habana-labs.com (8.15.2/8.15.2/Debian-22ubuntu3) with ESMTP id 45J6YQB11377354; Wed, 19 Jun 2024 09:34:26 +0300 From: Ofir Bitton To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH 1/9] accel/habanalbs/gaudi2: reduce interrupt count to 128 Date: Wed, 19 Jun 2024 09:34:17 +0300 Message-Id: <20240619063425.1377327-1-obitton@habana.ai> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Some systems allow a maximum number of 128 MSI-X interrupts. Hence we reduce the interrupt count to 128 instead of 512. Signed-off-by: Ofir Bitton Reviewed-by: Ofir Bitton Reviewed-by: Tomer Tayar --- drivers/accel/habanalabs/gaudi2/gaudi2P.h | 8 ++++---- drivers/accel/habanalabs/include/gaudi2/gaudi2.h | 4 ++-- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2P.h b/drivers/accel/habanalabs/gaudi2/gaudi2P.h index eee41387b269..05117272cac7 100644 --- a/drivers/accel/habanalabs/gaudi2/gaudi2P.h +++ b/drivers/accel/habanalabs/gaudi2/gaudi2P.h @@ -384,7 +384,7 @@ enum gaudi2_edma_id { /* User interrupt count is aligned with HW CQ count. * We have 64 CQ's per dcore, CQ0 in dcore 0 is reserved for legacy mode */ -#define GAUDI2_NUM_USER_INTERRUPTS 255 +#define GAUDI2_NUM_USER_INTERRUPTS 64 #define GAUDI2_NUM_RESERVED_INTERRUPTS 1 #define GAUDI2_TOTAL_USER_INTERRUPTS (GAUDI2_NUM_USER_INTERRUPTS + GAUDI2_NUM_RESERVED_INTERRUPTS) @@ -416,11 +416,11 @@ enum gaudi2_irq_num { GAUDI2_IRQ_NUM_NIC_PORT_LAST = (GAUDI2_IRQ_NUM_NIC_PORT_FIRST + NIC_NUMBER_OF_PORTS - 1), GAUDI2_IRQ_NUM_TPC_ASSERT, GAUDI2_IRQ_NUM_EQ_ERROR, + GAUDI2_IRQ_NUM_USER_FIRST, + GAUDI2_IRQ_NUM_USER_LAST = (GAUDI2_IRQ_NUM_USER_FIRST + GAUDI2_NUM_USER_INTERRUPTS - 1), GAUDI2_IRQ_NUM_RESERVED_FIRST, - GAUDI2_IRQ_NUM_RESERVED_LAST = (GAUDI2_MSIX_ENTRIES - GAUDI2_TOTAL_USER_INTERRUPTS - 1), + GAUDI2_IRQ_NUM_RESERVED_LAST = (GAUDI2_MSIX_ENTRIES - GAUDI2_NUM_RESERVED_INTERRUPTS - 1), GAUDI2_IRQ_NUM_UNEXPECTED_ERROR = RESERVED_MSIX_UNEXPECTED_USER_ERROR_INTERRUPT, - GAUDI2_IRQ_NUM_USER_FIRST = GAUDI2_IRQ_NUM_UNEXPECTED_ERROR + 1, - GAUDI2_IRQ_NUM_USER_LAST = (GAUDI2_IRQ_NUM_USER_FIRST + GAUDI2_NUM_USER_INTERRUPTS - 1), GAUDI2_IRQ_NUM_LAST = (GAUDI2_MSIX_ENTRIES - 1) }; diff --git a/drivers/accel/habanalabs/include/gaudi2/gaudi2.h b/drivers/accel/habanalabs/include/gaudi2/gaudi2.h index 0231d6c55b4a..753d46a2836b 100644 --- a/drivers/accel/habanalabs/include/gaudi2/gaudi2.h +++ b/drivers/accel/habanalabs/include/gaudi2/gaudi2.h @@ -63,9 +63,9 @@ #define RESERVED_VA_RANGE_FOR_ARC_ON_HOST_HPAGE_START 0xFFF0F80000000000ull #define RESERVED_VA_RANGE_FOR_ARC_ON_HOST_HPAGE_END 0xFFF0FFFFFFFFFFFFull -#define RESERVED_MSIX_UNEXPECTED_USER_ERROR_INTERRUPT 256 +#define RESERVED_MSIX_UNEXPECTED_USER_ERROR_INTERRUPT 127 -#define GAUDI2_MSIX_ENTRIES 512 +#define GAUDI2_MSIX_ENTRIES 128 #define QMAN_PQ_ENTRY_SIZE 16 /* Bytes */ From patchwork Wed Jun 19 06:34:18 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ofir Bitton X-Patchwork-Id: 13703452 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id ED68DC27C79 for ; Wed, 19 Jun 2024 06:35:05 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id C2FDD10E950; Wed, 19 Jun 2024 06:35:04 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=habana.ai header.i=@habana.ai header.b="GOSAHsTl"; dkim-atps=neutral Received: from mail02.habana.ai (habanamailrelay02.habana.ai [62.90.112.121]) by gabe.freedesktop.org (Postfix) with ESMTPS id 341E610E8CD for ; Wed, 19 Jun 2024 06:34:54 +0000 (UTC) Received: internal info suppressed DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=habana.ai; s=default; t=1718778883; bh=qoJqzIBgnbAnMYwE3UnTc3QEPRz4TplW8Gl8EJKReTI=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=GOSAHsTldYDQFBKElaoR0Xea8jsPwQPLlhtutY7IxGZsYVZD68Xin63t6iRPy3yGQ lNXk/3SRC/iEL2CG22n212fyuFOLROYJEfHS7fhOdAbcNKRFy4BiEGptbwBeREEEt9 c0MQVxTG2hA+d/uo7hWbqAFdjNn2bHeTXK9NEuAUUFsx48jFPSLw+P5y7yJjpBULXU oE/++HFSwLH5GOmJmiFD1sYl2lRA7ycx0brHFnI/FwZnC3hBwTUsRok28Ocl4mhNE5 CKsyPmvyi+IPnTZL7v//UcQ3qmHzg2/0iBGfZqK3YXCGf+oIZB7DtmJtubzU22pl+1 l+uYu2nzA6nfA== Received: from obitton-vm-u22.habana-labs.com (localhost [127.0.0.1]) by obitton-vm-u22.habana-labs.com (8.15.2/8.15.2/Debian-22ubuntu3) with ESMTP id 45J6YQB21377354; Wed, 19 Jun 2024 09:34:31 +0300 From: Ofir Bitton To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: Oded Gabbay , Daniel Vetter Subject: [PATCH 2/9] MAINTAINERS: Change habanalabs maintainer and git repo path Date: Wed, 19 Jun 2024 09:34:18 +0300 Message-Id: <20240619063425.1377327-2-obitton@habana.ai> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240619063425.1377327-1-obitton@habana.ai> References: <20240619063425.1377327-1-obitton@habana.ai> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Oded Gabbay Because I left habana, Ofir Bitton is now the habanalabs driver maintainer. The git repo also changed location to the Habana GitHub website. Signed-off-by: Oded Gabbay Acked-by: Daniel Vetter --- MAINTAINERS | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index d7cfe9895a44..0f645095d22f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9449,11 +9449,11 @@ S: Maintained F: block/partitions/efi.* HABANALABS PCI DRIVER -M: Oded Gabbay +M: Ofir Bitton L: dri-devel@lists.freedesktop.org S: Supported C: irc://irc.oftc.net/dri-devel -T: git https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux.git +T: git https://github.com/HabanaAI/drivers.accel.habanalabs.kernel.git F: Documentation/ABI/testing/debugfs-driver-habanalabs F: Documentation/ABI/testing/sysfs-driver-habanalabs F: drivers/accel/habanalabs/ From patchwork Wed Jun 19 06:34:19 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ofir Bitton X-Patchwork-Id: 13703448 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B8873C27C53 for ; Wed, 19 Jun 2024 06:34:52 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 4A85110E944; Wed, 19 Jun 2024 06:34:49 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=habana.ai header.i=@habana.ai header.b="l8+SWnpY"; dkim-atps=neutral Received: from mail02.habana.ai (habanamailrelay.habana.ai [213.57.90.13]) by gabe.freedesktop.org (Postfix) with ESMTPS id B3CD010E8FB for ; Wed, 19 Jun 2024 06:34:45 +0000 (UTC) Received: internal info suppressed DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=habana.ai; s=default; t=1718778893; bh=uKE4mf3zyVGbJYXdTQbI3k5JMIKAYcCXynOPaZpJeDA=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=l8+SWnpYMQPMNkDAdvIIZIBIpT6gnOAkc/qBiqsLA/kUeefSjP3AJXEKzK3C0aZVb G5zp3Av1LGzEgx2Gb6hzbTfEt2pAqNxUzy4wcrBtRvncDm3MIeTZCyrL/QJej3rBm1 NV8srY1u+6N8UbMvHzNnQVxlvrqEGgKaKN65mUv7L+iIL259UaNvcw//K+7aR1OXiQ 0OwEWkLqjkcgQK8y8emk1BfplQNvhq2/YghQJW3Thx1DuvfzlmuncLWTqxnVu6re7a ft4n84dHm3bzMSTGmACTuEL3z6srropAofwlb4XQPJ3HVTKmu4j5wIbFvM79T+qtnK eBKRxVaPVtBNA== Received: from obitton-vm-u22.habana-labs.com (localhost [127.0.0.1]) by obitton-vm-u22.habana-labs.com (8.15.2/8.15.2/Debian-22ubuntu3) with ESMTP id 45J6YQB31377354; Wed, 19 Jun 2024 09:34:33 +0300 From: Ofir Bitton To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: Ilia Levi Subject: [PATCH 3/9] accel/habanalabs: additional print in device-in-use info Date: Wed, 19 Jun 2024 09:34:19 +0300 Message-Id: <20240619063425.1377327-3-obitton@habana.ai> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240619063425.1377327-1-obitton@habana.ai> References: <20240619063425.1377327-1-obitton@habana.ai> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Ilia Levi When device release triggers a hard reset, there is a printout of the cause. Currently listed causes (that increment context refcount) are active command submissions and exported DMA buffer objects. In any other case, the printout emits "unknown reason". We identify and print another reason - allocated command buffers. Signed-off-by: Ilia Levi Reviewed-by: Ofir Bitton --- drivers/accel/habanalabs/common/device.c | 19 +++++++--- drivers/accel/habanalabs/common/habanalabs.h | 14 ++++++- .../accel/habanalabs/common/habanalabs_drv.c | 2 +- drivers/accel/habanalabs/common/memory_mgr.c | 37 ++++++++++++++++++- 4 files changed, 63 insertions(+), 9 deletions(-) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index 78e65c6b76a7..2fa6bf4c97af 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -549,7 +549,8 @@ int hl_hpriv_put(struct hl_fpriv *hpriv) return kref_put(&hpriv->refcount, hpriv_release); } -static void print_device_in_use_info(struct hl_device *hdev, const char *message) +static void print_device_in_use_info(struct hl_device *hdev, + struct hl_mem_mgr_fini_stats *mm_fini_stats, const char *message) { u32 active_cs_num, dmabuf_export_cnt; bool unknown_reason = true; @@ -573,6 +574,12 @@ static void print_device_in_use_info(struct hl_device *hdev, const char *message dmabuf_export_cnt); } + if (mm_fini_stats->n_busy_cb) { + unknown_reason = false; + offset += scnprintf(buf + offset, size - offset, " [%u live CB handles]", + mm_fini_stats->n_busy_cb); + } + if (unknown_reason) scnprintf(buf + offset, size - offset, " [unknown reason]"); @@ -590,6 +597,7 @@ void hl_device_release(struct drm_device *ddev, struct drm_file *file_priv) { struct hl_fpriv *hpriv = file_priv->driver_priv; struct hl_device *hdev = to_hl_device(ddev); + struct hl_mem_mgr_fini_stats mm_fini_stats; if (!hdev) { pr_crit("Closing FD after device was removed. Memory leak will occur and it is advised to reboot.\n"); @@ -601,12 +609,13 @@ void hl_device_release(struct drm_device *ddev, struct drm_file *file_priv) /* Memory buffers might be still in use at this point and thus the handles IDR destruction * is postponed to hpriv_release(). */ - hl_mem_mgr_fini(&hpriv->mem_mgr); + hl_mem_mgr_fini(&hpriv->mem_mgr, &mm_fini_stats); hdev->compute_ctx_in_release = 1; if (!hl_hpriv_put(hpriv)) { - print_device_in_use_info(hdev, "User process closed FD but device still in use"); + print_device_in_use_info(hdev, &mm_fini_stats, + "User process closed FD but device still in use"); hl_device_reset(hdev, HL_DRV_RESET_HARD); } @@ -976,7 +985,7 @@ static int device_early_init(struct hl_device *hdev) return 0; free_cb_mgr: - hl_mem_mgr_fini(&hdev->kernel_mem_mgr); + hl_mem_mgr_fini(&hdev->kernel_mem_mgr, NULL); hl_mem_mgr_idr_destroy(&hdev->kernel_mem_mgr); free_chip_info: kfree(hdev->hl_chip_info); @@ -1020,7 +1029,7 @@ static void device_early_fini(struct hl_device *hdev) mutex_destroy(&hdev->clk_throttling.lock); - hl_mem_mgr_fini(&hdev->kernel_mem_mgr); + hl_mem_mgr_fini(&hdev->kernel_mem_mgr, NULL); hl_mem_mgr_idr_destroy(&hdev->kernel_mem_mgr); kfree(hdev->hl_chip_info); diff --git a/drivers/accel/habanalabs/common/habanalabs.h b/drivers/accel/habanalabs/common/habanalabs.h index 3ea1b131cd42..f4ac5e9b1974 100644 --- a/drivers/accel/habanalabs/common/habanalabs.h +++ b/drivers/accel/habanalabs/common/habanalabs.h @@ -904,6 +904,18 @@ struct hl_mem_mgr { struct idr handles; }; +/** + * struct hl_mem_mgr_fini_stats - describes statistics returned during memory manager teardown. + * @n_busy_cb: the amount of CB handles that could not be removed + * @n_busy_ts: the amount of TS handles that could not be removed + * @n_busy_other: the amount of any other type of handles that could not be removed + */ +struct hl_mem_mgr_fini_stats { + u32 n_busy_cb; + u32 n_busy_ts; + u32 n_busy_other; +}; + /** * struct hl_mmap_mem_buf_behavior - describes unified memory manager buffer behavior * @topic: string identifier used for logging @@ -4036,7 +4048,7 @@ char *hl_format_as_binary(char *buf, size_t buf_len, u32 n); const char *hl_sync_engine_to_string(enum hl_sync_engine_type engine_type); void hl_mem_mgr_init(struct device *dev, struct hl_mem_mgr *mmg); -void hl_mem_mgr_fini(struct hl_mem_mgr *mmg); +void hl_mem_mgr_fini(struct hl_mem_mgr *mmg, struct hl_mem_mgr_fini_stats *stats); void hl_mem_mgr_idr_destroy(struct hl_mem_mgr *mmg); int hl_mem_mgr_mmap(struct hl_mem_mgr *mmg, struct vm_area_struct *vma, void *args); diff --git a/drivers/accel/habanalabs/common/habanalabs_drv.c b/drivers/accel/habanalabs/common/habanalabs_drv.c index b1613a82c7f2..708dfd10f39c 100644 --- a/drivers/accel/habanalabs/common/habanalabs_drv.c +++ b/drivers/accel/habanalabs/common/habanalabs_drv.c @@ -263,7 +263,7 @@ int hl_device_open(struct drm_device *ddev, struct drm_file *file_priv) out_err: mutex_unlock(&hdev->fpriv_list_lock); - hl_mem_mgr_fini(&hpriv->mem_mgr); + hl_mem_mgr_fini(&hpriv->mem_mgr, NULL); hl_mem_mgr_idr_destroy(&hpriv->mem_mgr); hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr); mutex_destroy(&hpriv->ctx_lock); diff --git a/drivers/accel/habanalabs/common/memory_mgr.c b/drivers/accel/habanalabs/common/memory_mgr.c index c4d84df355b0..99cd83139d46 100644 --- a/drivers/accel/habanalabs/common/memory_mgr.c +++ b/drivers/accel/habanalabs/common/memory_mgr.c @@ -318,28 +318,61 @@ void hl_mem_mgr_init(struct device *dev, struct hl_mem_mgr *mmg) idr_init(&mmg->handles); } +static void hl_mem_mgr_fini_stats_reset(struct hl_mem_mgr_fini_stats *stats) +{ + if (!stats) + return; + + memset(stats, 0, sizeof(*stats)); +} + +static void hl_mem_mgr_fini_stats_inc(u64 mem_id, struct hl_mem_mgr_fini_stats *stats) +{ + if (!stats) + return; + + switch (mem_id) { + case HL_MMAP_TYPE_CB: + ++stats->n_busy_cb; + break; + case HL_MMAP_TYPE_TS_BUFF: + ++stats->n_busy_ts; + break; + default: + /* we currently store only CB/TS so this shouldn't happen */ + ++stats->n_busy_other; + } +} + /** * hl_mem_mgr_fini - release unified memory manager * * @mmg: parent unified memory manager + * @stats: if non-NULL, will return some counters for handles that could not be removed. * * Release the unified memory manager. Shall be called from an interrupt context. */ -void hl_mem_mgr_fini(struct hl_mem_mgr *mmg) +void hl_mem_mgr_fini(struct hl_mem_mgr *mmg, struct hl_mem_mgr_fini_stats *stats) { struct hl_mmap_mem_buf *buf; struct idr *idp; const char *topic; + u64 mem_id; u32 id; + hl_mem_mgr_fini_stats_reset(stats); + idp = &mmg->handles; idr_for_each_entry(idp, buf, id) { topic = buf->behavior->topic; - if (hl_mmap_mem_buf_put(buf) != 1) + mem_id = buf->behavior->mem_id; + if (hl_mmap_mem_buf_put(buf) != 1) { dev_err(mmg->dev, "%s: Buff handle %u for CTX is still alive\n", topic, id); + hl_mem_mgr_fini_stats_inc(mem_id, stats); + } } } From patchwork Wed Jun 19 06:34:20 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ofir Bitton X-Patchwork-Id: 13703445 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1BEA3C27C53 for ; Wed, 19 Jun 2024 06:34:46 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 07BD510E909; Wed, 19 Jun 2024 06:34:45 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=habana.ai header.i=@habana.ai header.b="MnY6i1PS"; dkim-atps=neutral Received: from mail02.habana.ai (habanamailrelay02.habana.ai [62.90.112.121]) by gabe.freedesktop.org (Postfix) with ESMTPS id 45C2C10E8FB for ; Wed, 19 Jun 2024 06:34:44 +0000 (UTC) Received: internal info suppressed DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=habana.ai; s=default; t=1718778884; bh=5ump6eTLOwqBw/KiRuccbtl0IWfZ6Ud4eHUp0djojrs=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=MnY6i1PSqvaA9TQQuARbAu1bYcyotWLFA2uZ3I9aZP3TrCP1jPAJxdkA+9PNjAOsR a8s7Ap7CNLGGtLCAHfF4x2xYBA+2/vqgNg+/tqKKgt5W0FbJrBKDuGiUmnUuA6H60S ywqNoCELtYlEQc3TgY9HJORW568ZjE/06PcLRKMFMWrumDgt6L9tP+hEuXAguIUGgG 768m2EKncDYnwCIbLCXACgZVC+ubkUODpQDYga4m5AlPTN1vrUy0Q0T7p3mzmN/Yc0 rQhMs9kJ3xkn+xRtaXxr8eI5u/M2aRqPRDRBdlOLQAD7T1iyykv0G7uxcEb0FmVRfg 7NMUWLU8+Xzqw== Received: from obitton-vm-u22.habana-labs.com (localhost [127.0.0.1]) by obitton-vm-u22.habana-labs.com (8.15.2/8.15.2/Debian-22ubuntu3) with ESMTP id 45J6YQB41377354; Wed, 19 Jun 2024 09:34:33 +0300 From: Ofir Bitton To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: Farah Kassabri Subject: [PATCH 4/9] accel/habanalabs: add more info upon cpu pkt timeout Date: Wed, 19 Jun 2024 09:34:20 +0300 Message-Id: <20240619063425.1377327-4-obitton@habana.ai> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240619063425.1377327-1-obitton@habana.ai> References: <20240619063425.1377327-1-obitton@habana.ai> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Farah Kassabri In order to have better debuggability upon encountering FW issues, We are adding additional info once CPU packet timeout expires. Signed-off-by: Farah Kassabri Reviewed-by: Ofir Bitton --- drivers/accel/habanalabs/common/firmware_if.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/drivers/accel/habanalabs/common/firmware_if.c b/drivers/accel/habanalabs/common/firmware_if.c index 6f0c40b12072..3cd8a1f69980 100644 --- a/drivers/accel/habanalabs/common/firmware_if.c +++ b/drivers/accel/habanalabs/common/firmware_if.c @@ -460,11 +460,19 @@ int hl_fw_send_cpu_message(struct hl_device *hdev, u32 hw_queue_id, u32 *msg, /* If FW performed reset just before sending it a packet, we will get a timeout. * This is expected behavior, hence no need for error message. */ - if (!hl_device_operational(hdev, NULL) && !hdev->reset_info.in_compute_reset) + if (!hl_device_operational(hdev, NULL) && !hdev->reset_info.in_compute_reset) { dev_dbg(hdev->dev, "Device CPU packet timeout (0x%x) due to FW reset\n", tmp); - else - dev_err(hdev->dev, "Device CPU packet timeout (status = 0x%x)\n", tmp); + } else { + struct hl_bd *bd = queue->kernel_address; + + bd += hl_pi_2_offset(queue->pi); + + dev_err(hdev->dev, "Device CPU packet timeout (status = 0x%x)\n" + "Pkt info: dma_addr: 0x%llx, kernel_addr: %p, len:0x%x, ctl: 0x%x, ptr:0x%llx, dram_bd:%u\n", + tmp, pkt_dma_addr, (void *)pkt, bd->len, bd->ctl, bd->ptr, + queue->dram_bd); + } hdev->device_cpu_disabled = true; goto out; } From patchwork Wed Jun 19 06:34:21 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ofir Bitton X-Patchwork-Id: 13703446 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 53045C27C53 for ; Wed, 19 Jun 2024 06:34:49 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 6835A10E8FB; Wed, 19 Jun 2024 06:34:48 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=habana.ai header.i=@habana.ai header.b="Qy+ZX04N"; dkim-atps=neutral Received: from mail02.habana.ai (habanamailrelay.habana.ai [213.57.90.13]) by gabe.freedesktop.org (Postfix) with ESMTPS id E26BD10E944 for ; Wed, 19 Jun 2024 06:34:45 +0000 (UTC) Received: internal info suppressed DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=habana.ai; s=default; t=1718778894; bh=ATxvJBxOmbSyJka4PwJSH2f3ibAC8iw8M5GUBWILV1o=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=Qy+ZX04NqisrCqBQeCK01KL8Cla26UuV2xXZtp4grEqCrymkAmQdTZd/8TKbze2Ft c4YsB8mUHEyNOgXUZgoTesVr1yMnqO6U/uS+hlf/iygCluSPHb6Xs0U0ixt7iY3+hR ZRxQ4T0+qQEgOG3WwLG0/S0atfGPmexNfptqMruXuyBlLsv5A+o78z7bm/z98o8fT/ nOdJv4t//sMpcWivvPsMtpgolMuui+BKQBGO7dt9JIFmiuzx/5jYk3NO92XWxxUceZ 3nvBZwFobByoAoQobOGeFucFDSIKDAifMq8Y1YsUoRkew2dHLszDYwzKgXqEaiD20u PMvoYHJLdL5Qg== Received: from obitton-vm-u22.habana-labs.com (localhost [127.0.0.1]) by obitton-vm-u22.habana-labs.com (8.15.2/8.15.2/Debian-22ubuntu3) with ESMTP id 45J6YQB51377354; Wed, 19 Jun 2024 09:34:33 +0300 From: Ofir Bitton To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: Tomer Tayar Subject: [PATCH 5/9] accel/habanalabs: revise print on EQ heartbeat failure Date: Wed, 19 Jun 2024 09:34:21 +0300 Message-Id: <20240619063425.1377327-5-obitton@habana.ai> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240619063425.1377327-1-obitton@habana.ai> References: <20240619063425.1377327-1-obitton@habana.ai> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Tomer Tayar Don't print the "previous EQ index" value in case of a EQ heartbeat failure, because it is incremented along with the EQ CI and therefore redundant. In addition, as the CPU-CP PI is zeroed when it reaches a value that is twice the queue size, add a value of the CI with a similar wrap around, to make it easier to compare the values. Signed-off-by: Tomer Tayar Reviewed-by: Ofir Bitton --- drivers/accel/habanalabs/common/device.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index 2fa6bf4c97af..3efc26dd9497 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -1064,23 +1064,24 @@ static bool is_pci_link_healthy(struct hl_device *hdev) static bool hl_device_eq_heartbeat_received(struct hl_device *hdev) { + struct eq_heartbeat_debug_info *heartbeat_debug_info = &hdev->heartbeat_debug_info; + u32 cpu_q_id = heartbeat_debug_info->cpu_queue_id, pq_pi_mask = (HL_QUEUE_LENGTH << 1) - 1; struct asic_fixed_properties *prop = &hdev->asic_prop; - u32 cpu_q_id; if (!prop->cpucp_info.eq_health_check_supported) return true; if (!hdev->eq_heartbeat_received) { - cpu_q_id = hdev->heartbeat_debug_info.cpu_queue_id; - dev_err(hdev->dev, "EQ heartbeat event was not received!\n"); - dev_err(hdev->dev, "Heartbeat events counter: %u, Q_PI: %u, Q_CI: %u, EQ CI: %u, EQ prev: %u\n", - hdev->heartbeat_debug_info.heartbeat_event_counter, - hdev->kernel_queues[cpu_q_id].pi, - atomic_read(&hdev->kernel_queues[cpu_q_id].ci), - hdev->event_queue.ci, - hdev->event_queue.prev_eqe_index); + dev_err(hdev->dev, + "Heartbeat events counter: %u, EQ CI: %u, PQ PI: %u, PQ CI: %u (%u)\n", + heartbeat_debug_info->heartbeat_event_counter, + hdev->event_queue.ci, + hdev->kernel_queues[cpu_q_id].pi, + atomic_read(&hdev->kernel_queues[cpu_q_id].ci), + atomic_read(&hdev->kernel_queues[cpu_q_id].ci) & pq_pi_mask); + return false; } From patchwork Wed Jun 19 06:34:22 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ofir Bitton X-Patchwork-Id: 13703450 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 83BFEC27C53 for ; Wed, 19 Jun 2024 06:34:59 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 9F4E410E952; Wed, 19 Jun 2024 06:34:58 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=habana.ai header.i=@habana.ai header.b="gqXxriyv"; dkim-atps=neutral Received: from mail02.habana.ai (habanamailrelay02.habana.ai [62.90.112.121]) by gabe.freedesktop.org (Postfix) with ESMTPS id 3E8E710E926 for ; Wed, 19 Jun 2024 06:34:45 +0000 (UTC) Received: internal info suppressed DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=habana.ai; s=default; t=1718778885; bh=V0laHEgCDkAnpEzCKnp6J6owHCpEY1oftzaccgLL6p4=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=gqXxriyvEY3yVo5fLCIpwkVDnvuLSi1f14SHtj+qZIOtZEZPMPDOWz503GMzjWWyv a7CGiWLg+Dxeq0ulihN/SeljTtLjNLqJMkCw6eRVw7Ipi2uN6iIbCN4j9hY8N2SOct OqvB0oQGqEl8eVABdQ8TFYfaRL3H3yDUHhCKesSqjQgDf+Idp7jXrCsKycP/TQYeOk E+HCjIXFWRPg4vWeJir/+Kwb9hloZLgzn/DOD9Wibz95OMS6qLZ6EJywSy4dfgKgsI vK8f2+uK+JuX4TCsZbXuginKU5F+YrLVOnV88pnsvK862tJRfCkY4a6jWa54f9K4PT In3W7Tw9HzhNg== Received: from obitton-vm-u22.habana-labs.com (localhost [127.0.0.1]) by obitton-vm-u22.habana-labs.com (8.15.2/8.15.2/Debian-22ubuntu3) with ESMTP id 45J6YQB61377354; Wed, 19 Jun 2024 09:34:33 +0300 From: Ofir Bitton To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: Tomer Tayar Subject: [PATCH 6/9] accel/habanalabs: dump the EQ entries headers on EQ heartbeat failure Date: Wed, 19 Jun 2024 09:34:22 +0300 Message-Id: <20240619063425.1377327-6-obitton@habana.ai> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240619063425.1377327-1-obitton@habana.ai> References: <20240619063425.1377327-1-obitton@habana.ai> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Tomer Tayar Add a dump of the EQ entries headers upon a EQ heartbeat failure. Signed-off-by: Tomer Tayar Reviewed-by: Ofir Bitton --- drivers/accel/habanalabs/common/device.c | 2 ++ drivers/accel/habanalabs/common/habanalabs.h | 1 + drivers/accel/habanalabs/common/irq.c | 25 ++++++++++++++++++++ include/linux/habanalabs/cpucp_if.h | 3 +++ 4 files changed, 31 insertions(+) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index 3efc26dd9497..7bd7c2eb5dd2 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -1082,6 +1082,8 @@ static bool hl_device_eq_heartbeat_received(struct hl_device *hdev) atomic_read(&hdev->kernel_queues[cpu_q_id].ci), atomic_read(&hdev->kernel_queues[cpu_q_id].ci) & pq_pi_mask); + hl_eq_dump(hdev, &hdev->event_queue); + return false; } diff --git a/drivers/accel/habanalabs/common/habanalabs.h b/drivers/accel/habanalabs/common/habanalabs.h index f4ac5e9b1974..ce78b331e244 100644 --- a/drivers/accel/habanalabs/common/habanalabs.h +++ b/drivers/accel/habanalabs/common/habanalabs.h @@ -3754,6 +3754,7 @@ int hl_eq_init(struct hl_device *hdev, struct hl_eq *q); void hl_eq_fini(struct hl_device *hdev, struct hl_eq *q); void hl_cq_reset(struct hl_device *hdev, struct hl_cq *q); void hl_eq_reset(struct hl_device *hdev, struct hl_eq *q); +void hl_eq_dump(struct hl_device *hdev, struct hl_eq *q); irqreturn_t hl_irq_handler_cq(int irq, void *arg); irqreturn_t hl_irq_handler_eq(int irq, void *arg); irqreturn_t hl_irq_handler_dec_abnrm(int irq, void *arg); diff --git a/drivers/accel/habanalabs/common/irq.c b/drivers/accel/habanalabs/common/irq.c index 2caf2df4de08..7c9f2f6a2870 100644 --- a/drivers/accel/habanalabs/common/irq.c +++ b/drivers/accel/habanalabs/common/irq.c @@ -697,3 +697,28 @@ void hl_eq_reset(struct hl_device *hdev, struct hl_eq *q) memset(q->kernel_address, 0, q->size); } + +void hl_eq_dump(struct hl_device *hdev, struct hl_eq *q) +{ + u32 eq_length, eqe_size, ctl, ready, mode, type, index; + struct hl_eq_header *hdr; + u8 *ptr; + int i; + + eq_length = HL_EQ_LENGTH; + eqe_size = q->size / HL_EQ_LENGTH; + + dev_info(hdev->dev, "Contents of EQ entries headers:\n"); + + for (i = 0, ptr = q->kernel_address ; i < eq_length ; ++i, ptr += eqe_size) { + hdr = (struct hl_eq_header *) ptr; + ctl = le32_to_cpu(hdr->ctl); + ready = FIELD_GET(EQ_CTL_READY_MASK, ctl); + mode = FIELD_GET(EQ_CTL_EVENT_MODE_MASK, ctl); + type = FIELD_GET(EQ_CTL_EVENT_TYPE_MASK, ctl); + index = FIELD_GET(EQ_CTL_INDEX_MASK, ctl); + + dev_info(hdev->dev, "%02u: %#010x [ready: %u, mode %u, type %04u, index %05u]\n", + i, ctl, ready, mode, type, index); + } +} diff --git a/include/linux/habanalabs/cpucp_if.h b/include/linux/habanalabs/cpucp_if.h index 1ed17887f1a8..7ed3fdd55dda 100644 --- a/include/linux/habanalabs/cpucp_if.h +++ b/include/linux/habanalabs/cpucp_if.h @@ -397,6 +397,9 @@ struct hl_eq_entry { #define EQ_CTL_READY_SHIFT 31 #define EQ_CTL_READY_MASK 0x80000000 +#define EQ_CTL_EVENT_MODE_SHIFT 28 +#define EQ_CTL_EVENT_MODE_MASK 0x70000000 + #define EQ_CTL_EVENT_TYPE_SHIFT 16 #define EQ_CTL_EVENT_TYPE_MASK 0x0FFF0000 From patchwork Wed Jun 19 06:34:23 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ofir Bitton X-Patchwork-Id: 13703447 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id ED075C2BB85 for ; Wed, 19 Jun 2024 06:34:50 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 2843D10E926; Wed, 19 Jun 2024 06:34:49 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=habana.ai header.i=@habana.ai header.b="lLLhshXK"; dkim-atps=neutral Received: from mail02.habana.ai (habanamailrelay02.habana.ai [62.90.112.121]) by gabe.freedesktop.org (Postfix) with ESMTPS id ED41510E8FB for ; Wed, 19 Jun 2024 06:34:44 +0000 (UTC) Received: internal info suppressed DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=habana.ai; s=default; t=1718778885; bh=NxYliGGJgqHz9gehJ3BphJe2LVCMpgSUk84vQe7b8b8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=lLLhshXK2yeanoazEPxh/k027tAsuOyFfNa/CQq3zAeq/FHH5wvYBpVNSlFeGQaCG Fk5vOg6fM7Blg3LLaX1+gCyTccv+FNkC0qznTA1+SaneNdD8rl9gs7GNpnmR27evVf 65QO5u1yBmOwSEUCHTA+sXwxt68PKUGJmuH40hl56Jmf7pBRNyO7N4W+aPq0S2QIZM v54x9Ch1ZfS59PlAOvj1E7jOmx9yYIqguIqzrrwboOSrfM+Y/3kBsOJ4jNgw+Xovms KFRiOXqsEUmdlMD3tPC3aUnnIJLcC7P8PiSxtmx8l46sBqefu+96R8uzk7TEA9b64q evndcaqY/GABw== Received: from obitton-vm-u22.habana-labs.com (localhost [127.0.0.1]) by obitton-vm-u22.habana-labs.com (8.15.2/8.15.2/Debian-22ubuntu3) with ESMTP id 45J6YQB71377354; Wed, 19 Jun 2024 09:34:33 +0300 From: Ofir Bitton To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: Tomer Tayar Subject: [PATCH 7/9] accel/habanalabs: print timestamp of last PQ heartbeat on EQ heartbeat failure Date: Wed, 19 Jun 2024 09:34:23 +0300 Message-Id: <20240619063425.1377327-7-obitton@habana.ai> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240619063425.1377327-1-obitton@habana.ai> References: <20240619063425.1377327-1-obitton@habana.ai> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Tomer Tayar The test packet which is sent to FW for the PQ heartbeat is used also as the trigger in FW to send the EQ heartbeat event. Add the time of the last sent packet to the debug info which is printed upon a EQ heartbeat failure. Signed-off-by: Tomer Tayar Reviewed-by: Ofir Bitton --- drivers/accel/habanalabs/common/device.c | 37 +++++++++++++++++-- drivers/accel/habanalabs/common/firmware_if.c | 16 ++++---- drivers/accel/habanalabs/common/habanalabs.h | 5 +++ 3 files changed, 46 insertions(+), 12 deletions(-) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index 7bd7c2eb5dd2..050c278e5ddb 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -1062,11 +1062,28 @@ static bool is_pci_link_healthy(struct hl_device *hdev) return (device_id == hdev->pdev->device); } +static void stringify_time_of_last_heartbeat(struct hl_device *hdev, char *time_str, size_t size, + bool is_pq_hb) +{ + time64_t seconds = is_pq_hb ? hdev->heartbeat_debug_info.last_pq_heartbeat_ts + : hdev->heartbeat_debug_info.last_eq_heartbeat_ts; + struct tm tm; + + if (!seconds) + return; + + time64_to_tm(seconds, 0, &tm); + + snprintf(time_str, size, "%ld-%02d-%02d %02d:%02d:%02d (UTC)", + tm.tm_year + 1900, tm.tm_mon, tm.tm_mday, tm.tm_hour, tm.tm_min, tm.tm_sec); +} + static bool hl_device_eq_heartbeat_received(struct hl_device *hdev) { struct eq_heartbeat_debug_info *heartbeat_debug_info = &hdev->heartbeat_debug_info; u32 cpu_q_id = heartbeat_debug_info->cpu_queue_id, pq_pi_mask = (HL_QUEUE_LENGTH << 1) - 1; struct asic_fixed_properties *prop = &hdev->asic_prop; + char pq_time_str[64] = "N/A", eq_time_str[64] = "N/A"; if (!prop->cpucp_info.eq_health_check_supported) return true; @@ -1074,13 +1091,17 @@ static bool hl_device_eq_heartbeat_received(struct hl_device *hdev) if (!hdev->eq_heartbeat_received) { dev_err(hdev->dev, "EQ heartbeat event was not received!\n"); + stringify_time_of_last_heartbeat(hdev, pq_time_str, sizeof(pq_time_str), true); + stringify_time_of_last_heartbeat(hdev, eq_time_str, sizeof(eq_time_str), false); dev_err(hdev->dev, - "Heartbeat events counter: %u, EQ CI: %u, PQ PI: %u, PQ CI: %u (%u)\n", - heartbeat_debug_info->heartbeat_event_counter, + "EQ: {CI %u, HB counter %u, last HB time: %s}, PQ: {PI: %u, CI: %u (%u), last HB time: %s}\n", hdev->event_queue.ci, + heartbeat_debug_info->heartbeat_event_counter, + eq_time_str, hdev->kernel_queues[cpu_q_id].pi, atomic_read(&hdev->kernel_queues[cpu_q_id].ci), - atomic_read(&hdev->kernel_queues[cpu_q_id].ci) & pq_pi_mask); + atomic_read(&hdev->kernel_queues[cpu_q_id].ci) & pq_pi_mask, + pq_time_str); hl_eq_dump(hdev, &hdev->event_queue); @@ -1562,12 +1583,19 @@ static void handle_reset_trigger(struct hl_device *hdev, u32 flags) } } +static void reset_heartbeat_debug_info(struct hl_device *hdev) +{ + hdev->heartbeat_debug_info.last_pq_heartbeat_ts = 0; + hdev->heartbeat_debug_info.last_eq_heartbeat_ts = 0; + hdev->heartbeat_debug_info.heartbeat_event_counter = 0; +} + static inline void device_heartbeat_schedule(struct hl_device *hdev) { if (!hdev->heartbeat) return; - hdev->heartbeat_debug_info.heartbeat_event_counter = 0; + reset_heartbeat_debug_info(hdev); /* * Before scheduling the heartbeat driver will check if eq event has received. @@ -2883,6 +2911,7 @@ void hl_set_irq_affinity(struct hl_device *hdev, int irq) void hl_eq_heartbeat_event_handle(struct hl_device *hdev) { hdev->heartbeat_debug_info.heartbeat_event_counter++; + hdev->heartbeat_debug_info.last_eq_heartbeat_ts = ktime_get_real_seconds(); hdev->eq_heartbeat_received = true; } diff --git a/drivers/accel/habanalabs/common/firmware_if.c b/drivers/accel/habanalabs/common/firmware_if.c index 3cd8a1f69980..eeb6b2a80fc7 100644 --- a/drivers/accel/habanalabs/common/firmware_if.c +++ b/drivers/accel/habanalabs/common/firmware_if.c @@ -466,12 +466,12 @@ int hl_fw_send_cpu_message(struct hl_device *hdev, u32 hw_queue_id, u32 *msg, } else { struct hl_bd *bd = queue->kernel_address; - bd += hl_pi_2_offset(queue->pi); + bd += hl_pi_2_offset(pi); dev_err(hdev->dev, "Device CPU packet timeout (status = 0x%x)\n" - "Pkt info: dma_addr: 0x%llx, kernel_addr: %p, len:0x%x, ctl: 0x%x, ptr:0x%llx, dram_bd:%u\n", - tmp, pkt_dma_addr, (void *)pkt, bd->len, bd->ctl, bd->ptr, - queue->dram_bd); + "Pkt info[%u]: dma_addr: 0x%llx, kernel_addr: %p, len:0x%x, ctl: 0x%x, ptr:0x%llx, dram_bd:%u\n", + tmp, pi, pkt_dma_addr, (void *)pkt, bd->len, bd->ctl, bd->ptr, + queue->dram_bd); } hdev->device_cpu_disabled = true; goto out; @@ -681,12 +681,10 @@ int hl_fw_send_heartbeat(struct hl_device *hdev) int rc; memset(&hb_pkt, 0, sizeof(hb_pkt)); - hb_pkt.ctl = cpu_to_le32(CPUCP_PACKET_TEST << - CPUCP_PKT_CTL_OPCODE_SHIFT); + hb_pkt.ctl = cpu_to_le32(CPUCP_PACKET_TEST << CPUCP_PKT_CTL_OPCODE_SHIFT); hb_pkt.value = cpu_to_le64(CPUCP_PACKET_FENCE_VAL); - rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &hb_pkt, - sizeof(hb_pkt), 0, &result); + rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &hb_pkt, sizeof(hb_pkt), 0, &result); if ((rc) || (result != CPUCP_PACKET_FENCE_VAL)) return -EIO; @@ -697,6 +695,8 @@ int hl_fw_send_heartbeat(struct hl_device *hdev) rc = -EIO; } + hdev->heartbeat_debug_info.last_pq_heartbeat_ts = ktime_get_real_seconds(); + return rc; } diff --git a/drivers/accel/habanalabs/common/habanalabs.h b/drivers/accel/habanalabs/common/habanalabs.h index ce78b331e244..a06e5a966f45 100644 --- a/drivers/accel/habanalabs/common/habanalabs.h +++ b/drivers/accel/habanalabs/common/habanalabs.h @@ -3196,10 +3196,15 @@ struct hl_reset_info { /** * struct eq_heartbeat_debug_info - stores debug info to be used upon heartbeat failure. + * @last_pq_heartbeat_ts: timestamp of the last test packet that was sent to FW. + * This packet is the trigger in FW to send the EQ heartbeat event. + * @last_eq_heartbeat_ts: timestamp of the last EQ heartbeat event that was received from FW. * @heartbeat_event_counter: number of heartbeat events received. * @cpu_queue_id: used to read the queue pi/ci */ struct eq_heartbeat_debug_info { + time64_t last_pq_heartbeat_ts; + time64_t last_eq_heartbeat_ts; u32 heartbeat_event_counter; u32 cpu_queue_id; }; From patchwork Wed Jun 19 06:34:24 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ofir Bitton X-Patchwork-Id: 13703451 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1A7FEC27C53 for ; Wed, 19 Jun 2024 06:35:05 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 3BC1310E905; Wed, 19 Jun 2024 06:35:04 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=habana.ai header.i=@habana.ai header.b="O8YH0G0k"; dkim-atps=neutral Received: from mail02.habana.ai (habanamailrelay.habana.ai [213.57.90.13]) by gabe.freedesktop.org (Postfix) with ESMTPS id 9DAD810E950 for ; Wed, 19 Jun 2024 06:34:58 +0000 (UTC) Received: internal info suppressed DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=habana.ai; s=default; t=1718778906; bh=WmGwTdZ13Z8IYOL25S14t4gDqP3Y6eNfWVJIbC6SN+M=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=O8YH0G0kyvX/V85a+M29ZhwxTrsZOR6+qjlJgz1IbDCYaljs61qgao/LEpVrn1Gdp 1uF2CSNK/I17BTFlZq1UFKxpiPW/7WR14oBg906VTWEGO/Uv6p6xHkfuXo1MN0KYF5 Xn+N6wOrbrCF4V0rDwEBZ/5bRq+7NO6cXRPJHirNxuKZccM3gt/TiQJf/5vda3DSZG tHQfE6rfhUGeeQkHM4m5FtmGRe0VEy/yGEDRPeTvUJtJCHkv236iO/sV1tVmPaM8uF 56UvbRgi3UnwH64So5Enh9TnGtC+Zrf3VNLFHDDTl+9MYhXLJAjIRpKiLLPWYoVPY3 QHUsCVSj6YvOQ== Received: from obitton-vm-u22.habana-labs.com (localhost [127.0.0.1]) by obitton-vm-u22.habana-labs.com (8.15.2/8.15.2/Debian-22ubuntu3) with ESMTP id 45J6YQB81377354; Wed, 19 Jun 2024 09:34:33 +0300 From: Ofir Bitton To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: Tomer Tayar Subject: [PATCH 8/9] accel/habanalabs: move heartbeat work initialization to early init Date: Wed, 19 Jun 2024 09:34:24 +0300 Message-Id: <20240619063425.1377327-8-obitton@habana.ai> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240619063425.1377327-1-obitton@habana.ai> References: <20240619063425.1377327-1-obitton@habana.ai> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Tomer Tayar The device heartbeat work is currently initialized at device_heartbeat_schedule() which is called at the end of hl_device_init(). However hl_device_init() can fail at a previous step, and in such a case, a subsequent call to hl_device_fini() will lead to calling cleanup_resources() and accessing this work uninitialized. As there is no real need to re-initialize this work every time it is rescheduled, move this initialization to device_early_init() to be done once and early enough. Signed-off-by: Tomer Tayar Reviewed-by: Ofir Bitton --- drivers/accel/habanalabs/common/device.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index 050c278e5ddb..e0cf3b4343bb 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -30,6 +30,8 @@ enum dma_alloc_type { #define MEM_SCRUB_DEFAULT_VAL 0x1122334455667788 +static void hl_device_heartbeat(struct work_struct *work); + /* * hl_set_dram_bar- sets the bar to allow later access to address * @@ -963,6 +965,8 @@ static int device_early_init(struct hl_device *hdev) goto free_cb_mgr; } + INIT_DELAYED_WORK(&hdev->work_heartbeat, hl_device_heartbeat); + INIT_DELAYED_WORK(&hdev->device_reset_work.reset_work, device_hard_reset_pending); hdev->device_reset_work.hdev = hdev; hdev->device_fini_pending = 0; @@ -1604,8 +1608,6 @@ static inline void device_heartbeat_schedule(struct hl_device *hdev) */ hdev->eq_heartbeat_received = true; - INIT_DELAYED_WORK(&hdev->work_heartbeat, hl_device_heartbeat); - schedule_delayed_work(&hdev->work_heartbeat, usecs_to_jiffies(HL_HEARTBEAT_PER_USEC)); } From patchwork Wed Jun 19 06:34:25 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Ofir Bitton X-Patchwork-Id: 13703449 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 5364CC27C79 for ; Wed, 19 Jun 2024 06:34:58 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 7306010E8CD; Wed, 19 Jun 2024 06:34:57 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=habana.ai header.i=@habana.ai header.b="M81PsjXq"; dkim-atps=neutral Received: from mail02.habana.ai (habanamailrelay.habana.ai [213.57.90.13]) by gabe.freedesktop.org (Postfix) with ESMTPS id 8814C10E8FB for ; Wed, 19 Jun 2024 06:34:47 +0000 (UTC) Received: internal info suppressed DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=habana.ai; s=default; t=1718778894; bh=yVIz9ixUV8yNzzj5u0E3tyrmSA6nt1p/MJB5rrHIEOA=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=M81PsjXqqwh1IM+TailoLyqtoGDjX8+DyTrmFk+LBcnZaCzq6V+kWHB74zH0PRF74 vzT54D3A6PjTWohor4uPd6J8ONwhiNBc3jLMSwLA+Zpe5/r6821KHkPuz88HzLxydM Tcts5N1cOamLU6uAO0PNFhypeggiEnlmzFeZYQeBBJTRPHmkJrR39lQyGwcCFfCvxV CCphi4N2Zkst5I3FODjUS0+aiu/DLaj5EnyoO8Z7E2SsN6JkWaiJkCmrZ+QzdjNIzh nN/k4m40eCjWX3zEeHTk2iBN/aNgmt6SMFCd+Lfppejx/pNAR0Td7SQHFtZReKMjak zRIi98WRuUHLA== Received: from obitton-vm-u22.habana-labs.com (localhost [127.0.0.1]) by obitton-vm-u22.habana-labs.com (8.15.2/8.15.2/Debian-22ubuntu3) with ESMTP id 45J6YQB91377354; Wed, 19 Jun 2024 09:34:33 +0300 From: Ofir Bitton To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: Didi Freiman Subject: [PATCH 9/9] accel/habanalabs: gradual sleep in polling memory macro Date: Wed, 19 Jun 2024 09:34:25 +0300 Message-Id: <20240619063425.1377327-9-obitton@habana.ai> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240619063425.1377327-1-obitton@habana.ai> References: <20240619063425.1377327-1-obitton@habana.ai> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Didi Freiman It’s better to avoid long sleeps right from the beginning of the polling since the data may be available much sooner than the sleep period. Because polling host memory is inexpensive, this change gradually increases the sleep time up to the user-requested period. Signed-off-by: Didi Freiman Reviewed-by: Ofir Bitton --- drivers/accel/habanalabs/common/habanalabs.h | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/accel/habanalabs/common/habanalabs.h b/drivers/accel/habanalabs/common/habanalabs.h index a06e5a966f45..6f27ce4fa01b 100644 --- a/drivers/accel/habanalabs/common/habanalabs.h +++ b/drivers/accel/habanalabs/common/habanalabs.h @@ -2729,11 +2729,16 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val); * updated directly by the device. If false, the host memory being polled will * be updated by host CPU. Required so host knows whether or not the memory * might need to be byte-swapped before returning value to caller. + * + * On the first 4 polling iterations the macro goes to sleep for short period of + * time that gradually increases and reaches sleep_us on the fifth iteration. */ #define hl_poll_timeout_memory(hdev, addr, val, cond, sleep_us, timeout_us, \ mem_written_by_device) \ ({ \ + u64 __sleep_step_us; \ ktime_t __timeout; \ + u8 __step = 8; \ \ __timeout = ktime_add_us(ktime_get(), timeout_us); \ might_sleep_if(sleep_us); \ @@ -2751,8 +2756,10 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val); (val) = le32_to_cpu(*(__le32 *) &(val)); \ break; \ } \ - if (sleep_us) \ - usleep_range((sleep_us >> 2) + 1, sleep_us); \ + __sleep_step_us = sleep_us >> __step; \ + if (__sleep_step_us) \ + usleep_range((__sleep_step_us >> 2) + 1, __sleep_step_us); \ + __step >>= 1; \ } \ (cond) ? 0 : -ETIMEDOUT; \ })