From patchwork Thu Jun 8 13:38:38 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13272344 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 3A201C7EE25 for ; Thu, 8 Jun 2023 13:38:59 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 5A99910E5B5; Thu, 8 Jun 2023 13:38:58 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by gabe.freedesktop.org (Postfix) with ESMTPS id 22A4310E5B5 for ; Thu, 8 Jun 2023 13:38:57 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 7EE7C63C24; Thu, 8 Jun 2023 13:38:55 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0CF78C433EF; Thu, 8 Jun 2023 13:38:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686231534; bh=pLSnDM48gI3wCeowijVY1uBrUUQN0ZfedCJxGcz+reI=; h=From:To:Cc:Subject:Date:From; b=QdHrlSCYZ/RwvjKGYig0Rdzd6JX12ibJQaggE3P6I3wRg0bCxtIQ7SlJcIp2TlDik j4BWHrT4LSQcQuO/fApAk8YO6tv5gxIeXMPVlnAqHPeiq3ICaQs+Qi7OHmeUl44BFm 3rlIRXGuGIDSxpyt4D3d7858IiIcNYFxXGUktLO20GecwJob/9DIxYf+A9GDX7CPT8 AGiii4E+jwTgiAg21M71sTSCZdkHg19ZwF6p5qjyYb9Iw+uYXydaXjHlmJeGrSTCEz pB0OZSSDpURs+DiP6j5VtFVvSPpuTaGN7VDcKR8OSXCeaCSNoLWB852HnBCtwk5PNQ OelqhPe284E1g== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 01/12] accel/habanalabs: prevent immediate hard reset due to 2 adjacent H/W events Date: Thu, 8 Jun 2023 16:38:38 +0300 Message-Id: <20230608133849.2739411-1-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Tomer Tayar Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Tomer Tayar When a H/W event is received while a user is registered to events, no immediate hard reset will happen, and instead the user will be notified and will have some time to handle it and eventually release the device, after which the reset will be done. If a user, as part of the handling and as part of the cleanup steps towards releasing the device, unregisters from receiving those events, and at that time an adjacent H/W event is received, it will be assumed that the user is not registered to events and thus an immediate hard reset is required. To prevent such an unwanted immediate reset, modify the driver to perform it if the user is not registered to events AND we don't already have a pending reset for a previous H/W event. Signed-off-by: Tomer Tayar Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/device.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index b97339d1f7c6..1e61e79c42e5 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -1916,7 +1916,16 @@ int hl_device_cond_reset(struct hl_device *hdev, u32 flags, u64 event_mask) } ctx = hl_get_compute_ctx(hdev); - if (!ctx || !ctx->hpriv->notifier_event.eventfd) + if (!ctx) + goto device_reset; + + /* + * There is no point in postponing the reset if user is not registered for events. + * However if no eventfd_ctx exists but the device release watchdog is already scheduled, it + * just implies that user has unregistered as part of handling a previous event. In this + * case an immediate reset is not required. + */ + if (!ctx->hpriv->notifier_event.eventfd && !hdev->reset_info.watchdog_active) goto device_reset; /* Schedule the device release watchdog work unless reset is already in progress or if the From patchwork Thu Jun 8 13:38:39 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13272345 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 36475C7EE25 for ; Thu, 8 Jun 2023 13:39:04 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id CFB3F10E5BC; Thu, 8 Jun 2023 13:38:58 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by gabe.freedesktop.org (Postfix) with ESMTPS id 87AEA10E5B5 for ; Thu, 8 Jun 2023 13:38:57 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id EE69F615C6; Thu, 8 Jun 2023 13:38:56 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 77892C4339B; Thu, 8 Jun 2023 13:38:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686231536; bh=mOlbTkDwyB+4/Q/+Yy1kHQOljI7tE3g8StidJZ/Wurc=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=uCxHByDgGE05DMg4lS5f+OyK/rRf9XxC2r1AOn/O6GBKmcxmKPHkGSb43yI5PXfHq bw1Rg2Ckd2YTkeA4hob8ZF5GLA1sVqqer3eu7beyIk/JXpfey1ajd3MnpfDeBq2wWD w4FlfW685UohaC83YyGYtdUFEwYst5xoOHNryRBZ5/j1nFoP6vnKPY9SRWPZu20P9S nFabVp3+ejd5WiTpz2d85UXo/v44liHH1mHQWDDp2IkWkwYSc0lTJ0NWn+mJ0pjD0p zW6+b0MQUvfqIVXy6KnTHXigCJTppK1X0BoZoR5BcXE+/Y4RSEWrEJuuVvLlt79Oz+ IZCftGewE00Vg== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 02/12] accel/habanalabs: update pending reset flags with new reset requests Date: Thu, 8 Jun 2023 16:38:39 +0300 Message-Id: <20230608133849.2739411-2-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230608133849.2739411-1-ogabbay@kernel.org> References: <20230608133849.2739411-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Tomer Tayar Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Tomer Tayar If hl_device_cond_reset() is called while a reset is already pending but hasn't started, the reset request will be dropped. If the flags of the new request are more severe, e.g. a hard reset while the pending reset is a compute reset, the eventual reset won't be suitable for the device status. To prevent such cases, update the pending reset flags with the new requests flags before the requests are dropped. Signed-off-by: Tomer Tayar Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/device.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index 1e61e79c42e5..993305871292 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -1937,8 +1937,10 @@ int hl_device_cond_reset(struct hl_device *hdev, u32 flags, u64 event_mask) goto device_reset; } - if (hdev->reset_info.watchdog_active) + if (hdev->reset_info.watchdog_active) { + hdev->device_release_watchdog_work.flags |= flags; goto out; + } hdev->device_release_watchdog_work.flags = flags; dev_dbg(hdev->dev, "Device is going to be hard-reset in %u sec unless being released\n", From patchwork Thu Jun 8 13:38:40 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13272347 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 3E64DC7EE25 for ; Thu, 8 Jun 2023 13:39:10 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 84AA610E5C0; Thu, 8 Jun 2023 13:39:03 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by gabe.freedesktop.org (Postfix) with ESMTPS id B15F310E5BF for ; Thu, 8 Jun 2023 13:38:59 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id A20AD615C6; Thu, 8 Jun 2023 13:38:58 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id E1D63C433D2; Thu, 8 Jun 2023 13:38:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686231538; bh=Zh7dsC1p59y4TVBlWYm5MyJz+pqp1ATBMFp5GEAmMQE=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=hSfIwzCvaay9Sls28FqQKs02jwhyN6WQNX0i04oWwFBLGaJG8De5qv9jlFbe+vS7D LbEn8tbzbfWg2LHb4C6OQzamsPQP5+AmYNaIxALehqxve3JFWDcmv09GEU5znWNtel K8QOd/ucVzFhqDeWdIEYDMoTm328qPsCSQqVDlql9Wrb/pb3W6wcxFQm3nqU8f9XWH 80ORwWi1dPmKppf8xfSKt3BBoV9hUathe+zKnm3S46CDU9NSM4sdGpWKZtYjsRULBU 3GevU5rkxWeRMfPPq5Ba2Tue+4u3aV8w7OB69jZfd0n1eTtz0T+sSGvenzx5xK8cSk Dp/A+d3d7tN3Q== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 03/12] accel/habanalabs: notify user about undefined opcode event Date: Thu, 8 Jun 2023 16:38:40 +0300 Message-Id: <20230608133849.2739411-3-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230608133849.2739411-1-ogabbay@kernel.org> References: <20230608133849.2739411-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Ofir Bitton Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Ofir Bitton In order for user to be aware of undefined opcode events, we must store all relevant information and notify user about the failure. The user will fetch the stored info via info ioctl. Signed-off-by: Ofir Bitton Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/gaudi2/gaudi2.c | 142 ++++++++++++++++++++++- 1 file changed, 137 insertions(+), 5 deletions(-) diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2.c b/drivers/accel/habanalabs/gaudi2/gaudi2.c index 20c4583f12b0..ed3b0b6225d2 100644 --- a/drivers/accel/habanalabs/gaudi2/gaudi2.c +++ b/drivers/accel/habanalabs/gaudi2/gaudi2.c @@ -993,6 +993,111 @@ gaudi2_pcie_addr_dec_error_cause[GAUDI2_NUM_OF_PCIE_ADDR_DEC_ERR_CAUSE] = { "TLP is blocked by RR" }; +static const int gaudi2_queue_id_to_engine_id[] = { + [GAUDI2_QUEUE_ID_PDMA_0_0...GAUDI2_QUEUE_ID_PDMA_0_3] = GAUDI2_ENGINE_ID_PDMA_0, + [GAUDI2_QUEUE_ID_PDMA_1_0...GAUDI2_QUEUE_ID_PDMA_1_3] = GAUDI2_ENGINE_ID_PDMA_1, + [GAUDI2_QUEUE_ID_DCORE0_EDMA_0_0...GAUDI2_QUEUE_ID_DCORE0_EDMA_0_3] = + GAUDI2_DCORE0_ENGINE_ID_EDMA_0, + [GAUDI2_QUEUE_ID_DCORE0_EDMA_1_0...GAUDI2_QUEUE_ID_DCORE0_EDMA_1_3] = + GAUDI2_DCORE0_ENGINE_ID_EDMA_1, + [GAUDI2_QUEUE_ID_DCORE1_EDMA_0_0...GAUDI2_QUEUE_ID_DCORE1_EDMA_0_3] = + GAUDI2_DCORE1_ENGINE_ID_EDMA_0, + [GAUDI2_QUEUE_ID_DCORE1_EDMA_1_0...GAUDI2_QUEUE_ID_DCORE1_EDMA_1_3] = + GAUDI2_DCORE1_ENGINE_ID_EDMA_1, + [GAUDI2_QUEUE_ID_DCORE2_EDMA_0_0...GAUDI2_QUEUE_ID_DCORE2_EDMA_0_3] = + GAUDI2_DCORE2_ENGINE_ID_EDMA_0, + [GAUDI2_QUEUE_ID_DCORE2_EDMA_1_0...GAUDI2_QUEUE_ID_DCORE2_EDMA_1_3] = + GAUDI2_DCORE2_ENGINE_ID_EDMA_1, + [GAUDI2_QUEUE_ID_DCORE3_EDMA_0_0...GAUDI2_QUEUE_ID_DCORE3_EDMA_0_3] = + GAUDI2_DCORE3_ENGINE_ID_EDMA_0, + [GAUDI2_QUEUE_ID_DCORE3_EDMA_1_0...GAUDI2_QUEUE_ID_DCORE3_EDMA_1_3] = + GAUDI2_DCORE3_ENGINE_ID_EDMA_1, + [GAUDI2_QUEUE_ID_DCORE0_MME_0_0...GAUDI2_QUEUE_ID_DCORE0_MME_0_3] = + GAUDI2_DCORE0_ENGINE_ID_MME, + [GAUDI2_QUEUE_ID_DCORE1_MME_0_0...GAUDI2_QUEUE_ID_DCORE1_MME_0_3] = + GAUDI2_DCORE1_ENGINE_ID_MME, + [GAUDI2_QUEUE_ID_DCORE2_MME_0_0...GAUDI2_QUEUE_ID_DCORE2_MME_0_3] = + GAUDI2_DCORE2_ENGINE_ID_MME, + [GAUDI2_QUEUE_ID_DCORE3_MME_0_0...GAUDI2_QUEUE_ID_DCORE3_MME_0_3] = + GAUDI2_DCORE3_ENGINE_ID_MME, + [GAUDI2_QUEUE_ID_DCORE0_TPC_0_0...GAUDI2_QUEUE_ID_DCORE0_TPC_0_3] = + GAUDI2_DCORE0_ENGINE_ID_TPC_0, + [GAUDI2_QUEUE_ID_DCORE0_TPC_1_0...GAUDI2_QUEUE_ID_DCORE0_TPC_1_3] = + GAUDI2_DCORE0_ENGINE_ID_TPC_1, + [GAUDI2_QUEUE_ID_DCORE0_TPC_2_0...GAUDI2_QUEUE_ID_DCORE0_TPC_2_3] = + GAUDI2_DCORE0_ENGINE_ID_TPC_2, + [GAUDI2_QUEUE_ID_DCORE0_TPC_3_0...GAUDI2_QUEUE_ID_DCORE0_TPC_3_3] = + GAUDI2_DCORE0_ENGINE_ID_TPC_3, + [GAUDI2_QUEUE_ID_DCORE0_TPC_4_0...GAUDI2_QUEUE_ID_DCORE0_TPC_4_3] = + GAUDI2_DCORE0_ENGINE_ID_TPC_4, + [GAUDI2_QUEUE_ID_DCORE0_TPC_5_0...GAUDI2_QUEUE_ID_DCORE0_TPC_5_3] = + GAUDI2_DCORE0_ENGINE_ID_TPC_5, + [GAUDI2_QUEUE_ID_DCORE0_TPC_6_0...GAUDI2_QUEUE_ID_DCORE0_TPC_6_3] = + GAUDI2_DCORE0_ENGINE_ID_TPC_6, + [GAUDI2_QUEUE_ID_DCORE1_TPC_0_0...GAUDI2_QUEUE_ID_DCORE1_TPC_0_3] = + GAUDI2_DCORE1_ENGINE_ID_TPC_0, + [GAUDI2_QUEUE_ID_DCORE1_TPC_1_0...GAUDI2_QUEUE_ID_DCORE1_TPC_1_3] = + GAUDI2_DCORE1_ENGINE_ID_TPC_1, + [GAUDI2_QUEUE_ID_DCORE1_TPC_2_0...GAUDI2_QUEUE_ID_DCORE1_TPC_2_3] = + GAUDI2_DCORE1_ENGINE_ID_TPC_2, + [GAUDI2_QUEUE_ID_DCORE1_TPC_3_0...GAUDI2_QUEUE_ID_DCORE1_TPC_3_3] = + GAUDI2_DCORE1_ENGINE_ID_TPC_3, + [GAUDI2_QUEUE_ID_DCORE1_TPC_4_0...GAUDI2_QUEUE_ID_DCORE1_TPC_4_3] = + GAUDI2_DCORE1_ENGINE_ID_TPC_4, + [GAUDI2_QUEUE_ID_DCORE1_TPC_5_0...GAUDI2_QUEUE_ID_DCORE1_TPC_5_3] = + GAUDI2_DCORE1_ENGINE_ID_TPC_5, + [GAUDI2_QUEUE_ID_DCORE2_TPC_0_0...GAUDI2_QUEUE_ID_DCORE2_TPC_0_3] = + GAUDI2_DCORE2_ENGINE_ID_TPC_0, + [GAUDI2_QUEUE_ID_DCORE2_TPC_1_0...GAUDI2_QUEUE_ID_DCORE2_TPC_1_3] = + GAUDI2_DCORE2_ENGINE_ID_TPC_1, + [GAUDI2_QUEUE_ID_DCORE2_TPC_2_0...GAUDI2_QUEUE_ID_DCORE2_TPC_2_3] = + GAUDI2_DCORE2_ENGINE_ID_TPC_2, + [GAUDI2_QUEUE_ID_DCORE2_TPC_3_0...GAUDI2_QUEUE_ID_DCORE2_TPC_3_3] = + GAUDI2_DCORE2_ENGINE_ID_TPC_3, + [GAUDI2_QUEUE_ID_DCORE2_TPC_4_0...GAUDI2_QUEUE_ID_DCORE2_TPC_4_3] = + GAUDI2_DCORE2_ENGINE_ID_TPC_4, + [GAUDI2_QUEUE_ID_DCORE2_TPC_5_0...GAUDI2_QUEUE_ID_DCORE2_TPC_5_3] = + GAUDI2_DCORE2_ENGINE_ID_TPC_5, + [GAUDI2_QUEUE_ID_DCORE3_TPC_0_0...GAUDI2_QUEUE_ID_DCORE3_TPC_0_3] = + GAUDI2_DCORE3_ENGINE_ID_TPC_0, + [GAUDI2_QUEUE_ID_DCORE3_TPC_1_0...GAUDI2_QUEUE_ID_DCORE3_TPC_1_3] = + GAUDI2_DCORE3_ENGINE_ID_TPC_1, + [GAUDI2_QUEUE_ID_DCORE3_TPC_2_0...GAUDI2_QUEUE_ID_DCORE3_TPC_2_3] = + GAUDI2_DCORE3_ENGINE_ID_TPC_2, + [GAUDI2_QUEUE_ID_DCORE3_TPC_3_0...GAUDI2_QUEUE_ID_DCORE3_TPC_3_3] = + GAUDI2_DCORE3_ENGINE_ID_TPC_3, + [GAUDI2_QUEUE_ID_DCORE3_TPC_4_0...GAUDI2_QUEUE_ID_DCORE3_TPC_4_3] = + GAUDI2_DCORE3_ENGINE_ID_TPC_4, + [GAUDI2_QUEUE_ID_DCORE3_TPC_5_0...GAUDI2_QUEUE_ID_DCORE3_TPC_5_3] = + GAUDI2_DCORE3_ENGINE_ID_TPC_5, + [GAUDI2_QUEUE_ID_NIC_0_0...GAUDI2_QUEUE_ID_NIC_0_3] = GAUDI2_ENGINE_ID_NIC0_0, + [GAUDI2_QUEUE_ID_NIC_1_0...GAUDI2_QUEUE_ID_NIC_1_3] = GAUDI2_ENGINE_ID_NIC0_1, + [GAUDI2_QUEUE_ID_NIC_2_0...GAUDI2_QUEUE_ID_NIC_2_3] = GAUDI2_ENGINE_ID_NIC1_0, + [GAUDI2_QUEUE_ID_NIC_3_0...GAUDI2_QUEUE_ID_NIC_3_3] = GAUDI2_ENGINE_ID_NIC1_1, + [GAUDI2_QUEUE_ID_NIC_4_0...GAUDI2_QUEUE_ID_NIC_4_3] = GAUDI2_ENGINE_ID_NIC2_0, + [GAUDI2_QUEUE_ID_NIC_5_0...GAUDI2_QUEUE_ID_NIC_5_3] = GAUDI2_ENGINE_ID_NIC2_1, + [GAUDI2_QUEUE_ID_NIC_6_0...GAUDI2_QUEUE_ID_NIC_6_3] = GAUDI2_ENGINE_ID_NIC3_0, + [GAUDI2_QUEUE_ID_NIC_7_0...GAUDI2_QUEUE_ID_NIC_7_3] = GAUDI2_ENGINE_ID_NIC3_1, + [GAUDI2_QUEUE_ID_NIC_8_0...GAUDI2_QUEUE_ID_NIC_8_3] = GAUDI2_ENGINE_ID_NIC4_0, + [GAUDI2_QUEUE_ID_NIC_9_0...GAUDI2_QUEUE_ID_NIC_9_3] = GAUDI2_ENGINE_ID_NIC4_1, + [GAUDI2_QUEUE_ID_NIC_10_0...GAUDI2_QUEUE_ID_NIC_10_3] = GAUDI2_ENGINE_ID_NIC5_0, + [GAUDI2_QUEUE_ID_NIC_11_0...GAUDI2_QUEUE_ID_NIC_11_3] = GAUDI2_ENGINE_ID_NIC5_1, + [GAUDI2_QUEUE_ID_NIC_12_0...GAUDI2_QUEUE_ID_NIC_12_3] = GAUDI2_ENGINE_ID_NIC6_0, + [GAUDI2_QUEUE_ID_NIC_13_0...GAUDI2_QUEUE_ID_NIC_13_3] = GAUDI2_ENGINE_ID_NIC6_1, + [GAUDI2_QUEUE_ID_NIC_14_0...GAUDI2_QUEUE_ID_NIC_14_3] = GAUDI2_ENGINE_ID_NIC7_0, + [GAUDI2_QUEUE_ID_NIC_15_0...GAUDI2_QUEUE_ID_NIC_15_3] = GAUDI2_ENGINE_ID_NIC7_1, + [GAUDI2_QUEUE_ID_NIC_16_0...GAUDI2_QUEUE_ID_NIC_16_3] = GAUDI2_ENGINE_ID_NIC8_0, + [GAUDI2_QUEUE_ID_NIC_17_0...GAUDI2_QUEUE_ID_NIC_17_3] = GAUDI2_ENGINE_ID_NIC8_1, + [GAUDI2_QUEUE_ID_NIC_18_0...GAUDI2_QUEUE_ID_NIC_18_3] = GAUDI2_ENGINE_ID_NIC9_0, + [GAUDI2_QUEUE_ID_NIC_19_0...GAUDI2_QUEUE_ID_NIC_19_3] = GAUDI2_ENGINE_ID_NIC9_1, + [GAUDI2_QUEUE_ID_NIC_20_0...GAUDI2_QUEUE_ID_NIC_20_3] = GAUDI2_ENGINE_ID_NIC10_0, + [GAUDI2_QUEUE_ID_NIC_21_0...GAUDI2_QUEUE_ID_NIC_21_3] = GAUDI2_ENGINE_ID_NIC10_1, + [GAUDI2_QUEUE_ID_NIC_22_0...GAUDI2_QUEUE_ID_NIC_22_3] = GAUDI2_ENGINE_ID_NIC11_0, + [GAUDI2_QUEUE_ID_NIC_23_0...GAUDI2_QUEUE_ID_NIC_23_3] = GAUDI2_ENGINE_ID_NIC11_1, + [GAUDI2_QUEUE_ID_ROT_0_0...GAUDI2_QUEUE_ID_ROT_0_3] = GAUDI2_ENGINE_ID_ROT_0, + [GAUDI2_QUEUE_ID_ROT_1_0...GAUDI2_QUEUE_ID_ROT_1_3] = GAUDI2_ENGINE_ID_ROT_1, +}; + const u32 gaudi2_qm_blocks_bases[GAUDI2_QUEUE_ID_SIZE] = { [GAUDI2_QUEUE_ID_PDMA_0_0] = mmPDMA0_QM_BASE, [GAUDI2_QUEUE_ID_PDMA_0_1] = mmPDMA0_QM_BASE, @@ -7753,7 +7858,7 @@ static bool gaudi2_handle_ecc_event(struct hl_device *hdev, u16 event_type, return !!ecc_data->is_critical; } -static void print_lower_qman_data_on_err(struct hl_device *hdev, u64 qman_base) +static void handle_lower_qman_data_on_err(struct hl_device *hdev, u64 qman_base, u64 event_mask) { u32 lo, hi, cq_ptr_size, arc_cq_ptr_size; u64 cq_ptr, arc_cq_ptr, cp_current_inst; @@ -7775,10 +7880,22 @@ static void print_lower_qman_data_on_err(struct hl_device *hdev, u64 qman_base) dev_info(hdev->dev, "LowerQM. CQ: {ptr %#llx, size %u}, ARC_CQ: {ptr %#llx, size %u}, CP: {instruction %#llx}\n", cq_ptr, cq_ptr_size, arc_cq_ptr, arc_cq_ptr_size, cp_current_inst); + + if (event_mask & HL_NOTIFIER_EVENT_UNDEFINED_OPCODE) { + if (arc_cq_ptr) { + hdev->captured_err_info.undef_opcode.cq_addr = arc_cq_ptr; + hdev->captured_err_info.undef_opcode.cq_size = arc_cq_ptr_size; + } else { + hdev->captured_err_info.undef_opcode.cq_addr = cq_ptr; + hdev->captured_err_info.undef_opcode.cq_size = cq_ptr_size; + } + + hdev->captured_err_info.undef_opcode.stream_id = QMAN_STREAMS; + } } static int gaudi2_handle_qman_err_generic(struct hl_device *hdev, u16 event_type, - u64 qman_base, u32 qid_base) + u64 qman_base, u32 qid_base, u64 *event_mask) { u32 i, j, glbl_sts_val, arb_err_val, num_error_causes, error_count = 0; u64 glbl_sts_addr, arb_err_addr; @@ -7812,8 +7929,22 @@ static int gaudi2_handle_qman_err_generic(struct hl_device *hdev, u16 event_type error_count++; } - if (i == QMAN_STREAMS) - print_lower_qman_data_on_err(hdev, qman_base); + if (i == QMAN_STREAMS && error_count) { + /* check for undefined opcode */ + if (glbl_sts_val & PDMA0_QM_GLBL_ERR_STS_CP_UNDEF_CMD_ERR_MASK && + hdev->captured_err_info.undef_opcode.write_enable) { + memset(&hdev->captured_err_info.undef_opcode, 0, + sizeof(hdev->captured_err_info.undef_opcode)); + + hdev->captured_err_info.undef_opcode.write_enable = false; + hdev->captured_err_info.undef_opcode.timestamp = ktime_get(); + hdev->captured_err_info.undef_opcode.engine_id = + gaudi2_queue_id_to_engine_id[qid_base]; + *event_mask |= HL_NOTIFIER_EVENT_UNDEFINED_OPCODE; + } + + handle_lower_qman_data_on_err(hdev, qman_base, *event_mask); + } } arb_err_val = RREG32(arb_err_addr); @@ -8475,7 +8606,8 @@ static int gaudi2_handle_qman_err(struct hl_device *hdev, u16 event_type, u64 *e return 0; } - error_count = gaudi2_handle_qman_err_generic(hdev, event_type, qman_base, qid_base); + error_count = gaudi2_handle_qman_err_generic(hdev, event_type, qman_base, + qid_base, event_mask); /* Handle EDMA QM SEI here because there is no AXI error response event for EDMA */ if (event_type >= GAUDI2_EVENT_HDMA2_QM && event_type <= GAUDI2_EVENT_HDMA5_QM) { From patchwork Thu Jun 8 13:38:41 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13272346 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 480BAC7EE23 for ; Thu, 8 Jun 2023 13:39:07 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 3439610E5BE; Thu, 8 Jun 2023 13:39:03 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by gabe.freedesktop.org (Postfix) with ESMTPS id 3772C10E5BE for ; Thu, 8 Jun 2023 13:39:00 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id A928264DA2; Thu, 8 Jun 2023 13:38:59 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 95782C4339C; Thu, 8 Jun 2023 13:38:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686231539; bh=m20rZvEi8AuqMG/dcDfgEVMYA5IkHyR0lJGdvBJvcGc=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=UUV3+wrggVBmrTesOw0SLSijlAjdKnJ/HbCPcAc/NYbpZAl2r4Tr5j69lE1BnwMWB rwOhU6UMPpwMdLIeWCf3K18U/NyqUDeMrQs4Z+jckABItDJMjv8d5oVJIRULBUZZ1/ lt98H/lUhRT3Zhce6R6+CWXzJzEKLHNV0HhskshYzxC/X9aWFoPUuAVc0weT3Pr3qk Q9xXRdSunXgWNoydTcysFG1aoaUiyqs474J59kLCsXr7XGM/6jBm3IoYiSMgbaBpvt s4SmpuqSAJNrGukrxlVgjo+ScgXqpg2PvCT8fB3v9wI87PiDbVgpmtNlNjvvlJJL+w nzwRL0gHFDCIw== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 04/12] accel/habanalabs: print task name and request code upon ioctl failure Date: Thu, 8 Jun 2023 16:38:41 +0300 Message-Id: <20230608133849.2739411-4-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230608133849.2739411-1-ogabbay@kernel.org> References: <20230608133849.2739411-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Tomer Tayar Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Tomer Tayar When an ioctl fails, it is useful to know what is the task command name and the full ioctl request code, in addition to the task pid and the ioctl number. Add the additional information to the relevant debug error prints. Signed-off-by: Tomer Tayar Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- .../habanalabs/common/habanalabs_ioctl.c | 24 +++++++++++++------ 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/drivers/accel/habanalabs/common/habanalabs_ioctl.c b/drivers/accel/habanalabs/common/habanalabs_ioctl.c index 6a45a92344e9..549b2518fae0 100644 --- a/drivers/accel/habanalabs/common/habanalabs_ioctl.c +++ b/drivers/accel/habanalabs/common/habanalabs_ioctl.c @@ -1194,9 +1194,13 @@ static long _hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg, retcode = -EFAULT; out_err: - if (retcode) - dev_dbg_ratelimited(dev, "error in ioctl: pid=%d, cmd=0x%02x, nr=0x%02x\n", - task_pid_nr(current), cmd, nr); + if (retcode) { + char task_comm[TASK_COMM_LEN]; + + dev_dbg_ratelimited(dev, + "error in ioctl: pid=%d, comm=\"%s\", cmd=%#010x, nr=%#04x\n", + task_pid_nr(current), get_task_comm(task_comm, current), cmd, nr); + } if (kdata != stack_kdata) kfree(kdata); @@ -1219,8 +1223,11 @@ long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg) if ((nr >= HL_COMMAND_START) && (nr < HL_COMMAND_END)) { ioctl = &hl_ioctls[nr]; } else { - dev_dbg_ratelimited(hdev->dev, "invalid ioctl: pid=%d, nr=0x%02x\n", - task_pid_nr(current), nr); + char task_comm[TASK_COMM_LEN]; + + dev_dbg_ratelimited(hdev->dev, + "invalid ioctl: pid=%d, comm=\"%s\", cmd=%#010x, nr=%#04x\n", + task_pid_nr(current), get_task_comm(task_comm, current), cmd, nr); return -ENOTTY; } @@ -1242,8 +1249,11 @@ long hl_ioctl_control(struct file *filep, unsigned int cmd, unsigned long arg) if (nr == _IOC_NR(HL_IOCTL_INFO)) { ioctl = &hl_ioctls_control[nr]; } else { - dev_dbg_ratelimited(hdev->dev_ctrl, "invalid ioctl: pid=%d, nr=0x%02x\n", - task_pid_nr(current), nr); + char task_comm[TASK_COMM_LEN]; + + dev_dbg_ratelimited(hdev->dev_ctrl, + "invalid ioctl: pid=%d, comm=\"%s\", cmd=%#010x, nr=%#04x\n", + task_pid_nr(current), get_task_comm(task_comm, current), cmd, nr); return -ENOTTY; } From patchwork Thu Jun 8 13:38:42 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13272348 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 3ACDFC7EE23 for ; Thu, 8 Jun 2023 13:39:12 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 842D910E5BF; Thu, 8 Jun 2023 13:39:03 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by gabe.freedesktop.org (Postfix) with ESMTPS id 0407810E5BE for ; Thu, 8 Jun 2023 13:39:02 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 7F876615C6; Thu, 8 Jun 2023 13:39:01 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0F7B8C433EF; Thu, 8 Jun 2023 13:38:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686231540; bh=uLSxa7ges5FDdOtA4rzheK23+sZW+cK1JCKXn98hFxg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=QihLOvZggdsfgRXU+Czr3czPjUm9Wx5PokcH9VXuLkYB2pZqYTfqUl2egftx2h888 jCCu33SjNUMhK7pbgyPWU4ztpj6JMo1AFU+eAb5Ru8Ny0qtMMsayZSmef0hp9u7xyx 3+ajmJ5l9E56f9qa8+LWHJ94f0FiMKxqpAAyBa4S043JQuEb+ZGPcJ5jjFgbzjtkyL vi2xfQH8afGorjCd2Q6XjPaCmAyF3bYq/ywR9CgPdei36v1hI1cGHl2cK3ztg8nUcT uqAjpytTUNiWlYHsE1sGItEV1FAttkjVllbnRlT0RZpRBgeeN7jQ6FpLqi3fXJe+/X VMIoXASNeZAhg== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 05/12] accel/habanalabs: print task name upon creation of a user context Date: Thu, 8 Jun 2023 16:38:42 +0300 Message-Id: <20230608133849.2739411-5-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230608133849.2739411-1-ogabbay@kernel.org> References: <20230608133849.2739411-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Tomer Tayar Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Tomer Tayar It is useful for debug to know which user process have acquired the device. Add this info to the relevant debug print, in addition to the already printed user context's ASID. Signed-off-by: Tomer Tayar Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/context.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/accel/habanalabs/common/context.c b/drivers/accel/habanalabs/common/context.c index 9c8b1b37b510..0a53f7154739 100644 --- a/drivers/accel/habanalabs/common/context.c +++ b/drivers/accel/habanalabs/common/context.c @@ -102,7 +102,7 @@ static void hl_ctx_fini(struct hl_ctx *ctx) kfree(ctx->cs_pending); if (ctx->asid != HL_KERNEL_ASID_ID) { - dev_dbg(hdev->dev, "closing user context %d\n", ctx->asid); + dev_dbg(hdev->dev, "closing user context, asid=%u\n", ctx->asid); /* The engines are stopped as there is no executing CS, but the * Coresight might be still working by accessing addresses @@ -198,6 +198,7 @@ int hl_ctx_create(struct hl_device *hdev, struct hl_fpriv *hpriv) int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx) { + char task_comm[TASK_COMM_LEN]; int rc = 0, i; ctx->hdev = hdev; @@ -267,7 +268,8 @@ int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx) hl_encaps_sig_mgr_init(&ctx->sig_mgr); - dev_dbg(hdev->dev, "create user context %d\n", ctx->asid); + dev_dbg(hdev->dev, "create user context, comm=\"%s\", asid=%u\n", + get_task_comm(task_comm, current), ctx->asid); } return 0; From patchwork Thu Jun 8 13:38:43 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13272351 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B3724C7EE37 for ; Thu, 8 Jun 2023 13:39:17 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 15EB110E5C4; Thu, 8 Jun 2023 13:39:09 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by gabe.freedesktop.org (Postfix) with ESMTPS id ABA3A10E5C1 for ; Thu, 8 Jun 2023 13:39:03 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 317B4615C6; Thu, 8 Jun 2023 13:39:03 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7BE1DC4339B; Thu, 8 Jun 2023 13:39:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686231542; bh=7o/jqd3nYTliX8pgqNtbexstuTkFQBMWs/7P4HxDNOU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=iuna0RxBdUBRq45+ePC27905+GSw57mf24Darnrp8DTNgZG03iKVeCa8c8XlpJTc+ rRLw1LtUDLzS7+zPGRe/wtc4oIJj2Q48Glx4SAOXhZsdNY86buZR+bViZEpL2MmmVV 1HbBM0DRV5mDpoTaCF+LhY+DHvUvknMlT5yYNObmwXG2iv0nKrwiaUciPhl2JKMwEG 7D0qDPanM5hlhc2e4SsFFWVKE/21ycMFP/31w05p7lNX6HLtZY9z1YJw/t0l1JAosn mrIXA9ehQr3gvwkMvPmnYUQF8kSyHLojwKikquwmmJoMmxoqCiaubpBpzSh3mTfU/a /Dwla0kyGrZvg== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 06/12] accel/habanalabs: set device status 'malfunction' while in rmmod Date: Thu, 8 Jun 2023 16:38:43 +0300 Message-Id: <20230608133849.2739411-6-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230608133849.2739411-1-ogabbay@kernel.org> References: <20230608133849.2739411-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Koby Elbaz Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Koby Elbaz hl_device_status() returns the status of an acquired device. If a device is going down (following an rmmod cmd), it should be marked as an unusable/malfunctioning device, and hence should not be acquired. However, since this was not the case so far (i.e., a device going down would inaccurately return 'in reset' status allowing the user to acquire the device) it introduced a bug where as part of a reset flow, the driver could not kill processes that have not run yet, and since those processes aren't blocked from reacquiring a device, we get eventually a new flow of a driver attempting to kill all processes in a list that can't be ever really empty. Signed-off-by: Koby Elbaz Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/device.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index 993305871292..5973e4d64e19 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -315,7 +315,9 @@ enum hl_device_status hl_device_status(struct hl_device *hdev) { enum hl_device_status status; - if (hdev->reset_info.in_reset) { + if (hdev->device_fini_pending) { + status = HL_DEVICE_STATUS_MALFUNCTION; + } else if (hdev->reset_info.in_reset) { if (hdev->reset_info.in_compute_reset) status = HL_DEVICE_STATUS_IN_RESET_AFTER_DEVICE_RELEASE; else @@ -343,9 +345,9 @@ bool hl_device_operational(struct hl_device *hdev, *status = current_status; switch (current_status) { + case HL_DEVICE_STATUS_MALFUNCTION: case HL_DEVICE_STATUS_IN_RESET: case HL_DEVICE_STATUS_IN_RESET_AFTER_DEVICE_RELEASE: - case HL_DEVICE_STATUS_MALFUNCTION: case HL_DEVICE_STATUS_NEEDS_RESET: return false; case HL_DEVICE_STATUS_OPERATIONAL: From patchwork Thu Jun 8 13:38:44 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13272350 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 06835C7EE23 for ; Thu, 8 Jun 2023 13:39:16 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id CBA8D10E5C2; Thu, 8 Jun 2023 13:39:08 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by gabe.freedesktop.org (Postfix) with ESMTPS id 07F3710E5C1 for ; Thu, 8 Jun 2023 13:39:05 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 60B6B64D9A; Thu, 8 Jun 2023 13:39:04 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id E6753C433D2; Thu, 8 Jun 2023 13:39:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686231543; bh=4V9m4mnF2OraU5/DuHOxAsUUy7WLcLlakAg0H8UcBek=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=QBAjWrw9t0VZ7bJupREs/xh4lIrVcpc01LO4OajXOBycteW92tbfuulth/PDozHUy BeHPq7iwkdFvbxVjfICVntevvwnSLjDDOFUxdhLVx3YUjWa0Csz56LvdQ0aaKlsgLh mF5zGjVaz1584IhkscRuH0lNq3dHr9kxbrt4mt4vzPdDDTLIEzfmc8c/RboYH2jk3J 0AjYC1S4RfbsOxseHrCRS8kFi8iQewfWkIUBu7EVZ8WNwimYm90qG1QKlvx4PeVnHC QWiV5PnpRfV6DkphnqQFemH0teveu30wM+Jyt9z9IN6PLD067uwZcEeaWQbIBhcz5v LyslruGqf2Nfw== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 07/12] accel/habanalabs: stop fetching MME SBTE error cause Date: Thu, 8 Jun 2023 16:38:44 +0300 Message-Id: <20230608133849.2739411-7-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230608133849.2739411-1-ogabbay@kernel.org> References: <20230608133849.2739411-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Ofir Bitton Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Ofir Bitton Because in this case we have only a single possible cause, we can safely stop fetching the cause from firmware. Signed-off-by: Ofir Bitton Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/gaudi2/gaudi2.c | 31 ++++++------------------ 1 file changed, 8 insertions(+), 23 deletions(-) diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2.c b/drivers/accel/habanalabs/gaudi2/gaudi2.c index ed3b0b6225d2..899b1c4b53f6 100644 --- a/drivers/accel/habanalabs/gaudi2/gaudi2.c +++ b/drivers/accel/habanalabs/gaudi2/gaudi2.c @@ -66,7 +66,6 @@ #define GAUDI2_NUM_OF_TPC_INTR_CAUSE 31 #define GAUDI2_NUM_OF_DEC_ERR_CAUSE 25 #define GAUDI2_NUM_OF_MME_ERR_CAUSE 16 -#define GAUDI2_NUM_OF_MME_SBTE_ERR_CAUSE 5 #define GAUDI2_NUM_OF_MME_WAP_ERR_CAUSE 7 #define GAUDI2_NUM_OF_DMA_CORE_INTR_CAUSE 8 #define GAUDI2_NUM_OF_MMU_SPI_SEI_CAUSE 19 @@ -916,14 +915,6 @@ static const char * const guadi2_mme_error_cause[GAUDI2_NUM_OF_MME_ERR_CAUSE] = "sbte_prtn_intr_4", }; -static const char * const guadi2_mme_sbte_error_cause[GAUDI2_NUM_OF_MME_SBTE_ERR_CAUSE] = { - "i0", - "i1", - "i2", - "i3", - "i4", -}; - static const char * const guadi2_mme_wap_error_cause[GAUDI2_NUM_OF_MME_WAP_ERR_CAUSE] = { "WBC ERR RESP_0", "WBC ERR RESP_1", @@ -8781,21 +8772,16 @@ static int gaudi2_handle_mme_err(struct hl_device *hdev, u8 mme_index, u16 event return error_count; } -static int gaudi2_handle_mme_sbte_err(struct hl_device *hdev, u16 event_type, - u64 intr_cause_data) +static int gaudi2_handle_mme_sbte_err(struct hl_device *hdev, u16 event_type) { - int i, error_count = 0; - - for (i = 0 ; i < GAUDI2_NUM_OF_MME_SBTE_ERR_CAUSE ; i++) - if (intr_cause_data & BIT(i)) { - gaudi2_print_event(hdev, event_type, true, - "err cause: %s", guadi2_mme_sbte_error_cause[i]); - error_count++; - } - + /* + * We have a single error cause here but the report mechanism is + * buggy. Hence there is no good reason to fetch the cause so we + * just check for glbl_errors and exit. + */ hl_check_for_glbl_errors(hdev); - return error_count; + return GAUDI2_NA_EVENT_CAUSE; } static int gaudi2_handle_mme_wap_err(struct hl_device *hdev, u8 mme_index, u16 event_type, @@ -9856,8 +9842,7 @@ static void gaudi2_handle_eqe(struct hl_device *hdev, struct hl_eq_entry *eq_ent case GAUDI2_EVENT_MME1_SBTE0_AXI_ERR_RSP ... GAUDI2_EVENT_MME1_SBTE4_AXI_ERR_RSP: case GAUDI2_EVENT_MME2_SBTE0_AXI_ERR_RSP ... GAUDI2_EVENT_MME2_SBTE4_AXI_ERR_RSP: case GAUDI2_EVENT_MME3_SBTE0_AXI_ERR_RSP ... GAUDI2_EVENT_MME3_SBTE4_AXI_ERR_RSP: - error_count = gaudi2_handle_mme_sbte_err(hdev, event_type, - le64_to_cpu(eq_entry->intr_cause.intr_cause_data)); + error_count = gaudi2_handle_mme_sbte_err(hdev, event_type); event_mask |= HL_NOTIFIER_EVENT_USER_ENGINE_ERR; break; case GAUDI2_EVENT_VM0_ALARM_A ... GAUDI2_EVENT_VM3_ALARM_B: From patchwork Thu Jun 8 13:38:45 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13272349 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 57D07C7EE25 for ; Thu, 8 Jun 2023 13:39:14 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id CF3B810E5C1; Thu, 8 Jun 2023 13:39:07 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by gabe.freedesktop.org (Postfix) with ESMTPS id 5DF5610E5C1 for ; Thu, 8 Jun 2023 13:39:06 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id D3476615C6; Thu, 8 Jun 2023 13:39:05 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5D1FDC4339B; Thu, 8 Jun 2023 13:39:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686231545; bh=7ZHCiTls9MnkNYtCwMXWTA4PmUUGgVRiDIYYGQ+f64k=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=kAjJgsfWNY00ZPYiae20gv7GZyxbGVqaJTFmIylH30Ija2+1gkhb2uUEpZ/E7HkmN +9ivbLoebCwr+j3w+lhwwzS+K0EFDR4+ZqAfWATD7NuP7dxpmAFXhE1N1dzzF5ccrO 2cIft5/JSHHwKjJsSlWSt+8rLeHWA9vVsuJs0dtjHCZnPXpHMQxAdFiOwDeJeBNArj xBvLvcrjmlOZirD8AqD7nt9FvAt/79qDkK7K0B6IiFP5RQikki2P7R4LUeZwcy9odM NPmDLgvS213aWf91G1/gdhsyocrR5eYJZFUrw+O/vNxT7e6IgTTRQjxl2zRS2BpmVm asJw8SoMv45XA== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 08/12] accel/habanalabs: handle arc farm razwi Date: Thu, 8 Jun 2023 16:38:45 +0300 Message-Id: <20230608133849.2739411-8-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230608133849.2739411-1-ogabbay@kernel.org> References: <20230608133849.2739411-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Dani Liberman Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Dani Liberman Implement razwi handling for arc farm and add it to arc farm sei event handler. Signed-off-by: Dani Liberman Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/gaudi2/gaudi2.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2.c b/drivers/accel/habanalabs/gaudi2/gaudi2.c index 899b1c4b53f6..1085215ac51d 100644 --- a/drivers/accel/habanalabs/gaudi2/gaudi2.c +++ b/drivers/accel/habanalabs/gaudi2/gaudi2.c @@ -2097,7 +2097,8 @@ enum razwi_event_sources { RAZWI_PDMA, RAZWI_NIC, RAZWI_DEC, - RAZWI_ROT + RAZWI_ROT, + RAZWI_ARC_FARM }; struct hbm_mc_error_causes { @@ -8049,6 +8050,9 @@ static enum gaudi2_engine_id gaudi2_razwi_calc_engine_id(struct hl_device *hdev, case RAZWI_ROT: return GAUDI2_ENGINE_ID_ROT_0 + module_idx; + case RAZWI_ARC_FARM: + return GAUDI2_ENGINE_ID_ARC_FARM; + default: return GAUDI2_ENGINE_ID_SIZE; } @@ -8158,6 +8162,11 @@ static void gaudi2_ack_module_razwi_event_handler(struct hl_device *hdev, lbw_rtr_id = gaudi2_rot_initiator_lbw_rtr_id[module_idx]; sprintf(initiator_name, "ROT_%u", module_idx); break; + case RAZWI_ARC_FARM: + lbw_rtr_id = DCORE1_RTR5; + hbw_rtr_id = DCORE1_RTR7; + sprintf(initiator_name, "ARC_FARM_%u", module_idx); + break; default: return; } @@ -8611,7 +8620,7 @@ static int gaudi2_handle_qman_err(struct hl_device *hdev, u16 event_type, u64 *e return error_count; } -static int gaudi2_handle_arc_farm_sei_err(struct hl_device *hdev, u16 event_type) +static int gaudi2_handle_arc_farm_sei_err(struct hl_device *hdev, u16 event_type, u64 *event_mask) { u32 i, sts_val, sts_clr_val, error_count = 0, arc_farm; @@ -8633,6 +8642,7 @@ static int gaudi2_handle_arc_farm_sei_err(struct hl_device *hdev, u16 event_type sts_clr_val); } + gaudi2_ack_module_razwi_event_handler(hdev, RAZWI_ARC_FARM, 0, 0, event_mask); hl_check_for_glbl_errors(hdev); return error_count; @@ -9619,7 +9629,7 @@ static void gaudi2_handle_eqe(struct hl_device *hdev, struct hl_eq_entry *eq_ent break; case GAUDI2_EVENT_ARC_AXI_ERROR_RESPONSE_0: - error_count = gaudi2_handle_arc_farm_sei_err(hdev, event_type); + error_count = gaudi2_handle_arc_farm_sei_err(hdev, event_type, &event_mask); event_mask |= HL_NOTIFIER_EVENT_USER_ENGINE_ERR; break; From patchwork Thu Jun 8 13:38:46 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13272353 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 52B08C7EE25 for ; Thu, 8 Jun 2023 13:39:20 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 1AB7F10E5C7; Thu, 8 Jun 2023 13:39:15 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by gabe.freedesktop.org (Postfix) with ESMTPS id E332310E5C2 for ; Thu, 8 Jun 2023 13:39:07 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 650DD64DA2; Thu, 8 Jun 2023 13:39:07 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C7C5CC433D2; Thu, 8 Jun 2023 13:39:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686231546; bh=IiOWk2xtuYJLqnX1FQC+Z12sSSxJkYiPc4dMNUBLizk=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=QWXtx0oEzsqO7K89qlP3KS4Kd6HFCvijBMRcG6jVOaMXfDCcbihHpGqwnN4QDUs0H UhTONYoksIEqIGbL/NyM7BzzlRjw3KK6GjvraEJ6BYQL0HMQpO0h/AFlBg3eEBg5pd H1VLH0WaJr5BNvt40kSKLjwbdy8RQ/oYETjhPP/+JBeMIAm/SvV9N8mRCVMwH29Eeu ZapaN2ikmVTefa61nf/RWCafTjWSCFKcwqCfneCMqETKFtEW145YVFDsOg+nGsRSoX mmJhQq8M5He7MyTdjO8uQUfAzYw5dxkcADwCxs7pCE5GYho35F6j7mJyM2hKF2KOLZ dqM6tSuzx9CZw== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 09/12] accel/habanalabs: fix standalone preboot descriptor request Date: Thu, 8 Jun 2023 16:38:46 +0300 Message-Id: <20230608133849.2739411-9-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230608133849.2739411-1-ogabbay@kernel.org> References: <20230608133849.2739411-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: farah kassabri Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: farah kassabri The preboot used to statically allocate memory for the comms descriptor on the device memory when driver requested the descriptor information. Now preboot moved to dynamic memory allocation where it wants to check the size the driver expects vs. what the f/w expects. Note there are no backward compatibility issues as older f/w versions simply ignore this value. Signed-off-by: farah kassabri Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/firmware_if.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/accel/habanalabs/common/firmware_if.c b/drivers/accel/habanalabs/common/firmware_if.c index acbc1a6b5cb1..370508e98854 100644 --- a/drivers/accel/habanalabs/common/firmware_if.c +++ b/drivers/accel/habanalabs/common/firmware_if.c @@ -2743,7 +2743,8 @@ static int hl_fw_dynamic_init_cpu(struct hl_device *hdev, if (!(hdev->fw_components & FW_TYPE_BOOT_CPU)) { struct lkd_fw_binning_info *binning_info; - rc = hl_fw_dynamic_request_descriptor(hdev, fw_loader, 0); + rc = hl_fw_dynamic_request_descriptor(hdev, fw_loader, + sizeof(struct lkd_msg_comms)); if (rc) goto protocol_err; From patchwork Thu Jun 8 13:38:47 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13272355 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 8346AC7EE23 for ; Thu, 8 Jun 2023 13:39:22 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 3DB6610E5C6; Thu, 8 Jun 2023 13:39:16 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by gabe.freedesktop.org (Postfix) with ESMTPS id 37FBA10E5C5 for ; Thu, 8 Jun 2023 13:39:09 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id B380B63C24; Thu, 8 Jun 2023 13:39:08 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3F0BFC433A1; Thu, 8 Jun 2023 13:39:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686231548; bh=tPgS4W6BVvhcsvRzGf8flPuxmFCvKCtSMb1q1vTZ9rI=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=CI0C1u4sOVhSkbAlwQP2xniWuFTxhUruT01V8LMSa/7+QuNrgmVhOI+6CPfrJKXt3 mtt+c1FOxV/knUgnJ/rjh/R09L8k0vioy64ppWkzPc0fMNmr6+Z3kk8BYe8/7cgOI7 Z2A+Ke6dpNJYr9/JguZcsU2tMAHbwYzxtv8y/lwupQPpkPvRqsc4inphZJZ4NkfSYo i5cVD0P9O+noXEeFz04caL2ZSwqSrHxpjXMfZnrW8d4GyFB1YUWdr4D0zVfHJScsH2 4QQYY/JsUBm6+53kflBNMT+ltLdTKpgRRXYQAGPDHEpvXa43O6jfPWg0xGa7mFV7jh crznnwkoI9ZOA== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 10/12] accel/habanalabs: print return code when process termination fails Date: Thu, 8 Jun 2023 16:38:47 +0300 Message-Id: <20230608133849.2739411-10-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230608133849.2739411-1-ogabbay@kernel.org> References: <20230608133849.2739411-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Koby Elbaz Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Koby Elbaz As part of driver teardown, we attempt to kill all user processes. It shouldn't fail, but if it does we want to print the error code that the kapi returned to us. Signed-off-by: Koby Elbaz Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/device.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index 5973e4d64e19..764d40c0d666 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -1352,7 +1352,8 @@ static int device_kill_open_processes(struct hl_device *hdev, u32 timeout, bool * we don't need to kill it. */ dev_dbg(hdev->dev, - "Can't get task struct for user process, assuming process was killed from outside the driver\n"); + "Can't get task struct for user process %d, assuming process was killed from outside the driver\n", + pid_nr(hpriv->taskpid)); } } @@ -2438,14 +2439,14 @@ void hl_device_fini(struct hl_device *hdev) hdev->process_kill_trial_cnt = 0; rc = device_kill_open_processes(hdev, HL_WAIT_PROCESS_KILL_ON_DEVICE_FINI, false); if (rc) { - dev_crit(hdev->dev, "Failed to kill all open processes\n"); + dev_crit(hdev->dev, "Failed to kill all open processes (%d)\n", rc); device_disable_open_processes(hdev, false); } hdev->process_kill_trial_cnt = 0; rc = device_kill_open_processes(hdev, 0, true); if (rc) { - dev_crit(hdev->dev, "Failed to kill all control device open processes\n"); + dev_crit(hdev->dev, "Failed to kill all control device open processes (%d)\n", rc); device_disable_open_processes(hdev, true); } From patchwork Thu Jun 8 13:38:48 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13272352 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 24879C7EE23 for ; Thu, 8 Jun 2023 13:39:19 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 3702410E5C3; Thu, 8 Jun 2023 13:39:13 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by gabe.freedesktop.org (Postfix) with ESMTPS id 2B75210E5C3 for ; Thu, 8 Jun 2023 13:39:11 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 990E364DA5; Thu, 8 Jun 2023 13:39:10 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id AD758C433A4; Thu, 8 Jun 2023 13:39:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686231549; bh=Zpq/Y9o+eInggKNe8WO7TUg/6TozBHClv8sQmA9lFIg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=RO8bTPCVN5s7yQf4Il5GoOb2rNzQyuZKpKA9lzvj1W7fIxAKBo3GbJYrrIMOdpyOF CUA/barHBnEDNx77TeW4jp1biYmXZ3mEF0jdjpUMe9kn+vLrrP1ENSuH7GVSB5HJQB lB3BE46GhbibX73kM4nouKd6y1wwYYflnwa795fPPvMRz9vLWpCfxlk6TSbcd7/34/ NtQHYbpSYoywR01t8aYwJkLLbopv6DRTfVm4prhSjqqfe8/0x3ecS/Pp/3vEx2nWjN CbjfzJ5QMLpmivCQM1brS3fKt2j76rJwfmGsIpB3TjJlbLHk4uUZwL8Qr6zNOZUkst ZYqI3mkjOv6Kw== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 11/12] accel/habanalabs: call put_pid after hpriv list is updated Date: Thu, 8 Jun 2023 16:38:48 +0300 Message-Id: <20230608133849.2739411-11-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230608133849.2739411-1-ogabbay@kernel.org> References: <20230608133849.2739411-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Koby Elbaz Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Koby Elbaz Because we might still be using related resources, decrementing PID's reference count should be done at later stages of the device release. A good place is right after the representing private structure is removed from LKD's list. Signed-off-by: Koby Elbaz Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/device.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index 764d40c0d666..c61a58a2e622 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -408,8 +408,6 @@ static void hpriv_release(struct kref *ref) hdev->asic_funcs->send_device_activity(hdev, false); - put_pid(hpriv->taskpid); - hl_debugfs_remove_file(hpriv); mutex_destroy(&hpriv->ctx_lock); @@ -448,6 +446,8 @@ static void hpriv_release(struct kref *ref) list_del(&hpriv->dev_node); mutex_unlock(&hdev->fpriv_list_lock); + put_pid(hpriv->taskpid); + if (reset_device) { hl_device_reset(hdev, HL_DRV_RESET_DEV_RELEASE); } else { From patchwork Thu Jun 8 13:38:49 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oded Gabbay X-Patchwork-Id: 13272354 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 76358C7EE37 for ; Thu, 8 Jun 2023 13:39:21 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id D0AEA10E5C8; Thu, 8 Jun 2023 13:39:15 +0000 (UTC) Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by gabe.freedesktop.org (Postfix) with ESMTPS id 2890310E5C5 for ; Thu, 8 Jun 2023 13:39:12 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 9CF5564D9C; Thu, 8 Jun 2023 13:39:11 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2530FC433D2; Thu, 8 Jun 2023 13:39:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686231551; bh=VbBq693Ipzqt8mL4Z59q8eNMFa08uSDB2Eggc0CMs5M=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=ggXTGUjag7aYcUELYV3Zu6qN9B3NPGZDdHb0LxUtfoMN6/RwxfVuaRtm6nKv8ppco bp9IoQX6Tz+Om8rDEDHvL7fQjbHVB5WBxBUnPVlR95ubUHmOY9W1EsvYsRIA1xMM7N eJSXgKpoNwJ2m2XlxT0KNY25Kjs66EVtUU8wl9ngC9wO40Xulf5qzGoiXyzWGlJtFt t9RDC4KemiES0Gkf75BZilBR3KInVX+CJWIomUSlqdGzv1MMOJBQdJlpZtdBpXFoZE Fp1yz5vZGk9CsVtBOqxo2Iudv3NX+HIBA3KnCN9azXtSfLk54o0HL31oMLRvOFK/wY FZjH/CnhpYcLQ== From: Oded Gabbay To: dri-devel@lists.freedesktop.org Subject: [PATCH 12/12] accel/habanalabs: rename fd_list to hpriv_list Date: Thu, 8 Jun 2023 16:38:49 +0300 Message-Id: <20230608133849.2739411-12-ogabbay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230608133849.2739411-1-ogabbay@kernel.org> References: <20230608133849.2739411-1-ogabbay@kernel.org> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Koby Elbaz Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Koby Elbaz Every time an FD is returned to the user, the driver adds a corresponding private structure to the list. Yet, it's still a list of private structures rather than of FDs. Remove, as well, an unnecessary comment. Signed-off-by: Koby Elbaz Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/accel/habanalabs/common/device.c | 43 +++++++++++------------- 1 file changed, 19 insertions(+), 24 deletions(-) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index c61a58a2e622..0d02f1f7b994 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -1304,18 +1304,18 @@ int hl_device_resume(struct hl_device *hdev) static int device_kill_open_processes(struct hl_device *hdev, u32 timeout, bool control_dev) { struct task_struct *task = NULL; - struct list_head *fd_list; - struct hl_fpriv *hpriv; - struct mutex *fd_lock; + struct list_head *hpriv_list; + struct hl_fpriv *hpriv; + struct mutex *hpriv_lock; u32 pending_cnt; - fd_lock = control_dev ? &hdev->fpriv_ctrl_list_lock : &hdev->fpriv_list_lock; - fd_list = control_dev ? &hdev->fpriv_ctrl_list : &hdev->fpriv_list; + hpriv_lock = control_dev ? &hdev->fpriv_ctrl_list_lock : &hdev->fpriv_list_lock; + hpriv_list = control_dev ? &hdev->fpriv_ctrl_list : &hdev->fpriv_list; /* Giving time for user to close FD, and for processes that are inside * hl_device_open to finish */ - if (!list_empty(fd_list)) + if (!list_empty(hpriv_list)) ssleep(1); if (timeout) { @@ -1331,12 +1331,12 @@ static int device_kill_open_processes(struct hl_device *hdev, u32 timeout, bool } } - mutex_lock(fd_lock); + mutex_lock(hpriv_lock); /* This section must be protected because we are dereferencing * pointers that are freed if the process exits */ - list_for_each_entry(hpriv, fd_list, dev_node) { + list_for_each_entry(hpriv, hpriv_list, dev_node) { task = get_pid_task(hpriv->taskpid, PIDTYPE_PID); if (task) { dev_info(hdev->dev, "Killing user process pid=%d\n", @@ -1346,18 +1346,13 @@ static int device_kill_open_processes(struct hl_device *hdev, u32 timeout, bool put_task_struct(task); } else { - /* - * If we got here, it means that process was killed from outside the driver - * right after it started looping on fd_list and before get_pid_task, thus - * we don't need to kill it. - */ dev_dbg(hdev->dev, - "Can't get task struct for user process %d, assuming process was killed from outside the driver\n", + "Can't get task struct for user process %d, process was killed from outside the driver\n", pid_nr(hpriv->taskpid)); } } - mutex_unlock(fd_lock); + mutex_unlock(hpriv_lock); /* * We killed the open users, but that doesn't mean they are closed. @@ -1369,7 +1364,7 @@ static int device_kill_open_processes(struct hl_device *hdev, u32 timeout, bool */ wait_for_processes: - while ((!list_empty(fd_list)) && (pending_cnt)) { + while ((!list_empty(hpriv_list)) && (pending_cnt)) { dev_dbg(hdev->dev, "Waiting for all unmap operations to finish before hard reset\n"); @@ -1379,7 +1374,7 @@ static int device_kill_open_processes(struct hl_device *hdev, u32 timeout, bool } /* All processes exited successfully */ - if (list_empty(fd_list)) + if (list_empty(hpriv_list)) return 0; /* Give up waiting for processes to exit */ @@ -1393,17 +1388,17 @@ static int device_kill_open_processes(struct hl_device *hdev, u32 timeout, bool static void device_disable_open_processes(struct hl_device *hdev, bool control_dev) { - struct list_head *fd_list; + struct list_head *hpriv_list; struct hl_fpriv *hpriv; - struct mutex *fd_lock; + struct mutex *hpriv_lock; - fd_lock = control_dev ? &hdev->fpriv_ctrl_list_lock : &hdev->fpriv_list_lock; - fd_list = control_dev ? &hdev->fpriv_ctrl_list : &hdev->fpriv_list; + hpriv_lock = control_dev ? &hdev->fpriv_ctrl_list_lock : &hdev->fpriv_list_lock; + hpriv_list = control_dev ? &hdev->fpriv_ctrl_list : &hdev->fpriv_list; - mutex_lock(fd_lock); - list_for_each_entry(hpriv, fd_list, dev_node) + mutex_lock(hpriv_lock); + list_for_each_entry(hpriv, hpriv_list, dev_node) hpriv->hdev = NULL; - mutex_unlock(fd_lock); + mutex_unlock(hpriv_lock); } static void send_disable_pci_access(struct hl_device *hdev, u32 flags)