diff mbox series

[2/3] accel/habanalabs: reset device if scrubbing failed

Message ID 20230612120733.3079507-2-ogabbay@kernel.org (mailing list archive)
State New, archived
Headers show
Series [1/3] accel/habanalabs: remove pdev check on idle check | expand

Commit Message

Oded Gabbay June 12, 2023, 12:07 p.m. UTC
If scrubbing memory after user released device has failed it means
the device is in a bad state and should be reset.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/device.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Ofir Bitton June 13, 2023, 7:54 a.m. UTC | #1
On 12/06/2023 15:07, Oded Gabbay wrote:

If scrubbing memory after user released device has failed it means
the device is in a bad state and should be reset.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org><mailto:ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/device.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index 5e61761b8c11..d7d9198b2103 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -454,8 +454,10 @@ static void hpriv_release(struct kref *ref)
                /* Scrubbing is handled within hl_device_reset(), so here need to do it directly */
                int rc = hdev->asic_funcs->scrub_device_mem(hdev);

-               if (rc)
+               if (rc) {
                        dev_err(hdev->dev, "failed to scrub memory from hpriv release (%d)\n", rc);
+                       hl_device_reset(hdev, HL_DRV_RESET_HARD);
+               }
        }

        /* Now we can mark the compute_ctx as not active. Even if a reset is running in a different


Reviewed-by: Ofir Bitton <obitton@habana.ai<mailto:obitton@habana.ai>>
diff mbox series

Patch

diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index 5e61761b8c11..d7d9198b2103 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -454,8 +454,10 @@  static void hpriv_release(struct kref *ref)
 		/* Scrubbing is handled within hl_device_reset(), so here need to do it directly */
 		int rc = hdev->asic_funcs->scrub_device_mem(hdev);
 
-		if (rc)
+		if (rc) {
 			dev_err(hdev->dev, "failed to scrub memory from hpriv release (%d)\n", rc);
+			hl_device_reset(hdev, HL_DRV_RESET_HARD);
+		}
 	}
 
 	/* Now we can mark the compute_ctx as not active. Even if a reset is running in a different