diff mbox series

[13/13] accel/habanalabs: modify pci health check

Message ID 20240220160129.909714-13-ogabbay@kernel.org (mailing list archive)
State New, archived
Headers show
Series [01/13] accel/habanalabs/gaudi2: use single function to compare FW versions | expand

Commit Message

Oded Gabbay Feb. 20, 2024, 4:01 p.m. UTC
From: Ofir Bitton <obitton@habana.ai>

Today we read PCI VENDOR-ID in order to make sure PCI link is
healthy. Apparently the VENDOR-ID might be stored on host and
hence, when we read it we might not access the PCI bus.
In order to make sure PCI health check is reliable, we will start
checking the DEVICE-ID instead.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/device.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Carl Vanderlip Feb. 23, 2024, 11:32 p.m. UTC | #1
On 2/20/2024 8:01 AM, Oded Gabbay wrote:
> From: Ofir Bitton <obitton@habana.ai>
> 
> Today we read PCI VENDOR-ID in order to make sure PCI link is
> healthy. Apparently the VENDOR-ID might be stored on host and
> hence, when we read it we might not access the PCI bus.
> In order to make sure PCI health check is reliable, we will start
> checking the DEVICE-ID instead.

What's keeping some system from caching that as well?

Since this is checking for PCIe link health, it will be 0xFF when bad. 
Checking some part of Config Space that is writable would be more reliable.

-Carl V.
Ofir Bitton Feb. 24, 2024, 7:20 p.m. UTC | #2
On 24/02/2024 1:32, Carl Vanderlip wrote:
> 
> On 2/20/2024 8:01 AM, Oded Gabbay wrote:
>> From: Ofir Bitton <obitton@habana.ai>
>>
>> Today we read PCI VENDOR-ID in order to make sure PCI link is
>> healthy. Apparently the VENDOR-ID might be stored on host and
>> hence, when we read it we might not access the PCI bus.
>> In order to make sure PCI health check is reliable, we will start
>> checking the DEVICE-ID instead.
> 
> What's keeping some system from caching that as well?

The PCI Controllers/switches we use in Gaudi family products might cache 
only the VENDOR-ID and not the DEVICE-ID.

> 
> Since this is checking for PCIe link health, it will be 0xFF when bad.
> Checking some part of Config Space that is writable would be more reliable.

Generally speaking I agree but there is no product in the Gaudi family 
with a 0xFF DEVICE-ID (nor there will be), so I think this approach is 
good enough for our use-case.

--
Ofir

> 
> -Carl V.
diff mbox series

Patch

diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index 3b9e8a21d7df..8f92445c5a90 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -1035,14 +1035,14 @@  static void device_early_fini(struct hl_device *hdev)
 
 static bool is_pci_link_healthy(struct hl_device *hdev)
 {
-	u16 vendor_id;
+	u16 device_id;
 
 	if (!hdev->pdev)
 		return false;
 
-	pci_read_config_word(hdev->pdev, PCI_VENDOR_ID, &vendor_id);
+	pci_read_config_word(hdev->pdev, PCI_DEVICE_ID, &device_id);
 
-	return (vendor_id == PCI_VENDOR_ID_HABANALABS);
+	return (device_id == hdev->pdev->device);
 }
 
 static int hl_device_eq_heartbeat_check(struct hl_device *hdev)