Message ID | 20240220160129.909714-13-ogabbay@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [01/13] accel/habanalabs/gaudi2: use single function to compare FW versions | expand |
On 2/20/2024 8:01 AM, Oded Gabbay wrote: > From: Ofir Bitton <obitton@habana.ai> > > Today we read PCI VENDOR-ID in order to make sure PCI link is > healthy. Apparently the VENDOR-ID might be stored on host and > hence, when we read it we might not access the PCI bus. > In order to make sure PCI health check is reliable, we will start > checking the DEVICE-ID instead. What's keeping some system from caching that as well? Since this is checking for PCIe link health, it will be 0xFF when bad. Checking some part of Config Space that is writable would be more reliable. -Carl V.
On 24/02/2024 1:32, Carl Vanderlip wrote: > > On 2/20/2024 8:01 AM, Oded Gabbay wrote: >> From: Ofir Bitton <obitton@habana.ai> >> >> Today we read PCI VENDOR-ID in order to make sure PCI link is >> healthy. Apparently the VENDOR-ID might be stored on host and >> hence, when we read it we might not access the PCI bus. >> In order to make sure PCI health check is reliable, we will start >> checking the DEVICE-ID instead. > > What's keeping some system from caching that as well? The PCI Controllers/switches we use in Gaudi family products might cache only the VENDOR-ID and not the DEVICE-ID. > > Since this is checking for PCIe link health, it will be 0xFF when bad. > Checking some part of Config Space that is writable would be more reliable. Generally speaking I agree but there is no product in the Gaudi family with a 0xFF DEVICE-ID (nor there will be), so I think this approach is good enough for our use-case. -- Ofir > > -Carl V.
diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index 3b9e8a21d7df..8f92445c5a90 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -1035,14 +1035,14 @@ static void device_early_fini(struct hl_device *hdev) static bool is_pci_link_healthy(struct hl_device *hdev) { - u16 vendor_id; + u16 device_id; if (!hdev->pdev) return false; - pci_read_config_word(hdev->pdev, PCI_VENDOR_ID, &vendor_id); + pci_read_config_word(hdev->pdev, PCI_DEVICE_ID, &device_id); - return (vendor_id == PCI_VENDOR_ID_HABANALABS); + return (device_id == hdev->pdev->device); } static int hl_device_eq_heartbeat_check(struct hl_device *hdev)