diff mbox series

ethtool fails to read some QSFP+ modules.

Message ID 54b537c6-aca4-45be-9df0-53c80a046930@dan.merillat.org (mailing list archive)
State New
Delegated to: Michal Kubecek
Headers show
Series ethtool fails to read some QSFP+ modules. | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Dan Merillat June 30, 2024, 5:27 p.m. UTC
I was testing an older Kaiam XQX2502 40G-LR4 and ethtool -m failed with netlink error.  It's treating a failure to read
the optional page3 data as a hard failure.

This patch allows ethtool to read qsfp modules that don't implement the voltage/temperature alarm data.

Comments

Ido Schimmel July 1, 2024, 7:28 a.m. UTC | #1
On Sun, Jun 30, 2024 at 01:27:07PM -0400, Dan Merillat wrote:
> 
> I was testing an older Kaiam XQX2502 40G-LR4 and ethtool -m failed with netlink error.  It's treating a failure to read
> the optional page3 data as a hard failure.
> 
> This patch allows ethtool to read qsfp modules that don't implement the voltage/temperature alarm data.

Thanks for the report and the patch. Krzysztof Olędzki reported the same
issue earlier this year:
https://lore.kernel.org/netdev/9e757616-0396-4573-9ea9-3cb5ef5c901a@ans.pl/

Krzysztof, are you going to submit the ethtool and mlx4 patches?

> From 3144fbfc08fbfb90ecda4848fc9356bde8933d4a Mon Sep 17 00:00:00 2001
> From: Dan Merillat <git@dan.eginity.com>
> Date: Sun, 30 Jun 2024 13:11:51 -0400
> Subject: [PATCH] Some qsfp modules do not support page 3
> 
> Tested on an older Kaiam XQX2502 40G-LR4 module.
> ethtool -m aborts with netlink error due to page 3
> not existing on the module. Ignore the error and
> leave map->page_03h NULL.

User space only tries to read this page because the module advertised it
as supported. It is more likely that the NIC driver does not return all
the pages. Which driver is it?
Ido Schimmel July 1, 2024, 7:41 a.m. UTC | #2
Forgot to add Krzysztof :p

On Mon, Jul 01, 2024 at 10:28:39AM +0300, Ido Schimmel wrote:
> On Sun, Jun 30, 2024 at 01:27:07PM -0400, Dan Merillat wrote:
> > 
> > I was testing an older Kaiam XQX2502 40G-LR4 and ethtool -m failed with netlink error.  It's treating a failure to read
> > the optional page3 data as a hard failure.
> > 
> > This patch allows ethtool to read qsfp modules that don't implement the voltage/temperature alarm data.
> 
> Thanks for the report and the patch. Krzysztof Olędzki reported the same
> issue earlier this year:
> https://lore.kernel.org/netdev/9e757616-0396-4573-9ea9-3cb5ef5c901a@ans.pl/
> 
> Krzysztof, are you going to submit the ethtool and mlx4 patches?
> 
> > From 3144fbfc08fbfb90ecda4848fc9356bde8933d4a Mon Sep 17 00:00:00 2001
> > From: Dan Merillat <git@dan.eginity.com>
> > Date: Sun, 30 Jun 2024 13:11:51 -0400
> > Subject: [PATCH] Some qsfp modules do not support page 3
> > 
> > Tested on an older Kaiam XQX2502 40G-LR4 module.
> > ethtool -m aborts with netlink error due to page 3
> > not existing on the module. Ignore the error and
> > leave map->page_03h NULL.
> 
> User space only tries to read this page because the module advertised it
> as supported. It is more likely that the NIC driver does not return all
> the pages. Which driver is it?
Krzysztof Olędzki July 3, 2024, 2:36 p.m. UTC | #3
Good morning,

On 01.07.2024 at 00:41, Ido Schimmel wrote:
> Forgot to add Krzysztof :p
> 
> On Mon, Jul 01, 2024 at 10:28:39AM +0300, Ido Schimmel wrote:
>> On Sun, Jun 30, 2024 at 01:27:07PM -0400, Dan Merillat wrote:
>>>
>>> I was testing an older Kaiam XQX2502 40G-LR4 and ethtool -m failed with netlink error.  It's treating a failure to read
>>> the optional page3 data as a hard failure.
>>>
>>> This patch allows ethtool to read qsfp modules that don't implement the voltage/temperature alarm data.
>>
>> Thanks for the report and the patch. Krzysztof Olędzki reported the same
>> issue earlier this year:
>> https://lore.kernel.org/netdev/9e757616-0396-4573-9ea9-3cb5ef5c901a@ans.pl/
>>
>> Krzysztof, are you going to submit the ethtool and mlx4 patches?

Yes, and I apologize for the delay - I have been traveling with my family and was unable to get into it.

I should be able to work on the patches later this week, so please expect something from me around the weekend.

>>> From 3144fbfc08fbfb90ecda4848fc9356bde8933d4a Mon Sep 17 00:00:00 2001
>>> From: Dan Merillat <git@dan.eginity.com>
>>> Date: Sun, 30 Jun 2024 13:11:51 -0400
>>> Subject: [PATCH] Some qsfp modules do not support page 3
>>>
>>> Tested on an older Kaiam XQX2502 40G-LR4 module.
>>> ethtool -m aborts with netlink error due to page 3
>>> not existing on the module. Ignore the error and
>>> leave map->page_03h NULL.
>>
>> User space only tries to read this page because the module advertised it
>> as supported. It is more likely that the NIC driver does not return all
>> the pages. Which driver is it?
> 

Thanks,
 Krzysztof
Krzysztof Olędzki July 4, 2024, 12:27 p.m. UTC | #4
On 03.07.2024 at 07:36, Krzysztof Olędzki wrote:
> Good morning,
> 
> On 01.07.2024 at 00:41, Ido Schimmel wrote:
>> Forgot to add Krzysztof :p
>>
>> On Mon, Jul 01, 2024 at 10:28:39AM +0300, Ido Schimmel wrote:
>>> On Sun, Jun 30, 2024 at 01:27:07PM -0400, Dan Merillat wrote:
>>>>
>>>> I was testing an older Kaiam XQX2502 40G-LR4 and ethtool -m failed with netlink error.  It's treating a failure to read
>>>> the optional page3 data as a hard failure.
>>>>
>>>> This patch allows ethtool to read qsfp modules that don't implement the voltage/temperature alarm data.
>>>
>>> Thanks for the report and the patch. Krzysztof Olędzki reported the same
>>> issue earlier this year:
>>> https://lore.kernel.org/netdev/9e757616-0396-4573-9ea9-3cb5ef5c901a@ans.pl/
>>>
>>> Krzysztof, are you going to submit the ethtool and mlx4 patches?
> 
> Yes, and I apologize for the delay - I have been traveling with my family and was unable to get into it.
> 
> I should be able to work on the patches later this week, so please expect something from me around the weekend.
> 
>>>> From 3144fbfc08fbfb90ecda4848fc9356bde8933d4a Mon Sep 17 00:00:00 2001
>>>> From: Dan Merillat <git@dan.eginity.com>
>>>> Date: Sun, 30 Jun 2024 13:11:51 -0400
>>>> Subject: [PATCH] Some qsfp modules do not support page 3
>>>>
>>>> Tested on an older Kaiam XQX2502 40G-LR4 module.
>>>> ethtool -m aborts with netlink error due to page 3
>>>> not existing on the module. Ignore the error and
>>>> leave map->page_03h NULL.

BTW - the code change does what we have discussed, but I think the comment
may be incorrect? Before we call nl_get_eeprom_page, there is a check to
verify that Page 03h is present:

        /* Page 03h is only present when the module memory model is paged and
         * not flat.
         */
        if (map->lower_memory[SFF8636_STATUS_2_OFFSET] &
            SFF8636_STATUS_PAGE_3_PRESENT)
                return 0;

In this case, it seems to me that the failure can only be caused by either
HW issues (NIC or SFP) *or* a bug in the driver. Assuming we want to provide
some details in the code, maybe something like this may be better?

+	/* Page 03h is not available due to either HW issue or a bug
+	 * in the driver. This is a non-fatal error and sff8636_dom_parse()
+	 * handles this correctly.
+	 */

We were also discussing if printing a warning in such situation may make sense.
As I was thinking about this more, I wonder if we can just use the same check
in sff8636_show_dom() and if map->page_03h is NULL print for
 "Alarm/warning flags implemented"
something like:
 "Failed (Page 03h access error, HW issue or kernel driver bug?)"

We would get it in addition to "netlink error: Invalid argument" that comes from:
./netlink/nlsock.c:             perror("netlink error");


>>> User space only tries to read this page because the module advertised it
>>> as supported. It is more likely that the NIC driver does not return all
>>> the pages. Which driver is it?
>>

Krzysztof
diff mbox series

Patch

From 3144fbfc08fbfb90ecda4848fc9356bde8933d4a Mon Sep 17 00:00:00 2001
From: Dan Merillat <git@dan.eginity.com>
Date: Sun, 30 Jun 2024 13:11:51 -0400
Subject: [PATCH] Some qsfp modules do not support page 3

Tested on an older Kaiam XQX2502 40G-LR4 module.
ethtool -m aborts with netlink error due to page 3
not existing on the module. Ignore the error and
leave map->page_03h NULL.
---
 qsfp.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/qsfp.c b/qsfp.c
index a2921fb..0a16b42 100644
--- a/qsfp.c
+++ b/qsfp.c
@@ -1037,9 +1037,15 @@  sff8636_memory_map_init_pages(struct cmd_context *ctx,
 		return 0;
 
 	sff8636_request_init(&request, 0x3, SFF8636_PAGE_SIZE);
+
+	/* Some modules are paged but do not have page 03h.  This
+	 * is a non-fatal error, and sff8636_dom_parse() handles this
+	 * correctly.
+	 */
 	ret = nl_get_eeprom_page(ctx, &request);
 	if (ret < 0)
-		return ret;
+		return 0;
+
 	map->page_03h = request.data - SFF8636_PAGE_SIZE;
 
 	return 0;
-- 
2.45.1