Message ID | 54b537c6-aca4-45be-9df0-53c80a046930@dan.merillat.org (mailing list archive) |
---|---|
State | New |
Delegated to: | Michal Kubecek |
Headers | show |
Series | ethtool fails to read some QSFP+ modules. | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On Sun, Jun 30, 2024 at 01:27:07PM -0400, Dan Merillat wrote: > > I was testing an older Kaiam XQX2502 40G-LR4 and ethtool -m failed with netlink error. It's treating a failure to read > the optional page3 data as a hard failure. > > This patch allows ethtool to read qsfp modules that don't implement the voltage/temperature alarm data. Thanks for the report and the patch. Krzysztof Olędzki reported the same issue earlier this year: https://lore.kernel.org/netdev/9e757616-0396-4573-9ea9-3cb5ef5c901a@ans.pl/ Krzysztof, are you going to submit the ethtool and mlx4 patches? > From 3144fbfc08fbfb90ecda4848fc9356bde8933d4a Mon Sep 17 00:00:00 2001 > From: Dan Merillat <git@dan.eginity.com> > Date: Sun, 30 Jun 2024 13:11:51 -0400 > Subject: [PATCH] Some qsfp modules do not support page 3 > > Tested on an older Kaiam XQX2502 40G-LR4 module. > ethtool -m aborts with netlink error due to page 3 > not existing on the module. Ignore the error and > leave map->page_03h NULL. User space only tries to read this page because the module advertised it as supported. It is more likely that the NIC driver does not return all the pages. Which driver is it?
Forgot to add Krzysztof :p On Mon, Jul 01, 2024 at 10:28:39AM +0300, Ido Schimmel wrote: > On Sun, Jun 30, 2024 at 01:27:07PM -0400, Dan Merillat wrote: > > > > I was testing an older Kaiam XQX2502 40G-LR4 and ethtool -m failed with netlink error. It's treating a failure to read > > the optional page3 data as a hard failure. > > > > This patch allows ethtool to read qsfp modules that don't implement the voltage/temperature alarm data. > > Thanks for the report and the patch. Krzysztof Olędzki reported the same > issue earlier this year: > https://lore.kernel.org/netdev/9e757616-0396-4573-9ea9-3cb5ef5c901a@ans.pl/ > > Krzysztof, are you going to submit the ethtool and mlx4 patches? > > > From 3144fbfc08fbfb90ecda4848fc9356bde8933d4a Mon Sep 17 00:00:00 2001 > > From: Dan Merillat <git@dan.eginity.com> > > Date: Sun, 30 Jun 2024 13:11:51 -0400 > > Subject: [PATCH] Some qsfp modules do not support page 3 > > > > Tested on an older Kaiam XQX2502 40G-LR4 module. > > ethtool -m aborts with netlink error due to page 3 > > not existing on the module. Ignore the error and > > leave map->page_03h NULL. > > User space only tries to read this page because the module advertised it > as supported. It is more likely that the NIC driver does not return all > the pages. Which driver is it?
Good morning, On 01.07.2024 at 00:41, Ido Schimmel wrote: > Forgot to add Krzysztof :p > > On Mon, Jul 01, 2024 at 10:28:39AM +0300, Ido Schimmel wrote: >> On Sun, Jun 30, 2024 at 01:27:07PM -0400, Dan Merillat wrote: >>> >>> I was testing an older Kaiam XQX2502 40G-LR4 and ethtool -m failed with netlink error. It's treating a failure to read >>> the optional page3 data as a hard failure. >>> >>> This patch allows ethtool to read qsfp modules that don't implement the voltage/temperature alarm data. >> >> Thanks for the report and the patch. Krzysztof Olędzki reported the same >> issue earlier this year: >> https://lore.kernel.org/netdev/9e757616-0396-4573-9ea9-3cb5ef5c901a@ans.pl/ >> >> Krzysztof, are you going to submit the ethtool and mlx4 patches? Yes, and I apologize for the delay - I have been traveling with my family and was unable to get into it. I should be able to work on the patches later this week, so please expect something from me around the weekend. >>> From 3144fbfc08fbfb90ecda4848fc9356bde8933d4a Mon Sep 17 00:00:00 2001 >>> From: Dan Merillat <git@dan.eginity.com> >>> Date: Sun, 30 Jun 2024 13:11:51 -0400 >>> Subject: [PATCH] Some qsfp modules do not support page 3 >>> >>> Tested on an older Kaiam XQX2502 40G-LR4 module. >>> ethtool -m aborts with netlink error due to page 3 >>> not existing on the module. Ignore the error and >>> leave map->page_03h NULL. >> >> User space only tries to read this page because the module advertised it >> as supported. It is more likely that the NIC driver does not return all >> the pages. Which driver is it? > Thanks, Krzysztof
On 03.07.2024 at 07:36, Krzysztof Olędzki wrote: > Good morning, > > On 01.07.2024 at 00:41, Ido Schimmel wrote: >> Forgot to add Krzysztof :p >> >> On Mon, Jul 01, 2024 at 10:28:39AM +0300, Ido Schimmel wrote: >>> On Sun, Jun 30, 2024 at 01:27:07PM -0400, Dan Merillat wrote: >>>> >>>> I was testing an older Kaiam XQX2502 40G-LR4 and ethtool -m failed with netlink error. It's treating a failure to read >>>> the optional page3 data as a hard failure. >>>> >>>> This patch allows ethtool to read qsfp modules that don't implement the voltage/temperature alarm data. >>> >>> Thanks for the report and the patch. Krzysztof Olędzki reported the same >>> issue earlier this year: >>> https://lore.kernel.org/netdev/9e757616-0396-4573-9ea9-3cb5ef5c901a@ans.pl/ >>> >>> Krzysztof, are you going to submit the ethtool and mlx4 patches? > > Yes, and I apologize for the delay - I have been traveling with my family and was unable to get into it. > > I should be able to work on the patches later this week, so please expect something from me around the weekend. > >>>> From 3144fbfc08fbfb90ecda4848fc9356bde8933d4a Mon Sep 17 00:00:00 2001 >>>> From: Dan Merillat <git@dan.eginity.com> >>>> Date: Sun, 30 Jun 2024 13:11:51 -0400 >>>> Subject: [PATCH] Some qsfp modules do not support page 3 >>>> >>>> Tested on an older Kaiam XQX2502 40G-LR4 module. >>>> ethtool -m aborts with netlink error due to page 3 >>>> not existing on the module. Ignore the error and >>>> leave map->page_03h NULL. BTW - the code change does what we have discussed, but I think the comment may be incorrect? Before we call nl_get_eeprom_page, there is a check to verify that Page 03h is present: /* Page 03h is only present when the module memory model is paged and * not flat. */ if (map->lower_memory[SFF8636_STATUS_2_OFFSET] & SFF8636_STATUS_PAGE_3_PRESENT) return 0; In this case, it seems to me that the failure can only be caused by either HW issues (NIC or SFP) *or* a bug in the driver. Assuming we want to provide some details in the code, maybe something like this may be better? + /* Page 03h is not available due to either HW issue or a bug + * in the driver. This is a non-fatal error and sff8636_dom_parse() + * handles this correctly. + */ We were also discussing if printing a warning in such situation may make sense. As I was thinking about this more, I wonder if we can just use the same check in sff8636_show_dom() and if map->page_03h is NULL print for "Alarm/warning flags implemented" something like: "Failed (Page 03h access error, HW issue or kernel driver bug?)" We would get it in addition to "netlink error: Invalid argument" that comes from: ./netlink/nlsock.c: perror("netlink error"); >>> User space only tries to read this page because the module advertised it >>> as supported. It is more likely that the NIC driver does not return all >>> the pages. Which driver is it? >> Krzysztof
On Thu, Jul 04, 2024 at 05:27:49AM -0700, Krzysztof Olędzki wrote: > On 03.07.2024 at 07:36, Krzysztof Olędzki wrote: > > Good morning, > > > > On 01.07.2024 at 00:41, Ido Schimmel wrote: > >> Forgot to add Krzysztof :p > >> > >> On Mon, Jul 01, 2024 at 10:28:39AM +0300, Ido Schimmel wrote: > >>> On Sun, Jun 30, 2024 at 01:27:07PM -0400, Dan Merillat wrote: > >>>> > >>>> I was testing an older Kaiam XQX2502 40G-LR4 and ethtool -m failed with netlink error. It's treating a failure to read > >>>> the optional page3 data as a hard failure. > >>>> > >>>> This patch allows ethtool to read qsfp modules that don't implement the voltage/temperature alarm data. > >>> > >>> Thanks for the report and the patch. Krzysztof Olędzki reported the same > >>> issue earlier this year: > >>> https://lore.kernel.org/netdev/9e757616-0396-4573-9ea9-3cb5ef5c901a@ans.pl/ > >>> > >>> Krzysztof, are you going to submit the ethtool and mlx4 patches? > > > > Yes, and I apologize for the delay - I have been traveling with my family and was unable to get into it. > > > > I should be able to work on the patches later this week, so please expect something from me around the weekend. > > > >>>> From 3144fbfc08fbfb90ecda4848fc9356bde8933d4a Mon Sep 17 00:00:00 2001 > >>>> From: Dan Merillat <git@dan.eginity.com> > >>>> Date: Sun, 30 Jun 2024 13:11:51 -0400 > >>>> Subject: [PATCH] Some qsfp modules do not support page 3 > >>>> > >>>> Tested on an older Kaiam XQX2502 40G-LR4 module. > >>>> ethtool -m aborts with netlink error due to page 3 > >>>> not existing on the module. Ignore the error and > >>>> leave map->page_03h NULL. > > BTW - the code change does what we have discussed, but I think the comment > may be incorrect? Before we call nl_get_eeprom_page, there is a check to > verify that Page 03h is present: > > /* Page 03h is only present when the module memory model is paged and > * not flat. > */ > if (map->lower_memory[SFF8636_STATUS_2_OFFSET] & > SFF8636_STATUS_PAGE_3_PRESENT) > return 0; > > In this case, it seems to me that the failure can only be caused by either > HW issues (NIC or SFP) *or* a bug in the driver. Assuming we want to provide > some details in the code, maybe something like this may be better? > > + /* Page 03h is not available due to either HW issue or a bug > + * in the driver. This is a non-fatal error and sff8636_dom_parse() > + * handles this correctly. > + */ Looks fine to me although I believe an error in this case will always be returned because of a driver bug. Reading a page that does not exist should not result in an error, but in the module returning Upper Page 00h. Yet to encounter a module that works in a different way. From SFF-8636 Section 6.1: "Writing the value of a non-supported page shall not be accepted by the slave. The Page Select byte shall revert to 0h and read/write operations shall be to Upper Page 00h. Because Upper Page 00h is read-only, this scheme prevents the inadvertent corruption of module memory by a host attempting to write to a non-supported location." > > We were also discussing if printing a warning in such situation may make sense. > As I was thinking about this more, I wonder if we can just use the same check > in sff8636_show_dom() and if map->page_03h is NULL print for > "Alarm/warning flags implemented" > something like: > "Failed (Page 03h access error, HW issue or kernel driver bug?)" > > We would get it in addition to "netlink error: Invalid argument" that comes from: > ./netlink/nlsock.c: perror("netlink error"); I think it's better to print it in sff8636_memory_map_init_pages() as that way the user can more easily understand the reason for "netlink error: Invalid argument": # ./ethtool -m swp13 netlink error: Invalid argument Failed to read Upper Page 03h Identifier : 0x11 (QSFP28) Extended identifier : 0xcf [...] BTW, just to be sure, you are going to post patches for both ethtool and mlx4, right?
From 3144fbfc08fbfb90ecda4848fc9356bde8933d4a Mon Sep 17 00:00:00 2001 From: Dan Merillat <git@dan.eginity.com> Date: Sun, 30 Jun 2024 13:11:51 -0400 Subject: [PATCH] Some qsfp modules do not support page 3 Tested on an older Kaiam XQX2502 40G-LR4 module. ethtool -m aborts with netlink error due to page 3 not existing on the module. Ignore the error and leave map->page_03h NULL. --- qsfp.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/qsfp.c b/qsfp.c index a2921fb..0a16b42 100644 --- a/qsfp.c +++ b/qsfp.c @@ -1037,9 +1037,15 @@ sff8636_memory_map_init_pages(struct cmd_context *ctx, return 0; sff8636_request_init(&request, 0x3, SFF8636_PAGE_SIZE); + + /* Some modules are paged but do not have page 03h. This + * is a non-fatal error, and sff8636_dom_parse() handles this + * correctly. + */ ret = nl_get_eeprom_page(ctx, &request); if (ret < 0) - return ret; + return 0; + map->page_03h = request.data - SFF8636_PAGE_SIZE; return 0; -- 2.45.1