[net-next,v2,0/2] Add support for DSFP transceiver type

Message ID	1606123198-6230-1-git-send-email-moshe@mellanox.com (mailing list archive)
Headers	show Return-Path: <netdev-owner@kernel.org> From: Moshe Shemesh <moshe@mellanox.com> To: "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Andrew Lunn <andrew@lunn.ch>, Adrian Pop <pop.adrian61@gmail.com>, Michal Kubecek <mkubecek@suse.cz> Cc: netdev@vger.kernel.org, Vladyslav Tarasiuk <vladyslavt@nvidia.com>, Moshe Shemesh <moshe@nvidia.com>, Moshe Shemesh <moshe@mellanox.com> Subject: [PATCH net-next v2 0/2] Add support for DSFP transceiver type Date: Mon, 23 Nov 2020 11:19:56 +0200 Message-Id: <1606123198-6230-1-git-send-email-moshe@mellanox.com> Precedence: bulk
Series	Add support for DSFP transceiver type \| expand [net-next,v2,0/2] Add support for DSFP transceiver type [net-next,v2,1/2] ethtool: Add CMIS 4.0 module type to UAPI [net-next,v2,2/2] net/mlx5e: Add DSFP EEPROM dump support to ethtool

Moshe Shemesh Nov. 23, 2020, 9:19 a.m. UTC

Add support for new cable module type DSFP (Dual Small Form-Factor Pluggable
transceiver). DSFP EEPROM memory layout is compatible with CMIS 4.0 spec. Add
CMIS 4.0 module type to UAPI and implement DSFP EEPROM dump in mlx5.

Change log:
v1 -> v2
- Added comments on accessing only the mandatory part of passive and
  active cables.

Vladyslav Tarasiuk (2):
  ethtool: Add CMIS 4.0 module type to UAPI
  net/mlx5e: Add DSFP EEPROM dump support to ethtool

 .../ethernet/mellanox/mlx5/core/en_ethtool.c  | 12 ++++-
 .../net/ethernet/mellanox/mlx5/core/port.c    | 52 ++++++++++++++++---
 include/linux/mlx5/port.h                     |  1 +
 include/uapi/linux/ethtool.h                  |  3 ++
 4 files changed, 60 insertions(+), 8 deletions(-)

Andrew Lunn Nov. 24, 2020, 1:14 a.m. UTC | #1

On Mon, Nov 23, 2020 at 11:19:56AM +0200, Moshe Shemesh wrote:
> Add support for new cable module type DSFP (Dual Small Form-Factor Pluggable
> transceiver). DSFP EEPROM memory layout is compatible with CMIS 4.0 spec. Add
> CMIS 4.0 module type to UAPI and implement DSFP EEPROM dump in mlx5.

So the patches themselves look O.K.

But we are yet again kicking the can down the road and not fixing the
underlying inflexibility of the API.

Do we want to keep kicking the can, or is now the time to do the work
on this API?

   Andrew

Jakub Kicinski Nov. 24, 2020, 9:16 p.m. UTC | #2

On Tue, 24 Nov 2020 02:14:59 +0100 Andrew Lunn wrote:
> On Mon, Nov 23, 2020 at 11:19:56AM +0200, Moshe Shemesh wrote:
> > Add support for new cable module type DSFP (Dual Small Form-Factor Pluggable
> > transceiver). DSFP EEPROM memory layout is compatible with CMIS 4.0 spec. Add
> > CMIS 4.0 module type to UAPI and implement DSFP EEPROM dump in mlx5.  
> 
> So the patches themselves look O.K.
> 
> But we are yet again kicking the can down the road and not fixing the
> underlying inflexibility of the API.
> 
> Do we want to keep kicking the can, or is now the time to do the work
> on this API?

This is hardly rocket science. Let's do it right.

Moshe Shemesh Nov. 25, 2020, 10:40 a.m. UTC | #3

On 11/24/2020 11:16 PM, Jakub Kicinski wrote:
> On Tue, 24 Nov 2020 02:14:59 +0100 Andrew Lunn wrote:
>> On Mon, Nov 23, 2020 at 11:19:56AM +0200, Moshe Shemesh wrote:
>>> Add support for new cable module type DSFP (Dual Small Form-Factor Pluggable
>>> transceiver). DSFP EEPROM memory layout is compatible with CMIS 4.0 spec. Add
>>> CMIS 4.0 module type to UAPI and implement DSFP EEPROM dump in mlx5.
>> So the patches themselves look O.K.
>>
>> But we are yet again kicking the can down the road and not fixing the
>> underlying inflexibility of the API.
>>
>> Do we want to keep kicking the can, or is now the time to do the work
>> on this API?
> This is hardly rocket science. Let's do it right.

OK, we will add API options to select bank and page to read any specific 
page the user selects. So advanced user will use it get the optional 
pages he needs, but what about non advanced user who wants to use the 
current API with a current script for DSFP EEPROM. Isn't it better that 
he will get the 5 mandatory pages then keep it not supported ?

Andrew Lunn Nov. 25, 2020, 2:18 p.m. UTC | #4

> OK, we will add API options to select bank and page to read any specific
> page the user selects. So advanced user will use it get the optional pages
> he needs, but what about non advanced user who wants to use the current API
> with a current script for DSFP EEPROM. Isn't it better that he will get the
> 5 mandatory pages then keep it not supported ?

Users using ethtool will not see a difference. They get a dump of what
ethtool knows how to decode. It should try the netlink API first, and
then fall back to the old ioctl interface.

If i was implementing the ethtool side of it, i would probably do some
sort of caching system. We know page 0 should always exist, so
pre-load that into the cache. Try the netlink API first. If that
fails, use the ioctl interface. If the ioctl is used, put everything
returned into the cache. The decoder can then start decoding, see what
bits are set indicating other pages should be available. Ask for them
from the cache. The netlink API can go fetch them and load them into
the cache. If they cannot be loaded return ENODEV, and the decoder has
to skip what it wanted to decode. If you do it correctly, the decoder
should not care about ioctl vs netlink.

I can do a follow up patch for the generic SFP code in
drivers/net/phy, once you have done the first implementation. But i
only have a limited number of SFPs and most are 1G only. Russell King
can hopefully test with his collection.

     Andrew

Moshe Shemesh Nov. 26, 2020, 2:23 p.m. UTC | #5

On 11/25/2020 4:18 PM, Andrew Lunn wrote:
> External email: Use caution opening links or attachments
>
>
>> OK, we will add API options to select bank and page to read any specific
>> page the user selects. So advanced user will use it get the optional pages
>> he needs, but what about non advanced user who wants to use the current API
>> with a current script for DSFP EEPROM. Isn't it better that he will get the
>> 5 mandatory pages then keep it not supported ?
> Users using ethtool will not see a difference. They get a dump of what
> ethtool knows how to decode. It should try the netlink API first, and
> then fall back to the old ioctl interface.
Yes, it makes sense that whenever command not supported by netlink API 
it falls back to old ioctl interface. As I see it we want here to add 
bank and page options to netlink API  to get data from specific page.
>
> If i was implementing the ethtool side of it, i would probably do some
> sort of caching system. We know page 0 should always exist, so
> pre-load that into the cache. Try the netlink API first. If that
> fails, use the ioctl interface. If the ioctl is used, put everything
> returned into the cache.
I am not sure what you mean by cache here. Don't you want to read page 0 
once you got the ethtool command to read from the module ? If not, then 
at what stage ?
>   The decoder can then start decoding, see what
> bits are set indicating other pages should be available. Ask for them
> from the cache. The netlink API can go fetch them and load them into
> the cache. If they cannot be loaded return ENODEV, and the decoder has
> to skip what it wanted to decode.

So the decoder should read page 0 and check according to page 0 and 
specification which pages should be present, right ?

What about the global offset that we currently got when user doesn't 
specify a page, do you mean that this global offset goes through the 
optional and non optional pages that exist and skip the ones that are 
missing according to the specific EEPROM ?

>   If you do it correctly, the decoder
> should not care about ioctl vs netlink.
>
> I can do a follow up patch for the generic SFP code in
> drivers/net/phy, once you have done the first implementation. But i
> only have a limited number of SFPs and most are 1G only. Russell King
> can hopefully test with his collection.
>
>       Andrew

Andrew Lunn Nov. 26, 2020, 3:21 p.m. UTC | #6

> > If i was implementing the ethtool side of it, i would probably do some
> > sort of caching system. We know page 0 should always exist, so
> > pre-load that into the cache. Try the netlink API first. If that
> > fails, use the ioctl interface. If the ioctl is used, put everything
> > returned into the cache.
> I am not sure what you mean by cache here. Don't you want to read page 0
> once you got the ethtool command to read from the module ? If not, then at
> what stage ?

At the beginning, you try the netlink API and ask for pager 0, bytes
0-127. If you get a page, put it into the cache. If not, use the ioctl
interface, which could return one page, or multiple pages. Put them
all into the cache.

> >   The decoder can then start decoding, see what
> > bits are set indicating other pages should be available. Ask for them
> > from the cache. The netlink API can go fetch them and load them into
> > the cache. If they cannot be loaded return ENODEV, and the decoder has
> > to skip what it wanted to decode.
> 
> So the decoder should read page 0 and check according to page 0 and
> specification which pages should be present, right ?

Yes. It ask the cache, give me a pointer to page 0, bytes 0-127. It
then decodes that, looking at the enumeration data to indicate what
other pages should be available. Maybe it decides, page 0, bytes
128-255 should exist, so it asks the cache for a pointer to that. If
using netlink, it would ask the kernel for that data, put it into the
cache, and return a pointer. If using ioctl, it already knows if it
has that data, so it just returns a pointer, so says sorry, not
available.

> What about the global offset that we currently got when user doesn't specify
> a page, do you mean that this global offset goes through the optional and
> non optional pages that exist and skip the ones that are missing according
> to the specific EEPROM ?

ethtool -m|--dump-module-eeprom|--module-info devname [raw on|off] [hex on|off] [offset N] [length N]

So you mean [offset N] [length N].

That is going to be hard, but the API is broken for complex SFPs with
optional pages. And it is not well defined exactly what offset means.
You can keep backwards compatibility by identifying the SFP from page
0, and then reading the pages in the order the ioctl would do. Let
user space handle it, for those SFPs which the kernel already
supports. For SFPs which the kernel does not support, i would just
return not supported. You can do the same for raw. However, for new
SFPs, for raw you can run the decoder but output to /dev/null. That
loads into the cache all the pages which the decoder knows about. You
can then dump the cache. You probably need a new format, to give an
indication of what each page actually is.

Maybe you want to add new options [page N] [ bank N] to allow
arbitrary queries to be made? Again, you can answer these from the
cache, so the old ioctl interface could work if asked for pages which
the old API had.

    Andrew

Moshe Shemesh Nov. 27, 2020, 3:12 p.m. UTC | #7

On 11/26/2020 5:21 PM, Andrew Lunn wrote:
>>> If i was implementing the ethtool side of it, i would probably do some
>>> sort of caching system. We know page 0 should always exist, so
>>> pre-load that into the cache. Try the netlink API first. If that
>>> fails, use the ioctl interface. If the ioctl is used, put everything
>>> returned into the cache.
>> I am not sure what you mean by cache here. Don't you want to read page 0
>> once you got the ethtool command to read from the module ? If not, then at
>> what stage ?
> At the beginning, you try the netlink API and ask for pager 0, bytes
> 0-127. If you get a page, put it into the cache. If not, use the ioctl
> interface, which could return one page, or multiple pages. Put them
> all into the cache.


OK, but if the caching system is checking one time netlink and one time 
ioctl, it means this cache should be in user space, or did you mean to 
have this cache in kernel ?

>>>    The decoder can then start decoding, see what
>>> bits are set indicating other pages should be available. Ask for them
>>> from the cache. The netlink API can go fetch them and load them into
>>> the cache. If they cannot be loaded return ENODEV, and the decoder has
>>> to skip what it wanted to decode.
>> So the decoder should read page 0 and check according to page 0 and
>> specification which pages should be present, right ?
> Yes. It ask the cache, give me a pointer to page 0, bytes 0-127. It
> then decodes that, looking at the enumeration data to indicate what
> other pages should be available. Maybe it decides, page 0, bytes
> 128-255 should exist, so it asks the cache for a pointer to that. If
> using netlink, it would ask the kernel for that data, put it into the
> cache, and return a pointer. If using ioctl, it already knows if it
> has that data, so it just returns a pointer, so says sorry, not
> available.
>
>> What about the global offset that we currently got when user doesn't specify
>> a page, do you mean that this global offset goes through the optional and
>> non optional pages that exist and skip the ones that are missing according
>> to the specific EEPROM ?
> ethtool -m|--dump-module-eeprom|--module-info devname [raw on|off] [hex on|off] [offset N] [length N]
>
> So you mean [offset N] [length N].


Yes, that's the current options and we can either try coding new 
implementation for that or just call the current ioctl implementation. 
The new code can be triggered once options [bank N] and [Page N] are used.

>
> That is going to be hard, but the API is broken for complex SFPs with
> optional pages. And it is not well defined exactly what offset means.
> You can keep backwards compatibility by identifying the SFP from page
> 0, and then reading the pages in the order the ioctl would do. Let
> user space handle it, for those SFPs which the kernel already
> supports. For SFPs which the kernel does not support, i would just
> return not supported. You can do the same for raw. However, for new
> SFPs, for raw you can run the decoder but output to /dev/null. That
> loads into the cache all the pages which the decoder knows about. You
> can then dump the cache. You probably need a new format, to give an
> indication of what each page actually is.
OK, if I got it right on current API [offset N] [length N] just call 
ioctl current implementation, while using the option [raw on] will call 
new implementation for new SFPs (CMIS 4). Also using [bank N] and [page 
N] will call new implementation for new SFPs.
> Maybe you want to add new options [page N] [ bank N] to allow
> arbitrary queries to be made?
Exactly what I meant, I actually thought of letting the user ask for the 
page he wants, he should know what he needs.
> Again, you can answer these from the
> cache, so the old ioctl interface could work if asked for pages which
> the old API had.
Yes, for the simple EEPROM types that have 1 or 4 pages, ioctl read 
should be enough to get the data.
>
>      Andrew

Andrew Lunn Nov. 27, 2020, 3:56 p.m. UTC | #8

> OK, but if the caching system is checking one time netlink and one time
> ioctl, it means this cache should be in user space, or did you mean to have
> this cache in kernel ?

This is all in userspace, in the ethtool code.

> > > What about the global offset that we currently got when user doesn't specify
> > > a page, do you mean that this global offset goes through the optional and
> > > non optional pages that exist and skip the ones that are missing according
> > > to the specific EEPROM ?
> > ethtool -m|--dump-module-eeprom|--module-info devname [raw on|off] [hex on|off] [offset N] [length N]
> > 
> > So you mean [offset N] [length N].
> 
> 
> Yes, that's the current options and we can either try coding new
> implementation for that or just call the current ioctl implementation. The
> new code can be triggered once options [bank N] and [Page N] are used.

You cannot rely on the ioctl being available. New drivers won't
implement it, if they have the netlink code. Drivers will convert from
get_module_info to whatever new ndo call you add for netlink.

> OK, if I got it right on current API [offset N] [length N] just call ioctl
> current implementation, while using the option [raw on] will call new
> implementation for new SFPs (CMIS 4). Also using [bank N] and [page N] will
> call new implementation for new SFPs.

Not just CMIS. All SFPs.

    Andrew

Vladyslav Tarasiuk Dec. 29, 2020, 10:23 a.m. UTC | #9

On 27-Nov-20 17:56, Andrew Lunn wrote:
>> OK, but if the caching system is checking one time netlink and one time
>> ioctl, it means this cache should be in user space, or did you mean to have
>> this cache in kernel ?
> This is all in userspace, in the ethtool code.
>
>>>> What about the global offset that we currently got when user doesn't specify
>>>> a page, do you mean that this global offset goes through the optional and
>>>> non optional pages that exist and skip the ones that are missing according
>>>> to the specific EEPROM ?
>>> ethtool -m|--dump-module-eeprom|--module-info devname [raw on|off] [hex on|off] [offset N] [length N]
>>>
>>> So you mean [offset N] [length N].
>>
>> Yes, that's the current options and we can either try coding new
>> implementation for that or just call the current ioctl implementation. The
>> new code can be triggered once options [bank N] and [Page N] are used.
> You cannot rely on the ioctl being available. New drivers won't
> implement it, if they have the netlink code. Drivers will convert from
> get_module_info to whatever new ndo call you add for netlink.
>
>> OK, if I got it right on current API [offset N] [length N] just call ioctl
>> current implementation, while using the option [raw on] will call new
>> implementation for new SFPs (CMIS 4). Also using [bank N] and [page N] will
>> call new implementation for new SFPs.
> Not just CMIS. All SFPs.
>
>      Andrew

Hi Andrew,

Following this conversation, I wrote some pseudocode checking if I'm on 
right path here.
Please review:

struct eeprom_page {
         u8 page_number;
         u8 bank_number;
         u16 offset;
         u16 data_length;
         u8 *data;
}

Check every requested page, offset and length availability in userspace,
depending on module standard and EEPROM data.
================================================================================

cache_fetch_add(page_number, bank_number, offset, length)
{
         ethnla_put_u8 (msgbuff, ETHTOOL_A_EEPROM_PAGE_NUMBER, page_number);
         ethnla_put_u8 (msgbuff, ETHTOOL_A_EEPROM_BANK_NUMBER, bank_number);
         ethnla_put_u16(msgbuff, ETHTOOL_A_EEPROM_PAGE_OFFSET, offset);
         ethnla_put_u16(msgbuff, ETHTOOL_A_EEPROM_PAGE_LENGTH, length);
         err = nlsock_send_get_request(eeprom_page_reply_cb) // caches a 
page
         if (err < 0)
                 return err;
}

// Any call should be a pair of cache_fetch_add() with error check and only
// then a cache_get(), but for pseudocode this will do
cache_get_page(page_number, bank_number, offset, length)
{
         if (!cache_contains(page_number, bank_number, offset, length))
                 cache_fetch_add(page_number, bank_number, offset, length);
         return cache_get(page_number, bank_number);
}

print_cmis_y()
{
         page_data = cache_get_page(0, 0, 0, 256)->data;
         print_page_zero(page_data);
         page_data = cache_get_page(1, 0, 128, 128)->data;
         print_page_one(page_data);
}

print_human_readable()
{
         spec_id = cache_get_page(0, 0, 0, 128)->data[0];
         switch (spec_id) {
         case sff_xxxx:
                 print_sff_xxxx();
         case cmis_y:
                 print_cmis_y();
         default:
                 print_hex();
         }
}

getmodule_reply_cb()
{
         if (offset || hex || bank_number || page number)
                 print_hex();
         else
                 // if _human_readable() decoder needs more than page 
00, it will
                 // fetch it on demand
                 print_human_readable();
}

eeprom_page_reply_cb()
{
         cache_add(new_page);
}

page_available(page_number)
{
         spec_id = cache_get_page(0, 0, 0, 128)->data[0];

         // check bits indicating page availability
         switch (spec_id) {
         case sff_xxxx:
                 return is_page_available_sff_xxxx(page_number);
         case cmis_y:
                 return is_page_available_cmis_y(page_number, bank_number);
         default:
                 return -EINVAL;
         }
}

nl_getmodule()
{
         msg_init();

         // request first low page
         ret = cache_fetch_add(0, 0, 0, 128);

         if (ret < 0) // netlink unsupported, fall back to ioctl
                 return ret;

         netlink_cmd_check();
         nl_parser()
         // check bits indicating page availability according to used spec
         if (page_available(page_number, bank_number)) {
                 if (cache_contains(page_number, bank_number, offset, 
length))
                         print_page(page_number, bank_number, offset, 
length)
                 else {
                         ret = cache_fetch_add();
                         if (ret < 0)
                                 return ret;
                         print_page(page_number, bank_number, offset, 
length);
                 }
         } else {
                 return -EINVAL;
         }
}


Or just proceed to request any page and depend on kernel parameter 
validation
================================================================================

getmodule_reply_cb()
{
         if (offset || hex || bank_number || page_number)
                 print_hex();
         else
                 // if _human_readable() decoder needs more than page 
00, it will
                 // fetch it on demand
                 print_human_readable();
}

nl_getmodule()
{
         // request page straight away
         netlink_cmd_check();
         nl_parser();
         /*
          * nl_parser() sets page_number, page_bank, offset and length and
          * prepares to pass them to the kernel. See 
eeprom_page_parse_request()
          */
         ret = nlsock_sendmsg(getmodule_reply_cb);

         if (ret < 0)
                 retrun ret;
}



Kernel
This part is the same for both userspace variants
================================================================================

kernel_fallback(request, data)
{
         struct ethtool_eeprom eeprom;

         if (request->bank_number)
                 return -EINVAL;
         // also take into account page 00 size
         eeprom.offset = request->offset + page_size * request->page_number;
         eeprom.len = request->length;
         eeprom.cmd = ETHTOOL_GMODULEEEPROM;
         eeprom.data = data;

         err = ops->get_module_eeprom(dev, &eeprom, eeprom.data)
         if (err < 0)
                 return err;

         // prepare data for response via netlink like for supported ndo
}

eeprom_page_parse_request()
{
         struct eeprom_page_req_info *request;

         request->offset = nla_get_u16(tb[ETHTOOL_A_EEPROM_PAGE_OFFSET]);
         request->length = nla_get_u16(tb[ETHTOOL_A_EEPROM_PAGE_LENGTH]);
         request->page_number = 
nla_get_u8(tb[ETHTOOL_A_EEPROM_PAGE_NUMBER]);
         request->bank_number = 
nla_get_u8(tb[ETHTOOL_A_EEPROM_PAGE_BANK_NUMBER]);
}

eeprom_page_prepare_data(request)
{
         u8 data[128];
         u8 page_zero_data[128];

         if (!ops->get_module_eeprom_page) {
                 if (ops->get_module_info && ops->get_module_eeprom)
                         return kernel_fallback(request, &data);
                 else
                         return -EOPNOTSUPP;
         }
         // fetch page 00 for further checks
         err = ops->get_module_eeprom_page(dev, 0, 128, 0, 0, 
&page_zero_data);
         if (err < 0)
                 return err;

         // Checks according to spec, similar to page_available() in 
userspace.
         // May be ommited and passed directly to the driver for it to 
validate
         if (is_request_valid(request, &page_zero_data))
                 err = ops->get_module_eeprom_page(dev, request->offset,
request->length,
request->page_number,
request->bank_number, &data);
         else
                 return -EINVAL;

}

Driver
It is required to implement get_module_eeprom_page() ndo, where it queries
its EEPROM and copies to u8 *data array allocated by the kernel previously.
The ndo has the following prototype:
int get_module_eeprom_page(struct net_device *, u16 offset, u16 length,
                            u8 page_number, u8 bank_number, u8 *data);

The driver may test whatever it may need in this ndo implementation.
Its parameters may also be passed using a special struct.
===============================================================================

driver_get_module_eeprom_page()
{
         vendor_query_module_eeprom(page_number, bank_number, offset, 
length, data);
}


Thanks,
Vlad

Andrew Lunn Dec. 29, 2020, 4:25 p.m. UTC | #10

> Hi Andrew,
> 
> Following this conversation, I wrote some pseudocode checking if I'm on
> right path here.
> Please review:
> 
> struct eeprom_page {
>         u8 page_number;
>         u8 bank_number;
>         u16 offset;
>         u16 data_length;
>         u8 *data;
> }

I'm wondering about offset and data_length, in this context. I would
expect you always ask the kernel for the full page, not part of
it. Even when user space asks for just part of a page. That keeps you
cache management simpler. But maybe some indicator of low/high is
needed, since many pages are actually 1/2 pages?

The other thing to consider is SFF-8472 and its use of two different
i2c addresses, A0h and A2h. These are different to pages and banks.

> print_human_readable()
> {
>         spec_id = cache_get_page(0, 0, 0, 128)->data[0];
>         switch (spec_id) {
>         case sff_xxxx:
>                 print_sff_xxxx();
>         case cmis_y:
>                 print_cmis_y();
>         default:
>                 print_hex();
>         }
> }

You want to keep as much of the existing ethtool code as you can, but
the basic idea looks O.K.

> getmodule_reply_cb()
> {
>         if (offset || hex || bank_number || page number)
>                 print_hex();
>         else
>                 // if _human_readable() decoder needs more than page 00, it
> will
>                 // fetch it on demand
>                 print_human_readable();
> }

Things get interesting here. Say this is page 0, and
print_human_readable() finds a bit indicating page 1 is valid. So it
requests page 1. We go recursive. While deep down in
print_human_readable(), we send the next netlink message and call
getmodule_reply_cb() when the answer appears. I've had problems with
some of the netlink code not liking recursive calls.

So i suggest you try to find a different structure for the code. Try
to complete the netlink call before doing the decoding. So add the
page to the cache and then return. Do the decode after
nlsock_sendmsg() has returned.

> Driver
> It is required to implement get_module_eeprom_page() ndo, where it queries
> its EEPROM and copies to u8 *data array allocated by the kernel previously.
> The ndo has the following prototype:
> int get_module_eeprom_page(struct net_device *, u16 offset, u16 length,
>                            u8 page_number, u8 bank_number, u8 *data);


I would include extack here, so we can get better error messages.

  Andrew

Vladyslav Tarasiuk Dec. 30, 2020, 1:55 p.m. UTC | #11

On 29-Dec-20 18:25, Andrew Lunn wrote:
>> Hi Andrew,
>>
>> Following this conversation, I wrote some pseudocode checking if I'm on
>> right path here.
>> Please review:
>>
>> struct eeprom_page {
>>          u8 page_number;
>>          u8 bank_number;
>>          u16 offset;
>>          u16 data_length;
>>          u8 *data;
>> }
> I'm wondering about offset and data_length, in this context. I would
> expect you always ask the kernel for the full page, not part of
> it. Even when user space asks for just part of a page. That keeps you
> cache management simpler.
As far as I know, there may be bytes, which may change on read.
For example, clear on read values in CMIS 4.0.
Then, retrieving whole page every time may be incorrect.
So I kept these for cases, when user asks for specific few bytes.
> But maybe some indicator of low/high is
> needed, since many pages are actually 1/2 pages?
I was planning to use offset and data_length fields to indicate the
available page. For example, high page will have offset 128 and
data_length 128.
> The other thing to consider is SFF-8472 and its use of two different
> i2c addresses, A0h and A2h. These are different to pages and banks.

I wasn't aware of that. It complicates things a bit, should we add a
parameter of i2c address? So in this case page 0 will be with i2c
address A0h. And if user needs page 0 from i2c address A2h, he will
specify it in command line. And for other specs, this parameter will
not be supported.

>> print_human_readable()
>> {
>>          spec_id = cache_get_page(0, 0, 0, 128)->data[0];
>>          switch (spec_id) {
>>          case sff_xxxx:
>>                  print_sff_xxxx();
>>          case cmis_y:
>>                  print_cmis_y();
>>          default:
>>                  print_hex();
>>          }
>> }
> You want to keep as much of the existing ethtool code as you can, but
> the basic idea looks O.K.
Yes, under print_sff_xxxx(), for example, I meant using existing functions.
Either as is, or refactoring according to cache requirements.
>> getmodule_reply_cb()
>> {
>>          if (offset || hex || bank_number || page number)
>>                  print_hex();
>>          else
>>                  // if _human_readable() decoder needs more than page 00, it
>> will
>>                  // fetch it on demand
>>                  print_human_readable();
>> }
> Things get interesting here. Say this is page 0, and
> print_human_readable() finds a bit indicating page 1 is valid. So it
> requests page 1. We go recursive. While deep down in
> print_human_readable(), we send the next netlink message and call
> getmodule_reply_cb() when the answer appears. I've had problems with
> some of the netlink code not liking recursive calls.
>
> So i suggest you try to find a different structure for the code. Try
> to complete the netlink call before doing the decoding. So add the
> page to the cache and then return. Do the decode after
> nlsock_sendmsg() has returned.
I'm thinking about a standard-specific function, which will prefetch pages
needed by print_human_readable(). It will check the standard ID, and go 
request
pages and add them to the cache. Then, decoder kicks with already cached
pages. This will eliminate recursive netlink calls.
>> Driver
>> It is required to implement get_module_eeprom_page() ndo, where it queries
>> its EEPROM and copies to u8 *data array allocated by the kernel previously.
>> The ndo has the following prototype:
>> int get_module_eeprom_page(struct net_device *, u16 offset, u16 length,
>>                             u8 page_number, u8 bank_number, u8 *data);
>
> I would include extack here, so we can get better error messages.

OK, I will add extack.

Thanks,
Vlad

Andrew Lunn Dec. 30, 2020, 3:36 p.m. UTC | #12

On Wed, Dec 30, 2020 at 03:55:02PM +0200, Vladyslav Tarasiuk wrote:
> 
> On 29-Dec-20 18:25, Andrew Lunn wrote:
> > > Hi Andrew,
> > > 
> > > Following this conversation, I wrote some pseudocode checking if I'm on
> > > right path here.
> > > Please review:
> > > 
> > > struct eeprom_page {
> > >          u8 page_number;
> > >          u8 bank_number;
> > >          u16 offset;
> > >          u16 data_length;
> > >          u8 *data;
> > > }
> > I'm wondering about offset and data_length, in this context. I would
> > expect you always ask the kernel for the full page, not part of
> > it. Even when user space asks for just part of a page. That keeps you
> > cache management simpler.
> As far as I know, there may be bytes, which may change on read.
> For example, clear on read values in CMIS 4.0.

Ah, i did not know there were such bits. I will go read the spec. But
it should not really matter. If the SFP driver is interested in these
bits, it will have to intercept the read and act on the values.

> I wasn't aware of that. It complicates things a bit, should we add a
> parameter of i2c address? So in this case page 0 will be with i2c
> address A0h. And if user needs page 0 from i2c address A2h, he will
> specify it in command line.

Not on the command line. You should be able to determine from reading
page 0 at A0h is the diagnostics are at A2h or a page of A0h. That is
the whole point of this API, we decode the first page, and that tells
us what other pages should be available. So adding the i2c address to
the netlink message would be sensible. And i would not be too
surprised if there are SFPs with proprietary registers on other
addresses, which could be interesting to dump, if you can get access
to the needed datasheets.

   Andrew

Vladyslav Tarasiuk Jan. 4, 2021, 3:24 p.m. UTC | #13

On 30-Dec-20 17:36, Andrew Lunn wrote:
> On Wed, Dec 30, 2020 at 03:55:02PM +0200, Vladyslav Tarasiuk wrote:
>> On 29-Dec-20 18:25, Andrew Lunn wrote:
>>>> Hi Andrew,
>>>>
>>>> Following this conversation, I wrote some pseudocode checking if I'm on
>>>> right path here.
>>>> Please review:
>>>>
>>>> struct eeprom_page {
>>>>           u8 page_number;
>>>>           u8 bank_number;
>>>>           u16 offset;
>>>>           u16 data_length;
>>>>           u8 *data;
>>>> }
>>> I'm wondering about offset and data_length, in this context. I would
>>> expect you always ask the kernel for the full page, not part of
>>> it. Even when user space asks for just part of a page. That keeps you
>>> cache management simpler.
>> As far as I know, there may be bytes, which may change on read.
>> For example, clear on read values in CMIS 4.0.
> Ah, i did not know there were such bits. I will go read the spec. But
> it should not really matter. If the SFP driver is interested in these
> bits, it will have to intercept the read and act on the values.

But in case user requests a few bytes from a page with clear-on-read
values, reading full page will clear all such bytesfrom user perspective
even if they were not requested. Driver may intercept the read, but for
user it will look like those bytes were not set.

Current user interface allows arbitrary reads, so I wanted to keep this
behavior pretty much exactly like it is now - request specific part of
a page, get this part without any extra data.

>> I wasn't aware of that. It complicates things a bit, should we add a
>> parameter of i2c address? So in this case page 0 will be with i2c
>> address A0h. And if user needs page 0 from i2c address A2h, he will
>> specify it in command line.
> Not on the command line. You should be able to determine from reading
> page 0 at A0h is the diagnostics are at A2h or a page of A0h. That is
> the whole point of this API, we decode the first page, and that tells
> us what other pages should be available. So adding the i2c address to
> the netlink message would be sensible. And i would not be too
> surprised if there are SFPs with proprietary registers on other
> addresses, which could be interesting to dump, if you can get access
> to the needed datasheets.
Without command line argument user will not be able to request a single
A2h page, for example. He will see it only in some kind of general dump -
with human-readable decoder usage or multiple page dump.

And same goes forpages on other i2c addresses. How to know what to dump,
if user does not provide i2c address and there is no way to know what to
request from proprietary SFPs?

Thanks,
Vlad

Andrew Lunn Jan. 4, 2021, 3:48 p.m. UTC | #14

On Mon, Jan 04, 2021 at 05:24:11PM +0200, Vladyslav Tarasiuk wrote:
> 
> On 30-Dec-20 17:36, Andrew Lunn wrote:
> > On Wed, Dec 30, 2020 at 03:55:02PM +0200, Vladyslav Tarasiuk wrote:
> > > On 29-Dec-20 18:25, Andrew Lunn wrote:
> > > > > Hi Andrew,
> > > > > 
> > > > > Following this conversation, I wrote some pseudocode checking if I'm on
> > > > > right path here.
> > > > > Please review:
> > > > > 
> > > > > struct eeprom_page {
> > > > >           u8 page_number;
> > > > >           u8 bank_number;
> > > > >           u16 offset;
> > > > >           u16 data_length;
> > > > >           u8 *data;
> > > > > }
> > > > I'm wondering about offset and data_length, in this context. I would
> > > > expect you always ask the kernel for the full page, not part of
> > > > it. Even when user space asks for just part of a page. That keeps you
> > > > cache management simpler.
> > > As far as I know, there may be bytes, which may change on read.
> > > For example, clear on read values in CMIS 4.0.
> > Ah, i did not know there were such bits. I will go read the spec. But
> > it should not really matter. If the SFP driver is interested in these
> > bits, it will have to intercept the read and act on the values.
> 
> But in case user requests a few bytes from a page with clear-on-read
> values, reading full page will clear all such bytesfrom user perspective
> even if they were not requested. Driver may intercept the read, but for
> user it will look like those bytes were not set.

Yes, O.K. Reading individual words does make sense.

> Without command line argument user will not be able to request a single
> A2h page, for example. He will see it only in some kind of general dump -
> with human-readable decoder usage or multiple page dump.
> 
> And same goes forpages on other i2c addresses. How to know what to dump,
> if user does not provide i2c address and there is no way to know what to
> request from proprietary SFPs?

So we should look at this from the perspective of use cases. Currently
we have:

ethtool -m|--dump-module-eeprom|--module-info devname [raw on|off] [hex on|off] [offset N] [length N]

If you use it without any of [raw on|off] [hex on|off] [offset N]
[length N] it decodes what it finds. As soon as you pass any of these
options, the decoding is disabled and is just dumps values, either raw
or hex.

I would say, i2c address as a parameter can be added, but again,
passing it disables decoding, is just dumps raw or hex.

When not passed, and decoding is enabled, the decoder should decide if
A2 is available based on what is finds in page 0, and ask for it.

We also need to clearly define what offset and length means in this
context. It has to be within a specific page if page, bank or i2c
address has been passed, unlike what it currently means which is
offset into the current blob returned by the kernel.

I also took a look at CMIS. It has interesting semantics for address
wrap around when doing multiple byte reads. A read which reaches 127
wraps around to 0. A read which reached 255 wraps around to 128. So
for the kernel API, we probably do not want to allow offset/length to
cause a wrap around. You can only read within the low 128 bytes, or
the upper 128 bytes.

   Andrew

[net-next,v2,0/2] Add support for DSFP transceiver type

Message

Comments