[net-next,RFC,00/10] introduce line card support for modular switch

Message ID	20210113121222.733517-1-jiri@resnulli.us (mailing list archive)
Headers	show Return-Path: <netdev-owner@kernel.org> From: Jiri Pirko <jiri@resnulli.us> To: netdev@vger.kernel.org Cc: davem@davemloft.net, kuba@kernel.org, jacob.e.keller@intel.com, roopa@nvidia.com, mlxsw@nvidia.com Subject: [patch net-next RFC 00/10] introduce line card support for modular switch Date: Wed, 13 Jan 2021 13:12:12 +0100 Message-Id: <20210113121222.733517-1-jiri@resnulli.us> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	introduce line card support for modular switch \| expand [net-next,RFC,00/10] introduce line card support for modular switch [net-next,RFC,01/10] devlink: add support to create line card and expose to user [net-next,RFC,02/10] devlink: implement line card provisioning [net-next,RFC,03/10] devlink: implement line card active state [net-next,RFC,04/10] devlink: append split port number to the port name [net-next,RFC,05/10] devlink: add port to line card relationship set [net-next,RFC,06/10] netdevsim: introduce line card support [net-next,RFC,07/10] netdevsim: allow port objects to be linked with line cards [net-next,RFC,08/10] netdevsim: create devlink line card object and implement provisioning [net-next,RFC,09/10] netdevsim: implement line card activation [net-next,RFC,10/10] selftests: add netdevsim devlink lc test

Jiri Pirko Jan. 13, 2021, 12:12 p.m. UTC

From: Jiri Pirko <jiri@nvidia.com>

This patchset introduces support for modular switch systems.
NVIDIA Mellanox SN4800 is an example of such. It contains 8 slots
to accomodate line cards. Available line cards include:
16X 100GbE (QSFP28)
8X 200GbE (QSFP56)
4X 400GbE (QSFP-DD)

Similar to split cabels, it is essencial for the correctness of
configuration and funcionality to treat the line card entities
in the same way, no matter the line card is inserted or not.
Meaning, the netdevice of a line card port cannot just disappear
when line card is removed. Also, system admin needs to be able
to apply configuration on netdevices belonging to line card port
even before the linecard gets inserted.

To resolve this, a concept of "provisioning" is introduced.
The user may "provision" certain slot with a line card type.
Driver then creates all instances (devlink ports, netdevices, etc)
related to this line card type. The carrier of netdevices stays down.
Once the line card is inserted and activated, the carrier of the
related netdevices goes up.

Once user does not want to use the line card related instances
anymore, he can "unprovision" the slot. Driver then removes the
instances.

Patches 1-5 are extending devlink driver API and UAPI in order to
register, show, dump, provision and activate the line card.
Patches 6-9 are implementing the introduced API in netdevsim

Example:

# Create a new netdevsim device, with no ports and 2 line cards:
$ echo "10 0 2" >/sys/bus/netdevsim/new_device
$ devlink port # No ports are listed
$ devlink lc
netdevsim/netdevsim10:
  lc 0 state unprovisioned
    supported_types:
       card1port card2ports card4ports
  lc 1 state unprovisioned
    supported_types:
       card1port card2ports card4ports

# Note that driver advertizes supported line card types. In case of
# netdevsim, these are 3.

$ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports
$ devlink lc
netdevsim/netdevsim10:
  lc 0 state provisioned type card4ports
    supported_types:
       card1port card2ports card4ports
  lc 1 state unprovisioned
    supported_types:
       card1port card2ports card4ports
$ devlink port
netdevsim/netdevsim10/1000: type eth netdev eni10nl0p1 flavour physical lc 0 port 1 splittable false
netdevsim/netdevsim10/1001: type eth netdev eni10nl0p2 flavour physical lc 0 port 2 splittable false
netdevsim/netdevsim10/1002: type eth netdev eni10nl0p3 flavour physical lc 0 port 3 splittable false
netdevsim/netdevsim10/1003: type eth netdev eni10nl0p4 flavour physical lc 0 port 4 splittable false
#                                                 ^^                    ^^^^
#                                     netdev name adjusted          index of a line card this port belongs to

$ ip link set eni10nl0p1 up 
$ ip link show eni10nl0p1   
165: eni10nl0p1: <NO-CARRIER,BROADCAST,NOARP,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff

# Now activate the line card using debugfs. That emulates plug-in event
# on real hardware:
$ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active
$ ip link show eni10nl0p1
165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff
# The carrier is UP now.

Jiri Pirko (10):
  devlink: add support to create line card and expose to user
  devlink: implement line card provisioning
  devlink: implement line card active state
  devlink: append split port number to the port name
  devlink: add port to line card relationship set
  netdevsim: introduce line card support
  netdevsim: allow port objects to be linked with line cards
  netdevsim: create devlink line card object and implement provisioning
  netdevsim: implement line card activation
  selftests: add netdevsim devlink lc test

 drivers/net/netdevsim/bus.c                   |  21 +-
 drivers/net/netdevsim/dev.c                   | 370 ++++++++++++++-
 drivers/net/netdevsim/netdev.c                |   2 +
 drivers/net/netdevsim/netdevsim.h             |  23 +
 include/net/devlink.h                         |  44 ++
 include/uapi/linux/devlink.h                  |  25 +
 net/core/devlink.c                            | 443 +++++++++++++++++-
 .../drivers/net/netdevsim/devlink.sh          |  62 ++-
 8 files changed, 964 insertions(+), 26 deletions(-)

Andrew Lunn Jan. 14, 2021, 2:07 a.m. UTC | #1

> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports
> $ devlink lc
> netdevsim/netdevsim10:
>   lc 0 state provisioned type card4ports
>     supported_types:
>        card1port card2ports card4ports
>   lc 1 state unprovisioned
>     supported_types:
>        card1port card2ports card4ports

Hi Jiri

> # Now activate the line card using debugfs. That emulates plug-in event
> # on real hardware:
> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active
> $ ip link show eni10nl0p1
> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
>     link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff
> # The carrier is UP now.

What is missing from the devlink lc view is what line card is actually
in the slot. Say if i provision for a card4port, but actually insert a
card2port. It would be nice to have something like:

 $ devlink lc
 netdevsim/netdevsim10:
   lc 0 state provisioned type card4ports
     supported_types:
        card1port card2ports card4ports
     inserted_type:
        card2ports;
   lc 1 state unprovisioned
     supported_types:
        card1port card2ports card4ports
     inserted_type:
        None

I assume if i prevision for card4ports but actually install a
card2ports, all the interfaces stay down?

Maybe

> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active

should actually be
    echo "card2ports" > /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active

so you can emulate somebody putting the wrong card in the slot?

    Andrew

Jakub Kicinski Jan. 14, 2021, 2:27 a.m. UTC | #2

On Wed, 13 Jan 2021 13:12:12 +0100 Jiri Pirko wrote:
> This patchset introduces support for modular switch systems.
> NVIDIA Mellanox SN4800 is an example of such. It contains 8 slots
> to accomodate line cards. Available line cards include:
> 16X 100GbE (QSFP28)
> 8X 200GbE (QSFP56)
> 4X 400GbE (QSFP-DD)
> 
> Similar to split cabels, it is essencial for the correctness of
> configuration and funcionality to treat the line card entities
> in the same way, no matter the line card is inserted or not.
> Meaning, the netdevice of a line card port cannot just disappear
> when line card is removed. Also, system admin needs to be able
> to apply configuration on netdevices belonging to line card port
> even before the linecard gets inserted.

I don't understand why that would be. Please provide reasoning, 
e.g. what the FW/HW limitation is.

> To resolve this, a concept of "provisioning" is introduced.
> The user may "provision" certain slot with a line card type.
> Driver then creates all instances (devlink ports, netdevices, etc)
> related to this line card type. The carrier of netdevices stays down.
> Once the line card is inserted and activated, the carrier of the
> related netdevices goes up.

Dunno what "line card" means for Mellovidia but I don't think 
the analogy of port splitting works. To my knowledge traditional
line cards often carry processors w/ full MACs etc. so I'd say 
plugging in a line card is much more like plugging in a new NIC.

There is no way to tell a breakout cable from normal one, so the
system has no chance to magically configure itself. Besides SFP
is just plugging a cable, not a module of the system..

Jiri Pirko Jan. 14, 2021, 7:39 a.m. UTC | #3

Thu, Jan 14, 2021 at 03:07:18AM CET, andrew@lunn.ch wrote:
>> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports
>> $ devlink lc
>> netdevsim/netdevsim10:
>>   lc 0 state provisioned type card4ports
>>     supported_types:
>>        card1port card2ports card4ports
>>   lc 1 state unprovisioned
>>     supported_types:
>>        card1port card2ports card4ports
>
>Hi Jiri
>
>> # Now activate the line card using debugfs. That emulates plug-in event
>> # on real hardware:
>> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active
>> $ ip link show eni10nl0p1
>> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
>>     link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff
>> # The carrier is UP now.
>
>What is missing from the devlink lc view is what line card is actually
>in the slot. Say if i provision for a card4port, but actually insert a
>card2port. It would be nice to have something like:
>
> $ devlink lc
> netdevsim/netdevsim10:
>   lc 0 state provisioned type card4ports
>     supported_types:
>        card1port card2ports card4ports
>     inserted_type:
>        card2ports;
>   lc 1 state unprovisioned
>     supported_types:
>        card1port card2ports card4ports
>     inserted_type:
>        None

I see. Yes, that might be doable. I'm noting this down.


>
>I assume if i prevision for card4ports but actually install a
>card2ports, all the interfaces stay down?

Yes, the card won't get activated in case or provision mismatch.


>
>Maybe
>
>> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active
>
>should actually be
>    echo "card2ports" > /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active
>
>so you can emulate somebody putting the wrong card in the slot?

Got you.

Thanks!

>
>    Andrew

Jiri Pirko Jan. 14, 2021, 7:48 a.m. UTC | #4

Thu, Jan 14, 2021 at 03:27:16AM CET, kuba@kernel.org wrote:
>On Wed, 13 Jan 2021 13:12:12 +0100 Jiri Pirko wrote:
>> This patchset introduces support for modular switch systems.
>> NVIDIA Mellanox SN4800 is an example of such. It contains 8 slots
>> to accomodate line cards. Available line cards include:
>> 16X 100GbE (QSFP28)
>> 8X 200GbE (QSFP56)
>> 4X 400GbE (QSFP-DD)
>> 
>> Similar to split cabels, it is essencial for the correctness of
>> configuration and funcionality to treat the line card entities
>> in the same way, no matter the line card is inserted or not.
>> Meaning, the netdevice of a line card port cannot just disappear
>> when line card is removed. Also, system admin needs to be able
>> to apply configuration on netdevices belonging to line card port
>> even before the linecard gets inserted.
>
>I don't understand why that would be. Please provide reasoning, 
>e.g. what the FW/HW limitation is.

Well, for split cable, you need to be able to say:
port 2, split into 4. And you will have 4 netdevices. These netdevices
you can use to put into bridge, configure mtu, speeds, routes, etc.
These will exist no matter if the splitter cable is actually inserted or
not.

With linecards, this is very similar. By provisioning, you also create
certain number of ports, according to the linecard that you plan to
insert. And similarly to the splitter, the netdevices are created.

You may combine the linecard/splitter config when splitter cable is
connected to a linecard port. Then you provision a linecard,
port is going to appear and you will split this port.

>
>> To resolve this, a concept of "provisioning" is introduced.
>> The user may "provision" certain slot with a line card type.
>> Driver then creates all instances (devlink ports, netdevices, etc)
>> related to this line card type. The carrier of netdevices stays down.
>> Once the line card is inserted and activated, the carrier of the
>> related netdevices goes up.
>
>Dunno what "line card" means for Mellovidia but I don't think 
>the analogy of port splitting works. To my knowledge traditional
>line cards often carry processors w/ full MACs etc. so I'd say 
>plugging in a line card is much more like plugging in a new NIC.

No. It is basically a phy gearbox. The mac is not there. The interface
between asic and linecard are lanes. The linecards is basically an
attachable phy.

>
>There is no way to tell a breakout cable from normal one, so the
>system has no chance to magically configure itself. Besides SFP
>is just plugging a cable, not a module of the system..

Jacob Keller Jan. 14, 2021, 10:56 p.m. UTC | #5

On 1/13/2021 11:39 PM, Jiri Pirko wrote:
> Thu, Jan 14, 2021 at 03:07:18AM CET, andrew@lunn.ch wrote:
>>
>> I assume if i prevision for card4ports but actually install a
>> card2ports, all the interfaces stay down?
> 
> Yes, the card won't get activated in case or provision mismatch.
> 

If you're able to detect the line card type when plugging it in, I don't
understand why you need system administrator to pre-provision it using
such an interface? Wouldn't it make more sense to simply detect the
case? Or is it that you expect these things to be moved around and want
to make sure that you can configure the associated netdevices before the
card is plugged in?

Jacob Keller Jan. 14, 2021, 10:58 p.m. UTC | #6

On 1/13/2021 6:27 PM, Jakub Kicinski wrote:
> On Wed, 13 Jan 2021 13:12:12 +0100 Jiri Pirko wrote:
>> This patchset introduces support for modular switch systems.
>> NVIDIA Mellanox SN4800 is an example of such. It contains 8 slots
>> to accomodate line cards. Available line cards include:
>> 16X 100GbE (QSFP28)
>> 8X 200GbE (QSFP56)
>> 4X 400GbE (QSFP-DD)
>>
>> Similar to split cabels, it is essencial for the correctness of
>> configuration and funcionality to treat the line card entities
>> in the same way, no matter the line card is inserted or not.
>> Meaning, the netdevice of a line card port cannot just disappear
>> when line card is removed. Also, system admin needs to be able
>> to apply configuration on netdevices belonging to line card port
>> even before the linecard gets inserted.
> 
> I don't understand why that would be. Please provide reasoning, 
> e.g. what the FW/HW limitation is.
> 

I agree, I wouldn't imagine that plugging or unplugging line cards is
expected to be done on a regular basis?

>> To resolve this, a concept of "provisioning" is introduced.
>> The user may "provision" certain slot with a line card type.
>> Driver then creates all instances (devlink ports, netdevices, etc)
>> related to this line card type. The carrier of netdevices stays down.
>> Once the line card is inserted and activated, the carrier of the
>> related netdevices goes up.
> 
> Dunno what "line card" means for Mellovidia but I don't think 
> the analogy of port splitting works. To my knowledge traditional
> line cards often carry processors w/ full MACs etc. so I'd say 
> plugging in a line card is much more like plugging in a new NIC.
> 

Even if they didn't...

> There is no way to tell a breakout cable from normal one, so the
> system has no chance to magically configure itself. Besides SFP
> is just plugging a cable, not a module of the system.. 
> 
If you're able to tell what is plugged in, why would we want to force
user to provision ahead of time? Wouldn't it make more sense to just
instantiate them as the card is plugged in? I guess it might be useful
to allow programming the netdevices before the cable is actually
inserted... I guess I don't see why that is valuable.

It would be sort of like if you provision a PCI slot before a device is
plugged into it..

Jakub Kicinski Jan. 14, 2021, 11:20 p.m. UTC | #7

On Thu, 14 Jan 2021 14:58:33 -0800 Jacob Keller wrote:
> > There is no way to tell a breakout cable from normal one, so the
> > system has no chance to magically configure itself. Besides SFP
> > is just plugging a cable, not a module of the system.. 
> >   
> If you're able to tell what is plugged in, why would we want to force
> user to provision ahead of time? Wouldn't it make more sense to just
> instantiate them as the card is plugged in? I guess it might be useful
> to allow programming the netdevices before the cable is actually
> inserted... I guess I don't see why that is valuable.
> 
> It would be sort of like if you provision a PCI slot before a device is
> plugged into it..

Yup, that's pretty much my thinking as well.

Jakub Kicinski Jan. 14, 2021, 11:30 p.m. UTC | #8

On Thu, 14 Jan 2021 08:48:04 +0100 Jiri Pirko wrote:
> Thu, Jan 14, 2021 at 03:27:16AM CET, kuba@kernel.org wrote:
> >On Wed, 13 Jan 2021 13:12:12 +0100 Jiri Pirko wrote:  
> >> This patchset introduces support for modular switch systems.
> >> NVIDIA Mellanox SN4800 is an example of such. It contains 8 slots
> >> to accomodate line cards. Available line cards include:
> >> 16X 100GbE (QSFP28)
> >> 8X 200GbE (QSFP56)
> >> 4X 400GbE (QSFP-DD)
> >> 
> >> Similar to split cabels, it is essencial for the correctness of
> >> configuration and funcionality to treat the line card entities
> >> in the same way, no matter the line card is inserted or not.
> >> Meaning, the netdevice of a line card port cannot just disappear
> >> when line card is removed. Also, system admin needs to be able
> >> to apply configuration on netdevices belonging to line card port
> >> even before the linecard gets inserted.  
> >
> >I don't understand why that would be. Please provide reasoning, 
> >e.g. what the FW/HW limitation is.  
> 
> Well, for split cable, you need to be able to say:
> port 2, split into 4. And you will have 4 netdevices. These netdevices
> you can use to put into bridge, configure mtu, speeds, routes, etc.
> These will exist no matter if the splitter cable is actually inserted or
> not.

The difference is that the line card is more detectable (I hope).

I'm not a SFP experts so maybe someone will correct me but AFAIU
the QSFP (for optics) is the same regardless of breakout. It's the
passive optical strands that are either bundled or not. So there is 
no way for the system to detect the cable type (AFAIK).

Or to put it differently IMO the netdev should be provisioned if the
system has a port into which user can plug in a cable. When there is 
a line card-sized hole in the chassis, I'd be surprised to see ports.

That said I never worked with real world routers so maybe that's what
they do. Maybe some with a Cisco router in the basement can tell us? :)

> With linecards, this is very similar. By provisioning, you also create
> certain number of ports, according to the linecard that you plan to
> insert. And similarly to the splitter, the netdevices are created.
> 
> You may combine the linecard/splitter config when splitter cable is
> connected to a linecard port. Then you provision a linecard,
> port is going to appear and you will split this port.
> 
> >> To resolve this, a concept of "provisioning" is introduced.
> >> The user may "provision" certain slot with a line card type.
> >> Driver then creates all instances (devlink ports, netdevices, etc)
> >> related to this line card type. The carrier of netdevices stays down.
> >> Once the line card is inserted and activated, the carrier of the
> >> related netdevices goes up.  
> >
> >Dunno what "line card" means for Mellovidia but I don't think 
> >the analogy of port splitting works. To my knowledge traditional
> >line cards often carry processors w/ full MACs etc. so I'd say 
> >plugging in a line card is much more like plugging in a new NIC.  
> 
> No. It is basically a phy gearbox. The mac is not there. The interface
> between asic and linecard are lanes. The linecards is basically an
> attachable phy.

If the device really needs this configuration / can't detect things
automatically, then we gotta do something like what you have.
The only question is do we still want to call it a line card.
Sounds more like a front panel module. At Netronome we called 
those phymods.

Jiri Pirko Jan. 15, 2021, 2:19 p.m. UTC | #9

Thu, Jan 14, 2021 at 11:56:15PM CET, jacob.e.keller@intel.com wrote:
>
>
>On 1/13/2021 11:39 PM, Jiri Pirko wrote:
>> Thu, Jan 14, 2021 at 03:07:18AM CET, andrew@lunn.ch wrote:
>>>
>>> I assume if i prevision for card4ports but actually install a
>>> card2ports, all the interfaces stay down?
>> 
>> Yes, the card won't get activated in case or provision mismatch.
>> 
>
>If you're able to detect the line card type when plugging it in, I don't
>understand why you need system administrator to pre-provision it using
>such an interface? Wouldn't it make more sense to simply detect the
>case? Or is it that you expect these things to be moved around and want
>to make sure that you can configure the associated netdevices before the
>card is plugged in?

That is what I wrote in the cover letter.

Jiri Pirko Jan. 15, 2021, 2:39 p.m. UTC | #10

Fri, Jan 15, 2021 at 12:30:13AM CET, kuba@kernel.org wrote:
>On Thu, 14 Jan 2021 08:48:04 +0100 Jiri Pirko wrote:
>> Thu, Jan 14, 2021 at 03:27:16AM CET, kuba@kernel.org wrote:
>> >On Wed, 13 Jan 2021 13:12:12 +0100 Jiri Pirko wrote:  
>> >> This patchset introduces support for modular switch systems.
>> >> NVIDIA Mellanox SN4800 is an example of such. It contains 8 slots
>> >> to accomodate line cards. Available line cards include:
>> >> 16X 100GbE (QSFP28)
>> >> 8X 200GbE (QSFP56)
>> >> 4X 400GbE (QSFP-DD)
>> >> 
>> >> Similar to split cabels, it is essencial for the correctness of
>> >> configuration and funcionality to treat the line card entities
>> >> in the same way, no matter the line card is inserted or not.
>> >> Meaning, the netdevice of a line card port cannot just disappear
>> >> when line card is removed. Also, system admin needs to be able
>> >> to apply configuration on netdevices belonging to line card port
>> >> even before the linecard gets inserted.  
>> >
>> >I don't understand why that would be. Please provide reasoning, 
>> >e.g. what the FW/HW limitation is.  
>> 
>> Well, for split cable, you need to be able to say:
>> port 2, split into 4. And you will have 4 netdevices. These netdevices
>> you can use to put into bridge, configure mtu, speeds, routes, etc.
>> These will exist no matter if the splitter cable is actually inserted or
>> not.
>
>The difference is that the line card is more detectable (I hope).
>
>I'm not a SFP experts so maybe someone will correct me but AFAIU
>the QSFP (for optics) is the same regardless of breakout. It's the
>passive optical strands that are either bundled or not. So there is 
>no way for the system to detect the cable type (AFAIK).

For SFP module, you are able to detect those.

>
>Or to put it differently IMO the netdev should be provisioned if the
>system has a port into which user can plug in a cable. When there is 

Not really. For slit cables, the ports are provisioned not matter which
cable is connected, slitter 1->2/1->4 or 1->1 cable.


>a line card-sized hole in the chassis, I'd be surprised to see ports.
>
>That said I never worked with real world routers so maybe that's what
>they do. Maybe some with a Cisco router in the basement can tell us? :)

The need for provision/pre-configure splitter/linecard is that the
ports/netdevices do not disapper/reappear when you replace
splitter/linecard. Consider a faulty linecard with one port burned. You
just want to replace it with new one. And in that case, you really don't
want kernel to remove netdevices and possibly mess up routing for
example.


>
>> With linecards, this is very similar. By provisioning, you also create
>> certain number of ports, according to the linecard that you plan to
>> insert. And similarly to the splitter, the netdevices are created.
>> 
>> You may combine the linecard/splitter config when splitter cable is
>> connected to a linecard port. Then you provision a linecard,
>> port is going to appear and you will split this port.
>> 
>> >> To resolve this, a concept of "provisioning" is introduced.
>> >> The user may "provision" certain slot with a line card type.
>> >> Driver then creates all instances (devlink ports, netdevices, etc)
>> >> related to this line card type. The carrier of netdevices stays down.
>> >> Once the line card is inserted and activated, the carrier of the
>> >> related netdevices goes up.  
>> >
>> >Dunno what "line card" means for Mellovidia but I don't think 
>> >the analogy of port splitting works. To my knowledge traditional
>> >line cards often carry processors w/ full MACs etc. so I'd say 
>> >plugging in a line card is much more like plugging in a new NIC.  
>> 
>> No. It is basically a phy gearbox. The mac is not there. The interface
>> between asic and linecard are lanes. The linecards is basically an
>> attachable phy.
>
>If the device really needs this configuration / can't detect things
>automatically, then we gotta do something like what you have.
>The only question is do we still want to call it a line card.
>Sounds more like a front panel module. At Netronome we called 
>those phymods.

Sure, the name is up to the discussion. We call it "linecard"
internally. I don't care about the name.

Jiri Pirko Jan. 15, 2021, 2:40 p.m. UTC | #11

Fri, Jan 15, 2021 at 12:20:58AM CET, kuba@kernel.org wrote:
>On Thu, 14 Jan 2021 14:58:33 -0800 Jacob Keller wrote:
>> > There is no way to tell a breakout cable from normal one, so the
>> > system has no chance to magically configure itself. Besides SFP
>> > is just plugging a cable, not a module of the system.. 
>> >   
>> If you're able to tell what is plugged in, why would we want to force
>> user to provision ahead of time? Wouldn't it make more sense to just
>> instantiate them as the card is plugged in? I guess it might be useful
>> to allow programming the netdevices before the cable is actually
>> inserted... I guess I don't see why that is valuable.
>> 
>> It would be sort of like if you provision a PCI slot before a device is
>> plugged into it..
>
>Yup, that's pretty much my thinking as well.

Please see my reply in the other sub-tread of this thread. Thanks!

Ido Schimmel Jan. 15, 2021, 3:43 p.m. UTC | #12

On Wed, Jan 13, 2021 at 01:12:12PM +0100, Jiri Pirko wrote:
> # Create a new netdevsim device, with no ports and 2 line cards:
> $ echo "10 0 2" >/sys/bus/netdevsim/new_device
> $ devlink port # No ports are listed
> $ devlink lc
> netdevsim/netdevsim10:
>   lc 0 state unprovisioned
>     supported_types:
>        card1port card2ports card4ports
>   lc 1 state unprovisioned
>     supported_types:
>        card1port card2ports card4ports
> 
> # Note that driver advertizes supported line card types. In case of
> # netdevsim, these are 3.
> 
> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports

Why do we need a separate command for that? You actually introduced
'DEVLINK_CMD_LINECARD_SET' in patch #1, but it's never used.

I prefer:

devlink lc set netdevsim/netdevsim10 index 0 state provision type card4ports
devlink lc set netdevsim/netdevsim10 index 0 state unprovision

It is consistent with the GET/SET/NEW/DEL pattern used by other
commands.

> $ devlink lc
> netdevsim/netdevsim10:
>   lc 0 state provisioned type card4ports
>     supported_types:
>        card1port card2ports card4ports
>   lc 1 state unprovisioned
>     supported_types:
>        card1port card2ports card4ports
> $ devlink port
> netdevsim/netdevsim10/1000: type eth netdev eni10nl0p1 flavour physical lc 0 port 1 splittable false
> netdevsim/netdevsim10/1001: type eth netdev eni10nl0p2 flavour physical lc 0 port 2 splittable false
> netdevsim/netdevsim10/1002: type eth netdev eni10nl0p3 flavour physical lc 0 port 3 splittable false
> netdevsim/netdevsim10/1003: type eth netdev eni10nl0p4 flavour physical lc 0 port 4 splittable false
> #                                                 ^^                    ^^^^
> #                                     netdev name adjusted          index of a line card this port belongs to
> 
> $ ip link set eni10nl0p1 up 
> $ ip link show eni10nl0p1   
> 165: eni10nl0p1: <NO-CARRIER,BROADCAST,NOARP,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
>     link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff
> 
> # Now activate the line card using debugfs. That emulates plug-in event
> # on real hardware:
> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active
> $ ip link show eni10nl0p1
> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
>     link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff
> # The carrier is UP now.
> 
> Jiri Pirko (10):
>   devlink: add support to create line card and expose to user
>   devlink: implement line card provisioning
>   devlink: implement line card active state
>   devlink: append split port number to the port name
>   devlink: add port to line card relationship set
>   netdevsim: introduce line card support
>   netdevsim: allow port objects to be linked with line cards
>   netdevsim: create devlink line card object and implement provisioning
>   netdevsim: implement line card activation
>   selftests: add netdevsim devlink lc test
> 
>  drivers/net/netdevsim/bus.c                   |  21 +-
>  drivers/net/netdevsim/dev.c                   | 370 ++++++++++++++-
>  drivers/net/netdevsim/netdev.c                |   2 +
>  drivers/net/netdevsim/netdevsim.h             |  23 +
>  include/net/devlink.h                         |  44 ++
>  include/uapi/linux/devlink.h                  |  25 +
>  net/core/devlink.c                            | 443 +++++++++++++++++-
>  .../drivers/net/netdevsim/devlink.sh          |  62 ++-
>  8 files changed, 964 insertions(+), 26 deletions(-)
> 
> -- 
> 2.26.2
>

Jiri Pirko Jan. 15, 2021, 4:55 p.m. UTC | #13

Fri, Jan 15, 2021 at 04:43:57PM CET, idosch@idosch.org wrote:
>On Wed, Jan 13, 2021 at 01:12:12PM +0100, Jiri Pirko wrote:
>> # Create a new netdevsim device, with no ports and 2 line cards:
>> $ echo "10 0 2" >/sys/bus/netdevsim/new_device
>> $ devlink port # No ports are listed
>> $ devlink lc
>> netdevsim/netdevsim10:
>>   lc 0 state unprovisioned
>>     supported_types:
>>        card1port card2ports card4ports
>>   lc 1 state unprovisioned
>>     supported_types:
>>        card1port card2ports card4ports
>> 
>> # Note that driver advertizes supported line card types. In case of
>> # netdevsim, these are 3.
>> 
>> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports
>
>Why do we need a separate command for that? You actually introduced
>'DEVLINK_CMD_LINECARD_SET' in patch #1, but it's never used.
>
>I prefer:
>
>devlink lc set netdevsim/netdevsim10 index 0 state provision type card4ports

This is misleading. This is actually not setting state. The state gets
changed upon successful provisioning process. Also, one may think that
he can set other states, but he can't. I don't like this at all :/


>devlink lc set netdevsim/netdevsim10 index 0 state unprovision
>
>It is consistent with the GET/SET/NEW/DEL pattern used by other
>commands.

Not really, see split port for example. This is similar to that.

>
>> $ devlink lc
>> netdevsim/netdevsim10:
>>   lc 0 state provisioned type card4ports
>>     supported_types:
>>        card1port card2ports card4ports
>>   lc 1 state unprovisioned
>>     supported_types:
>>        card1port card2ports card4ports
>> $ devlink port
>> netdevsim/netdevsim10/1000: type eth netdev eni10nl0p1 flavour physical lc 0 port 1 splittable false
>> netdevsim/netdevsim10/1001: type eth netdev eni10nl0p2 flavour physical lc 0 port 2 splittable false
>> netdevsim/netdevsim10/1002: type eth netdev eni10nl0p3 flavour physical lc 0 port 3 splittable false
>> netdevsim/netdevsim10/1003: type eth netdev eni10nl0p4 flavour physical lc 0 port 4 splittable false
>> #                                                 ^^                    ^^^^
>> #                                     netdev name adjusted          index of a line card this port belongs to
>> 
>> $ ip link set eni10nl0p1 up 
>> $ ip link show eni10nl0p1   
>> 165: eni10nl0p1: <NO-CARRIER,BROADCAST,NOARP,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
>>     link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff
>> 
>> # Now activate the line card using debugfs. That emulates plug-in event
>> # on real hardware:
>> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active
>> $ ip link show eni10nl0p1
>> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
>>     link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff
>> # The carrier is UP now.
>> 
>> Jiri Pirko (10):
>>   devlink: add support to create line card and expose to user
>>   devlink: implement line card provisioning
>>   devlink: implement line card active state
>>   devlink: append split port number to the port name
>>   devlink: add port to line card relationship set
>>   netdevsim: introduce line card support
>>   netdevsim: allow port objects to be linked with line cards
>>   netdevsim: create devlink line card object and implement provisioning
>>   netdevsim: implement line card activation
>>   selftests: add netdevsim devlink lc test
>> 
>>  drivers/net/netdevsim/bus.c                   |  21 +-
>>  drivers/net/netdevsim/dev.c                   | 370 ++++++++++++++-
>>  drivers/net/netdevsim/netdev.c                |   2 +
>>  drivers/net/netdevsim/netdevsim.h             |  23 +
>>  include/net/devlink.h                         |  44 ++
>>  include/uapi/linux/devlink.h                  |  25 +
>>  net/core/devlink.c                            | 443 +++++++++++++++++-
>>  .../drivers/net/netdevsim/devlink.sh          |  62 ++-
>>  8 files changed, 964 insertions(+), 26 deletions(-)
>> 
>> -- 
>> 2.26.2
>>

Ido Schimmel Jan. 15, 2021, 6:01 p.m. UTC | #14

On Fri, Jan 15, 2021 at 05:55:59PM +0100, Jiri Pirko wrote:
> Fri, Jan 15, 2021 at 04:43:57PM CET, idosch@idosch.org wrote:
> >On Wed, Jan 13, 2021 at 01:12:12PM +0100, Jiri Pirko wrote:
> >> # Create a new netdevsim device, with no ports and 2 line cards:
> >> $ echo "10 0 2" >/sys/bus/netdevsim/new_device
> >> $ devlink port # No ports are listed
> >> $ devlink lc
> >> netdevsim/netdevsim10:
> >>   lc 0 state unprovisioned
> >>     supported_types:
> >>        card1port card2ports card4ports
> >>   lc 1 state unprovisioned
> >>     supported_types:
> >>        card1port card2ports card4ports
> >> 
> >> # Note that driver advertizes supported line card types. In case of
> >> # netdevsim, these are 3.
> >> 
> >> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports
> >
> >Why do we need a separate command for that? You actually introduced
> >'DEVLINK_CMD_LINECARD_SET' in patch #1, but it's never used.
> >
> >I prefer:
> >
> >devlink lc set netdevsim/netdevsim10 index 0 state provision type card4ports
> 
> This is misleading. This is actually not setting state. The state gets
> changed upon successful provisioning process. Also, one may think that
> he can set other states, but he can't. I don't like this at all :/

So make state a read-only attribute. You really only care about setting
the type.

To provision:

# devlink lc set netdevsim/netdevsim10 index 0 type card4ports

To unprovsion:

# devlink lc set netdevsim/netdevsim10 index 0 type none

Or:

# devlink lc set netdevsim/netdevsim10 index 0 notype

> 
> 
> >devlink lc set netdevsim/netdevsim10 index 0 state unprovision
> >
> >It is consistent with the GET/SET/NEW/DEL pattern used by other
> >commands.
> 
> Not really, see split port for example. This is similar to that.

It's not. The split command creates new objects whereas this command
modifies an existing object.

> 
> >
> >> $ devlink lc
> >> netdevsim/netdevsim10:
> >>   lc 0 state provisioned type card4ports
> >>     supported_types:
> >>        card1port card2ports card4ports
> >>   lc 1 state unprovisioned
> >>     supported_types:
> >>        card1port card2ports card4ports
> >> $ devlink port
> >> netdevsim/netdevsim10/1000: type eth netdev eni10nl0p1 flavour physical lc 0 port 1 splittable false
> >> netdevsim/netdevsim10/1001: type eth netdev eni10nl0p2 flavour physical lc 0 port 2 splittable false
> >> netdevsim/netdevsim10/1002: type eth netdev eni10nl0p3 flavour physical lc 0 port 3 splittable false
> >> netdevsim/netdevsim10/1003: type eth netdev eni10nl0p4 flavour physical lc 0 port 4 splittable false
> >> #                                                 ^^                    ^^^^
> >> #                                     netdev name adjusted          index of a line card this port belongs to
> >> 
> >> $ ip link set eni10nl0p1 up 
> >> $ ip link show eni10nl0p1   
> >> 165: eni10nl0p1: <NO-CARRIER,BROADCAST,NOARP,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
> >>     link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff
> >> 
> >> # Now activate the line card using debugfs. That emulates plug-in event
> >> # on real hardware:
> >> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active
> >> $ ip link show eni10nl0p1
> >> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
> >>     link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff
> >> # The carrier is UP now.
> >> 
> >> Jiri Pirko (10):
> >>   devlink: add support to create line card and expose to user
> >>   devlink: implement line card provisioning
> >>   devlink: implement line card active state
> >>   devlink: append split port number to the port name
> >>   devlink: add port to line card relationship set
> >>   netdevsim: introduce line card support
> >>   netdevsim: allow port objects to be linked with line cards
> >>   netdevsim: create devlink line card object and implement provisioning
> >>   netdevsim: implement line card activation
> >>   selftests: add netdevsim devlink lc test
> >> 
> >>  drivers/net/netdevsim/bus.c                   |  21 +-
> >>  drivers/net/netdevsim/dev.c                   | 370 ++++++++++++++-
> >>  drivers/net/netdevsim/netdev.c                |   2 +
> >>  drivers/net/netdevsim/netdevsim.h             |  23 +
> >>  include/net/devlink.h                         |  44 ++
> >>  include/uapi/linux/devlink.h                  |  25 +
> >>  net/core/devlink.c                            | 443 +++++++++++++++++-
> >>  .../drivers/net/netdevsim/devlink.sh          |  62 ++-
> >>  8 files changed, 964 insertions(+), 26 deletions(-)
> >> 
> >> -- 
> >> 2.26.2
> >>

Jakub Kicinski Jan. 15, 2021, 7:26 p.m. UTC | #15

On Fri, 15 Jan 2021 15:39:06 +0100 Jiri Pirko wrote:
> >I'm not a SFP experts so maybe someone will correct me but AFAIU
> >the QSFP (for optics) is the same regardless of breakout. It's the
> >passive optical strands that are either bundled or not. So there is 
> >no way for the system to detect the cable type (AFAIK).  
> 
> For SFP module, you are able to detect those.

Not sure you understand what I'm saying. Maybe you're thinking about
DACs? This is a optical cable for breakout:

https://www.fs.com/products/68048.html

There is no electronics in it to "detect" things AFAIU. Same QSFP can
be used with this cable or a non-breakout.

> >Or to put it differently IMO the netdev should be provisioned if the
> >system has a port into which user can plug in a cable. When there is   
> 
> Not really. For slit cables, the ports are provisioned not matter which
> cable is connected, slitter 1->2/1->4 or 1->1 cable.
> 
> 
> >a line card-sized hole in the chassis, I'd be surprised to see ports.
> >
> >That said I never worked with real world routers so maybe that's what
> >they do. Maybe some with a Cisco router in the basement can tell us? :)  
> 
> The need for provision/pre-configure splitter/linecard is that the
> ports/netdevices do not disapper/reappear when you replace
> splitter/linecard. Consider a faulty linecard with one port burned. You
> just want to replace it with new one. And in that case, you really don't
> want kernel to remove netdevices and possibly mess up routing for
> example.

Having a single burned port sounds like a relatively rare scenario.
Reconfiguring routing is not the end of the world.

> >If the device really needs this configuration / can't detect things
> >automatically, then we gotta do something like what you have.
> >The only question is do we still want to call it a line card.
> >Sounds more like a front panel module. At Netronome we called 
> >those phymods.  
> 
> Sure, the name is up to the discussion. We call it "linecard"
> internally. I don't care about the name.

Yeah, let's call it something more appropriate to indicate its
breakout/retimer/gearbox nature, and we'll be good :)

Jiri Pirko Jan. 18, 2021, 1 p.m. UTC | #16

Fri, Jan 15, 2021 at 08:26:17PM CET, kuba@kernel.org wrote:
>On Fri, 15 Jan 2021 15:39:06 +0100 Jiri Pirko wrote:
>> >I'm not a SFP experts so maybe someone will correct me but AFAIU
>> >the QSFP (for optics) is the same regardless of breakout. It's the
>> >passive optical strands that are either bundled or not. So there is 
>> >no way for the system to detect the cable type (AFAIK).  
>> 
>> For SFP module, you are able to detect those.
>
>Not sure you understand what I'm saying. Maybe you're thinking about
>DACs? This is a optical cable for breakout:
>
>https://www.fs.com/products/68048.html
>
>There is no electronics in it to "detect" things AFAIU. Same QSFP can
>be used with this cable or a non-breakout.

Ah, got you.


>
>> >Or to put it differently IMO the netdev should be provisioned if the
>> >system has a port into which user can plug in a cable. When there is   
>> 
>> Not really. For slit cables, the ports are provisioned not matter which
>> cable is connected, slitter 1->2/1->4 or 1->1 cable.
>> 
>> 
>> >a line card-sized hole in the chassis, I'd be surprised to see ports.
>> >
>> >That said I never worked with real world routers so maybe that's what
>> >they do. Maybe some with a Cisco router in the basement can tell us? :)  
>> 
>> The need for provision/pre-configure splitter/linecard is that the
>> ports/netdevices do not disapper/reappear when you replace
>> splitter/linecard. Consider a faulty linecard with one port burned. You
>> just want to replace it with new one. And in that case, you really don't
>> want kernel to remove netdevices and possibly mess up routing for
>> example.
>
>Having a single burned port sounds like a relatively rare scenario.

Hmm, rare in scale is common...


>Reconfiguring routing is not the end of the world.

Well, yes, but you don't really want netdevices to come and go then you
plug in/out cables/modules. That's why we have split implemented as we
do. I don't understand why do you think linecards are different.

Plus, I'm not really sure that our hw can report the type, will check.
One way or another, I think that both configuration flows have valid
usecase. Some user may want pre-configuration, some user may want auto.
Btw, it is possible to implement splitter cable in auto mode as well.


>
>> >If the device really needs this configuration / can't detect things
>> >automatically, then we gotta do something like what you have.
>> >The only question is do we still want to call it a line card.
>> >Sounds more like a front panel module. At Netronome we called 
>> >those phymods.  
>> 
>> Sure, the name is up to the discussion. We call it "linecard"
>> internally. I don't care about the name.
>
>Yeah, let's call it something more appropriate to indicate its
>breakout/retimer/gearbox nature, and we'll be good :)

Well, it can contain much more. It can contain a smartnic/fpga/whatever
for example. Not sure we can find something that fits to all cases.
I was thinking about it in the past, I think that the linecard is quite
appropriate. It connects with lines/lanes, and it does something,
either phy/gearbox, or just interconnects the lanes using smartnic/fpga
for example.

Jiri Pirko Jan. 18, 2021, 1:03 p.m. UTC | #17

Fri, Jan 15, 2021 at 07:01:45PM CET, idosch@idosch.org wrote:
>On Fri, Jan 15, 2021 at 05:55:59PM +0100, Jiri Pirko wrote:
>> Fri, Jan 15, 2021 at 04:43:57PM CET, idosch@idosch.org wrote:
>> >On Wed, Jan 13, 2021 at 01:12:12PM +0100, Jiri Pirko wrote:
>> >> # Create a new netdevsim device, with no ports and 2 line cards:
>> >> $ echo "10 0 2" >/sys/bus/netdevsim/new_device
>> >> $ devlink port # No ports are listed
>> >> $ devlink lc
>> >> netdevsim/netdevsim10:
>> >>   lc 0 state unprovisioned
>> >>     supported_types:
>> >>        card1port card2ports card4ports
>> >>   lc 1 state unprovisioned
>> >>     supported_types:
>> >>        card1port card2ports card4ports
>> >> 
>> >> # Note that driver advertizes supported line card types. In case of
>> >> # netdevsim, these are 3.
>> >> 
>> >> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports
>> >
>> >Why do we need a separate command for that? You actually introduced
>> >'DEVLINK_CMD_LINECARD_SET' in patch #1, but it's never used.
>> >
>> >I prefer:
>> >
>> >devlink lc set netdevsim/netdevsim10 index 0 state provision type card4ports
>> 
>> This is misleading. This is actually not setting state. The state gets
>> changed upon successful provisioning process. Also, one may think that
>> he can set other states, but he can't. I don't like this at all :/
>
>So make state a read-only attribute. You really only care about setting
>the type.
>
>To provision:
>
># devlink lc set netdevsim/netdevsim10 index 0 type card4ports
>
>To unprovsion:
>
># devlink lc set netdevsim/netdevsim10 index 0 type none
>
>Or:
>
># devlink lc set netdevsim/netdevsim10 index 0 notype

Hmm, okay, that might work. And I can add state "FAILED_PROVISION" what
would indicate that after the type was set by the user, driver was not
able to successfully provision. The the user has to set "notype" & "type
x" again. Sounds good?


>
>> 
>> 
>> >devlink lc set netdevsim/netdevsim10 index 0 state unprovision
>> >
>> >It is consistent with the GET/SET/NEW/DEL pattern used by other
>> >commands.
>> 
>> Not really, see split port for example. This is similar to that.
>
>It's not. The split command creates new objects whereas this command
>modifies an existing object.

You are right.


>
>> 
>> >
>> >> $ devlink lc
>> >> netdevsim/netdevsim10:
>> >>   lc 0 state provisioned type card4ports
>> >>     supported_types:
>> >>        card1port card2ports card4ports
>> >>   lc 1 state unprovisioned
>> >>     supported_types:
>> >>        card1port card2ports card4ports
>> >> $ devlink port
>> >> netdevsim/netdevsim10/1000: type eth netdev eni10nl0p1 flavour physical lc 0 port 1 splittable false
>> >> netdevsim/netdevsim10/1001: type eth netdev eni10nl0p2 flavour physical lc 0 port 2 splittable false
>> >> netdevsim/netdevsim10/1002: type eth netdev eni10nl0p3 flavour physical lc 0 port 3 splittable false
>> >> netdevsim/netdevsim10/1003: type eth netdev eni10nl0p4 flavour physical lc 0 port 4 splittable false
>> >> #                                                 ^^                    ^^^^
>> >> #                                     netdev name adjusted          index of a line card this port belongs to
>> >> 
>> >> $ ip link set eni10nl0p1 up 
>> >> $ ip link show eni10nl0p1   
>> >> 165: eni10nl0p1: <NO-CARRIER,BROADCAST,NOARP,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
>> >>     link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff
>> >> 
>> >> # Now activate the line card using debugfs. That emulates plug-in event
>> >> # on real hardware:
>> >> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active
>> >> $ ip link show eni10nl0p1
>> >> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
>> >>     link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff
>> >> # The carrier is UP now.
>> >> 
>> >> Jiri Pirko (10):
>> >>   devlink: add support to create line card and expose to user
>> >>   devlink: implement line card provisioning
>> >>   devlink: implement line card active state
>> >>   devlink: append split port number to the port name
>> >>   devlink: add port to line card relationship set
>> >>   netdevsim: introduce line card support
>> >>   netdevsim: allow port objects to be linked with line cards
>> >>   netdevsim: create devlink line card object and implement provisioning
>> >>   netdevsim: implement line card activation
>> >>   selftests: add netdevsim devlink lc test
>> >> 
>> >>  drivers/net/netdevsim/bus.c                   |  21 +-
>> >>  drivers/net/netdevsim/dev.c                   | 370 ++++++++++++++-
>> >>  drivers/net/netdevsim/netdev.c                |   2 +
>> >>  drivers/net/netdevsim/netdevsim.h             |  23 +
>> >>  include/net/devlink.h                         |  44 ++
>> >>  include/uapi/linux/devlink.h                  |  25 +
>> >>  net/core/devlink.c                            | 443 +++++++++++++++++-
>> >>  .../drivers/net/netdevsim/devlink.sh          |  62 ++-
>> >>  8 files changed, 964 insertions(+), 26 deletions(-)
>> >> 
>> >> -- 
>> >> 2.26.2
>> >>

Jakub Kicinski Jan. 18, 2021, 5:59 p.m. UTC | #18

On Mon, 18 Jan 2021 14:00:09 +0100 Jiri Pirko wrote:
> >> >Or to put it differently IMO the netdev should be provisioned if the
> >> >system has a port into which user can plug in a cable. When there is     
> >> 
> >> Not really. For slit cables, the ports are provisioned not matter which
> >> cable is connected, slitter 1->2/1->4 or 1->1 cable.
> >> 
> >>   
> >> >a line card-sized hole in the chassis, I'd be surprised to see ports.
> >> >
> >> >That said I never worked with real world routers so maybe that's what
> >> >they do. Maybe some with a Cisco router in the basement can tell us? :)    
> >> 
> >> The need for provision/pre-configure splitter/linecard is that the
> >> ports/netdevices do not disapper/reappear when you replace
> >> splitter/linecard. Consider a faulty linecard with one port burned. You
> >> just want to replace it with new one. And in that case, you really don't
> >> want kernel to remove netdevices and possibly mess up routing for
> >> example.  
> >
> >Having a single burned port sounds like a relatively rare scenario.  
> 
> Hmm, rare in scale is common...

Sure but at a scale of million switches it doesn't matter if a couple
are re-configuring their routing.

> >Reconfiguring routing is not the end of the world.  
> 
> Well, yes, but you don't really want netdevices to come and go then you
> plug in/out cables/modules. That's why we have split implemented as we
> do. I don't understand why do you think linecards are different.

If I have an unused port it will still show up as a netdev.
If I have an unused phymod slot w/ a slot cover in it, why would there
be a netdev? Our definition of a physical port is something like "a
socket for a networking cable on the outside of the device". With your
code I can "provision" a phymod and there is no whole to plug in a
cable. If we follow the same logic, if I have a server with PCIe
hotplug, why can't I "provision" some netdevs for a NIC that I will
plug in later?

> Plus, I'm not really sure that our hw can report the type, will check.

I think that's key.

> One way or another, I think that both configuration flows have valid
> usecase. Some user may want pre-configuration, some user may want auto.
> Btw, it is possible to implement splitter cable in auto mode as well.

Auto as in iterate over possible configs until link up? That's nasty.

> >> >If the device really needs this configuration / can't detect things
> >> >automatically, then we gotta do something like what you have.
> >> >The only question is do we still want to call it a line card.
> >> >Sounds more like a front panel module. At Netronome we called 
> >> >those phymods.    
> >> 
> >> Sure, the name is up to the discussion. We call it "linecard"
> >> internally. I don't care about the name.  
> >
> >Yeah, let's call it something more appropriate to indicate its
> >breakout/retimer/gearbox nature, and we'll be good :)  
> 
> Well, it can contain much more. It can contain a smartnic/fpga/whatever
> for example. Not sure we can find something that fits to all cases.
> I was thinking about it in the past, I think that the linecard is quite
> appropriate. It connects with lines/lanes, and it does something,
> either phy/gearbox, or just interconnects the lanes using smartnic/fpga
> for example.

If it has a FPGA / NPU in it, it's definitely auto-discoverable. 
I don't understand why you think that it's okay to "provision" NICs
which aren't there but only for this particular use case.

Edwin Peer Jan. 18, 2021, 6:01 p.m. UTC | #19

On Wed, Jan 13, 2021 at 4:14 AM Jiri Pirko <jiri@resnulli.us> wrote:

> To resolve this, a concept of "provisioning" is introduced.
> The user may "provision" certain slot with a line card type.
> Driver then creates all instances (devlink ports, netdevices, etc)
> related to this line card type. The carrier of netdevices stays down.
> Once the line card is inserted and activated, the carrier of the
> related netdevices goes up.

Do we need to start distinguishing different reasons for carrier down,
or have some kind of device not ready state instead?

I'm facing a similar issue with NIC firmware that isn't yet ready by
device open time, but have been resisting the urge to lie to the stack
about the state of the device and use link state as the next gate.
Sure, most things will just work most of the time, but the problems
with this approach are manifold. Firstly, at least in the NIC case,
the user may confuse this state for some kind of cable issue and go
looking in the wrong place for a solution. But, there are also several
ways the initialization can fail after this point and now the device
is administratively UP, but can never be UP, with no sanctioned way to
communicate the failure. Aren't the issues here similar?

Regards,
Edwin Peer

David Ahern Jan. 18, 2021, 10:55 p.m. UTC | #20

On 1/18/21 6:00 AM, Jiri Pirko wrote:
> 
>> Reconfiguring routing is not the end of the world.
> Well, yes, but you don't really want netdevices to come and go then you
> plug in/out cables/modules. That's why we have split implemented as we

And you don't want a routing daemon to use netdevices which are not
valid due to non-existence. Best case with what you want is carrier down
on the LC's netdevices and that destroys routing.

> do. I don't understand why do you think linecards are different.

I still don't get why you expect linecards to be different than any
other hotplug device.

David Ahern Jan. 18, 2021, 10:57 p.m. UTC | #21

On 1/18/21 11:01 AM, Edwin Peer wrote:
> I'm facing a similar issue with NIC firmware that isn't yet ready by
> device open time, but have been resisting the urge to lie to the stack

why not have the ndo_open return -EBUSY or -EAGAIN to tell S/W to try
again 'later'?

> about the state of the device and use link state as the next gate.
> Sure, most things will just work most of the time, but the problems
> with this approach are manifold. Firstly, at least in the NIC case,
> the user may confuse this state for some kind of cable issue and go
> looking in the wrong place for a solution. But, there are also several
> ways the initialization can fail after this point and now the device
> is administratively UP, but can never be UP, with no sanctioned way to
> communicate the failure. Aren't the issues here similar?

Edwin Peer Jan. 18, 2021, 11:40 p.m. UTC | #22

On Mon, Jan 18, 2021 at 2:57 PM David Ahern <dsahern@gmail.com> wrote:

> On 1/18/21 11:01 AM, Edwin Peer wrote:
> > I'm facing a similar issue with NIC firmware that isn't yet ready by
> > device open time, but have been resisting the urge to lie to the stack
>
> why not have the ndo_open return -EBUSY or -EAGAIN to tell S/W to try
> again 'later'?

Indeed, this is what we ended up doing, although we still need to
confirm Network Manager, systemd and whatever else our customers might
use do the necessary to satisfy the user requirement to handle the
delayed init.

Only reason I piped up is that this line card thing seems to introduce
a similar issue.

Regards,
Edwin Peer

David Ahern Jan. 19, 2021, 2:39 a.m. UTC | #23

On 1/18/21 4:40 PM, Edwin Peer wrote:
> On Mon, Jan 18, 2021 at 2:57 PM David Ahern <dsahern@gmail.com> wrote:
> 
>> On 1/18/21 11:01 AM, Edwin Peer wrote:
>>> I'm facing a similar issue with NIC firmware that isn't yet ready by
>>> device open time, but have been resisting the urge to lie to the stack
>>
>> why not have the ndo_open return -EBUSY or -EAGAIN to tell S/W to try
>> again 'later'?
> 
> Indeed, this is what we ended up doing, although we still need to
> confirm Network Manager, systemd and whatever else our customers might
> use do the necessary to satisfy the user requirement to handle the
> delayed init.

I am not surprised about the issue - boot times have been improved and
devices have gotten more complicated. And I was wondering how network
managers (add ifupdown{2} to that list) would handle an EAGAIN. You
could have an event sent -- e.g., IFLA_EVENT_FW_READY -- to allow
managers to avoid polling. Redundant for multiple netdev's per device,
but makes it event driven.

> 
> Only reason I piped up is that this line card thing seems to introduce
> a similar issue.

Seems reasonable.

Edwin Peer Jan. 19, 2021, 5:06 a.m. UTC | #24

On Mon, Jan 18, 2021 at 6:39 PM David Ahern <dsahern@gmail.com> wrote:

> > Indeed, this is what we ended up doing, although we still need to
> > confirm Network Manager, systemd and whatever else our customers might
> > use do the necessary to satisfy the user requirement to handle the
> > delayed init.
>
> I am not surprised about the issue - boot times have been improved and
> devices have gotten more complicated. And I was wondering how network
> managers (add ifupdown{2} to that list) would handle an EAGAIN. You
> could have an event sent -- e.g., IFLA_EVENT_FW_READY -- to allow
> managers to avoid polling. Redundant for multiple netdev's per device,
> but makes it event driven.

This is what I was hinting at by talking about another device state.
For that, there would necessarily need to be an event to inform user
space about the transition out of said state into normal open/up. Of
course, the tools would need to be updated to know about such a new
event too. EAGAIN does seem simpler. In our case, we don't expect to
be polling too long, or frequently for that matter. It is
exceptionally rare that the firmware wouldn't be ready, but it can
happen.

Regards,
Edwin Peer

Jiri Pirko Jan. 19, 2021, 11:51 a.m. UTC | #25

Mon, Jan 18, 2021 at 06:59:28PM CET, kuba@kernel.org wrote:
>On Mon, 18 Jan 2021 14:00:09 +0100 Jiri Pirko wrote:
>> >> >Or to put it differently IMO the netdev should be provisioned if the
>> >> >system has a port into which user can plug in a cable. When there is     
>> >> 
>> >> Not really. For slit cables, the ports are provisioned not matter which
>> >> cable is connected, slitter 1->2/1->4 or 1->1 cable.
>> >> 
>> >>   
>> >> >a line card-sized hole in the chassis, I'd be surprised to see ports.
>> >> >
>> >> >That said I never worked with real world routers so maybe that's what
>> >> >they do. Maybe some with a Cisco router in the basement can tell us? :)    
>> >> 
>> >> The need for provision/pre-configure splitter/linecard is that the
>> >> ports/netdevices do not disapper/reappear when you replace
>> >> splitter/linecard. Consider a faulty linecard with one port burned. You
>> >> just want to replace it with new one. And in that case, you really don't
>> >> want kernel to remove netdevices and possibly mess up routing for
>> >> example.  
>> >
>> >Having a single burned port sounds like a relatively rare scenario.  
>> 
>> Hmm, rare in scale is common...
>
>Sure but at a scale of million switches it doesn't matter if a couple
>are re-configuring their routing.
>
>> >Reconfiguring routing is not the end of the world.  
>> 
>> Well, yes, but you don't really want netdevices to come and go then you
>> plug in/out cables/modules. That's why we have split implemented as we
>> do. I don't understand why do you think linecards are different.
>
>If I have an unused port it will still show up as a netdev.
>If I have an unused phymod slot w/ a slot cover in it, why would there
>be a netdev? Our definition of a physical port is something like "a
>socket for a networking cable on the outside of the device". With your
>code I can "provision" a phymod and there is no whole to plug in a
>cable. If we follow the same logic, if I have a server with PCIe
>hotplug, why can't I "provision" some netdevs for a NIC that I will
>plug in later?
>
>> Plus, I'm not really sure that our hw can report the type, will check.
>
>I think that's key.

So, it can't. The driver is only aware of "activation" of the linecard
being successful or not.


>
>> One way or another, I think that both configuration flows have valid
>> usecase. Some user may want pre-configuration, some user may want auto.
>> Btw, it is possible to implement splitter cable in auto mode as well.
>
>Auto as in iterate over possible configs until link up? That's nasty.
>
>> >> >If the device really needs this configuration / can't detect things
>> >> >automatically, then we gotta do something like what you have.
>> >> >The only question is do we still want to call it a line card.
>> >> >Sounds more like a front panel module. At Netronome we called 
>> >> >those phymods.    
>> >> 
>> >> Sure, the name is up to the discussion. We call it "linecard"
>> >> internally. I don't care about the name.  
>> >
>> >Yeah, let's call it something more appropriate to indicate its
>> >breakout/retimer/gearbox nature, and we'll be good :)  
>> 
>> Well, it can contain much more. It can contain a smartnic/fpga/whatever
>> for example. Not sure we can find something that fits to all cases.
>> I was thinking about it in the past, I think that the linecard is quite
>> appropriate. It connects with lines/lanes, and it does something,
>> either phy/gearbox, or just interconnects the lanes using smartnic/fpga
>> for example.
>
>If it has a FPGA / NPU in it, it's definitely auto-discoverable. 
>I don't understand why you think that it's okay to "provision" NICs
>which aren't there but only for this particular use case.

Jiri Pirko Jan. 19, 2021, 11:56 a.m. UTC | #26

Thu, Jan 14, 2021 at 03:07:18AM CET, andrew@lunn.ch wrote:
>> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports
>> $ devlink lc
>> netdevsim/netdevsim10:
>>   lc 0 state provisioned type card4ports
>>     supported_types:
>>        card1port card2ports card4ports
>>   lc 1 state unprovisioned
>>     supported_types:
>>        card1port card2ports card4ports
>
>Hi Jiri
>
>> # Now activate the line card using debugfs. That emulates plug-in event
>> # on real hardware:
>> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active
>> $ ip link show eni10nl0p1
>> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
>>     link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff
>> # The carrier is UP now.
>
>What is missing from the devlink lc view is what line card is actually
>in the slot. Say if i provision for a card4port, but actually insert a
>card2port. It would be nice to have something like:

I checked, our hw does not support that. Only provides info that
linecard activation was/wasn't successful.


>
> $ devlink lc
> netdevsim/netdevsim10:
>   lc 0 state provisioned type card4ports
>     supported_types:
>        card1port card2ports card4ports
>     inserted_type:
>        card2ports;
>   lc 1 state unprovisioned
>     supported_types:
>        card1port card2ports card4ports
>     inserted_type:
>        None
>
>I assume if i prevision for card4ports but actually install a
>card2ports, all the interfaces stay down?
>
>Maybe
>
>> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active
>
>should actually be
>    echo "card2ports" > /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active
>
>so you can emulate somebody putting the wrong card in the slot?
>
>    Andrew

Andrew Lunn Jan. 19, 2021, 2:51 p.m. UTC | #27

On Tue, Jan 19, 2021 at 12:56:10PM +0100, Jiri Pirko wrote:
> Thu, Jan 14, 2021 at 03:07:18AM CET, andrew@lunn.ch wrote:
> >> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports
> >> $ devlink lc
> >> netdevsim/netdevsim10:
> >>   lc 0 state provisioned type card4ports
> >>     supported_types:
> >>        card1port card2ports card4ports
> >>   lc 1 state unprovisioned
> >>     supported_types:
> >>        card1port card2ports card4ports
> >
> >Hi Jiri
> >
> >> # Now activate the line card using debugfs. That emulates plug-in event
> >> # on real hardware:
> >> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active
> >> $ ip link show eni10nl0p1
> >> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
> >>     link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff
> >> # The carrier is UP now.
> >
> >What is missing from the devlink lc view is what line card is actually
> >in the slot. Say if i provision for a card4port, but actually insert a
> >card2port. It would be nice to have something like:
> 
> I checked, our hw does not support that. Only provides info that
> linecard activation was/wasn't successful.

Hi Jiri

Is this a firmware limitation? There is no API to extract the
information from the firmware to the host? The firmware itself knows
there is a mismatch and refuses to configure the line card, and
prevents the MAC going up?

Even if you cannot do this now, it seems likely in future firmware
versions you will be able to, so maybe at least define the netlink
attributes now? As well as attributes indicating activation was
successful.

	Andrew

David Ahern Jan. 19, 2021, 4:23 p.m. UTC | #28

On 1/19/21 4:56 AM, Jiri Pirko wrote:
> Thu, Jan 14, 2021 at 03:07:18AM CET, andrew@lunn.ch wrote:
>>> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports
>>> $ devlink lc
>>> netdevsim/netdevsim10:
>>>   lc 0 state provisioned type card4ports
>>>     supported_types:
>>>        card1port card2ports card4ports
>>>   lc 1 state unprovisioned
>>>     supported_types:
>>>        card1port card2ports card4ports
>>
>> Hi Jiri
>>
>>> # Now activate the line card using debugfs. That emulates plug-in event
>>> # on real hardware:
>>> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active
>>> $ ip link show eni10nl0p1
>>> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
>>>     link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff
>>> # The carrier is UP now.
>>
>> What is missing from the devlink lc view is what line card is actually
>> in the slot. Say if i provision for a card4port, but actually insert a
>> card2port. It would be nice to have something like:
> 
> I checked, our hw does not support that. Only provides info that
> linecard activation was/wasn't successful.
> 

There is no way for the supervisor / management card to probe and see
what card is actually inserted in a given slot? That seems like a
serious design deficiency. What about some agent running on the line
card talking to an agent on the supervisor to provide that information?

Jiri Pirko Jan. 20, 2021, 8:36 a.m. UTC | #29

Tue, Jan 19, 2021 at 03:51:49PM CET, andrew@lunn.ch wrote:
>On Tue, Jan 19, 2021 at 12:56:10PM +0100, Jiri Pirko wrote:
>> Thu, Jan 14, 2021 at 03:07:18AM CET, andrew@lunn.ch wrote:
>> >> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports
>> >> $ devlink lc
>> >> netdevsim/netdevsim10:
>> >>   lc 0 state provisioned type card4ports
>> >>     supported_types:
>> >>        card1port card2ports card4ports
>> >>   lc 1 state unprovisioned
>> >>     supported_types:
>> >>        card1port card2ports card4ports
>> >
>> >Hi Jiri
>> >
>> >> # Now activate the line card using debugfs. That emulates plug-in event
>> >> # on real hardware:
>> >> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active
>> >> $ ip link show eni10nl0p1
>> >> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
>> >>     link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff
>> >> # The carrier is UP now.
>> >
>> >What is missing from the devlink lc view is what line card is actually
>> >in the slot. Say if i provision for a card4port, but actually insert a
>> >card2port. It would be nice to have something like:
>> 
>> I checked, our hw does not support that. Only provides info that
>> linecard activation was/wasn't successful.
>
>Hi Jiri
>
>Is this a firmware limitation? There is no API to extract the
>information from the firmware to the host? The firmware itself knows
>there is a mismatch and refuses to configure the line card, and
>prevents the MAC going up?

No, the FW does not know. The ASIC is not physically able to get the
linecard type. Yes, it is odd, I agree. The linecard type is known to
the driver which operates on i2c. This driver takes care of power
management of the linecard, among other tasks.


>
>Even if you cannot do this now, it seems likely in future firmware
>versions you will be able to, so maybe at least define the netlink

Sure, for netdevsim that is not problem. Our current hw does not support
it, the future may.


>attributes now? As well as attributes indicating activation was
>successful.

State "ACTIVATED" is that indication. It is in this RFC.


>
>	Andrew

Jiri Pirko Jan. 20, 2021, 8:37 a.m. UTC | #30

Tue, Jan 19, 2021 at 05:23:19PM CET, dsahern@gmail.com wrote:
>On 1/19/21 4:56 AM, Jiri Pirko wrote:
>> Thu, Jan 14, 2021 at 03:07:18AM CET, andrew@lunn.ch wrote:
>>>> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports
>>>> $ devlink lc
>>>> netdevsim/netdevsim10:
>>>>   lc 0 state provisioned type card4ports
>>>>     supported_types:
>>>>        card1port card2ports card4ports
>>>>   lc 1 state unprovisioned
>>>>     supported_types:
>>>>        card1port card2ports card4ports
>>>
>>> Hi Jiri
>>>
>>>> # Now activate the line card using debugfs. That emulates plug-in event
>>>> # on real hardware:
>>>> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active
>>>> $ ip link show eni10nl0p1
>>>> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
>>>>     link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff
>>>> # The carrier is UP now.
>>>
>>> What is missing from the devlink lc view is what line card is actually
>>> in the slot. Say if i provision for a card4port, but actually insert a
>>> card2port. It would be nice to have something like:
>> 
>> I checked, our hw does not support that. Only provides info that
>> linecard activation was/wasn't successful.
>> 
>
>There is no way for the supervisor / management card to probe and see
>what card is actually inserted in a given slot? That seems like a
>serious design deficiency. What about some agent running on the line
>card talking to an agent on the supervisor to provide that information?

The ASIC does not have this info. The linecard type is exposed over i2c
interface, different driver sits on top of it.
I agree it is odd, but that is how it is for our hw, unfortunatelly.

Andrew Lunn Jan. 20, 2021, 1:56 p.m. UTC | #31

> No, the FW does not know. The ASIC is not physically able to get the
> linecard type. Yes, it is odd, I agree. The linecard type is known to
> the driver which operates on i2c. This driver takes care of power
> management of the linecard, among other tasks.

So what does activated actually mean for your hardware? It seems to
mean something like: Some random card has been plugged in, we have no
idea what, but it has power, and we have enabled the MACs as
provisioned, which if you are lucky might match the hardware?

The foundations of this feature seems dubious.

    Andrew

Jakub Kicinski Jan. 20, 2021, 11:41 p.m. UTC | #32

On Wed, 20 Jan 2021 14:56:46 +0100 Andrew Lunn wrote:
> > No, the FW does not know. The ASIC is not physically able to get the
> > linecard type. Yes, it is odd, I agree. The linecard type is known to
> > the driver which operates on i2c. This driver takes care of power
> > management of the linecard, among other tasks.  
> 
> So what does activated actually mean for your hardware? It seems to
> mean something like: Some random card has been plugged in, we have no
> idea what, but it has power, and we have enabled the MACs as
> provisioned, which if you are lucky might match the hardware?
> 
> The foundations of this feature seems dubious.

But Jiri also says "The linecard type is known to the driver which
operates on i2c." which sounds like there is some i2c driver (in user
space?) which talks to the card and _does_ have the info? Maybe I'm
misreading it. What's the i2c driver?

Andrew Lunn Jan. 21, 2021, 12:01 a.m. UTC | #33

On Wed, Jan 20, 2021 at 03:41:58PM -0800, Jakub Kicinski wrote:
> On Wed, 20 Jan 2021 14:56:46 +0100 Andrew Lunn wrote:
> > > No, the FW does not know. The ASIC is not physically able to get the
> > > linecard type. Yes, it is odd, I agree. The linecard type is known to
> > > the driver which operates on i2c. This driver takes care of power
> > > management of the linecard, among other tasks.  
> > 
> > So what does activated actually mean for your hardware? It seems to
> > mean something like: Some random card has been plugged in, we have no
> > idea what, but it has power, and we have enabled the MACs as
> > provisioned, which if you are lucky might match the hardware?
> > 
> > The foundations of this feature seems dubious.
> 
> But Jiri also says "The linecard type is known to the driver which
> operates on i2c." which sounds like there is some i2c driver (in user
> space?) which talks to the card and _does_ have the info? Maybe I'm
> misreading it. What's the i2c driver?

Hi Jakub

A complete guess, but i think it will be the BMC, not the ASIC. There
have been patches from Mellanox in the past for a BMC, i think sent to
arm-soc, for the ASPEED devices often used as BMCs. And the BMC is
often the device doing power management. So what might be missing is
an interface between the driver and the BMC. But that then makes the
driver system specific. A OEM who buys ASICs and makes their own board
could have their own BMC running there own BMC firmware.

All speculation...

      Andrew

Jakub Kicinski Jan. 21, 2021, 12:16 a.m. UTC | #34

On Thu, 21 Jan 2021 01:01:21 +0100 Andrew Lunn wrote:
> On Wed, Jan 20, 2021 at 03:41:58PM -0800, Jakub Kicinski wrote:
> > On Wed, 20 Jan 2021 14:56:46 +0100 Andrew Lunn wrote:  
> > > > No, the FW does not know. The ASIC is not physically able to get the
> > > > linecard type. Yes, it is odd, I agree. The linecard type is known to
> > > > the driver which operates on i2c. This driver takes care of power
> > > > management of the linecard, among other tasks.    
> > > 
> > > So what does activated actually mean for your hardware? It seems to
> > > mean something like: Some random card has been plugged in, we have no
> > > idea what, but it has power, and we have enabled the MACs as
> > > provisioned, which if you are lucky might match the hardware?
> > > 
> > > The foundations of this feature seems dubious.  
> > 
> > But Jiri also says "The linecard type is known to the driver which
> > operates on i2c." which sounds like there is some i2c driver (in user
> > space?) which talks to the card and _does_ have the info? Maybe I'm
> > misreading it. What's the i2c driver?  
> 
> Hi Jakub
> 
> A complete guess, but i think it will be the BMC, not the ASIC. There
> have been patches from Mellanox in the past for a BMC, i think sent to
> arm-soc, for the ASPEED devices often used as BMCs. And the BMC is
> often the device doing power management. So what might be missing is
> an interface between the driver and the BMC. But that then makes the
> driver system specific. A OEM who buys ASICs and makes their own board
> could have their own BMC running there own BMC firmware.
> 
> All speculation...

I see that does make sense

Jiri Pirko Jan. 21, 2021, 3:32 p.m. UTC | #35

Thu, Jan 21, 2021 at 12:41:58AM CET, kuba@kernel.org wrote:
>On Wed, 20 Jan 2021 14:56:46 +0100 Andrew Lunn wrote:
>> > No, the FW does not know. The ASIC is not physically able to get the
>> > linecard type. Yes, it is odd, I agree. The linecard type is known to
>> > the driver which operates on i2c. This driver takes care of power
>> > management of the linecard, among other tasks.  
>> 
>> So what does activated actually mean for your hardware? It seems to
>> mean something like: Some random card has been plugged in, we have no
>> idea what, but it has power, and we have enabled the MACs as
>> provisioned, which if you are lucky might match the hardware?
>> 
>> The foundations of this feature seems dubious.
>
>But Jiri also says "The linecard type is known to the driver which
>operates on i2c." which sounds like there is some i2c driver (in user
>space?) which talks to the card and _does_ have the info? Maybe I'm
>misreading it. What's the i2c driver?

That is Vadim's i2c kernel driver, this is going to upstream.

Jiri Pirko Jan. 21, 2021, 3:34 p.m. UTC | #36

Thu, Jan 21, 2021 at 01:01:21AM CET, andrew@lunn.ch wrote:
>On Wed, Jan 20, 2021 at 03:41:58PM -0800, Jakub Kicinski wrote:
>> On Wed, 20 Jan 2021 14:56:46 +0100 Andrew Lunn wrote:
>> > > No, the FW does not know. The ASIC is not physically able to get the
>> > > linecard type. Yes, it is odd, I agree. The linecard type is known to
>> > > the driver which operates on i2c. This driver takes care of power
>> > > management of the linecard, among other tasks.  
>> > 
>> > So what does activated actually mean for your hardware? It seems to
>> > mean something like: Some random card has been plugged in, we have no
>> > idea what, but it has power, and we have enabled the MACs as
>> > provisioned, which if you are lucky might match the hardware?
>> > 
>> > The foundations of this feature seems dubious.
>> 
>> But Jiri also says "The linecard type is known to the driver which
>> operates on i2c." which sounds like there is some i2c driver (in user
>> space?) which talks to the card and _does_ have the info? Maybe I'm
>> misreading it. What's the i2c driver?
>
>Hi Jakub
>
>A complete guess, but i think it will be the BMC, not the ASIC. There
>have been patches from Mellanox in the past for a BMC, i think sent to
>arm-soc, for the ASPEED devices often used as BMCs. And the BMC is
>often the device doing power management. So what might be missing is
>an interface between the driver and the BMC. But that then makes the
>driver system specific. A OEM who buys ASICs and makes their own board
>could have their own BMC running there own BMC firmware.
>
>All speculation...

Basically all correct.

The thing is mlxsw and the i2c driver cannot talk to each other:
1) It would be ugly
2) They may likely be on a different host


>
>      Andrew

David Ahern Jan. 21, 2021, 4:38 p.m. UTC | #37

On 1/21/21 8:32 AM, Jiri Pirko wrote:
> Thu, Jan 21, 2021 at 12:41:58AM CET, kuba@kernel.org wrote:
>> On Wed, 20 Jan 2021 14:56:46 +0100 Andrew Lunn wrote:
>>>> No, the FW does not know. The ASIC is not physically able to get the
>>>> linecard type. Yes, it is odd, I agree. The linecard type is known to
>>>> the driver which operates on i2c. This driver takes care of power
>>>> management of the linecard, among other tasks.  
>>>
>>> So what does activated actually mean for your hardware? It seems to
>>> mean something like: Some random card has been plugged in, we have no
>>> idea what, but it has power, and we have enabled the MACs as
>>> provisioned, which if you are lucky might match the hardware?
>>>
>>> The foundations of this feature seems dubious.
>>
>> But Jiri also says "The linecard type is known to the driver which
>> operates on i2c." which sounds like there is some i2c driver (in user
>> space?) which talks to the card and _does_ have the info? Maybe I'm
>> misreading it. What's the i2c driver?
> 
> That is Vadim's i2c kernel driver, this is going to upstream.
> 

This pre-provisioning concept makes a fragile design to work around h/w
shortcomings. You really need a way for the management card to know
exactly what was plugged in to a slot so the control plane S/W can
respond accordingly. Surely there is a way for processes on the LC to
communicate with a process on the management card - even if it is inband
packets with special headers.

Jiri Pirko Jan. 22, 2021, 7:28 a.m. UTC | #38

Thu, Jan 21, 2021 at 05:38:40PM CET, dsahern@gmail.com wrote:
>On 1/21/21 8:32 AM, Jiri Pirko wrote:
>> Thu, Jan 21, 2021 at 12:41:58AM CET, kuba@kernel.org wrote:
>>> On Wed, 20 Jan 2021 14:56:46 +0100 Andrew Lunn wrote:
>>>>> No, the FW does not know. The ASIC is not physically able to get the
>>>>> linecard type. Yes, it is odd, I agree. The linecard type is known to
>>>>> the driver which operates on i2c. This driver takes care of power
>>>>> management of the linecard, among other tasks.  
>>>>
>>>> So what does activated actually mean for your hardware? It seems to
>>>> mean something like: Some random card has been plugged in, we have no
>>>> idea what, but it has power, and we have enabled the MACs as
>>>> provisioned, which if you are lucky might match the hardware?
>>>>
>>>> The foundations of this feature seems dubious.
>>>
>>> But Jiri also says "The linecard type is known to the driver which
>>> operates on i2c." which sounds like there is some i2c driver (in user
>>> space?) which talks to the card and _does_ have the info? Maybe I'm
>>> misreading it. What's the i2c driver?
>> 
>> That is Vadim's i2c kernel driver, this is going to upstream.
>> 
>
>This pre-provisioning concept makes a fragile design to work around h/w
>shortcomings. You really need a way for the management card to know
>exactly what was plugged in to a slot so the control plane S/W can
>respond accordingly. Surely there is a way for processes on the LC to
>communicate with a process on the management card - even if it is inband
>packets with special headers.

I don't see any way. The userspace is the one who can get the info, from
the i2c driver. The mlxsw driver has no means to get that info itself.

Jiri Pirko Jan. 22, 2021, 8:01 a.m. UTC | #39

Mon, Jan 18, 2021 at 11:55:45PM CET, dsahern@gmail.com wrote:
>On 1/18/21 6:00 AM, Jiri Pirko wrote:
>> 
>>> Reconfiguring routing is not the end of the world.
>> Well, yes, but you don't really want netdevices to come and go then you
>> plug in/out cables/modules. That's why we have split implemented as we
>
>And you don't want a routing daemon to use netdevices which are not
>valid due to non-existence. Best case with what you want is carrier down
>on the LC's netdevices and that destroys routing.

There are other things. The user may configure the netdev parameters in
advance, like mtu, put it in a bridge, setup TC filters on it etc.
The linecard unplug/plug does not destroy the settings. This is the same
thing with split ports and that is why we have implemented split ports
in "provision" mode as well.

>
>> do. I don't understand why do you think linecards are different.
>
>I still don't get why you expect linecards to be different than any
>other hotplug device.

It it not a device, does not have "struct device" related to it.
It is just a phy part of another device.

Jiri Pirko Jan. 22, 2021, 8:05 a.m. UTC | #40

Thu, Jan 21, 2021 at 05:38:40PM CET, dsahern@gmail.com wrote:
>On 1/21/21 8:32 AM, Jiri Pirko wrote:
>> Thu, Jan 21, 2021 at 12:41:58AM CET, kuba@kernel.org wrote:
>>> On Wed, 20 Jan 2021 14:56:46 +0100 Andrew Lunn wrote:
>>>>> No, the FW does not know. The ASIC is not physically able to get the
>>>>> linecard type. Yes, it is odd, I agree. The linecard type is known to
>>>>> the driver which operates on i2c. This driver takes care of power
>>>>> management of the linecard, among other tasks.  
>>>>
>>>> So what does activated actually mean for your hardware? It seems to
>>>> mean something like: Some random card has been plugged in, we have no
>>>> idea what, but it has power, and we have enabled the MACs as
>>>> provisioned, which if you are lucky might match the hardware?
>>>>
>>>> The foundations of this feature seems dubious.
>>>
>>> But Jiri also says "The linecard type is known to the driver which
>>> operates on i2c." which sounds like there is some i2c driver (in user
>>> space?) which talks to the card and _does_ have the info? Maybe I'm
>>> misreading it. What's the i2c driver?
>> 
>> That is Vadim's i2c kernel driver, this is going to upstream.
>> 
>
>This pre-provisioning concept makes a fragile design to work around h/w
>shortcomings. You really need a way for the management card to know

Not really. As I replied to you in the other part of this thread, the
linecard is basically very similar to a splitter cable. In a way, it is
a splitter cable. And should be threated in a similar way. As a phy. Not
as a device. Cables are replaceble without netdevice reappearing. This
linecards are the same. Therefore, the concept of provisioning makes
sense for them, as it does for splitter cable.


>exactly what was plugged in to a slot so the control plane S/W can
>respond accordingly. Surely there is a way for processes on the LC to
>communicate with a process on the management card - even if it is inband
>packets with special headers.

If a device is capable of splitter cable/linecard hotplug, sure, that
may be implemented. But the user has to configure it as such, to be
aware that "cable change" may move netdevices around.

Andrew Lunn Jan. 22, 2021, 2:13 p.m. UTC | #41

> I don't see any way. The userspace is the one who can get the info, from
> the i2c driver. The mlxsw driver has no means to get that info itself.

Hi Jiri

Please can you tell us more about this i2c driver. Do you have any
architecture pictures?

It is not unknown for one driver to embed another driver inside it. So
the i2c driver could be inside the mlxsw. It is also possible to link
drivers together, the mlxsw could go find the i2c driver and make use
of its services.

   Andrew

Jiri Pirko Jan. 26, 2021, 11:33 a.m. UTC | #42

Fri, Jan 22, 2021 at 03:13:12PM CET, andrew@lunn.ch wrote:
>> I don't see any way. The userspace is the one who can get the info, from
>> the i2c driver. The mlxsw driver has no means to get that info itself.
>
>Hi Jiri
>
>Please can you tell us more about this i2c driver. Do you have any
>architecture pictures?

Quoting Vadim Pasternak:
"
Not upstreamed yet.
It will be mlxreg-lc driver for line card in drivers/platfrom/mellanox and
additional mlxreg-pm for line card powering on/off, setting enable/disable
and handling power off upon thermal shutdown event.
"

>
>It is not unknown for one driver to embed another driver inside it. So
>the i2c driver could be inside the mlxsw. It is also possible to link
>drivers together, the mlxsw could go find the i2c driver and make use
>of its services.

Okay. Do you have examples? How could the kernel figure out the relation
of the instances?

>
>   Andrew

Andrew Lunn Jan. 26, 2021, 1:56 p.m. UTC | #43

On Tue, Jan 26, 2021 at 12:33:26PM +0100, Jiri Pirko wrote:
> Fri, Jan 22, 2021 at 03:13:12PM CET, andrew@lunn.ch wrote:
> >> I don't see any way. The userspace is the one who can get the info, from
> >> the i2c driver. The mlxsw driver has no means to get that info itself.
> >
> >Hi Jiri
> >
> >Please can you tell us more about this i2c driver. Do you have any
> >architecture pictures?
> 
> Quoting Vadim Pasternak:
> "
> Not upstreamed yet.
> It will be mlxreg-lc driver for line card in drivers/platfrom/mellanox and
> additional mlxreg-pm for line card powering on/off, setting enable/disable
> and handling power off upon thermal shutdown event.
> "
> 
> 
> >
> >It is not unknown for one driver to embed another driver inside it. So
> >the i2c driver could be inside the mlxsw. It is also possible to link
> >drivers together, the mlxsw could go find the i2c driver and make use
> >of its services.
> 
> Okay. Do you have examples? How could the kernel figure out the relation
> of the instances?

Hi Jiri

One driver, embedded into another? You actually submitted an example:

commit 6882b0aee180f2797b8803bdf699aa45c2e5f2d6
Author: Vadim Pasternak <vadimp@mellanox.com>
Date:   Wed Nov 16 15:20:44 2016 +0100

    mlxsw: Introduce support for I2C bus

    Add I2C bus implementation for Mellanox Technologies Switch ASICs.
    This includes command interface implementation using input / out mailboxes,
    whose location is retrieved from the firmware during probe time.

    Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
    Reviewed-by: Ido Schimmel <idosch@mellanox.com>
    Signed-off-by: Jiri Pirko <jiri@mellanox.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

There are Linux standard APIs for controlling the power to devices,
the regulator API. So i assume mlxreg-pm will make use of that. There
are also standard APIs for thermal management, which again, mlxreg-pm
should be using. The regulator API allows you to find regulators by
name. So just define a sensible naming convention, and the switch
driver can lookup the regulator, and turn it on/off as needed.

I'm guessing there are no standard Linux API which mlxreg-lc fits. I'm
also not sure it offers anything useful standalone. So i would
actually embed it inside the switchdev driver, and have internal APIs
to get information about the line card.

But i'm missing big picture architecture knowledge here, there could
be reasons why these suggestions don't work.

   Andrew

Jiri Pirko Jan. 27, 2021, 7:57 a.m. UTC | #44

Tue, Jan 26, 2021 at 02:56:08PM CET, andrew@lunn.ch wrote:
>On Tue, Jan 26, 2021 at 12:33:26PM +0100, Jiri Pirko wrote:
>> Fri, Jan 22, 2021 at 03:13:12PM CET, andrew@lunn.ch wrote:
>> >> I don't see any way. The userspace is the one who can get the info, from
>> >> the i2c driver. The mlxsw driver has no means to get that info itself.
>> >
>> >Hi Jiri
>> >
>> >Please can you tell us more about this i2c driver. Do you have any
>> >architecture pictures?
>> 
>> Quoting Vadim Pasternak:
>> "
>> Not upstreamed yet.
>> It will be mlxreg-lc driver for line card in drivers/platfrom/mellanox and
>> additional mlxreg-pm for line card powering on/off, setting enable/disable
>> and handling power off upon thermal shutdown event.
>> "
>> 
>> 
>> >
>> >It is not unknown for one driver to embed another driver inside it. So
>> >the i2c driver could be inside the mlxsw. It is also possible to link
>> >drivers together, the mlxsw could go find the i2c driver and make use
>> >of its services.
>> 
>> Okay. Do you have examples? How could the kernel figure out the relation
>> of the instances?
>
>Hi Jiri
>
>One driver, embedded into another? You actually submitted an example:
>
>commit 6882b0aee180f2797b8803bdf699aa45c2e5f2d6
>Author: Vadim Pasternak <vadimp@mellanox.com>
>Date:   Wed Nov 16 15:20:44 2016 +0100
>
>    mlxsw: Introduce support for I2C bus
>    
>    Add I2C bus implementation for Mellanox Technologies Switch ASICs.
>    This includes command interface implementation using input / out mailboxes,
>    whose location is retrieved from the firmware during probe time.
>    
>    Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
>    Reviewed-by: Ido Schimmel <idosch@mellanox.com>
>    Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>    Signed-off-by: David S. Miller <davem@davemloft.net>
>
>There are Linux standard APIs for controlling the power to devices,
>the regulator API. So i assume mlxreg-pm will make use of that. There
>are also standard APIs for thermal management, which again, mlxreg-pm
>should be using. The regulator API allows you to find regulators by
>name. So just define a sensible naming convention, and the switch
>driver can lookup the regulator, and turn it on/off as needed.


I don't think it would apply. The thing is, i2c driver has a channel to
the linecard eeprom, from where it can read info about the linecard. The
i2c driver also knows when the linecard is plugged in, unlike mlxsw.
It acts as a standalone driver. Mlxsw has no way to directly find if the
card was plugged in (unpowered) and which type it is.

Not sure how to "embed" it. I don't think any existing API could help.
Basicall mlxsw would have to register a callback to the i2c driver
called every time card is inserted to do auto-provision.
Now consider a case when there are multiple instances of the ASIC on the
system. How to assemble a relationship between mlxsw instance and i2c
driver instance?

But again, auto-provision is only one usecase. Manual provisioning is
needed anyway. And that is exactly what my patchset is aiming to
introduce. Auto-provision can be added when/if needed later on.


>
>I'm guessing there are no standard Linux API which mlxreg-lc fits. I'm
>also not sure it offers anything useful standalone. So i would
>actually embed it inside the switchdev driver, and have internal APIs
>to get information about the line card.
>
>But i'm missing big picture architecture knowledge here, there could
>be reasons why these suggestions don't work.
>
>   Andrew

Andrew Lunn Jan. 27, 2021, 2:14 p.m. UTC | #45

> >There are Linux standard APIs for controlling the power to devices,
> >the regulator API. So i assume mlxreg-pm will make use of that. There
> >are also standard APIs for thermal management, which again, mlxreg-pm
> >should be using. The regulator API allows you to find regulators by
> >name. So just define a sensible naming convention, and the switch
> >driver can lookup the regulator, and turn it on/off as needed.
> 
> 
> I don't think it would apply. The thing is, i2c driver has a channel to
> the linecard eeprom, from where it can read info about the linecard. The
> i2c driver also knows when the linecard is plugged in, unlike mlxsw.
> It acts as a standalone driver. Mlxsw has no way to directly find if the
> card was plugged in (unpowered) and which type it is.
> 
> Not sure how to "embed" it. I don't think any existing API could help.
> Basicall mlxsw would have to register a callback to the i2c driver
> called every time card is inserted to do auto-provision.
> Now consider a case when there are multiple instances of the ASIC on the
> system. How to assemble a relationship between mlxsw instance and i2c
> driver instance?

You have that knowledge already, otherwise you cannot solve this
problem at all. The switch is an PCIe device right? So when the bus is
enumerated, the driver loads. How do you bind the i2c driver to the
i2c bus? You cannot enumerate i2c, so you must have some hard coded
knowledge somewhere? You just need to get that knowledge into the
mlxsw driver so it can bind its internal i2c client driver to the i2c
bus. That way you avoid user space, i guess maybe udev rules, or some
daemon monitoring propriety /sys files?

> But again, auto-provision is only one usecase. Manual provisioning is
> needed anyway. And that is exactly what my patchset is aiming to
> introduce. Auto-provision can be added when/if needed later on.

I still don't actually get this use case. Why would i want to manually
provision?

	Andrew

David Ahern Jan. 27, 2021, 2:57 p.m. UTC | #46

On 1/27/21 7:14 AM, Andrew Lunn wrote:
>> I don't think it would apply. The thing is, i2c driver has a channel to
>> the linecard eeprom, from where it can read info about the linecard. The
>> i2c driver also knows when the linecard is plugged in, unlike mlxsw.
>> It acts as a standalone driver. Mlxsw has no way to directly find if the
>> card was plugged in (unpowered) and which type it is.
>>
>> Not sure how to "embed" it. I don't think any existing API could help.
>> Basicall mlxsw would have to register a callback to the i2c driver
>> called every time card is inserted to do auto-provision.
>> Now consider a case when there are multiple instances of the ASIC on the
>> system. How to assemble a relationship between mlxsw instance and i2c
>> driver instance?
> 
> You have that knowledge already, otherwise you cannot solve this
> problem at all. The switch is an PCIe device right? So when the bus is
> enumerated, the driver loads. How do you bind the i2c driver to the
> i2c bus? You cannot enumerate i2c, so you must have some hard coded
> knowledge somewhere? You just need to get that knowledge into the
> mlxsw driver so it can bind its internal i2c client driver to the i2c
> bus. That way you avoid user space, i guess maybe udev rules, or some
> daemon monitoring propriety /sys files?
> 
>> But again, auto-provision is only one usecase. Manual provisioning is
>> needed anyway. And that is exactly what my patchset is aiming to
>> introduce. Auto-provision can be added when/if needed later on.
> 
> I still don't actually get this use case. Why would i want to manually
> provision?
> 


+1.

Jiri Pirko Jan. 28, 2021, 8:14 a.m. UTC | #47

Wed, Jan 27, 2021 at 03:14:34PM CET, andrew@lunn.ch wrote:
>> >There are Linux standard APIs for controlling the power to devices,
>> >the regulator API. So i assume mlxreg-pm will make use of that. There
>> >are also standard APIs for thermal management, which again, mlxreg-pm
>> >should be using. The regulator API allows you to find regulators by
>> >name. So just define a sensible naming convention, and the switch
>> >driver can lookup the regulator, and turn it on/off as needed.
>> 
>> 
>> I don't think it would apply. The thing is, i2c driver has a channel to
>> the linecard eeprom, from where it can read info about the linecard. The
>> i2c driver also knows when the linecard is plugged in, unlike mlxsw.
>> It acts as a standalone driver. Mlxsw has no way to directly find if the
>> card was plugged in (unpowered) and which type it is.
>> 
>> Not sure how to "embed" it. I don't think any existing API could help.
>> Basicall mlxsw would have to register a callback to the i2c driver
>> called every time card is inserted to do auto-provision.
>> Now consider a case when there are multiple instances of the ASIC on the
>> system. How to assemble a relationship between mlxsw instance and i2c
>> driver instance?
>
>You have that knowledge already, otherwise you cannot solve this

No I don't have it. I'm not sure why do you say so. The mlxsw and i2c
driver act independently.


>problem at all. The switch is an PCIe device right? So when the bus is
>enumerated, the driver loads. How do you bind the i2c driver to the
>i2c bus? You cannot enumerate i2c, so you must have some hard coded
>knowledge somewhere? You just need to get that knowledge into the
>mlxsw driver so it can bind its internal i2c client driver to the i2c

There is no internal i2c client driver for this.


>bus. That way you avoid user space, i guess maybe udev rules, or some
>daemon monitoring propriety /sys files?
>
>> But again, auto-provision is only one usecase. Manual provisioning is
>> needed anyway. And that is exactly what my patchset is aiming to
>> introduce. Auto-provision can be added when/if needed later on.
>
>I still don't actually get this use case. Why would i want to manually
>provision?

Because user might want to see the system with all netdevices, configure
them, change the linecard if they got broken and all config, like
bridge, tc, etc will stay on the netdevices. Again, this is the same we
do for split port. This is important requirement, user don't want to see
netdevices come and go when he is plugging/unplugging cables. Linecards
are the same in this matter. Basically is is a "splitter module",
replacing the "splitter cable"


>
>	Andrew

Andrew Lunn Jan. 28, 2021, 2:17 p.m. UTC | #48

On Thu, Jan 28, 2021 at 09:14:34AM +0100, Jiri Pirko wrote:
> Wed, Jan 27, 2021 at 03:14:34PM CET, andrew@lunn.ch wrote:
> >> >There are Linux standard APIs for controlling the power to devices,
> >> >the regulator API. So i assume mlxreg-pm will make use of that. There
> >> >are also standard APIs for thermal management, which again, mlxreg-pm
> >> >should be using. The regulator API allows you to find regulators by
> >> >name. So just define a sensible naming convention, and the switch
> >> >driver can lookup the regulator, and turn it on/off as needed.
> >> 
> >> 
> >> I don't think it would apply. The thing is, i2c driver has a channel to
> >> the linecard eeprom, from where it can read info about the linecard. The
> >> i2c driver also knows when the linecard is plugged in, unlike mlxsw.
> >> It acts as a standalone driver. Mlxsw has no way to directly find if the
> >> card was plugged in (unpowered) and which type it is.
> >> 
> >> Not sure how to "embed" it. I don't think any existing API could help.
> >> Basicall mlxsw would have to register a callback to the i2c driver
> >> called every time card is inserted to do auto-provision.
> >> Now consider a case when there are multiple instances of the ASIC on the
> >> system. How to assemble a relationship between mlxsw instance and i2c
> >> driver instance?
> >
> >You have that knowledge already, otherwise you cannot solve this
> 
> No I don't have it. I'm not sure why do you say so. The mlxsw and i2c
> driver act independently.

Ah, so you just export some information in /sys from the i2c driver?
And you expect the poor user to look at the values, and copy paste
them to the correct mlxsw instance? 50/50 guess if you have two
switches, and hope they don't make a typO?

> >I still don't actually get this use case. Why would i want to manually
> >provision?
> 
> Because user might want to see the system with all netdevices, configure
> them, change the linecard if they got broken and all config, like
> bridge, tc, etc will stay on the netdevices. Again, this is the same we
> do for split port. This is important requirement, user don't want to see
> netdevices come and go when he is plugging/unplugging cables. Linecards
> are the same in this matter. Basically is is a "splitter module",
> replacing the "splitter cable"

So, what is the real use case here? Why might the user want to do
this?

Is it: The magic smoke has escaped. The user takes a spare switch, and
wants to put it on her desk to configure it where she has a comfy chair
and piece and quiet, unlike in the data centre, which is very noise,
only has hard plastic chair, no coffee allowed. She makes her best
guess at the configuration, up/downs the interfaces, reboots, to make
sure it is permanent, and only then moves to the data centre to swap
the dead router for the new one, and fix up whatever configuration
errors there are, while sat on the hard chair?

So this feature is about comfy chair vs hard chair?

I'm also wondering about the splitter port use case. At what point do
you tell the user that it is physically impossible to split the port
because the SFP simply does not support it? You say the netdevs don't
come/go. I assume the link never goes up, but how does the user know
the configuration is FUBAR, not the SFP? To me, it seems a lot more
intuitive that when i remove an SFP which has been split into 4, and
pop in an SFP which only supports a single stream, the 3 extra netdevs
would just vanish.

   Andrew

Jiri Pirko Jan. 29, 2021, 7:20 a.m. UTC | #49

Thu, Jan 28, 2021 at 03:17:13PM CET, andrew@lunn.ch wrote:
>On Thu, Jan 28, 2021 at 09:14:34AM +0100, Jiri Pirko wrote:
>> Wed, Jan 27, 2021 at 03:14:34PM CET, andrew@lunn.ch wrote:
>> >> >There are Linux standard APIs for controlling the power to devices,
>> >> >the regulator API. So i assume mlxreg-pm will make use of that. There
>> >> >are also standard APIs for thermal management, which again, mlxreg-pm
>> >> >should be using. The regulator API allows you to find regulators by
>> >> >name. So just define a sensible naming convention, and the switch
>> >> >driver can lookup the regulator, and turn it on/off as needed.
>> >> 
>> >> 
>> >> I don't think it would apply. The thing is, i2c driver has a channel to
>> >> the linecard eeprom, from where it can read info about the linecard. The
>> >> i2c driver also knows when the linecard is plugged in, unlike mlxsw.
>> >> It acts as a standalone driver. Mlxsw has no way to directly find if the
>> >> card was plugged in (unpowered) and which type it is.
>> >> 
>> >> Not sure how to "embed" it. I don't think any existing API could help.
>> >> Basicall mlxsw would have to register a callback to the i2c driver
>> >> called every time card is inserted to do auto-provision.
>> >> Now consider a case when there are multiple instances of the ASIC on the
>> >> system. How to assemble a relationship between mlxsw instance and i2c
>> >> driver instance?
>> >
>> >You have that knowledge already, otherwise you cannot solve this
>> 
>> No I don't have it. I'm not sure why do you say so. The mlxsw and i2c
>> driver act independently.
>
>Ah, so you just export some information in /sys from the i2c driver?
>And you expect the poor user to look at the values, and copy paste
>them to the correct mlxsw instance? 50/50 guess if you have two
>switches, and hope they don't make a typO?

Which values are you talking about here exactly?


>
>> >I still don't actually get this use case. Why would i want to manually
>> >provision?
>> 
>> Because user might want to see the system with all netdevices, configure
>> them, change the linecard if they got broken and all config, like
>> bridge, tc, etc will stay on the netdevices. Again, this is the same we
>> do for split port. This is important requirement, user don't want to see
>> netdevices come and go when he is plugging/unplugging cables. Linecards
>> are the same in this matter. Basically is is a "splitter module",
>> replacing the "splitter cable"
>
>So, what is the real use case here? Why might the user want to do
>this?
>
>Is it: The magic smoke has escaped. The user takes a spare switch, and
>wants to put it on her desk to configure it where she has a comfy chair
>and piece and quiet, unlike in the data centre, which is very noise,
>only has hard plastic chair, no coffee allowed. She makes her best
>guess at the configuration, up/downs the interfaces, reboots, to make
>sure it is permanent, and only then moves to the data centre to swap
>the dead router for the new one, and fix up whatever configuration
>errors there are, while sat on the hard chair?
>
>So this feature is about comfy chair vs hard chair?

I don't really get the question, but configuring switch w/o any linecard
and plug the linecards in later on is definitelly a usecase.


>
>I'm also wondering about the splitter port use case. At what point do
>you tell the user that it is physically impossible to split the port
>because the SFP simply does not support it? You say the netdevs don't
>come/go. I assume the link never goes up, but how does the user know
>the configuration is FUBAR, not the SFP? To me, it seems a lot more
>intuitive that when i remove an SFP which has been split into 4, and
>pop in an SFP which only supports a single stream, the 3 extra netdevs
>would just vanish.

As I wrote easlier in this thread, for hw that supports it, there should
be possibility to turn on "autosplit" mode that would do exactly what
you describe. But depends on a usecase. User should be in power to
configure "autosplit" for split cables and "autodetect" for linecards.
Both should be treated in the same way I believe.


>
>   Andrew
>

Vadim Pasternak Jan. 29, 2021, 4:45 p.m. UTC | #50

> -----Original Message-----
> From: Andrew Lunn <andrew@lunn.ch>
> Sent: Friday, January 29, 2021 5:50 PM
> To: Jiri Pirko <jiri@resnulli.us>
> Cc: David Ahern <dsahern@gmail.com>; Jakub Kicinski <kuba@kernel.org>;
> netdev@vger.kernel.org; davem@davemloft.net; jacob.e.keller@intel.com;
> Roopa Prabhu <roopa@nvidia.com>; mlxsw <mlxsw@nvidia.com>; Vadim
> Pasternak <vadimp@nvidia.com>
> Subject: Re: [patch net-next RFC 00/10] introduce line card support for
> modular switch
> 
> On Fri, Jan 29, 2021 at 08:20:15AM +0100, Jiri Pirko wrote:
> > Thu, Jan 28, 2021 at 03:17:13PM CET, andrew@lunn.ch wrote:
> > >On Thu, Jan 28, 2021 at 09:14:34AM +0100, Jiri Pirko wrote:
> > >> Wed, Jan 27, 2021 at 03:14:34PM CET, andrew@lunn.ch wrote:
> > >> >> >There are Linux standard APIs for controlling the power to
> > >> >> >devices, the regulator API. So i assume mlxreg-pm will make use
> > >> >> >of that. There are also standard APIs for thermal management,
> > >> >> >which again, mlxreg-pm should be using. The regulator API
> > >> >> >allows you to find regulators by name. So just define a
> > >> >> >sensible naming convention, and the switch driver can lookup the
> regulator, and turn it on/off as needed.
> > >> >>
> > >> >>
> > >> >> I don't think it would apply. The thing is, i2c driver has a
> > >> >> channel to the linecard eeprom, from where it can read info
> > >> >> about the linecard. The i2c driver also knows when the linecard is
> plugged in, unlike mlxsw.
> > >> >> It acts as a standalone driver. Mlxsw has no way to directly
> > >> >> find if the card was plugged in (unpowered) and which type it is.
> > >> >>
> > >> >> Not sure how to "embed" it. I don't think any existing API could help.
> > >> >> Basicall mlxsw would have to register a callback to the i2c
> > >> >> driver called every time card is inserted to do auto-provision.
> > >> >> Now consider a case when there are multiple instances of the
> > >> >> ASIC on the system. How to assemble a relationship between mlxsw
> > >> >> instance and i2c driver instance?
> > >> >
> > >> >You have that knowledge already, otherwise you cannot solve this
> > >>
> > >> No I don't have it. I'm not sure why do you say so. The mlxsw and
> > >> i2c driver act independently.
> > >
> > >Ah, so you just export some information in /sys from the i2c driver?
> > >And you expect the poor user to look at the values, and copy paste
> > >them to the correct mlxsw instance? 50/50 guess if you have two
> > >switches, and hope they don't make a typO?
> >
> > Which values are you talking about here exactly?
> 
> The i2c driver tells you what line card is actually inserted.
> Hopefully it interprets the EEPROM and gives the user a nice string. You then
> need to use this string to provision the switch, so it knows what line card has
> been inserted. Or the user can pre-prevision, make a guess as to what card will
> actually be inserted sometime in the future, tell the switch, and hope that
> actually happens.

Hi Andrew,

mlxsw I2C driver is BMC side driver. Its purpose to provide hwmon,
thermal, QSFP info for the chassis management at BMC side.
It works on top of PRM interface and it is associated with the chip I2C
slave device.
It doesn't aware of system topology, it knows nothing about system I2C
tree, what is EEPROM, where it located and so on. This is not a scope of
this driver.

Platform line card driver is aware of line card I2C topology, its
responsibility is to detect line card basic hardware type, create I2C
topology (mux), connect all the necessary I2C devices, like hotswap
devices, voltage and power regulators devices, iio/a2d devices and line
card EEPROMs, creates LED instances for LED located on a line card, exposes
line card related attributes, like CPLD and FPGA versions, reset causes,
required powered through line card hwmon interface.

> 
> > >> >I still don't actually get this use case. Why would i want to
> > >> >manually provision?
> > >>
> > >> Because user might want to see the system with all netdevices,
> > >> configure them, change the linecard if they got broken and all
> > >> config, like bridge, tc, etc will stay on the netdevices. Again,
> > >> this is the same we do for split port. This is important
> > >> requirement, user don't want to see netdevices come and go when he
> > >> is plugging/unplugging cables. Linecards are the same in this
> > >> matter. Basically is is a "splitter module", replacing the "splitter cable"
> > >
> > >So, what is the real use case here? Why might the user want to do
> > >this?
> > >
> > >Is it: The magic smoke has escaped. The user takes a spare switch,
> > >and wants to put it on her desk to configure it where she has a comfy
> > >chair and piece and quiet, unlike in the data centre, which is very
> > >noise, only has hard plastic chair, no coffee allowed. She makes her
> > >best guess at the configuration, up/downs the interfaces, reboots, to
> > >make sure it is permanent, and only then moves to the data centre to
> > >swap the dead router for the new one, and fix up whatever
> > >configuration errors there are, while sat on the hard chair?
> > >
> > >So this feature is about comfy chair vs hard chair?
> >
> > I don't really get the question, but configuring switch w/o any
> > linecard and plug the linecards in later on is definitelly a usecase.
> 
> It is a requirement, not a use case.
> 
> A use case is the big picture, what is the user doing, at the big picture level. In
> the somewhat absurd example given above, the user case is, the router chassis
> has died, but they think the line cards are O.K. They want to do as much
> configuration and testing as possible before going into the data center to
> actually replace the chassis.  By reusing the existing line cards, they reduce the
> risk of getting the cables plugged into the wrong port.
> 
> From the use case, you can derive the requirements. In order to test that ifup --
> all puts the IP addresses in the right places, it needs to have the netdevs with
> the correct names. Either they need line cards in the router, or they need to be
> able to fake line cards. There is also a requirement that the line cards are easy
> to exchange. They do not need to be fully hot-plugable, since one router is
> dead, the other can be powered off. But ideally you want simple thumb
> screws, not a Philips screwdriver or an allan key. There is also a requirement
> that this provision is persistent, since the user is likely to reboot the system in
> order to test the configurations files actually work at boot time. Either the
> switch driver needs to write the information to FLASH, or user space needs to
> tell it on every boot, a systemd service file or similar.
> 
> But since you have not been able to answer my question, i wonder if
> everything is backwards around here. Your architecture is broken, you cannot
> easily determine what line card is inserted, so you need a workaround,
> provisioning. But provision might actually be a useful feature, so lets try to sell
> the feature, and gloss over that the architecture is broken.
> 
> So, i would like to see the architecture fixed first. The switch driver somehow
> talks to the i2c driver to find out what card is in the slot, and configures itself.
> My guess is, every other switch does this, this is what the user expects as a
> base feature, it is what we want Linux to do by default.
> 
> You can later add provisioning, where if the slot is empty, you can fake a line
> card, to fulfil the use cases. And when the slot is actually filled, you can verify
> what is plugged in matches what was expected, and be very noise if not.
> 
> 	  Andrew

Andrew Lunn Jan. 29, 2021, 5:31 p.m. UTC | #51

> Platform line card driver is aware of line card I2C topology, its
> responsibility is to detect line card basic hardware type, create I2C
> topology (mux), connect all the necessary I2C devices, like hotswap
> devices, voltage and power regulators devices, iio/a2d devices and line
> card EEPROMs, creates LED instances for LED located on a line card, exposes
> line card related attributes, like CPLD and FPGA versions, reset causes,
> required powered through line card hwmon interface.

So this driver, and the switch driver need to talk to each other, so
the switch driver actually knows what, if anything, is in the slot.

    Andrew

Jiri Pirko Jan. 30, 2021, 2:19 p.m. UTC | #52

Fri, Jan 29, 2021 at 06:31:59PM CET, andrew@lunn.ch wrote:
>> Platform line card driver is aware of line card I2C topology, its
>> responsibility is to detect line card basic hardware type, create I2C
>> topology (mux), connect all the necessary I2C devices, like hotswap
>> devices, voltage and power regulators devices, iio/a2d devices and line
>> card EEPROMs, creates LED instances for LED located on a line card, exposes
>> line card related attributes, like CPLD and FPGA versions, reset causes,
>> required powered through line card hwmon interface.
>
>So this driver, and the switch driver need to talk to each other, so
>the switch driver actually knows what, if anything, is in the slot.

Not possible in case the BMC is a different host, which is common
scenario.

Andrew Lunn Feb. 1, 2021, 1:43 a.m. UTC | #53

> Platform line card driver is aware of line card I2C topology, its
> responsibility is to detect line card basic hardware type, create I2C
> topology (mux), connect all the necessary I2C devices, like hotswap
> devices, voltage and power regulators devices, iio/a2d devices and line
> card EEPROMs, creates LED instances for LED located on a line card, exposes
> line card related attributes, like CPLD and FPGA versions, reset causes,
> required powered through line card hwmon interface.

Jiri says the hardware is often connected to the BMC. But you do
expose much of this to the host as well? You want devlink dev info to
show the version information. Use devlink dev flash to upgrade the
bitfile in the CPD and FPGA. The hwmon instances are pretty pointless
on the BMC where nobody can see them. Are there temperature sensors
involved? The host is where the thermal policy is running, deciding
what to throttle, or shut down when it gets too hot. LEDs can be
controlled via /sys/class/led as expected?

So exporting what the line card actually is to the host is not really
a problem, it is just one more bit of information amongst everything
else already exposed to it.

	Andrew

Jiri Pirko Feb. 1, 2021, 8:16 a.m. UTC | #54

Sun, Jan 31, 2021 at 06:09:24PM CET, dsahern@gmail.com wrote:
>On 1/30/21 7:19 AM, Jiri Pirko wrote:
>> Fri, Jan 29, 2021 at 06:31:59PM CET, andrew@lunn.ch wrote:
>>>> Platform line card driver is aware of line card I2C topology, its
>>>> responsibility is to detect line card basic hardware type, create I2C
>>>> topology (mux), connect all the necessary I2C devices, like hotswap
>>>> devices, voltage and power regulators devices, iio/a2d devices and line
>>>> card EEPROMs, creates LED instances for LED located on a line card, exposes
>>>> line card related attributes, like CPLD and FPGA versions, reset causes,
>>>> required powered through line card hwmon interface.
>>>
>>> So this driver, and the switch driver need to talk to each other, so
>>> the switch driver actually knows what, if anything, is in the slot.
>> 
>> Not possible in case the BMC is a different host, which is common
>> scenario.
>> 
>
>User provisions a 4 port card, but a 2 port card is inserted. How is
>this detected and the user told the wrong card is inserted?

The card won't get activated.
The user won't see the type of inserted linecard. Again, it is not
possible for ASIC to access the linecard eeprom. See Vadim's reply.


>
>If it is not detected that's a serious problem, no?

That is how it is, unfortunatelly.


>
>If it is detected why can't the same mechanism be used for auto
>provisioning?

Again, not possible to detect.

Andrew Lunn Feb. 1, 2021, 1:41 p.m. UTC | #55

On Mon, Feb 01, 2021 at 09:16:41AM +0100, Jiri Pirko wrote:
> Sun, Jan 31, 2021 at 06:09:24PM CET, dsahern@gmail.com wrote:
> >On 1/30/21 7:19 AM, Jiri Pirko wrote:
> >> Fri, Jan 29, 2021 at 06:31:59PM CET, andrew@lunn.ch wrote:
> >>>> Platform line card driver is aware of line card I2C topology, its
> >>>> responsibility is to detect line card basic hardware type, create I2C
> >>>> topology (mux), connect all the necessary I2C devices, like hotswap
> >>>> devices, voltage and power regulators devices, iio/a2d devices and line
> >>>> card EEPROMs, creates LED instances for LED located on a line card, exposes
> >>>> line card related attributes, like CPLD and FPGA versions, reset causes,
> >>>> required powered through line card hwmon interface.
> >>>
> >>> So this driver, and the switch driver need to talk to each other, so
> >>> the switch driver actually knows what, if anything, is in the slot.
> >> 
> >> Not possible in case the BMC is a different host, which is common
> >> scenario.
> >> 
> >
> >User provisions a 4 port card, but a 2 port card is inserted. How is
> >this detected and the user told the wrong card is inserted?
> 
> The card won't get activated.
> The user won't see the type of inserted linecard. Again, it is not
> possible for ASIC to access the linecard eeprom. See Vadim's reply.
> 
> 
> >
> >If it is not detected that's a serious problem, no?
> 
> That is how it is, unfortunatelly.
> 
> 
> >
> >If it is detected why can't the same mechanism be used for auto
> >provisioning?
> 
> Again, not possible to detect.

If the platform line card driver is running in the host, you can
detect it. From your wording, it sounds like some systems do have this
driver in the host. So please add the needed code.

When the platform line card driver is on the BMC, you need a proxy in
between. Isn't this what IPMI and Redfish is all about? The proxy
driver can offer the same interface as the platform line card driver.

    Andrew

Jiri Pirko Feb. 3, 2021, 2:57 p.m. UTC | #56

Mon, Feb 01, 2021 at 02:41:07PM CET, andrew@lunn.ch wrote:
>On Mon, Feb 01, 2021 at 09:16:41AM +0100, Jiri Pirko wrote:
>> Sun, Jan 31, 2021 at 06:09:24PM CET, dsahern@gmail.com wrote:
>> >On 1/30/21 7:19 AM, Jiri Pirko wrote:
>> >> Fri, Jan 29, 2021 at 06:31:59PM CET, andrew@lunn.ch wrote:
>> >>>> Platform line card driver is aware of line card I2C topology, its
>> >>>> responsibility is to detect line card basic hardware type, create I2C
>> >>>> topology (mux), connect all the necessary I2C devices, like hotswap
>> >>>> devices, voltage and power regulators devices, iio/a2d devices and line
>> >>>> card EEPROMs, creates LED instances for LED located on a line card, exposes
>> >>>> line card related attributes, like CPLD and FPGA versions, reset causes,
>> >>>> required powered through line card hwmon interface.
>> >>>
>> >>> So this driver, and the switch driver need to talk to each other, so
>> >>> the switch driver actually knows what, if anything, is in the slot.
>> >> 
>> >> Not possible in case the BMC is a different host, which is common
>> >> scenario.
>> >> 
>> >
>> >User provisions a 4 port card, but a 2 port card is inserted. How is
>> >this detected and the user told the wrong card is inserted?
>> 
>> The card won't get activated.
>> The user won't see the type of inserted linecard. Again, it is not
>> possible for ASIC to access the linecard eeprom. See Vadim's reply.
>> 
>> 
>> >
>> >If it is not detected that's a serious problem, no?
>> 
>> That is how it is, unfortunatelly.
>> 
>> 
>> >
>> >If it is detected why can't the same mechanism be used for auto
>> >provisioning?
>> 
>> Again, not possible to detect.
>
>If the platform line card driver is running in the host, you can
>detect it. From your wording, it sounds like some systems do have this
>driver in the host. So please add the needed code.

But if not, it cannot. We still need the provisioning then.


>
>When the platform line card driver is on the BMC, you need a proxy in
>between. Isn't this what IPMI and Redfish is all about? The proxy
>driver can offer the same interface as the platform line card driver.

Do you have any example of kernel driver which is doing some thing like
that?


>
>    Andrew

Andrew Lunn Feb. 3, 2021, 4:26 p.m. UTC | #57

> >When the platform line card driver is on the BMC, you need a proxy in
> >between. Isn't this what IPMI and Redfish is all about? The proxy
> >driver can offer the same interface as the platform line card driver.
> 
> Do you have any example of kernel driver which is doing some thing like
> that?

drivers/hwmon/ibmaem.c is a pretty normal looking HWMON driver, for
temperature/power/energy sensors which are connected to the BMC and
accessed over IPMI.

char/ipmi/ipmi_watchdog.c as the name suggests is a watchdog. At first
glance its API to user space follows the standard API, even if it does
not make use of the watchdog subsystem core.

These two should give you examples of how you talk to the BMC from a
kernel driver.

	 Andrew

[net-next,RFC,00/10] introduce line card support for modular switch

Message

Comments