Message ID | 20210113121222.733517-1-jiri@resnulli.us (mailing list archive) |
---|---|
Headers | show |
Series | introduce line card support for modular switch | expand |
> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports > $ devlink lc > netdevsim/netdevsim10: > lc 0 state provisioned type card4ports > supported_types: > card1port card2ports card4ports > lc 1 state unprovisioned > supported_types: > card1port card2ports card4ports Hi Jiri > # Now activate the line card using debugfs. That emulates plug-in event > # on real hardware: > $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active > $ ip link show eni10nl0p1 > 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 > link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff > # The carrier is UP now. What is missing from the devlink lc view is what line card is actually in the slot. Say if i provision for a card4port, but actually insert a card2port. It would be nice to have something like: $ devlink lc netdevsim/netdevsim10: lc 0 state provisioned type card4ports supported_types: card1port card2ports card4ports inserted_type: card2ports; lc 1 state unprovisioned supported_types: card1port card2ports card4ports inserted_type: None I assume if i prevision for card4ports but actually install a card2ports, all the interfaces stay down? Maybe > $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active should actually be echo "card2ports" > /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active so you can emulate somebody putting the wrong card in the slot? Andrew
On Wed, 13 Jan 2021 13:12:12 +0100 Jiri Pirko wrote: > This patchset introduces support for modular switch systems. > NVIDIA Mellanox SN4800 is an example of such. It contains 8 slots > to accomodate line cards. Available line cards include: > 16X 100GbE (QSFP28) > 8X 200GbE (QSFP56) > 4X 400GbE (QSFP-DD) > > Similar to split cabels, it is essencial for the correctness of > configuration and funcionality to treat the line card entities > in the same way, no matter the line card is inserted or not. > Meaning, the netdevice of a line card port cannot just disappear > when line card is removed. Also, system admin needs to be able > to apply configuration on netdevices belonging to line card port > even before the linecard gets inserted. I don't understand why that would be. Please provide reasoning, e.g. what the FW/HW limitation is. > To resolve this, a concept of "provisioning" is introduced. > The user may "provision" certain slot with a line card type. > Driver then creates all instances (devlink ports, netdevices, etc) > related to this line card type. The carrier of netdevices stays down. > Once the line card is inserted and activated, the carrier of the > related netdevices goes up. Dunno what "line card" means for Mellovidia but I don't think the analogy of port splitting works. To my knowledge traditional line cards often carry processors w/ full MACs etc. so I'd say plugging in a line card is much more like plugging in a new NIC. There is no way to tell a breakout cable from normal one, so the system has no chance to magically configure itself. Besides SFP is just plugging a cable, not a module of the system..
Thu, Jan 14, 2021 at 03:07:18AM CET, andrew@lunn.ch wrote: >> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports >> $ devlink lc >> netdevsim/netdevsim10: >> lc 0 state provisioned type card4ports >> supported_types: >> card1port card2ports card4ports >> lc 1 state unprovisioned >> supported_types: >> card1port card2ports card4ports > >Hi Jiri > >> # Now activate the line card using debugfs. That emulates plug-in event >> # on real hardware: >> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active >> $ ip link show eni10nl0p1 >> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 >> link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff >> # The carrier is UP now. > >What is missing from the devlink lc view is what line card is actually >in the slot. Say if i provision for a card4port, but actually insert a >card2port. It would be nice to have something like: > > $ devlink lc > netdevsim/netdevsim10: > lc 0 state provisioned type card4ports > supported_types: > card1port card2ports card4ports > inserted_type: > card2ports; > lc 1 state unprovisioned > supported_types: > card1port card2ports card4ports > inserted_type: > None I see. Yes, that might be doable. I'm noting this down. > >I assume if i prevision for card4ports but actually install a >card2ports, all the interfaces stay down? Yes, the card won't get activated in case or provision mismatch. > >Maybe > >> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active > >should actually be > echo "card2ports" > /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active > >so you can emulate somebody putting the wrong card in the slot? Got you. Thanks! > > Andrew
Thu, Jan 14, 2021 at 03:27:16AM CET, kuba@kernel.org wrote: >On Wed, 13 Jan 2021 13:12:12 +0100 Jiri Pirko wrote: >> This patchset introduces support for modular switch systems. >> NVIDIA Mellanox SN4800 is an example of such. It contains 8 slots >> to accomodate line cards. Available line cards include: >> 16X 100GbE (QSFP28) >> 8X 200GbE (QSFP56) >> 4X 400GbE (QSFP-DD) >> >> Similar to split cabels, it is essencial for the correctness of >> configuration and funcionality to treat the line card entities >> in the same way, no matter the line card is inserted or not. >> Meaning, the netdevice of a line card port cannot just disappear >> when line card is removed. Also, system admin needs to be able >> to apply configuration on netdevices belonging to line card port >> even before the linecard gets inserted. > >I don't understand why that would be. Please provide reasoning, >e.g. what the FW/HW limitation is. Well, for split cable, you need to be able to say: port 2, split into 4. And you will have 4 netdevices. These netdevices you can use to put into bridge, configure mtu, speeds, routes, etc. These will exist no matter if the splitter cable is actually inserted or not. With linecards, this is very similar. By provisioning, you also create certain number of ports, according to the linecard that you plan to insert. And similarly to the splitter, the netdevices are created. You may combine the linecard/splitter config when splitter cable is connected to a linecard port. Then you provision a linecard, port is going to appear and you will split this port. > >> To resolve this, a concept of "provisioning" is introduced. >> The user may "provision" certain slot with a line card type. >> Driver then creates all instances (devlink ports, netdevices, etc) >> related to this line card type. The carrier of netdevices stays down. >> Once the line card is inserted and activated, the carrier of the >> related netdevices goes up. > >Dunno what "line card" means for Mellovidia but I don't think >the analogy of port splitting works. To my knowledge traditional >line cards often carry processors w/ full MACs etc. so I'd say >plugging in a line card is much more like plugging in a new NIC. No. It is basically a phy gearbox. The mac is not there. The interface between asic and linecard are lanes. The linecards is basically an attachable phy. > >There is no way to tell a breakout cable from normal one, so the >system has no chance to magically configure itself. Besides SFP >is just plugging a cable, not a module of the system..
On 1/13/2021 11:39 PM, Jiri Pirko wrote: > Thu, Jan 14, 2021 at 03:07:18AM CET, andrew@lunn.ch wrote: >> >> I assume if i prevision for card4ports but actually install a >> card2ports, all the interfaces stay down? > > Yes, the card won't get activated in case or provision mismatch. > If you're able to detect the line card type when plugging it in, I don't understand why you need system administrator to pre-provision it using such an interface? Wouldn't it make more sense to simply detect the case? Or is it that you expect these things to be moved around and want to make sure that you can configure the associated netdevices before the card is plugged in?
On 1/13/2021 6:27 PM, Jakub Kicinski wrote: > On Wed, 13 Jan 2021 13:12:12 +0100 Jiri Pirko wrote: >> This patchset introduces support for modular switch systems. >> NVIDIA Mellanox SN4800 is an example of such. It contains 8 slots >> to accomodate line cards. Available line cards include: >> 16X 100GbE (QSFP28) >> 8X 200GbE (QSFP56) >> 4X 400GbE (QSFP-DD) >> >> Similar to split cabels, it is essencial for the correctness of >> configuration and funcionality to treat the line card entities >> in the same way, no matter the line card is inserted or not. >> Meaning, the netdevice of a line card port cannot just disappear >> when line card is removed. Also, system admin needs to be able >> to apply configuration on netdevices belonging to line card port >> even before the linecard gets inserted. > > I don't understand why that would be. Please provide reasoning, > e.g. what the FW/HW limitation is. > I agree, I wouldn't imagine that plugging or unplugging line cards is expected to be done on a regular basis? >> To resolve this, a concept of "provisioning" is introduced. >> The user may "provision" certain slot with a line card type. >> Driver then creates all instances (devlink ports, netdevices, etc) >> related to this line card type. The carrier of netdevices stays down. >> Once the line card is inserted and activated, the carrier of the >> related netdevices goes up. > > Dunno what "line card" means for Mellovidia but I don't think > the analogy of port splitting works. To my knowledge traditional > line cards often carry processors w/ full MACs etc. so I'd say > plugging in a line card is much more like plugging in a new NIC. > Even if they didn't... > There is no way to tell a breakout cable from normal one, so the > system has no chance to magically configure itself. Besides SFP > is just plugging a cable, not a module of the system.. > If you're able to tell what is plugged in, why would we want to force user to provision ahead of time? Wouldn't it make more sense to just instantiate them as the card is plugged in? I guess it might be useful to allow programming the netdevices before the cable is actually inserted... I guess I don't see why that is valuable. It would be sort of like if you provision a PCI slot before a device is plugged into it..
On Thu, 14 Jan 2021 14:58:33 -0800 Jacob Keller wrote: > > There is no way to tell a breakout cable from normal one, so the > > system has no chance to magically configure itself. Besides SFP > > is just plugging a cable, not a module of the system.. > > > If you're able to tell what is plugged in, why would we want to force > user to provision ahead of time? Wouldn't it make more sense to just > instantiate them as the card is plugged in? I guess it might be useful > to allow programming the netdevices before the cable is actually > inserted... I guess I don't see why that is valuable. > > It would be sort of like if you provision a PCI slot before a device is > plugged into it.. Yup, that's pretty much my thinking as well.
On Thu, 14 Jan 2021 08:48:04 +0100 Jiri Pirko wrote: > Thu, Jan 14, 2021 at 03:27:16AM CET, kuba@kernel.org wrote: > >On Wed, 13 Jan 2021 13:12:12 +0100 Jiri Pirko wrote: > >> This patchset introduces support for modular switch systems. > >> NVIDIA Mellanox SN4800 is an example of such. It contains 8 slots > >> to accomodate line cards. Available line cards include: > >> 16X 100GbE (QSFP28) > >> 8X 200GbE (QSFP56) > >> 4X 400GbE (QSFP-DD) > >> > >> Similar to split cabels, it is essencial for the correctness of > >> configuration and funcionality to treat the line card entities > >> in the same way, no matter the line card is inserted or not. > >> Meaning, the netdevice of a line card port cannot just disappear > >> when line card is removed. Also, system admin needs to be able > >> to apply configuration on netdevices belonging to line card port > >> even before the linecard gets inserted. > > > >I don't understand why that would be. Please provide reasoning, > >e.g. what the FW/HW limitation is. > > Well, for split cable, you need to be able to say: > port 2, split into 4. And you will have 4 netdevices. These netdevices > you can use to put into bridge, configure mtu, speeds, routes, etc. > These will exist no matter if the splitter cable is actually inserted or > not. The difference is that the line card is more detectable (I hope). I'm not a SFP experts so maybe someone will correct me but AFAIU the QSFP (for optics) is the same regardless of breakout. It's the passive optical strands that are either bundled or not. So there is no way for the system to detect the cable type (AFAIK). Or to put it differently IMO the netdev should be provisioned if the system has a port into which user can plug in a cable. When there is a line card-sized hole in the chassis, I'd be surprised to see ports. That said I never worked with real world routers so maybe that's what they do. Maybe some with a Cisco router in the basement can tell us? :) > With linecards, this is very similar. By provisioning, you also create > certain number of ports, according to the linecard that you plan to > insert. And similarly to the splitter, the netdevices are created. > > You may combine the linecard/splitter config when splitter cable is > connected to a linecard port. Then you provision a linecard, > port is going to appear and you will split this port. > > >> To resolve this, a concept of "provisioning" is introduced. > >> The user may "provision" certain slot with a line card type. > >> Driver then creates all instances (devlink ports, netdevices, etc) > >> related to this line card type. The carrier of netdevices stays down. > >> Once the line card is inserted and activated, the carrier of the > >> related netdevices goes up. > > > >Dunno what "line card" means for Mellovidia but I don't think > >the analogy of port splitting works. To my knowledge traditional > >line cards often carry processors w/ full MACs etc. so I'd say > >plugging in a line card is much more like plugging in a new NIC. > > No. It is basically a phy gearbox. The mac is not there. The interface > between asic and linecard are lanes. The linecards is basically an > attachable phy. If the device really needs this configuration / can't detect things automatically, then we gotta do something like what you have. The only question is do we still want to call it a line card. Sounds more like a front panel module. At Netronome we called those phymods.
Thu, Jan 14, 2021 at 11:56:15PM CET, jacob.e.keller@intel.com wrote: > > >On 1/13/2021 11:39 PM, Jiri Pirko wrote: >> Thu, Jan 14, 2021 at 03:07:18AM CET, andrew@lunn.ch wrote: >>> >>> I assume if i prevision for card4ports but actually install a >>> card2ports, all the interfaces stay down? >> >> Yes, the card won't get activated in case or provision mismatch. >> > >If you're able to detect the line card type when plugging it in, I don't >understand why you need system administrator to pre-provision it using >such an interface? Wouldn't it make more sense to simply detect the >case? Or is it that you expect these things to be moved around and want >to make sure that you can configure the associated netdevices before the >card is plugged in? That is what I wrote in the cover letter.
Fri, Jan 15, 2021 at 12:30:13AM CET, kuba@kernel.org wrote: >On Thu, 14 Jan 2021 08:48:04 +0100 Jiri Pirko wrote: >> Thu, Jan 14, 2021 at 03:27:16AM CET, kuba@kernel.org wrote: >> >On Wed, 13 Jan 2021 13:12:12 +0100 Jiri Pirko wrote: >> >> This patchset introduces support for modular switch systems. >> >> NVIDIA Mellanox SN4800 is an example of such. It contains 8 slots >> >> to accomodate line cards. Available line cards include: >> >> 16X 100GbE (QSFP28) >> >> 8X 200GbE (QSFP56) >> >> 4X 400GbE (QSFP-DD) >> >> >> >> Similar to split cabels, it is essencial for the correctness of >> >> configuration and funcionality to treat the line card entities >> >> in the same way, no matter the line card is inserted or not. >> >> Meaning, the netdevice of a line card port cannot just disappear >> >> when line card is removed. Also, system admin needs to be able >> >> to apply configuration on netdevices belonging to line card port >> >> even before the linecard gets inserted. >> > >> >I don't understand why that would be. Please provide reasoning, >> >e.g. what the FW/HW limitation is. >> >> Well, for split cable, you need to be able to say: >> port 2, split into 4. And you will have 4 netdevices. These netdevices >> you can use to put into bridge, configure mtu, speeds, routes, etc. >> These will exist no matter if the splitter cable is actually inserted or >> not. > >The difference is that the line card is more detectable (I hope). > >I'm not a SFP experts so maybe someone will correct me but AFAIU >the QSFP (for optics) is the same regardless of breakout. It's the >passive optical strands that are either bundled or not. So there is >no way for the system to detect the cable type (AFAIK). For SFP module, you are able to detect those. > >Or to put it differently IMO the netdev should be provisioned if the >system has a port into which user can plug in a cable. When there is Not really. For slit cables, the ports are provisioned not matter which cable is connected, slitter 1->2/1->4 or 1->1 cable. >a line card-sized hole in the chassis, I'd be surprised to see ports. > >That said I never worked with real world routers so maybe that's what >they do. Maybe some with a Cisco router in the basement can tell us? :) The need for provision/pre-configure splitter/linecard is that the ports/netdevices do not disapper/reappear when you replace splitter/linecard. Consider a faulty linecard with one port burned. You just want to replace it with new one. And in that case, you really don't want kernel to remove netdevices and possibly mess up routing for example. > >> With linecards, this is very similar. By provisioning, you also create >> certain number of ports, according to the linecard that you plan to >> insert. And similarly to the splitter, the netdevices are created. >> >> You may combine the linecard/splitter config when splitter cable is >> connected to a linecard port. Then you provision a linecard, >> port is going to appear and you will split this port. >> >> >> To resolve this, a concept of "provisioning" is introduced. >> >> The user may "provision" certain slot with a line card type. >> >> Driver then creates all instances (devlink ports, netdevices, etc) >> >> related to this line card type. The carrier of netdevices stays down. >> >> Once the line card is inserted and activated, the carrier of the >> >> related netdevices goes up. >> > >> >Dunno what "line card" means for Mellovidia but I don't think >> >the analogy of port splitting works. To my knowledge traditional >> >line cards often carry processors w/ full MACs etc. so I'd say >> >plugging in a line card is much more like plugging in a new NIC. >> >> No. It is basically a phy gearbox. The mac is not there. The interface >> between asic and linecard are lanes. The linecards is basically an >> attachable phy. > >If the device really needs this configuration / can't detect things >automatically, then we gotta do something like what you have. >The only question is do we still want to call it a line card. >Sounds more like a front panel module. At Netronome we called >those phymods. Sure, the name is up to the discussion. We call it "linecard" internally. I don't care about the name.
Fri, Jan 15, 2021 at 12:20:58AM CET, kuba@kernel.org wrote: >On Thu, 14 Jan 2021 14:58:33 -0800 Jacob Keller wrote: >> > There is no way to tell a breakout cable from normal one, so the >> > system has no chance to magically configure itself. Besides SFP >> > is just plugging a cable, not a module of the system.. >> > >> If you're able to tell what is plugged in, why would we want to force >> user to provision ahead of time? Wouldn't it make more sense to just >> instantiate them as the card is plugged in? I guess it might be useful >> to allow programming the netdevices before the cable is actually >> inserted... I guess I don't see why that is valuable. >> >> It would be sort of like if you provision a PCI slot before a device is >> plugged into it.. > >Yup, that's pretty much my thinking as well. Please see my reply in the other sub-tread of this thread. Thanks!
On Wed, Jan 13, 2021 at 01:12:12PM +0100, Jiri Pirko wrote: > # Create a new netdevsim device, with no ports and 2 line cards: > $ echo "10 0 2" >/sys/bus/netdevsim/new_device > $ devlink port # No ports are listed > $ devlink lc > netdevsim/netdevsim10: > lc 0 state unprovisioned > supported_types: > card1port card2ports card4ports > lc 1 state unprovisioned > supported_types: > card1port card2ports card4ports > > # Note that driver advertizes supported line card types. In case of > # netdevsim, these are 3. > > $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports Why do we need a separate command for that? You actually introduced 'DEVLINK_CMD_LINECARD_SET' in patch #1, but it's never used. I prefer: devlink lc set netdevsim/netdevsim10 index 0 state provision type card4ports devlink lc set netdevsim/netdevsim10 index 0 state unprovision It is consistent with the GET/SET/NEW/DEL pattern used by other commands. > $ devlink lc > netdevsim/netdevsim10: > lc 0 state provisioned type card4ports > supported_types: > card1port card2ports card4ports > lc 1 state unprovisioned > supported_types: > card1port card2ports card4ports > $ devlink port > netdevsim/netdevsim10/1000: type eth netdev eni10nl0p1 flavour physical lc 0 port 1 splittable false > netdevsim/netdevsim10/1001: type eth netdev eni10nl0p2 flavour physical lc 0 port 2 splittable false > netdevsim/netdevsim10/1002: type eth netdev eni10nl0p3 flavour physical lc 0 port 3 splittable false > netdevsim/netdevsim10/1003: type eth netdev eni10nl0p4 flavour physical lc 0 port 4 splittable false > # ^^ ^^^^ > # netdev name adjusted index of a line card this port belongs to > > $ ip link set eni10nl0p1 up > $ ip link show eni10nl0p1 > 165: eni10nl0p1: <NO-CARRIER,BROADCAST,NOARP,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 > link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff > > # Now activate the line card using debugfs. That emulates plug-in event > # on real hardware: > $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active > $ ip link show eni10nl0p1 > 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 > link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff > # The carrier is UP now. > > Jiri Pirko (10): > devlink: add support to create line card and expose to user > devlink: implement line card provisioning > devlink: implement line card active state > devlink: append split port number to the port name > devlink: add port to line card relationship set > netdevsim: introduce line card support > netdevsim: allow port objects to be linked with line cards > netdevsim: create devlink line card object and implement provisioning > netdevsim: implement line card activation > selftests: add netdevsim devlink lc test > > drivers/net/netdevsim/bus.c | 21 +- > drivers/net/netdevsim/dev.c | 370 ++++++++++++++- > drivers/net/netdevsim/netdev.c | 2 + > drivers/net/netdevsim/netdevsim.h | 23 + > include/net/devlink.h | 44 ++ > include/uapi/linux/devlink.h | 25 + > net/core/devlink.c | 443 +++++++++++++++++- > .../drivers/net/netdevsim/devlink.sh | 62 ++- > 8 files changed, 964 insertions(+), 26 deletions(-) > > -- > 2.26.2 >
Fri, Jan 15, 2021 at 04:43:57PM CET, idosch@idosch.org wrote: >On Wed, Jan 13, 2021 at 01:12:12PM +0100, Jiri Pirko wrote: >> # Create a new netdevsim device, with no ports and 2 line cards: >> $ echo "10 0 2" >/sys/bus/netdevsim/new_device >> $ devlink port # No ports are listed >> $ devlink lc >> netdevsim/netdevsim10: >> lc 0 state unprovisioned >> supported_types: >> card1port card2ports card4ports >> lc 1 state unprovisioned >> supported_types: >> card1port card2ports card4ports >> >> # Note that driver advertizes supported line card types. In case of >> # netdevsim, these are 3. >> >> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports > >Why do we need a separate command for that? You actually introduced >'DEVLINK_CMD_LINECARD_SET' in patch #1, but it's never used. > >I prefer: > >devlink lc set netdevsim/netdevsim10 index 0 state provision type card4ports This is misleading. This is actually not setting state. The state gets changed upon successful provisioning process. Also, one may think that he can set other states, but he can't. I don't like this at all :/ >devlink lc set netdevsim/netdevsim10 index 0 state unprovision > >It is consistent with the GET/SET/NEW/DEL pattern used by other >commands. Not really, see split port for example. This is similar to that. > >> $ devlink lc >> netdevsim/netdevsim10: >> lc 0 state provisioned type card4ports >> supported_types: >> card1port card2ports card4ports >> lc 1 state unprovisioned >> supported_types: >> card1port card2ports card4ports >> $ devlink port >> netdevsim/netdevsim10/1000: type eth netdev eni10nl0p1 flavour physical lc 0 port 1 splittable false >> netdevsim/netdevsim10/1001: type eth netdev eni10nl0p2 flavour physical lc 0 port 2 splittable false >> netdevsim/netdevsim10/1002: type eth netdev eni10nl0p3 flavour physical lc 0 port 3 splittable false >> netdevsim/netdevsim10/1003: type eth netdev eni10nl0p4 flavour physical lc 0 port 4 splittable false >> # ^^ ^^^^ >> # netdev name adjusted index of a line card this port belongs to >> >> $ ip link set eni10nl0p1 up >> $ ip link show eni10nl0p1 >> 165: eni10nl0p1: <NO-CARRIER,BROADCAST,NOARP,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 >> link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff >> >> # Now activate the line card using debugfs. That emulates plug-in event >> # on real hardware: >> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active >> $ ip link show eni10nl0p1 >> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 >> link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff >> # The carrier is UP now. >> >> Jiri Pirko (10): >> devlink: add support to create line card and expose to user >> devlink: implement line card provisioning >> devlink: implement line card active state >> devlink: append split port number to the port name >> devlink: add port to line card relationship set >> netdevsim: introduce line card support >> netdevsim: allow port objects to be linked with line cards >> netdevsim: create devlink line card object and implement provisioning >> netdevsim: implement line card activation >> selftests: add netdevsim devlink lc test >> >> drivers/net/netdevsim/bus.c | 21 +- >> drivers/net/netdevsim/dev.c | 370 ++++++++++++++- >> drivers/net/netdevsim/netdev.c | 2 + >> drivers/net/netdevsim/netdevsim.h | 23 + >> include/net/devlink.h | 44 ++ >> include/uapi/linux/devlink.h | 25 + >> net/core/devlink.c | 443 +++++++++++++++++- >> .../drivers/net/netdevsim/devlink.sh | 62 ++- >> 8 files changed, 964 insertions(+), 26 deletions(-) >> >> -- >> 2.26.2 >>
On Fri, Jan 15, 2021 at 05:55:59PM +0100, Jiri Pirko wrote: > Fri, Jan 15, 2021 at 04:43:57PM CET, idosch@idosch.org wrote: > >On Wed, Jan 13, 2021 at 01:12:12PM +0100, Jiri Pirko wrote: > >> # Create a new netdevsim device, with no ports and 2 line cards: > >> $ echo "10 0 2" >/sys/bus/netdevsim/new_device > >> $ devlink port # No ports are listed > >> $ devlink lc > >> netdevsim/netdevsim10: > >> lc 0 state unprovisioned > >> supported_types: > >> card1port card2ports card4ports > >> lc 1 state unprovisioned > >> supported_types: > >> card1port card2ports card4ports > >> > >> # Note that driver advertizes supported line card types. In case of > >> # netdevsim, these are 3. > >> > >> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports > > > >Why do we need a separate command for that? You actually introduced > >'DEVLINK_CMD_LINECARD_SET' in patch #1, but it's never used. > > > >I prefer: > > > >devlink lc set netdevsim/netdevsim10 index 0 state provision type card4ports > > This is misleading. This is actually not setting state. The state gets > changed upon successful provisioning process. Also, one may think that > he can set other states, but he can't. I don't like this at all :/ So make state a read-only attribute. You really only care about setting the type. To provision: # devlink lc set netdevsim/netdevsim10 index 0 type card4ports To unprovsion: # devlink lc set netdevsim/netdevsim10 index 0 type none Or: # devlink lc set netdevsim/netdevsim10 index 0 notype > > > >devlink lc set netdevsim/netdevsim10 index 0 state unprovision > > > >It is consistent with the GET/SET/NEW/DEL pattern used by other > >commands. > > Not really, see split port for example. This is similar to that. It's not. The split command creates new objects whereas this command modifies an existing object. > > > > >> $ devlink lc > >> netdevsim/netdevsim10: > >> lc 0 state provisioned type card4ports > >> supported_types: > >> card1port card2ports card4ports > >> lc 1 state unprovisioned > >> supported_types: > >> card1port card2ports card4ports > >> $ devlink port > >> netdevsim/netdevsim10/1000: type eth netdev eni10nl0p1 flavour physical lc 0 port 1 splittable false > >> netdevsim/netdevsim10/1001: type eth netdev eni10nl0p2 flavour physical lc 0 port 2 splittable false > >> netdevsim/netdevsim10/1002: type eth netdev eni10nl0p3 flavour physical lc 0 port 3 splittable false > >> netdevsim/netdevsim10/1003: type eth netdev eni10nl0p4 flavour physical lc 0 port 4 splittable false > >> # ^^ ^^^^ > >> # netdev name adjusted index of a line card this port belongs to > >> > >> $ ip link set eni10nl0p1 up > >> $ ip link show eni10nl0p1 > >> 165: eni10nl0p1: <NO-CARRIER,BROADCAST,NOARP,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 > >> link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff > >> > >> # Now activate the line card using debugfs. That emulates plug-in event > >> # on real hardware: > >> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active > >> $ ip link show eni10nl0p1 > >> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 > >> link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff > >> # The carrier is UP now. > >> > >> Jiri Pirko (10): > >> devlink: add support to create line card and expose to user > >> devlink: implement line card provisioning > >> devlink: implement line card active state > >> devlink: append split port number to the port name > >> devlink: add port to line card relationship set > >> netdevsim: introduce line card support > >> netdevsim: allow port objects to be linked with line cards > >> netdevsim: create devlink line card object and implement provisioning > >> netdevsim: implement line card activation > >> selftests: add netdevsim devlink lc test > >> > >> drivers/net/netdevsim/bus.c | 21 +- > >> drivers/net/netdevsim/dev.c | 370 ++++++++++++++- > >> drivers/net/netdevsim/netdev.c | 2 + > >> drivers/net/netdevsim/netdevsim.h | 23 + > >> include/net/devlink.h | 44 ++ > >> include/uapi/linux/devlink.h | 25 + > >> net/core/devlink.c | 443 +++++++++++++++++- > >> .../drivers/net/netdevsim/devlink.sh | 62 ++- > >> 8 files changed, 964 insertions(+), 26 deletions(-) > >> > >> -- > >> 2.26.2 > >>
On Fri, 15 Jan 2021 15:39:06 +0100 Jiri Pirko wrote: > >I'm not a SFP experts so maybe someone will correct me but AFAIU > >the QSFP (for optics) is the same regardless of breakout. It's the > >passive optical strands that are either bundled or not. So there is > >no way for the system to detect the cable type (AFAIK). > > For SFP module, you are able to detect those. Not sure you understand what I'm saying. Maybe you're thinking about DACs? This is a optical cable for breakout: https://www.fs.com/products/68048.html There is no electronics in it to "detect" things AFAIU. Same QSFP can be used with this cable or a non-breakout. > >Or to put it differently IMO the netdev should be provisioned if the > >system has a port into which user can plug in a cable. When there is > > Not really. For slit cables, the ports are provisioned not matter which > cable is connected, slitter 1->2/1->4 or 1->1 cable. > > > >a line card-sized hole in the chassis, I'd be surprised to see ports. > > > >That said I never worked with real world routers so maybe that's what > >they do. Maybe some with a Cisco router in the basement can tell us? :) > > The need for provision/pre-configure splitter/linecard is that the > ports/netdevices do not disapper/reappear when you replace > splitter/linecard. Consider a faulty linecard with one port burned. You > just want to replace it with new one. And in that case, you really don't > want kernel to remove netdevices and possibly mess up routing for > example. Having a single burned port sounds like a relatively rare scenario. Reconfiguring routing is not the end of the world. > >If the device really needs this configuration / can't detect things > >automatically, then we gotta do something like what you have. > >The only question is do we still want to call it a line card. > >Sounds more like a front panel module. At Netronome we called > >those phymods. > > Sure, the name is up to the discussion. We call it "linecard" > internally. I don't care about the name. Yeah, let's call it something more appropriate to indicate its breakout/retimer/gearbox nature, and we'll be good :)
Fri, Jan 15, 2021 at 08:26:17PM CET, kuba@kernel.org wrote: >On Fri, 15 Jan 2021 15:39:06 +0100 Jiri Pirko wrote: >> >I'm not a SFP experts so maybe someone will correct me but AFAIU >> >the QSFP (for optics) is the same regardless of breakout. It's the >> >passive optical strands that are either bundled or not. So there is >> >no way for the system to detect the cable type (AFAIK). >> >> For SFP module, you are able to detect those. > >Not sure you understand what I'm saying. Maybe you're thinking about >DACs? This is a optical cable for breakout: > >https://www.fs.com/products/68048.html > >There is no electronics in it to "detect" things AFAIU. Same QSFP can >be used with this cable or a non-breakout. Ah, got you. > >> >Or to put it differently IMO the netdev should be provisioned if the >> >system has a port into which user can plug in a cable. When there is >> >> Not really. For slit cables, the ports are provisioned not matter which >> cable is connected, slitter 1->2/1->4 or 1->1 cable. >> >> >> >a line card-sized hole in the chassis, I'd be surprised to see ports. >> > >> >That said I never worked with real world routers so maybe that's what >> >they do. Maybe some with a Cisco router in the basement can tell us? :) >> >> The need for provision/pre-configure splitter/linecard is that the >> ports/netdevices do not disapper/reappear when you replace >> splitter/linecard. Consider a faulty linecard with one port burned. You >> just want to replace it with new one. And in that case, you really don't >> want kernel to remove netdevices and possibly mess up routing for >> example. > >Having a single burned port sounds like a relatively rare scenario. Hmm, rare in scale is common... >Reconfiguring routing is not the end of the world. Well, yes, but you don't really want netdevices to come and go then you plug in/out cables/modules. That's why we have split implemented as we do. I don't understand why do you think linecards are different. Plus, I'm not really sure that our hw can report the type, will check. One way or another, I think that both configuration flows have valid usecase. Some user may want pre-configuration, some user may want auto. Btw, it is possible to implement splitter cable in auto mode as well. > >> >If the device really needs this configuration / can't detect things >> >automatically, then we gotta do something like what you have. >> >The only question is do we still want to call it a line card. >> >Sounds more like a front panel module. At Netronome we called >> >those phymods. >> >> Sure, the name is up to the discussion. We call it "linecard" >> internally. I don't care about the name. > >Yeah, let's call it something more appropriate to indicate its >breakout/retimer/gearbox nature, and we'll be good :) Well, it can contain much more. It can contain a smartnic/fpga/whatever for example. Not sure we can find something that fits to all cases. I was thinking about it in the past, I think that the linecard is quite appropriate. It connects with lines/lanes, and it does something, either phy/gearbox, or just interconnects the lanes using smartnic/fpga for example.
Fri, Jan 15, 2021 at 07:01:45PM CET, idosch@idosch.org wrote: >On Fri, Jan 15, 2021 at 05:55:59PM +0100, Jiri Pirko wrote: >> Fri, Jan 15, 2021 at 04:43:57PM CET, idosch@idosch.org wrote: >> >On Wed, Jan 13, 2021 at 01:12:12PM +0100, Jiri Pirko wrote: >> >> # Create a new netdevsim device, with no ports and 2 line cards: >> >> $ echo "10 0 2" >/sys/bus/netdevsim/new_device >> >> $ devlink port # No ports are listed >> >> $ devlink lc >> >> netdevsim/netdevsim10: >> >> lc 0 state unprovisioned >> >> supported_types: >> >> card1port card2ports card4ports >> >> lc 1 state unprovisioned >> >> supported_types: >> >> card1port card2ports card4ports >> >> >> >> # Note that driver advertizes supported line card types. In case of >> >> # netdevsim, these are 3. >> >> >> >> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports >> > >> >Why do we need a separate command for that? You actually introduced >> >'DEVLINK_CMD_LINECARD_SET' in patch #1, but it's never used. >> > >> >I prefer: >> > >> >devlink lc set netdevsim/netdevsim10 index 0 state provision type card4ports >> >> This is misleading. This is actually not setting state. The state gets >> changed upon successful provisioning process. Also, one may think that >> he can set other states, but he can't. I don't like this at all :/ > >So make state a read-only attribute. You really only care about setting >the type. > >To provision: > ># devlink lc set netdevsim/netdevsim10 index 0 type card4ports > >To unprovsion: > ># devlink lc set netdevsim/netdevsim10 index 0 type none > >Or: > ># devlink lc set netdevsim/netdevsim10 index 0 notype Hmm, okay, that might work. And I can add state "FAILED_PROVISION" what would indicate that after the type was set by the user, driver was not able to successfully provision. The the user has to set "notype" & "type x" again. Sounds good? > >> >> >> >devlink lc set netdevsim/netdevsim10 index 0 state unprovision >> > >> >It is consistent with the GET/SET/NEW/DEL pattern used by other >> >commands. >> >> Not really, see split port for example. This is similar to that. > >It's not. The split command creates new objects whereas this command >modifies an existing object. You are right. > >> >> > >> >> $ devlink lc >> >> netdevsim/netdevsim10: >> >> lc 0 state provisioned type card4ports >> >> supported_types: >> >> card1port card2ports card4ports >> >> lc 1 state unprovisioned >> >> supported_types: >> >> card1port card2ports card4ports >> >> $ devlink port >> >> netdevsim/netdevsim10/1000: type eth netdev eni10nl0p1 flavour physical lc 0 port 1 splittable false >> >> netdevsim/netdevsim10/1001: type eth netdev eni10nl0p2 flavour physical lc 0 port 2 splittable false >> >> netdevsim/netdevsim10/1002: type eth netdev eni10nl0p3 flavour physical lc 0 port 3 splittable false >> >> netdevsim/netdevsim10/1003: type eth netdev eni10nl0p4 flavour physical lc 0 port 4 splittable false >> >> # ^^ ^^^^ >> >> # netdev name adjusted index of a line card this port belongs to >> >> >> >> $ ip link set eni10nl0p1 up >> >> $ ip link show eni10nl0p1 >> >> 165: eni10nl0p1: <NO-CARRIER,BROADCAST,NOARP,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 >> >> link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff >> >> >> >> # Now activate the line card using debugfs. That emulates plug-in event >> >> # on real hardware: >> >> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active >> >> $ ip link show eni10nl0p1 >> >> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 >> >> link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff >> >> # The carrier is UP now. >> >> >> >> Jiri Pirko (10): >> >> devlink: add support to create line card and expose to user >> >> devlink: implement line card provisioning >> >> devlink: implement line card active state >> >> devlink: append split port number to the port name >> >> devlink: add port to line card relationship set >> >> netdevsim: introduce line card support >> >> netdevsim: allow port objects to be linked with line cards >> >> netdevsim: create devlink line card object and implement provisioning >> >> netdevsim: implement line card activation >> >> selftests: add netdevsim devlink lc test >> >> >> >> drivers/net/netdevsim/bus.c | 21 +- >> >> drivers/net/netdevsim/dev.c | 370 ++++++++++++++- >> >> drivers/net/netdevsim/netdev.c | 2 + >> >> drivers/net/netdevsim/netdevsim.h | 23 + >> >> include/net/devlink.h | 44 ++ >> >> include/uapi/linux/devlink.h | 25 + >> >> net/core/devlink.c | 443 +++++++++++++++++- >> >> .../drivers/net/netdevsim/devlink.sh | 62 ++- >> >> 8 files changed, 964 insertions(+), 26 deletions(-) >> >> >> >> -- >> >> 2.26.2 >> >>
On Mon, 18 Jan 2021 14:00:09 +0100 Jiri Pirko wrote: > >> >Or to put it differently IMO the netdev should be provisioned if the > >> >system has a port into which user can plug in a cable. When there is > >> > >> Not really. For slit cables, the ports are provisioned not matter which > >> cable is connected, slitter 1->2/1->4 or 1->1 cable. > >> > >> > >> >a line card-sized hole in the chassis, I'd be surprised to see ports. > >> > > >> >That said I never worked with real world routers so maybe that's what > >> >they do. Maybe some with a Cisco router in the basement can tell us? :) > >> > >> The need for provision/pre-configure splitter/linecard is that the > >> ports/netdevices do not disapper/reappear when you replace > >> splitter/linecard. Consider a faulty linecard with one port burned. You > >> just want to replace it with new one. And in that case, you really don't > >> want kernel to remove netdevices and possibly mess up routing for > >> example. > > > >Having a single burned port sounds like a relatively rare scenario. > > Hmm, rare in scale is common... Sure but at a scale of million switches it doesn't matter if a couple are re-configuring their routing. > >Reconfiguring routing is not the end of the world. > > Well, yes, but you don't really want netdevices to come and go then you > plug in/out cables/modules. That's why we have split implemented as we > do. I don't understand why do you think linecards are different. If I have an unused port it will still show up as a netdev. If I have an unused phymod slot w/ a slot cover in it, why would there be a netdev? Our definition of a physical port is something like "a socket for a networking cable on the outside of the device". With your code I can "provision" a phymod and there is no whole to plug in a cable. If we follow the same logic, if I have a server with PCIe hotplug, why can't I "provision" some netdevs for a NIC that I will plug in later? > Plus, I'm not really sure that our hw can report the type, will check. I think that's key. > One way or another, I think that both configuration flows have valid > usecase. Some user may want pre-configuration, some user may want auto. > Btw, it is possible to implement splitter cable in auto mode as well. Auto as in iterate over possible configs until link up? That's nasty. > >> >If the device really needs this configuration / can't detect things > >> >automatically, then we gotta do something like what you have. > >> >The only question is do we still want to call it a line card. > >> >Sounds more like a front panel module. At Netronome we called > >> >those phymods. > >> > >> Sure, the name is up to the discussion. We call it "linecard" > >> internally. I don't care about the name. > > > >Yeah, let's call it something more appropriate to indicate its > >breakout/retimer/gearbox nature, and we'll be good :) > > Well, it can contain much more. It can contain a smartnic/fpga/whatever > for example. Not sure we can find something that fits to all cases. > I was thinking about it in the past, I think that the linecard is quite > appropriate. It connects with lines/lanes, and it does something, > either phy/gearbox, or just interconnects the lanes using smartnic/fpga > for example. If it has a FPGA / NPU in it, it's definitely auto-discoverable. I don't understand why you think that it's okay to "provision" NICs which aren't there but only for this particular use case.
On Wed, Jan 13, 2021 at 4:14 AM Jiri Pirko <jiri@resnulli.us> wrote: > To resolve this, a concept of "provisioning" is introduced. > The user may "provision" certain slot with a line card type. > Driver then creates all instances (devlink ports, netdevices, etc) > related to this line card type. The carrier of netdevices stays down. > Once the line card is inserted and activated, the carrier of the > related netdevices goes up. Do we need to start distinguishing different reasons for carrier down, or have some kind of device not ready state instead? I'm facing a similar issue with NIC firmware that isn't yet ready by device open time, but have been resisting the urge to lie to the stack about the state of the device and use link state as the next gate. Sure, most things will just work most of the time, but the problems with this approach are manifold. Firstly, at least in the NIC case, the user may confuse this state for some kind of cable issue and go looking in the wrong place for a solution. But, there are also several ways the initialization can fail after this point and now the device is administratively UP, but can never be UP, with no sanctioned way to communicate the failure. Aren't the issues here similar? Regards, Edwin Peer
On 1/18/21 6:00 AM, Jiri Pirko wrote: > >> Reconfiguring routing is not the end of the world. > Well, yes, but you don't really want netdevices to come and go then you > plug in/out cables/modules. That's why we have split implemented as we And you don't want a routing daemon to use netdevices which are not valid due to non-existence. Best case with what you want is carrier down on the LC's netdevices and that destroys routing. > do. I don't understand why do you think linecards are different. I still don't get why you expect linecards to be different than any other hotplug device.
On 1/18/21 11:01 AM, Edwin Peer wrote: > I'm facing a similar issue with NIC firmware that isn't yet ready by > device open time, but have been resisting the urge to lie to the stack why not have the ndo_open return -EBUSY or -EAGAIN to tell S/W to try again 'later'? > about the state of the device and use link state as the next gate. > Sure, most things will just work most of the time, but the problems > with this approach are manifold. Firstly, at least in the NIC case, > the user may confuse this state for some kind of cable issue and go > looking in the wrong place for a solution. But, there are also several > ways the initialization can fail after this point and now the device > is administratively UP, but can never be UP, with no sanctioned way to > communicate the failure. Aren't the issues here similar?
On Mon, Jan 18, 2021 at 2:57 PM David Ahern <dsahern@gmail.com> wrote: > On 1/18/21 11:01 AM, Edwin Peer wrote: > > I'm facing a similar issue with NIC firmware that isn't yet ready by > > device open time, but have been resisting the urge to lie to the stack > > why not have the ndo_open return -EBUSY or -EAGAIN to tell S/W to try > again 'later'? Indeed, this is what we ended up doing, although we still need to confirm Network Manager, systemd and whatever else our customers might use do the necessary to satisfy the user requirement to handle the delayed init. Only reason I piped up is that this line card thing seems to introduce a similar issue. Regards, Edwin Peer
On 1/18/21 4:40 PM, Edwin Peer wrote: > On Mon, Jan 18, 2021 at 2:57 PM David Ahern <dsahern@gmail.com> wrote: > >> On 1/18/21 11:01 AM, Edwin Peer wrote: >>> I'm facing a similar issue with NIC firmware that isn't yet ready by >>> device open time, but have been resisting the urge to lie to the stack >> >> why not have the ndo_open return -EBUSY or -EAGAIN to tell S/W to try >> again 'later'? > > Indeed, this is what we ended up doing, although we still need to > confirm Network Manager, systemd and whatever else our customers might > use do the necessary to satisfy the user requirement to handle the > delayed init. I am not surprised about the issue - boot times have been improved and devices have gotten more complicated. And I was wondering how network managers (add ifupdown{2} to that list) would handle an EAGAIN. You could have an event sent -- e.g., IFLA_EVENT_FW_READY -- to allow managers to avoid polling. Redundant for multiple netdev's per device, but makes it event driven. > > Only reason I piped up is that this line card thing seems to introduce > a similar issue. Seems reasonable.
On Mon, Jan 18, 2021 at 6:39 PM David Ahern <dsahern@gmail.com> wrote: > > Indeed, this is what we ended up doing, although we still need to > > confirm Network Manager, systemd and whatever else our customers might > > use do the necessary to satisfy the user requirement to handle the > > delayed init. > > I am not surprised about the issue - boot times have been improved and > devices have gotten more complicated. And I was wondering how network > managers (add ifupdown{2} to that list) would handle an EAGAIN. You > could have an event sent -- e.g., IFLA_EVENT_FW_READY -- to allow > managers to avoid polling. Redundant for multiple netdev's per device, > but makes it event driven. This is what I was hinting at by talking about another device state. For that, there would necessarily need to be an event to inform user space about the transition out of said state into normal open/up. Of course, the tools would need to be updated to know about such a new event too. EAGAIN does seem simpler. In our case, we don't expect to be polling too long, or frequently for that matter. It is exceptionally rare that the firmware wouldn't be ready, but it can happen. Regards, Edwin Peer
Mon, Jan 18, 2021 at 06:59:28PM CET, kuba@kernel.org wrote: >On Mon, 18 Jan 2021 14:00:09 +0100 Jiri Pirko wrote: >> >> >Or to put it differently IMO the netdev should be provisioned if the >> >> >system has a port into which user can plug in a cable. When there is >> >> >> >> Not really. For slit cables, the ports are provisioned not matter which >> >> cable is connected, slitter 1->2/1->4 or 1->1 cable. >> >> >> >> >> >> >a line card-sized hole in the chassis, I'd be surprised to see ports. >> >> > >> >> >That said I never worked with real world routers so maybe that's what >> >> >they do. Maybe some with a Cisco router in the basement can tell us? :) >> >> >> >> The need for provision/pre-configure splitter/linecard is that the >> >> ports/netdevices do not disapper/reappear when you replace >> >> splitter/linecard. Consider a faulty linecard with one port burned. You >> >> just want to replace it with new one. And in that case, you really don't >> >> want kernel to remove netdevices and possibly mess up routing for >> >> example. >> > >> >Having a single burned port sounds like a relatively rare scenario. >> >> Hmm, rare in scale is common... > >Sure but at a scale of million switches it doesn't matter if a couple >are re-configuring their routing. > >> >Reconfiguring routing is not the end of the world. >> >> Well, yes, but you don't really want netdevices to come and go then you >> plug in/out cables/modules. That's why we have split implemented as we >> do. I don't understand why do you think linecards are different. > >If I have an unused port it will still show up as a netdev. >If I have an unused phymod slot w/ a slot cover in it, why would there >be a netdev? Our definition of a physical port is something like "a >socket for a networking cable on the outside of the device". With your >code I can "provision" a phymod and there is no whole to plug in a >cable. If we follow the same logic, if I have a server with PCIe >hotplug, why can't I "provision" some netdevs for a NIC that I will >plug in later? > >> Plus, I'm not really sure that our hw can report the type, will check. > >I think that's key. So, it can't. The driver is only aware of "activation" of the linecard being successful or not. > >> One way or another, I think that both configuration flows have valid >> usecase. Some user may want pre-configuration, some user may want auto. >> Btw, it is possible to implement splitter cable in auto mode as well. > >Auto as in iterate over possible configs until link up? That's nasty. > >> >> >If the device really needs this configuration / can't detect things >> >> >automatically, then we gotta do something like what you have. >> >> >The only question is do we still want to call it a line card. >> >> >Sounds more like a front panel module. At Netronome we called >> >> >those phymods. >> >> >> >> Sure, the name is up to the discussion. We call it "linecard" >> >> internally. I don't care about the name. >> > >> >Yeah, let's call it something more appropriate to indicate its >> >breakout/retimer/gearbox nature, and we'll be good :) >> >> Well, it can contain much more. It can contain a smartnic/fpga/whatever >> for example. Not sure we can find something that fits to all cases. >> I was thinking about it in the past, I think that the linecard is quite >> appropriate. It connects with lines/lanes, and it does something, >> either phy/gearbox, or just interconnects the lanes using smartnic/fpga >> for example. > >If it has a FPGA / NPU in it, it's definitely auto-discoverable. >I don't understand why you think that it's okay to "provision" NICs >which aren't there but only for this particular use case.
Thu, Jan 14, 2021 at 03:07:18AM CET, andrew@lunn.ch wrote: >> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports >> $ devlink lc >> netdevsim/netdevsim10: >> lc 0 state provisioned type card4ports >> supported_types: >> card1port card2ports card4ports >> lc 1 state unprovisioned >> supported_types: >> card1port card2ports card4ports > >Hi Jiri > >> # Now activate the line card using debugfs. That emulates plug-in event >> # on real hardware: >> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active >> $ ip link show eni10nl0p1 >> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 >> link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff >> # The carrier is UP now. > >What is missing from the devlink lc view is what line card is actually >in the slot. Say if i provision for a card4port, but actually insert a >card2port. It would be nice to have something like: I checked, our hw does not support that. Only provides info that linecard activation was/wasn't successful. > > $ devlink lc > netdevsim/netdevsim10: > lc 0 state provisioned type card4ports > supported_types: > card1port card2ports card4ports > inserted_type: > card2ports; > lc 1 state unprovisioned > supported_types: > card1port card2ports card4ports > inserted_type: > None > >I assume if i prevision for card4ports but actually install a >card2ports, all the interfaces stay down? > >Maybe > >> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active > >should actually be > echo "card2ports" > /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active > >so you can emulate somebody putting the wrong card in the slot? > > Andrew
On Tue, Jan 19, 2021 at 12:56:10PM +0100, Jiri Pirko wrote: > Thu, Jan 14, 2021 at 03:07:18AM CET, andrew@lunn.ch wrote: > >> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports > >> $ devlink lc > >> netdevsim/netdevsim10: > >> lc 0 state provisioned type card4ports > >> supported_types: > >> card1port card2ports card4ports > >> lc 1 state unprovisioned > >> supported_types: > >> card1port card2ports card4ports > > > >Hi Jiri > > > >> # Now activate the line card using debugfs. That emulates plug-in event > >> # on real hardware: > >> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active > >> $ ip link show eni10nl0p1 > >> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 > >> link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff > >> # The carrier is UP now. > > > >What is missing from the devlink lc view is what line card is actually > >in the slot. Say if i provision for a card4port, but actually insert a > >card2port. It would be nice to have something like: > > I checked, our hw does not support that. Only provides info that > linecard activation was/wasn't successful. Hi Jiri Is this a firmware limitation? There is no API to extract the information from the firmware to the host? The firmware itself knows there is a mismatch and refuses to configure the line card, and prevents the MAC going up? Even if you cannot do this now, it seems likely in future firmware versions you will be able to, so maybe at least define the netlink attributes now? As well as attributes indicating activation was successful. Andrew
On 1/19/21 4:56 AM, Jiri Pirko wrote: > Thu, Jan 14, 2021 at 03:07:18AM CET, andrew@lunn.ch wrote: >>> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports >>> $ devlink lc >>> netdevsim/netdevsim10: >>> lc 0 state provisioned type card4ports >>> supported_types: >>> card1port card2ports card4ports >>> lc 1 state unprovisioned >>> supported_types: >>> card1port card2ports card4ports >> >> Hi Jiri >> >>> # Now activate the line card using debugfs. That emulates plug-in event >>> # on real hardware: >>> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active >>> $ ip link show eni10nl0p1 >>> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 >>> link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff >>> # The carrier is UP now. >> >> What is missing from the devlink lc view is what line card is actually >> in the slot. Say if i provision for a card4port, but actually insert a >> card2port. It would be nice to have something like: > > I checked, our hw does not support that. Only provides info that > linecard activation was/wasn't successful. > There is no way for the supervisor / management card to probe and see what card is actually inserted in a given slot? That seems like a serious design deficiency. What about some agent running on the line card talking to an agent on the supervisor to provide that information?
Tue, Jan 19, 2021 at 03:51:49PM CET, andrew@lunn.ch wrote: >On Tue, Jan 19, 2021 at 12:56:10PM +0100, Jiri Pirko wrote: >> Thu, Jan 14, 2021 at 03:07:18AM CET, andrew@lunn.ch wrote: >> >> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports >> >> $ devlink lc >> >> netdevsim/netdevsim10: >> >> lc 0 state provisioned type card4ports >> >> supported_types: >> >> card1port card2ports card4ports >> >> lc 1 state unprovisioned >> >> supported_types: >> >> card1port card2ports card4ports >> > >> >Hi Jiri >> > >> >> # Now activate the line card using debugfs. That emulates plug-in event >> >> # on real hardware: >> >> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active >> >> $ ip link show eni10nl0p1 >> >> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 >> >> link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff >> >> # The carrier is UP now. >> > >> >What is missing from the devlink lc view is what line card is actually >> >in the slot. Say if i provision for a card4port, but actually insert a >> >card2port. It would be nice to have something like: >> >> I checked, our hw does not support that. Only provides info that >> linecard activation was/wasn't successful. > >Hi Jiri > >Is this a firmware limitation? There is no API to extract the >information from the firmware to the host? The firmware itself knows >there is a mismatch and refuses to configure the line card, and >prevents the MAC going up? No, the FW does not know. The ASIC is not physically able to get the linecard type. Yes, it is odd, I agree. The linecard type is known to the driver which operates on i2c. This driver takes care of power management of the linecard, among other tasks. > >Even if you cannot do this now, it seems likely in future firmware >versions you will be able to, so maybe at least define the netlink Sure, for netdevsim that is not problem. Our current hw does not support it, the future may. >attributes now? As well as attributes indicating activation was >successful. State "ACTIVATED" is that indication. It is in this RFC. > > Andrew
Tue, Jan 19, 2021 at 05:23:19PM CET, dsahern@gmail.com wrote: >On 1/19/21 4:56 AM, Jiri Pirko wrote: >> Thu, Jan 14, 2021 at 03:07:18AM CET, andrew@lunn.ch wrote: >>>> $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports >>>> $ devlink lc >>>> netdevsim/netdevsim10: >>>> lc 0 state provisioned type card4ports >>>> supported_types: >>>> card1port card2ports card4ports >>>> lc 1 state unprovisioned >>>> supported_types: >>>> card1port card2ports card4ports >>> >>> Hi Jiri >>> >>>> # Now activate the line card using debugfs. That emulates plug-in event >>>> # on real hardware: >>>> $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active >>>> $ ip link show eni10nl0p1 >>>> 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 >>>> link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff >>>> # The carrier is UP now. >>> >>> What is missing from the devlink lc view is what line card is actually >>> in the slot. Say if i provision for a card4port, but actually insert a >>> card2port. It would be nice to have something like: >> >> I checked, our hw does not support that. Only provides info that >> linecard activation was/wasn't successful. >> > >There is no way for the supervisor / management card to probe and see >what card is actually inserted in a given slot? That seems like a >serious design deficiency. What about some agent running on the line >card talking to an agent on the supervisor to provide that information? The ASIC does not have this info. The linecard type is exposed over i2c interface, different driver sits on top of it. I agree it is odd, but that is how it is for our hw, unfortunatelly.
> No, the FW does not know. The ASIC is not physically able to get the > linecard type. Yes, it is odd, I agree. The linecard type is known to > the driver which operates on i2c. This driver takes care of power > management of the linecard, among other tasks. So what does activated actually mean for your hardware? It seems to mean something like: Some random card has been plugged in, we have no idea what, but it has power, and we have enabled the MACs as provisioned, which if you are lucky might match the hardware? The foundations of this feature seems dubious. Andrew
On Wed, 20 Jan 2021 14:56:46 +0100 Andrew Lunn wrote: > > No, the FW does not know. The ASIC is not physically able to get the > > linecard type. Yes, it is odd, I agree. The linecard type is known to > > the driver which operates on i2c. This driver takes care of power > > management of the linecard, among other tasks. > > So what does activated actually mean for your hardware? It seems to > mean something like: Some random card has been plugged in, we have no > idea what, but it has power, and we have enabled the MACs as > provisioned, which if you are lucky might match the hardware? > > The foundations of this feature seems dubious. But Jiri also says "The linecard type is known to the driver which operates on i2c." which sounds like there is some i2c driver (in user space?) which talks to the card and _does_ have the info? Maybe I'm misreading it. What's the i2c driver?
On Wed, Jan 20, 2021 at 03:41:58PM -0800, Jakub Kicinski wrote: > On Wed, 20 Jan 2021 14:56:46 +0100 Andrew Lunn wrote: > > > No, the FW does not know. The ASIC is not physically able to get the > > > linecard type. Yes, it is odd, I agree. The linecard type is known to > > > the driver which operates on i2c. This driver takes care of power > > > management of the linecard, among other tasks. > > > > So what does activated actually mean for your hardware? It seems to > > mean something like: Some random card has been plugged in, we have no > > idea what, but it has power, and we have enabled the MACs as > > provisioned, which if you are lucky might match the hardware? > > > > The foundations of this feature seems dubious. > > But Jiri also says "The linecard type is known to the driver which > operates on i2c." which sounds like there is some i2c driver (in user > space?) which talks to the card and _does_ have the info? Maybe I'm > misreading it. What's the i2c driver? Hi Jakub A complete guess, but i think it will be the BMC, not the ASIC. There have been patches from Mellanox in the past for a BMC, i think sent to arm-soc, for the ASPEED devices often used as BMCs. And the BMC is often the device doing power management. So what might be missing is an interface between the driver and the BMC. But that then makes the driver system specific. A OEM who buys ASICs and makes their own board could have their own BMC running there own BMC firmware. All speculation... Andrew
On Thu, 21 Jan 2021 01:01:21 +0100 Andrew Lunn wrote: > On Wed, Jan 20, 2021 at 03:41:58PM -0800, Jakub Kicinski wrote: > > On Wed, 20 Jan 2021 14:56:46 +0100 Andrew Lunn wrote: > > > > No, the FW does not know. The ASIC is not physically able to get the > > > > linecard type. Yes, it is odd, I agree. The linecard type is known to > > > > the driver which operates on i2c. This driver takes care of power > > > > management of the linecard, among other tasks. > > > > > > So what does activated actually mean for your hardware? It seems to > > > mean something like: Some random card has been plugged in, we have no > > > idea what, but it has power, and we have enabled the MACs as > > > provisioned, which if you are lucky might match the hardware? > > > > > > The foundations of this feature seems dubious. > > > > But Jiri also says "The linecard type is known to the driver which > > operates on i2c." which sounds like there is some i2c driver (in user > > space?) which talks to the card and _does_ have the info? Maybe I'm > > misreading it. What's the i2c driver? > > Hi Jakub > > A complete guess, but i think it will be the BMC, not the ASIC. There > have been patches from Mellanox in the past for a BMC, i think sent to > arm-soc, for the ASPEED devices often used as BMCs. And the BMC is > often the device doing power management. So what might be missing is > an interface between the driver and the BMC. But that then makes the > driver system specific. A OEM who buys ASICs and makes their own board > could have their own BMC running there own BMC firmware. > > All speculation... I see that does make sense
Thu, Jan 21, 2021 at 12:41:58AM CET, kuba@kernel.org wrote: >On Wed, 20 Jan 2021 14:56:46 +0100 Andrew Lunn wrote: >> > No, the FW does not know. The ASIC is not physically able to get the >> > linecard type. Yes, it is odd, I agree. The linecard type is known to >> > the driver which operates on i2c. This driver takes care of power >> > management of the linecard, among other tasks. >> >> So what does activated actually mean for your hardware? It seems to >> mean something like: Some random card has been plugged in, we have no >> idea what, but it has power, and we have enabled the MACs as >> provisioned, which if you are lucky might match the hardware? >> >> The foundations of this feature seems dubious. > >But Jiri also says "The linecard type is known to the driver which >operates on i2c." which sounds like there is some i2c driver (in user >space?) which talks to the card and _does_ have the info? Maybe I'm >misreading it. What's the i2c driver? That is Vadim's i2c kernel driver, this is going to upstream.
Thu, Jan 21, 2021 at 01:01:21AM CET, andrew@lunn.ch wrote: >On Wed, Jan 20, 2021 at 03:41:58PM -0800, Jakub Kicinski wrote: >> On Wed, 20 Jan 2021 14:56:46 +0100 Andrew Lunn wrote: >> > > No, the FW does not know. The ASIC is not physically able to get the >> > > linecard type. Yes, it is odd, I agree. The linecard type is known to >> > > the driver which operates on i2c. This driver takes care of power >> > > management of the linecard, among other tasks. >> > >> > So what does activated actually mean for your hardware? It seems to >> > mean something like: Some random card has been plugged in, we have no >> > idea what, but it has power, and we have enabled the MACs as >> > provisioned, which if you are lucky might match the hardware? >> > >> > The foundations of this feature seems dubious. >> >> But Jiri also says "The linecard type is known to the driver which >> operates on i2c." which sounds like there is some i2c driver (in user >> space?) which talks to the card and _does_ have the info? Maybe I'm >> misreading it. What's the i2c driver? > >Hi Jakub > >A complete guess, but i think it will be the BMC, not the ASIC. There >have been patches from Mellanox in the past for a BMC, i think sent to >arm-soc, for the ASPEED devices often used as BMCs. And the BMC is >often the device doing power management. So what might be missing is >an interface between the driver and the BMC. But that then makes the >driver system specific. A OEM who buys ASICs and makes their own board >could have their own BMC running there own BMC firmware. > >All speculation... Basically all correct. The thing is mlxsw and the i2c driver cannot talk to each other: 1) It would be ugly 2) They may likely be on a different host > > Andrew
On 1/21/21 8:32 AM, Jiri Pirko wrote: > Thu, Jan 21, 2021 at 12:41:58AM CET, kuba@kernel.org wrote: >> On Wed, 20 Jan 2021 14:56:46 +0100 Andrew Lunn wrote: >>>> No, the FW does not know. The ASIC is not physically able to get the >>>> linecard type. Yes, it is odd, I agree. The linecard type is known to >>>> the driver which operates on i2c. This driver takes care of power >>>> management of the linecard, among other tasks. >>> >>> So what does activated actually mean for your hardware? It seems to >>> mean something like: Some random card has been plugged in, we have no >>> idea what, but it has power, and we have enabled the MACs as >>> provisioned, which if you are lucky might match the hardware? >>> >>> The foundations of this feature seems dubious. >> >> But Jiri also says "The linecard type is known to the driver which >> operates on i2c." which sounds like there is some i2c driver (in user >> space?) which talks to the card and _does_ have the info? Maybe I'm >> misreading it. What's the i2c driver? > > That is Vadim's i2c kernel driver, this is going to upstream. > This pre-provisioning concept makes a fragile design to work around h/w shortcomings. You really need a way for the management card to know exactly what was plugged in to a slot so the control plane S/W can respond accordingly. Surely there is a way for processes on the LC to communicate with a process on the management card - even if it is inband packets with special headers.
Thu, Jan 21, 2021 at 05:38:40PM CET, dsahern@gmail.com wrote: >On 1/21/21 8:32 AM, Jiri Pirko wrote: >> Thu, Jan 21, 2021 at 12:41:58AM CET, kuba@kernel.org wrote: >>> On Wed, 20 Jan 2021 14:56:46 +0100 Andrew Lunn wrote: >>>>> No, the FW does not know. The ASIC is not physically able to get the >>>>> linecard type. Yes, it is odd, I agree. The linecard type is known to >>>>> the driver which operates on i2c. This driver takes care of power >>>>> management of the linecard, among other tasks. >>>> >>>> So what does activated actually mean for your hardware? It seems to >>>> mean something like: Some random card has been plugged in, we have no >>>> idea what, but it has power, and we have enabled the MACs as >>>> provisioned, which if you are lucky might match the hardware? >>>> >>>> The foundations of this feature seems dubious. >>> >>> But Jiri also says "The linecard type is known to the driver which >>> operates on i2c." which sounds like there is some i2c driver (in user >>> space?) which talks to the card and _does_ have the info? Maybe I'm >>> misreading it. What's the i2c driver? >> >> That is Vadim's i2c kernel driver, this is going to upstream. >> > >This pre-provisioning concept makes a fragile design to work around h/w >shortcomings. You really need a way for the management card to know >exactly what was plugged in to a slot so the control plane S/W can >respond accordingly. Surely there is a way for processes on the LC to >communicate with a process on the management card - even if it is inband >packets with special headers. I don't see any way. The userspace is the one who can get the info, from the i2c driver. The mlxsw driver has no means to get that info itself.
Mon, Jan 18, 2021 at 11:55:45PM CET, dsahern@gmail.com wrote: >On 1/18/21 6:00 AM, Jiri Pirko wrote: >> >>> Reconfiguring routing is not the end of the world. >> Well, yes, but you don't really want netdevices to come and go then you >> plug in/out cables/modules. That's why we have split implemented as we > >And you don't want a routing daemon to use netdevices which are not >valid due to non-existence. Best case with what you want is carrier down >on the LC's netdevices and that destroys routing. There are other things. The user may configure the netdev parameters in advance, like mtu, put it in a bridge, setup TC filters on it etc. The linecard unplug/plug does not destroy the settings. This is the same thing with split ports and that is why we have implemented split ports in "provision" mode as well. > >> do. I don't understand why do you think linecards are different. > >I still don't get why you expect linecards to be different than any >other hotplug device. It it not a device, does not have "struct device" related to it. It is just a phy part of another device.
Thu, Jan 21, 2021 at 05:38:40PM CET, dsahern@gmail.com wrote: >On 1/21/21 8:32 AM, Jiri Pirko wrote: >> Thu, Jan 21, 2021 at 12:41:58AM CET, kuba@kernel.org wrote: >>> On Wed, 20 Jan 2021 14:56:46 +0100 Andrew Lunn wrote: >>>>> No, the FW does not know. The ASIC is not physically able to get the >>>>> linecard type. Yes, it is odd, I agree. The linecard type is known to >>>>> the driver which operates on i2c. This driver takes care of power >>>>> management of the linecard, among other tasks. >>>> >>>> So what does activated actually mean for your hardware? It seems to >>>> mean something like: Some random card has been plugged in, we have no >>>> idea what, but it has power, and we have enabled the MACs as >>>> provisioned, which if you are lucky might match the hardware? >>>> >>>> The foundations of this feature seems dubious. >>> >>> But Jiri also says "The linecard type is known to the driver which >>> operates on i2c." which sounds like there is some i2c driver (in user >>> space?) which talks to the card and _does_ have the info? Maybe I'm >>> misreading it. What's the i2c driver? >> >> That is Vadim's i2c kernel driver, this is going to upstream. >> > >This pre-provisioning concept makes a fragile design to work around h/w >shortcomings. You really need a way for the management card to know Not really. As I replied to you in the other part of this thread, the linecard is basically very similar to a splitter cable. In a way, it is a splitter cable. And should be threated in a similar way. As a phy. Not as a device. Cables are replaceble without netdevice reappearing. This linecards are the same. Therefore, the concept of provisioning makes sense for them, as it does for splitter cable. >exactly what was plugged in to a slot so the control plane S/W can >respond accordingly. Surely there is a way for processes on the LC to >communicate with a process on the management card - even if it is inband >packets with special headers. If a device is capable of splitter cable/linecard hotplug, sure, that may be implemented. But the user has to configure it as such, to be aware that "cable change" may move netdevices around.
> I don't see any way. The userspace is the one who can get the info, from > the i2c driver. The mlxsw driver has no means to get that info itself. Hi Jiri Please can you tell us more about this i2c driver. Do you have any architecture pictures? It is not unknown for one driver to embed another driver inside it. So the i2c driver could be inside the mlxsw. It is also possible to link drivers together, the mlxsw could go find the i2c driver and make use of its services. Andrew
Fri, Jan 22, 2021 at 03:13:12PM CET, andrew@lunn.ch wrote: >> I don't see any way. The userspace is the one who can get the info, from >> the i2c driver. The mlxsw driver has no means to get that info itself. > >Hi Jiri > >Please can you tell us more about this i2c driver. Do you have any >architecture pictures? Quoting Vadim Pasternak: " Not upstreamed yet. It will be mlxreg-lc driver for line card in drivers/platfrom/mellanox and additional mlxreg-pm for line card powering on/off, setting enable/disable and handling power off upon thermal shutdown event. " > >It is not unknown for one driver to embed another driver inside it. So >the i2c driver could be inside the mlxsw. It is also possible to link >drivers together, the mlxsw could go find the i2c driver and make use >of its services. Okay. Do you have examples? How could the kernel figure out the relation of the instances? > > Andrew
On Tue, Jan 26, 2021 at 12:33:26PM +0100, Jiri Pirko wrote: > Fri, Jan 22, 2021 at 03:13:12PM CET, andrew@lunn.ch wrote: > >> I don't see any way. The userspace is the one who can get the info, from > >> the i2c driver. The mlxsw driver has no means to get that info itself. > > > >Hi Jiri > > > >Please can you tell us more about this i2c driver. Do you have any > >architecture pictures? > > Quoting Vadim Pasternak: > " > Not upstreamed yet. > It will be mlxreg-lc driver for line card in drivers/platfrom/mellanox and > additional mlxreg-pm for line card powering on/off, setting enable/disable > and handling power off upon thermal shutdown event. > " > > > > > >It is not unknown for one driver to embed another driver inside it. So > >the i2c driver could be inside the mlxsw. It is also possible to link > >drivers together, the mlxsw could go find the i2c driver and make use > >of its services. > > Okay. Do you have examples? How could the kernel figure out the relation > of the instances? Hi Jiri One driver, embedded into another? You actually submitted an example: commit 6882b0aee180f2797b8803bdf699aa45c2e5f2d6 Author: Vadim Pasternak <vadimp@mellanox.com> Date: Wed Nov 16 15:20:44 2016 +0100 mlxsw: Introduce support for I2C bus Add I2C bus implementation for Mellanox Technologies Switch ASICs. This includes command interface implementation using input / out mailboxes, whose location is retrieved from the firmware during probe time. Signed-off-by: Vadim Pasternak <vadimp@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> There are Linux standard APIs for controlling the power to devices, the regulator API. So i assume mlxreg-pm will make use of that. There are also standard APIs for thermal management, which again, mlxreg-pm should be using. The regulator API allows you to find regulators by name. So just define a sensible naming convention, and the switch driver can lookup the regulator, and turn it on/off as needed. I'm guessing there are no standard Linux API which mlxreg-lc fits. I'm also not sure it offers anything useful standalone. So i would actually embed it inside the switchdev driver, and have internal APIs to get information about the line card. But i'm missing big picture architecture knowledge here, there could be reasons why these suggestions don't work. Andrew
Tue, Jan 26, 2021 at 02:56:08PM CET, andrew@lunn.ch wrote: >On Tue, Jan 26, 2021 at 12:33:26PM +0100, Jiri Pirko wrote: >> Fri, Jan 22, 2021 at 03:13:12PM CET, andrew@lunn.ch wrote: >> >> I don't see any way. The userspace is the one who can get the info, from >> >> the i2c driver. The mlxsw driver has no means to get that info itself. >> > >> >Hi Jiri >> > >> >Please can you tell us more about this i2c driver. Do you have any >> >architecture pictures? >> >> Quoting Vadim Pasternak: >> " >> Not upstreamed yet. >> It will be mlxreg-lc driver for line card in drivers/platfrom/mellanox and >> additional mlxreg-pm for line card powering on/off, setting enable/disable >> and handling power off upon thermal shutdown event. >> " >> >> >> > >> >It is not unknown for one driver to embed another driver inside it. So >> >the i2c driver could be inside the mlxsw. It is also possible to link >> >drivers together, the mlxsw could go find the i2c driver and make use >> >of its services. >> >> Okay. Do you have examples? How could the kernel figure out the relation >> of the instances? > >Hi Jiri > >One driver, embedded into another? You actually submitted an example: > >commit 6882b0aee180f2797b8803bdf699aa45c2e5f2d6 >Author: Vadim Pasternak <vadimp@mellanox.com> >Date: Wed Nov 16 15:20:44 2016 +0100 > > mlxsw: Introduce support for I2C bus > > Add I2C bus implementation for Mellanox Technologies Switch ASICs. > This includes command interface implementation using input / out mailboxes, > whose location is retrieved from the firmware during probe time. > > Signed-off-by: Vadim Pasternak <vadimp@mellanox.com> > Reviewed-by: Ido Schimmel <idosch@mellanox.com> > Signed-off-by: Jiri Pirko <jiri@mellanox.com> > Signed-off-by: David S. Miller <davem@davemloft.net> > >There are Linux standard APIs for controlling the power to devices, >the regulator API. So i assume mlxreg-pm will make use of that. There >are also standard APIs for thermal management, which again, mlxreg-pm >should be using. The regulator API allows you to find regulators by >name. So just define a sensible naming convention, and the switch >driver can lookup the regulator, and turn it on/off as needed. I don't think it would apply. The thing is, i2c driver has a channel to the linecard eeprom, from where it can read info about the linecard. The i2c driver also knows when the linecard is plugged in, unlike mlxsw. It acts as a standalone driver. Mlxsw has no way to directly find if the card was plugged in (unpowered) and which type it is. Not sure how to "embed" it. I don't think any existing API could help. Basicall mlxsw would have to register a callback to the i2c driver called every time card is inserted to do auto-provision. Now consider a case when there are multiple instances of the ASIC on the system. How to assemble a relationship between mlxsw instance and i2c driver instance? But again, auto-provision is only one usecase. Manual provisioning is needed anyway. And that is exactly what my patchset is aiming to introduce. Auto-provision can be added when/if needed later on. > >I'm guessing there are no standard Linux API which mlxreg-lc fits. I'm >also not sure it offers anything useful standalone. So i would >actually embed it inside the switchdev driver, and have internal APIs >to get information about the line card. > >But i'm missing big picture architecture knowledge here, there could >be reasons why these suggestions don't work. > > Andrew
> >There are Linux standard APIs for controlling the power to devices, > >the regulator API. So i assume mlxreg-pm will make use of that. There > >are also standard APIs for thermal management, which again, mlxreg-pm > >should be using. The regulator API allows you to find regulators by > >name. So just define a sensible naming convention, and the switch > >driver can lookup the regulator, and turn it on/off as needed. > > > I don't think it would apply. The thing is, i2c driver has a channel to > the linecard eeprom, from where it can read info about the linecard. The > i2c driver also knows when the linecard is plugged in, unlike mlxsw. > It acts as a standalone driver. Mlxsw has no way to directly find if the > card was plugged in (unpowered) and which type it is. > > Not sure how to "embed" it. I don't think any existing API could help. > Basicall mlxsw would have to register a callback to the i2c driver > called every time card is inserted to do auto-provision. > Now consider a case when there are multiple instances of the ASIC on the > system. How to assemble a relationship between mlxsw instance and i2c > driver instance? You have that knowledge already, otherwise you cannot solve this problem at all. The switch is an PCIe device right? So when the bus is enumerated, the driver loads. How do you bind the i2c driver to the i2c bus? You cannot enumerate i2c, so you must have some hard coded knowledge somewhere? You just need to get that knowledge into the mlxsw driver so it can bind its internal i2c client driver to the i2c bus. That way you avoid user space, i guess maybe udev rules, or some daemon monitoring propriety /sys files? > But again, auto-provision is only one usecase. Manual provisioning is > needed anyway. And that is exactly what my patchset is aiming to > introduce. Auto-provision can be added when/if needed later on. I still don't actually get this use case. Why would i want to manually provision? Andrew
On 1/27/21 7:14 AM, Andrew Lunn wrote: >> I don't think it would apply. The thing is, i2c driver has a channel to >> the linecard eeprom, from where it can read info about the linecard. The >> i2c driver also knows when the linecard is plugged in, unlike mlxsw. >> It acts as a standalone driver. Mlxsw has no way to directly find if the >> card was plugged in (unpowered) and which type it is. >> >> Not sure how to "embed" it. I don't think any existing API could help. >> Basicall mlxsw would have to register a callback to the i2c driver >> called every time card is inserted to do auto-provision. >> Now consider a case when there are multiple instances of the ASIC on the >> system. How to assemble a relationship between mlxsw instance and i2c >> driver instance? > > You have that knowledge already, otherwise you cannot solve this > problem at all. The switch is an PCIe device right? So when the bus is > enumerated, the driver loads. How do you bind the i2c driver to the > i2c bus? You cannot enumerate i2c, so you must have some hard coded > knowledge somewhere? You just need to get that knowledge into the > mlxsw driver so it can bind its internal i2c client driver to the i2c > bus. That way you avoid user space, i guess maybe udev rules, or some > daemon monitoring propriety /sys files? > >> But again, auto-provision is only one usecase. Manual provisioning is >> needed anyway. And that is exactly what my patchset is aiming to >> introduce. Auto-provision can be added when/if needed later on. > > I still don't actually get this use case. Why would i want to manually > provision? > +1.
Wed, Jan 27, 2021 at 03:14:34PM CET, andrew@lunn.ch wrote: >> >There are Linux standard APIs for controlling the power to devices, >> >the regulator API. So i assume mlxreg-pm will make use of that. There >> >are also standard APIs for thermal management, which again, mlxreg-pm >> >should be using. The regulator API allows you to find regulators by >> >name. So just define a sensible naming convention, and the switch >> >driver can lookup the regulator, and turn it on/off as needed. >> >> >> I don't think it would apply. The thing is, i2c driver has a channel to >> the linecard eeprom, from where it can read info about the linecard. The >> i2c driver also knows when the linecard is plugged in, unlike mlxsw. >> It acts as a standalone driver. Mlxsw has no way to directly find if the >> card was plugged in (unpowered) and which type it is. >> >> Not sure how to "embed" it. I don't think any existing API could help. >> Basicall mlxsw would have to register a callback to the i2c driver >> called every time card is inserted to do auto-provision. >> Now consider a case when there are multiple instances of the ASIC on the >> system. How to assemble a relationship between mlxsw instance and i2c >> driver instance? > >You have that knowledge already, otherwise you cannot solve this No I don't have it. I'm not sure why do you say so. The mlxsw and i2c driver act independently. >problem at all. The switch is an PCIe device right? So when the bus is >enumerated, the driver loads. How do you bind the i2c driver to the >i2c bus? You cannot enumerate i2c, so you must have some hard coded >knowledge somewhere? You just need to get that knowledge into the >mlxsw driver so it can bind its internal i2c client driver to the i2c There is no internal i2c client driver for this. >bus. That way you avoid user space, i guess maybe udev rules, or some >daemon monitoring propriety /sys files? > >> But again, auto-provision is only one usecase. Manual provisioning is >> needed anyway. And that is exactly what my patchset is aiming to >> introduce. Auto-provision can be added when/if needed later on. > >I still don't actually get this use case. Why would i want to manually >provision? Because user might want to see the system with all netdevices, configure them, change the linecard if they got broken and all config, like bridge, tc, etc will stay on the netdevices. Again, this is the same we do for split port. This is important requirement, user don't want to see netdevices come and go when he is plugging/unplugging cables. Linecards are the same in this matter. Basically is is a "splitter module", replacing the "splitter cable" > > Andrew
On Thu, Jan 28, 2021 at 09:14:34AM +0100, Jiri Pirko wrote: > Wed, Jan 27, 2021 at 03:14:34PM CET, andrew@lunn.ch wrote: > >> >There are Linux standard APIs for controlling the power to devices, > >> >the regulator API. So i assume mlxreg-pm will make use of that. There > >> >are also standard APIs for thermal management, which again, mlxreg-pm > >> >should be using. The regulator API allows you to find regulators by > >> >name. So just define a sensible naming convention, and the switch > >> >driver can lookup the regulator, and turn it on/off as needed. > >> > >> > >> I don't think it would apply. The thing is, i2c driver has a channel to > >> the linecard eeprom, from where it can read info about the linecard. The > >> i2c driver also knows when the linecard is plugged in, unlike mlxsw. > >> It acts as a standalone driver. Mlxsw has no way to directly find if the > >> card was plugged in (unpowered) and which type it is. > >> > >> Not sure how to "embed" it. I don't think any existing API could help. > >> Basicall mlxsw would have to register a callback to the i2c driver > >> called every time card is inserted to do auto-provision. > >> Now consider a case when there are multiple instances of the ASIC on the > >> system. How to assemble a relationship between mlxsw instance and i2c > >> driver instance? > > > >You have that knowledge already, otherwise you cannot solve this > > No I don't have it. I'm not sure why do you say so. The mlxsw and i2c > driver act independently. Ah, so you just export some information in /sys from the i2c driver? And you expect the poor user to look at the values, and copy paste them to the correct mlxsw instance? 50/50 guess if you have two switches, and hope they don't make a typO? > >I still don't actually get this use case. Why would i want to manually > >provision? > > Because user might want to see the system with all netdevices, configure > them, change the linecard if they got broken and all config, like > bridge, tc, etc will stay on the netdevices. Again, this is the same we > do for split port. This is important requirement, user don't want to see > netdevices come and go when he is plugging/unplugging cables. Linecards > are the same in this matter. Basically is is a "splitter module", > replacing the "splitter cable" So, what is the real use case here? Why might the user want to do this? Is it: The magic smoke has escaped. The user takes a spare switch, and wants to put it on her desk to configure it where she has a comfy chair and piece and quiet, unlike in the data centre, which is very noise, only has hard plastic chair, no coffee allowed. She makes her best guess at the configuration, up/downs the interfaces, reboots, to make sure it is permanent, and only then moves to the data centre to swap the dead router for the new one, and fix up whatever configuration errors there are, while sat on the hard chair? So this feature is about comfy chair vs hard chair? I'm also wondering about the splitter port use case. At what point do you tell the user that it is physically impossible to split the port because the SFP simply does not support it? You say the netdevs don't come/go. I assume the link never goes up, but how does the user know the configuration is FUBAR, not the SFP? To me, it seems a lot more intuitive that when i remove an SFP which has been split into 4, and pop in an SFP which only supports a single stream, the 3 extra netdevs would just vanish. Andrew
Thu, Jan 28, 2021 at 03:17:13PM CET, andrew@lunn.ch wrote: >On Thu, Jan 28, 2021 at 09:14:34AM +0100, Jiri Pirko wrote: >> Wed, Jan 27, 2021 at 03:14:34PM CET, andrew@lunn.ch wrote: >> >> >There are Linux standard APIs for controlling the power to devices, >> >> >the regulator API. So i assume mlxreg-pm will make use of that. There >> >> >are also standard APIs for thermal management, which again, mlxreg-pm >> >> >should be using. The regulator API allows you to find regulators by >> >> >name. So just define a sensible naming convention, and the switch >> >> >driver can lookup the regulator, and turn it on/off as needed. >> >> >> >> >> >> I don't think it would apply. The thing is, i2c driver has a channel to >> >> the linecard eeprom, from where it can read info about the linecard. The >> >> i2c driver also knows when the linecard is plugged in, unlike mlxsw. >> >> It acts as a standalone driver. Mlxsw has no way to directly find if the >> >> card was plugged in (unpowered) and which type it is. >> >> >> >> Not sure how to "embed" it. I don't think any existing API could help. >> >> Basicall mlxsw would have to register a callback to the i2c driver >> >> called every time card is inserted to do auto-provision. >> >> Now consider a case when there are multiple instances of the ASIC on the >> >> system. How to assemble a relationship between mlxsw instance and i2c >> >> driver instance? >> > >> >You have that knowledge already, otherwise you cannot solve this >> >> No I don't have it. I'm not sure why do you say so. The mlxsw and i2c >> driver act independently. > >Ah, so you just export some information in /sys from the i2c driver? >And you expect the poor user to look at the values, and copy paste >them to the correct mlxsw instance? 50/50 guess if you have two >switches, and hope they don't make a typO? Which values are you talking about here exactly? > >> >I still don't actually get this use case. Why would i want to manually >> >provision? >> >> Because user might want to see the system with all netdevices, configure >> them, change the linecard if they got broken and all config, like >> bridge, tc, etc will stay on the netdevices. Again, this is the same we >> do for split port. This is important requirement, user don't want to see >> netdevices come and go when he is plugging/unplugging cables. Linecards >> are the same in this matter. Basically is is a "splitter module", >> replacing the "splitter cable" > >So, what is the real use case here? Why might the user want to do >this? > >Is it: The magic smoke has escaped. The user takes a spare switch, and >wants to put it on her desk to configure it where she has a comfy chair >and piece and quiet, unlike in the data centre, which is very noise, >only has hard plastic chair, no coffee allowed. She makes her best >guess at the configuration, up/downs the interfaces, reboots, to make >sure it is permanent, and only then moves to the data centre to swap >the dead router for the new one, and fix up whatever configuration >errors there are, while sat on the hard chair? > >So this feature is about comfy chair vs hard chair? I don't really get the question, but configuring switch w/o any linecard and plug the linecards in later on is definitelly a usecase. > >I'm also wondering about the splitter port use case. At what point do >you tell the user that it is physically impossible to split the port >because the SFP simply does not support it? You say the netdevs don't >come/go. I assume the link never goes up, but how does the user know >the configuration is FUBAR, not the SFP? To me, it seems a lot more >intuitive that when i remove an SFP which has been split into 4, and >pop in an SFP which only supports a single stream, the 3 extra netdevs >would just vanish. As I wrote easlier in this thread, for hw that supports it, there should be possibility to turn on "autosplit" mode that would do exactly what you describe. But depends on a usecase. User should be in power to configure "autosplit" for split cables and "autodetect" for linecards. Both should be treated in the same way I believe. > > Andrew >
> -----Original Message----- > From: Andrew Lunn <andrew@lunn.ch> > Sent: Friday, January 29, 2021 5:50 PM > To: Jiri Pirko <jiri@resnulli.us> > Cc: David Ahern <dsahern@gmail.com>; Jakub Kicinski <kuba@kernel.org>; > netdev@vger.kernel.org; davem@davemloft.net; jacob.e.keller@intel.com; > Roopa Prabhu <roopa@nvidia.com>; mlxsw <mlxsw@nvidia.com>; Vadim > Pasternak <vadimp@nvidia.com> > Subject: Re: [patch net-next RFC 00/10] introduce line card support for > modular switch > > On Fri, Jan 29, 2021 at 08:20:15AM +0100, Jiri Pirko wrote: > > Thu, Jan 28, 2021 at 03:17:13PM CET, andrew@lunn.ch wrote: > > >On Thu, Jan 28, 2021 at 09:14:34AM +0100, Jiri Pirko wrote: > > >> Wed, Jan 27, 2021 at 03:14:34PM CET, andrew@lunn.ch wrote: > > >> >> >There are Linux standard APIs for controlling the power to > > >> >> >devices, the regulator API. So i assume mlxreg-pm will make use > > >> >> >of that. There are also standard APIs for thermal management, > > >> >> >which again, mlxreg-pm should be using. The regulator API > > >> >> >allows you to find regulators by name. So just define a > > >> >> >sensible naming convention, and the switch driver can lookup the > regulator, and turn it on/off as needed. > > >> >> > > >> >> > > >> >> I don't think it would apply. The thing is, i2c driver has a > > >> >> channel to the linecard eeprom, from where it can read info > > >> >> about the linecard. The i2c driver also knows when the linecard is > plugged in, unlike mlxsw. > > >> >> It acts as a standalone driver. Mlxsw has no way to directly > > >> >> find if the card was plugged in (unpowered) and which type it is. > > >> >> > > >> >> Not sure how to "embed" it. I don't think any existing API could help. > > >> >> Basicall mlxsw would have to register a callback to the i2c > > >> >> driver called every time card is inserted to do auto-provision. > > >> >> Now consider a case when there are multiple instances of the > > >> >> ASIC on the system. How to assemble a relationship between mlxsw > > >> >> instance and i2c driver instance? > > >> > > > >> >You have that knowledge already, otherwise you cannot solve this > > >> > > >> No I don't have it. I'm not sure why do you say so. The mlxsw and > > >> i2c driver act independently. > > > > > >Ah, so you just export some information in /sys from the i2c driver? > > >And you expect the poor user to look at the values, and copy paste > > >them to the correct mlxsw instance? 50/50 guess if you have two > > >switches, and hope they don't make a typO? > > > > Which values are you talking about here exactly? > > The i2c driver tells you what line card is actually inserted. > Hopefully it interprets the EEPROM and gives the user a nice string. You then > need to use this string to provision the switch, so it knows what line card has > been inserted. Or the user can pre-prevision, make a guess as to what card will > actually be inserted sometime in the future, tell the switch, and hope that > actually happens. Hi Andrew, mlxsw I2C driver is BMC side driver. Its purpose to provide hwmon, thermal, QSFP info for the chassis management at BMC side. It works on top of PRM interface and it is associated with the chip I2C slave device. It doesn't aware of system topology, it knows nothing about system I2C tree, what is EEPROM, where it located and so on. This is not a scope of this driver. Platform line card driver is aware of line card I2C topology, its responsibility is to detect line card basic hardware type, create I2C topology (mux), connect all the necessary I2C devices, like hotswap devices, voltage and power regulators devices, iio/a2d devices and line card EEPROMs, creates LED instances for LED located on a line card, exposes line card related attributes, like CPLD and FPGA versions, reset causes, required powered through line card hwmon interface. > > > >> >I still don't actually get this use case. Why would i want to > > >> >manually provision? > > >> > > >> Because user might want to see the system with all netdevices, > > >> configure them, change the linecard if they got broken and all > > >> config, like bridge, tc, etc will stay on the netdevices. Again, > > >> this is the same we do for split port. This is important > > >> requirement, user don't want to see netdevices come and go when he > > >> is plugging/unplugging cables. Linecards are the same in this > > >> matter. Basically is is a "splitter module", replacing the "splitter cable" > > > > > >So, what is the real use case here? Why might the user want to do > > >this? > > > > > >Is it: The magic smoke has escaped. The user takes a spare switch, > > >and wants to put it on her desk to configure it where she has a comfy > > >chair and piece and quiet, unlike in the data centre, which is very > > >noise, only has hard plastic chair, no coffee allowed. She makes her > > >best guess at the configuration, up/downs the interfaces, reboots, to > > >make sure it is permanent, and only then moves to the data centre to > > >swap the dead router for the new one, and fix up whatever > > >configuration errors there are, while sat on the hard chair? > > > > > >So this feature is about comfy chair vs hard chair? > > > > I don't really get the question, but configuring switch w/o any > > linecard and plug the linecards in later on is definitelly a usecase. > > It is a requirement, not a use case. > > A use case is the big picture, what is the user doing, at the big picture level. In > the somewhat absurd example given above, the user case is, the router chassis > has died, but they think the line cards are O.K. They want to do as much > configuration and testing as possible before going into the data center to > actually replace the chassis. By reusing the existing line cards, they reduce the > risk of getting the cables plugged into the wrong port. > > From the use case, you can derive the requirements. In order to test that ifup -- > all puts the IP addresses in the right places, it needs to have the netdevs with > the correct names. Either they need line cards in the router, or they need to be > able to fake line cards. There is also a requirement that the line cards are easy > to exchange. They do not need to be fully hot-plugable, since one router is > dead, the other can be powered off. But ideally you want simple thumb > screws, not a Philips screwdriver or an allan key. There is also a requirement > that this provision is persistent, since the user is likely to reboot the system in > order to test the configurations files actually work at boot time. Either the > switch driver needs to write the information to FLASH, or user space needs to > tell it on every boot, a systemd service file or similar. > > But since you have not been able to answer my question, i wonder if > everything is backwards around here. Your architecture is broken, you cannot > easily determine what line card is inserted, so you need a workaround, > provisioning. But provision might actually be a useful feature, so lets try to sell > the feature, and gloss over that the architecture is broken. > > So, i would like to see the architecture fixed first. The switch driver somehow > talks to the i2c driver to find out what card is in the slot, and configures itself. > My guess is, every other switch does this, this is what the user expects as a > base feature, it is what we want Linux to do by default. > > You can later add provisioning, where if the slot is empty, you can fake a line > card, to fulfil the use cases. And when the slot is actually filled, you can verify > what is plugged in matches what was expected, and be very noise if not. > > Andrew
> Platform line card driver is aware of line card I2C topology, its > responsibility is to detect line card basic hardware type, create I2C > topology (mux), connect all the necessary I2C devices, like hotswap > devices, voltage and power regulators devices, iio/a2d devices and line > card EEPROMs, creates LED instances for LED located on a line card, exposes > line card related attributes, like CPLD and FPGA versions, reset causes, > required powered through line card hwmon interface. So this driver, and the switch driver need to talk to each other, so the switch driver actually knows what, if anything, is in the slot. Andrew
Fri, Jan 29, 2021 at 06:31:59PM CET, andrew@lunn.ch wrote: >> Platform line card driver is aware of line card I2C topology, its >> responsibility is to detect line card basic hardware type, create I2C >> topology (mux), connect all the necessary I2C devices, like hotswap >> devices, voltage and power regulators devices, iio/a2d devices and line >> card EEPROMs, creates LED instances for LED located on a line card, exposes >> line card related attributes, like CPLD and FPGA versions, reset causes, >> required powered through line card hwmon interface. > >So this driver, and the switch driver need to talk to each other, so >the switch driver actually knows what, if anything, is in the slot. Not possible in case the BMC is a different host, which is common scenario.
> Platform line card driver is aware of line card I2C topology, its > responsibility is to detect line card basic hardware type, create I2C > topology (mux), connect all the necessary I2C devices, like hotswap > devices, voltage and power regulators devices, iio/a2d devices and line > card EEPROMs, creates LED instances for LED located on a line card, exposes > line card related attributes, like CPLD and FPGA versions, reset causes, > required powered through line card hwmon interface. Jiri says the hardware is often connected to the BMC. But you do expose much of this to the host as well? You want devlink dev info to show the version information. Use devlink dev flash to upgrade the bitfile in the CPD and FPGA. The hwmon instances are pretty pointless on the BMC where nobody can see them. Are there temperature sensors involved? The host is where the thermal policy is running, deciding what to throttle, or shut down when it gets too hot. LEDs can be controlled via /sys/class/led as expected? So exporting what the line card actually is to the host is not really a problem, it is just one more bit of information amongst everything else already exposed to it. Andrew
Sun, Jan 31, 2021 at 06:09:24PM CET, dsahern@gmail.com wrote: >On 1/30/21 7:19 AM, Jiri Pirko wrote: >> Fri, Jan 29, 2021 at 06:31:59PM CET, andrew@lunn.ch wrote: >>>> Platform line card driver is aware of line card I2C topology, its >>>> responsibility is to detect line card basic hardware type, create I2C >>>> topology (mux), connect all the necessary I2C devices, like hotswap >>>> devices, voltage and power regulators devices, iio/a2d devices and line >>>> card EEPROMs, creates LED instances for LED located on a line card, exposes >>>> line card related attributes, like CPLD and FPGA versions, reset causes, >>>> required powered through line card hwmon interface. >>> >>> So this driver, and the switch driver need to talk to each other, so >>> the switch driver actually knows what, if anything, is in the slot. >> >> Not possible in case the BMC is a different host, which is common >> scenario. >> > >User provisions a 4 port card, but a 2 port card is inserted. How is >this detected and the user told the wrong card is inserted? The card won't get activated. The user won't see the type of inserted linecard. Again, it is not possible for ASIC to access the linecard eeprom. See Vadim's reply. > >If it is not detected that's a serious problem, no? That is how it is, unfortunatelly. > >If it is detected why can't the same mechanism be used for auto >provisioning? Again, not possible to detect.
On Mon, Feb 01, 2021 at 09:16:41AM +0100, Jiri Pirko wrote: > Sun, Jan 31, 2021 at 06:09:24PM CET, dsahern@gmail.com wrote: > >On 1/30/21 7:19 AM, Jiri Pirko wrote: > >> Fri, Jan 29, 2021 at 06:31:59PM CET, andrew@lunn.ch wrote: > >>>> Platform line card driver is aware of line card I2C topology, its > >>>> responsibility is to detect line card basic hardware type, create I2C > >>>> topology (mux), connect all the necessary I2C devices, like hotswap > >>>> devices, voltage and power regulators devices, iio/a2d devices and line > >>>> card EEPROMs, creates LED instances for LED located on a line card, exposes > >>>> line card related attributes, like CPLD and FPGA versions, reset causes, > >>>> required powered through line card hwmon interface. > >>> > >>> So this driver, and the switch driver need to talk to each other, so > >>> the switch driver actually knows what, if anything, is in the slot. > >> > >> Not possible in case the BMC is a different host, which is common > >> scenario. > >> > > > >User provisions a 4 port card, but a 2 port card is inserted. How is > >this detected and the user told the wrong card is inserted? > > The card won't get activated. > The user won't see the type of inserted linecard. Again, it is not > possible for ASIC to access the linecard eeprom. See Vadim's reply. > > > > > >If it is not detected that's a serious problem, no? > > That is how it is, unfortunatelly. > > > > > >If it is detected why can't the same mechanism be used for auto > >provisioning? > > Again, not possible to detect. If the platform line card driver is running in the host, you can detect it. From your wording, it sounds like some systems do have this driver in the host. So please add the needed code. When the platform line card driver is on the BMC, you need a proxy in between. Isn't this what IPMI and Redfish is all about? The proxy driver can offer the same interface as the platform line card driver. Andrew
Mon, Feb 01, 2021 at 02:41:07PM CET, andrew@lunn.ch wrote: >On Mon, Feb 01, 2021 at 09:16:41AM +0100, Jiri Pirko wrote: >> Sun, Jan 31, 2021 at 06:09:24PM CET, dsahern@gmail.com wrote: >> >On 1/30/21 7:19 AM, Jiri Pirko wrote: >> >> Fri, Jan 29, 2021 at 06:31:59PM CET, andrew@lunn.ch wrote: >> >>>> Platform line card driver is aware of line card I2C topology, its >> >>>> responsibility is to detect line card basic hardware type, create I2C >> >>>> topology (mux), connect all the necessary I2C devices, like hotswap >> >>>> devices, voltage and power regulators devices, iio/a2d devices and line >> >>>> card EEPROMs, creates LED instances for LED located on a line card, exposes >> >>>> line card related attributes, like CPLD and FPGA versions, reset causes, >> >>>> required powered through line card hwmon interface. >> >>> >> >>> So this driver, and the switch driver need to talk to each other, so >> >>> the switch driver actually knows what, if anything, is in the slot. >> >> >> >> Not possible in case the BMC is a different host, which is common >> >> scenario. >> >> >> > >> >User provisions a 4 port card, but a 2 port card is inserted. How is >> >this detected and the user told the wrong card is inserted? >> >> The card won't get activated. >> The user won't see the type of inserted linecard. Again, it is not >> possible for ASIC to access the linecard eeprom. See Vadim's reply. >> >> >> > >> >If it is not detected that's a serious problem, no? >> >> That is how it is, unfortunatelly. >> >> >> > >> >If it is detected why can't the same mechanism be used for auto >> >provisioning? >> >> Again, not possible to detect. > >If the platform line card driver is running in the host, you can >detect it. From your wording, it sounds like some systems do have this >driver in the host. So please add the needed code. But if not, it cannot. We still need the provisioning then. > >When the platform line card driver is on the BMC, you need a proxy in >between. Isn't this what IPMI and Redfish is all about? The proxy >driver can offer the same interface as the platform line card driver. Do you have any example of kernel driver which is doing some thing like that? > > Andrew
> >When the platform line card driver is on the BMC, you need a proxy in > >between. Isn't this what IPMI and Redfish is all about? The proxy > >driver can offer the same interface as the platform line card driver. > > Do you have any example of kernel driver which is doing some thing like > that? drivers/hwmon/ibmaem.c is a pretty normal looking HWMON driver, for temperature/power/energy sensors which are connected to the BMC and accessed over IPMI. char/ipmi/ipmi_watchdog.c as the name suggests is a watchdog. At first glance its API to user space follows the standard API, even if it does not make use of the watchdog subsystem core. These two should give you examples of how you talk to the BMC from a kernel driver. Andrew
From: Jiri Pirko <jiri@nvidia.com> This patchset introduces support for modular switch systems. NVIDIA Mellanox SN4800 is an example of such. It contains 8 slots to accomodate line cards. Available line cards include: 16X 100GbE (QSFP28) 8X 200GbE (QSFP56) 4X 400GbE (QSFP-DD) Similar to split cabels, it is essencial for the correctness of configuration and funcionality to treat the line card entities in the same way, no matter the line card is inserted or not. Meaning, the netdevice of a line card port cannot just disappear when line card is removed. Also, system admin needs to be able to apply configuration on netdevices belonging to line card port even before the linecard gets inserted. To resolve this, a concept of "provisioning" is introduced. The user may "provision" certain slot with a line card type. Driver then creates all instances (devlink ports, netdevices, etc) related to this line card type. The carrier of netdevices stays down. Once the line card is inserted and activated, the carrier of the related netdevices goes up. Once user does not want to use the line card related instances anymore, he can "unprovision" the slot. Driver then removes the instances. Patches 1-5 are extending devlink driver API and UAPI in order to register, show, dump, provision and activate the line card. Patches 6-9 are implementing the introduced API in netdevsim Example: # Create a new netdevsim device, with no ports and 2 line cards: $ echo "10 0 2" >/sys/bus/netdevsim/new_device $ devlink port # No ports are listed $ devlink lc netdevsim/netdevsim10: lc 0 state unprovisioned supported_types: card1port card2ports card4ports lc 1 state unprovisioned supported_types: card1port card2ports card4ports # Note that driver advertizes supported line card types. In case of # netdevsim, these are 3. $ devlink lc provision netdevsim/netdevsim10 lc 0 type card4ports $ devlink lc netdevsim/netdevsim10: lc 0 state provisioned type card4ports supported_types: card1port card2ports card4ports lc 1 state unprovisioned supported_types: card1port card2ports card4ports $ devlink port netdevsim/netdevsim10/1000: type eth netdev eni10nl0p1 flavour physical lc 0 port 1 splittable false netdevsim/netdevsim10/1001: type eth netdev eni10nl0p2 flavour physical lc 0 port 2 splittable false netdevsim/netdevsim10/1002: type eth netdev eni10nl0p3 flavour physical lc 0 port 3 splittable false netdevsim/netdevsim10/1003: type eth netdev eni10nl0p4 flavour physical lc 0 port 4 splittable false # ^^ ^^^^ # netdev name adjusted index of a line card this port belongs to $ ip link set eni10nl0p1 up $ ip link show eni10nl0p1 165: eni10nl0p1: <NO-CARRIER,BROADCAST,NOARP,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff # Now activate the line card using debugfs. That emulates plug-in event # on real hardware: $ echo "Y"> /sys/kernel/debug/netdevsim/netdevsim10/linecards/0/active $ ip link show eni10nl0p1 165: eni10nl0p1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 7e:2d:05:93:d3:d1 brd ff:ff:ff:ff:ff:ff # The carrier is UP now. Jiri Pirko (10): devlink: add support to create line card and expose to user devlink: implement line card provisioning devlink: implement line card active state devlink: append split port number to the port name devlink: add port to line card relationship set netdevsim: introduce line card support netdevsim: allow port objects to be linked with line cards netdevsim: create devlink line card object and implement provisioning netdevsim: implement line card activation selftests: add netdevsim devlink lc test drivers/net/netdevsim/bus.c | 21 +- drivers/net/netdevsim/dev.c | 370 ++++++++++++++- drivers/net/netdevsim/netdev.c | 2 + drivers/net/netdevsim/netdevsim.h | 23 + include/net/devlink.h | 44 ++ include/uapi/linux/devlink.h | 25 + net/core/devlink.c | 443 +++++++++++++++++- .../drivers/net/netdevsim/devlink.sh | 62 ++- 8 files changed, 964 insertions(+), 26 deletions(-)