Message ID | 304963d62ed1fa5f75437d1f832830d7970f9919.1658943678.git.ecree.xilinx@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | sfc: VF representors for EF100 - RX side | expand |
On Wed, 27 Jul 2022 18:46:02 +0100 ecree@xilinx.com wrote: > From: Edward Cree <ecree.xilinx@gmail.com> > > When setting the VF rep's MAC address, set the provisioned MAC address > for the VF through MC_CMD_SET_CLIENT_MAC_ADDRESSES. Wait.. hm? The VF rep is not the VF. It's the other side of the wire. Are you passing the VF rep's MAC on the VF? Ethernet packets between the hypervisor and the VF would have the same SA and DA.
On 28/07/2022 04:10, Jakub Kicinski wrote: > On Wed, 27 Jul 2022 18:46:02 +0100 ecree@xilinx.com wrote: >> From: Edward Cree <ecree.xilinx@gmail.com> >> >> When setting the VF rep's MAC address, set the provisioned MAC address >> for the VF through MC_CMD_SET_CLIENT_MAC_ADDRESSES. > > Wait.. hm? The VF rep is not the VF. It's the other side of the wire. > Are you passing the VF rep's MAC on the VF? Ethernet packets between > the hypervisor and the VF would have the same SA and DA. > Yes (but only if there's an IP stack on the repr; I think it's fine if the repr is plugged straight into a bridge so any ARP picks up a different DA?). I thought that was weird but I also thought that was 'how it's done' with reps — properties of the VF are set by applying them to the rep. Is there some other way to configure VF MAC? (Are we supposed to still be using the legacy SR-IOV interface, .ndo_set_vf_mac()? I thought that was deprecated in favour of more switchdev-flavoured stuff…) -ed
On Thu, 28 Jul 2022 16:47:36 +0100 Edward Cree wrote: > On 28/07/2022 04:10, Jakub Kicinski wrote: > > On Wed, 27 Jul 2022 18:46:02 +0100 ecree@xilinx.com wrote: > >> When setting the VF rep's MAC address, set the provisioned MAC address > >> for the VF through MC_CMD_SET_CLIENT_MAC_ADDRESSES. > > > > Wait.. hm? The VF rep is not the VF. It's the other side of the wire. > > Are you passing the VF rep's MAC on the VF? Ethernet packets between > > the hypervisor and the VF would have the same SA and DA. > > Yes (but only if there's an IP stack on the repr; I think it's fine if > the repr is plugged straight into a bridge so any ARP picks up a > different DA?). > I thought that was weird but I also thought that was 'how it's done' > with reps — properties of the VF are set by applying them to the rep. > Is there some other way to configure VF MAC? (Are we supposed to still > be using the legacy SR-IOV interface, .ndo_set_vf_mac()? I thought > that was deprecated in favour of more switchdev-flavoured stuff…) It's set thru devlink port function set DEV/PORT_INDEX hw_addr ADDR "port functions" is a weird object representing something in Mellanox FW. Hopefully it makes more sense to you than it does to me.
On 28/07/2022 17:20, Jakub Kicinski wrote: > It's set thru > > devlink port function set DEV/PORT_INDEX hw_addr ADDR > > "port functions" is a weird object representing something > in Mellanox FW. Hopefully it makes more sense to you than > it does to me. Hmm that does look weird, looks like it acts on a PCI device (DEV is a PCI address) and then I'm not sure what PORT_INDEX is meant to mean (the man page doesn't describe it at all). Possibly it doesn't have semantics as such and is just a synthetic index into a list of ports… I can't say it makes sense to me either :shrug: We did take a look at what nfp does, as well; they use the old .ndo_set_vf_mac(), but they appear to support it both on the PF and on the VF reprs — meaning that (AFAICT) it allows to set the MAC address of VF 0 through the repr for VF 1. (There is no check that I can see in nfp_app_set_vf_mac() that the value of `int vf` matches the caller.) Our (SN1000) approach to the problem of configuring 'remote' functions (VFs in VMs, PFs on the embedded SoC) is to use representors for them all (VF reps as added in this & prev series, PF reps coming in the future. Similarly, if we were ever to add Subfunctions, each SF would have a corresponding SF representor that would work in much the same way as VF reps). At which point you should always be able to configure an object through its associated rep, and there should never be a need for an 'index' parameter (be that 'VF index' or 'port index'). While .ndo_set_mac_address() might be the Wrong Thing (if we want to be able to set VF and VF-rep addresses independently to different things), the Right Thing ought to have the same signature (i.e. just taking a netdev and a hwaddr). Devlink seems to me like a needless complication here. Anyway, since the proper direction is unclear, I'll respin the series without patches 10-13 in the hope of getting the rest of it in before the merge window. -ed
On Thu, 28 Jul 2022 19:12:34 +0100 Edward Cree wrote: > On 28/07/2022 17:20, Jakub Kicinski wrote: > > It's set thru > > > > devlink port function set DEV/PORT_INDEX hw_addr ADDR > > > > "port functions" is a weird object representing something > > in Mellanox FW. Hopefully it makes more sense to you than > > it does to me. > Hmm that does look weird, looks like it acts on a PCI device > (DEV is a PCI address) and then I'm not sure what PORT_INDEX > is meant to mean (the man page doesn't describe it at all). > Possibly it doesn't have semantics as such and is just a > synthetic index into a list of ports… > I can't say it makes sense to me either :shrug: > > We did take a look at what nfp does, as well; they use the > old .ndo_set_vf_mac(), but they appear to support it both on > the PF and on the VF reprs — meaning that (AFAICT) it allows > to set the MAC address of VF 0 through the repr for VF 1. > (There is no check that I can see in nfp_app_set_vf_mac() > that the value of `int vf` matches the caller.) IIRC the reprs are all linked to the PCI device of the PF in sysfs, and OpenStack would pick a device linked to the PCI parent almost at random. So the VF reprs needed the legacy NDOs. At least that's what I remember being told. I think the legacy NDOs are acceptable, devlink way is preferred (devlink way did not exist when NFP code was written). > Our (SN1000) approach to the problem of configuring 'remote' > functions (VFs in VMs, PFs on the embedded SoC) is to use > representors for them all (VF reps as added in this & prev > series, PF reps coming in the future. Similarly, if we > were ever to add Subfunctions, each SF would have a > corresponding SF representor that would work in much the > same way as VF reps). At which point you should always be > able to configure an object through its associated rep, > and there should never be a need for an 'index' parameter > (be that 'VF index' or 'port index'). How do you map reprs to VFs? The PCI devices of the VF may be on a different system. > While .ndo_set_mac_address() might be the Wrong Thing (if > we want to be able to set VF and VF-rep addresses > independently to different things), the Right Thing ought > to have the same signature (i.e. just taking a netdev and > a hwaddr). Devlink seems to me like a needless > complication here. But reps are like switch ports in a switch ASIC, and the PCI device is the other side of the virtual wire. You would not be configuring the MAC address of a peer to peer link by setting the local address. > Anyway, since the proper direction is unclear, I'll respin > the series without patches 10-13 in the hope of getting > the rest of it in before the merge window. SG
On 28/07/2022 19:32, Jakub Kicinski wrote: > IIRC the reprs are all linked to the PCI device of the PF in sysfs, You mean /sys/class/net/$VFREP/device? On sfc that's (intentionally) nonexistent; the only visible connection between repr and its owning PF is /sys/class/net/$VFREP/phys_switch_id which holds the same value as /sys/class/net/$PF/phys_port_id. > How do you map reprs to VFs? The PCI devices of the VF may be on > a different system. That's what the client ID from patch #10 is for. We ask the FW for a handle to "caller's PCIe controller → caller's PF → VF number efv->idx", and that handle is what we store in efv->clid, and later pass to MC_CMD_SET_CLIENT_MAC_ADDRESSES in patch #12. The user determines which repr corresponds to which VF by looking in /sys/class/net/$VFREP/phys_port_name (e.g. "p0pf0vf0"). > But reps are like switch ports in a switch ASIC, and the PCI > device is the other side of the virtual wire. You would not be > configuring the MAC address of a peer to peer link by setting > the local address. Indeed. I agree that .ndo_set_mac_address() is the wrong interface. But the interface I have in mind would be something like int (*ndo_set_partner_mac_address)(struct net_device *, void *); and would only be implemented by representor netdevs. Idk what the uAPI/UI for that would be; probably a new `ip link set` parameter. -ed
On Thu, 28 Jul 2022 19:54:21 +0100 Edward Cree wrote: > On 28/07/2022 19:32, Jakub Kicinski wrote: > > How do you map reprs to VFs? The PCI devices of the VF may be on > > a different system. > That's what the client ID from patch #10 is for. We ask the FW for > a handle to "caller's PCIe controller → caller's PF → VF number > efv->idx", and that handle is what we store in efv->clid, and later > pass to MC_CMD_SET_CLIENT_MAC_ADDRESSES in patch #12. > > The user determines which repr corresponds to which VF by looking in > /sys/class/net/$VFREP/phys_port_name (e.g. "p0pf0vf0"). .. and that would also most likely be what the devlink port ID would be. > > But reps are like switch ports in a switch ASIC, and the PCI > > device is the other side of the virtual wire. You would not be > > configuring the MAC address of a peer to peer link by setting > > the local address. > Indeed. I agree that .ndo_set_mac_address() is the wrong interface. > But the interface I have in mind would be something like > int (*ndo_set_partner_mac_address)(struct net_device *, void *); > and would only be implemented by representor netdevs. > Idk what the uAPI/UI for that would be; probably a new `ip link set` > parameter. Yup... If only you were there during the fight over this uAPI. Now it's the devlink "port function" thing.
On 28/07/2022 20:27, Jakub Kicinski wrote: > On Thu, 28 Jul 2022 19:54:21 +0100 Edward Cree wrote: >> The user determines which repr corresponds to which VF by looking in >> /sys/class/net/$VFREP/phys_port_name (e.g. "p0pf0vf0"). > > .. and that would also most likely be what the devlink port ID would be. AFAICT the devlink port index is just an integer. The example in the man page is devlink port function set pci/0000:01:00.0/1 hw_addr 00:00:00:11:22:33 Moreover, struct devlink_port has `unsigned int index`. Though it does also have `struct devlink_port_attrs` which appears to encode the PF and VF numbers; I think those can be read with `devlink port show`. But the whole devlink port abstraction is unnecessary when we already *have* an object to represent the port. >> Indeed. I agree that .ndo_set_mac_address() is the wrong interface. >> But the interface I have in mind would be something like >> int (*ndo_set_partner_mac_address)(struct net_device *, void *); >> and would only be implemented by representor netdevs. >> Idk what the uAPI/UI for that would be; probably a new `ip link set` >> parameter. > > Yup... If only you were there during the fight over this uAPI. > Now it's the devlink "port function" thing. Sadly I was too busy with EF100 bring-up, and naïvely assumed that I could safely ignore devlink port stuff as it was so obviously going to be a classic Mellanox design: tasteless, overweight, and not cleanly mappable onto any other vendor. Which seems to have been true but they've managed to make it the standard anyway by virtue of being there first, as usual :'( (Yeah, I probably shouldn't publicly say things like that about another vendor's devs. But I'm getting frustrated at this recurring pattern.) Devlink port function *would* be useful for administering functions that don't have a representor. I just can't see any good reason why such things should ever exist. Maybe it's not too late to introduce my API to exist alongside it… though I have no idea how much work it would take to teach the orchestration frameworks to use it :/ -ed
On Thu, 28 Jul 2022 21:23:23 +0100 Edward Cree wrote: > Sadly I was too busy with EF100 bring-up, and naïvely assumed that I > could safely ignore devlink port stuff as it was so obviously going > to be a classic Mellanox design: tasteless, overweight, and not > cleanly mappable onto any other vendor. Which seems to have been > true but they've managed to make it the standard anyway by virtue > of being there first, as usual :'( > (Yeah, I probably shouldn't publicly say things like that about > another vendor's devs. But I'm getting frustrated at this recurring > pattern.) I spend an unhealthy amount of time thinking about the problem of vendors not paying attention when new uAPIs are forged. Happy to try things. > Devlink port function *would* be useful for administering functions > that don't have a representor. I just can't see any good reason > why such things should ever exist. The SmartNIC/DPU/IPU/isolated hv+IO CPU can expose storage functions to the peer. nVidia is working on extending the devlink rate limit API to cover such cases.
On 29/07/2022 02:45, Jakub Kicinski wrote: >> Devlink port function *would* be useful for administering functions >> that don't have a representor. I just can't see any good reason >> why such things should ever exist. > > The SmartNIC/DPU/IPU/isolated hv+IO CPU can expose storage functions > to the peer. nVidia is working on extending the devlink rate limit API > to cover such cases. All the storage-on-SmartNIC setups I can imagine involve the storage function (e.g. a virtio-blk PF) being connected to the network switch, either to access remote network storage or to export the local storage over the network. (I'm not quite sure why you'd bother combining storage and networking functionality onto a single device if they _weren't_ connected in this way.) Which means that your storage function has a v-switch port, and thus should have a representor netdevice so you can e.g. use tc rules to define its access to the physical network. Arguably any network rate limiting you then want to apply to that function's v-switch port should be in the form of a tc police action. (Which is far more flexible than devlink rate, because you can have different policers for traffic matching different tc filters, e.g. separate rate limits for control and data traffic of the dFS.) Idk, maybe I'm being crazy in assuming that hardware has sane design semantics. But the obvious way to build a SmartNIC maps very cleanly onto representors, without any need for devlink port function, and I think it makes more sense to say that maybe some weird device might end up having representors to control some objects that don't have network access, than that everyone has to implement this whole parallel structure of devlink objects for things that already have representors. -ed
diff --git a/drivers/net/ethernet/sfc/ef100_rep.c b/drivers/net/ethernet/sfc/ef100_rep.c index ebab4579e63b..58365a4c7c6a 100644 --- a/drivers/net/ethernet/sfc/ef100_rep.c +++ b/drivers/net/ethernet/sfc/ef100_rep.c @@ -107,6 +107,33 @@ static int efx_ef100_rep_get_phys_port_name(struct net_device *dev, return 0; } +static int efx_ef100_rep_set_mac_address(struct net_device *net_dev, void *data) +{ + MCDI_DECLARE_BUF(inbuf, MC_CMD_SET_CLIENT_MAC_ADDRESSES_IN_LEN(1)); + struct efx_rep *efv = netdev_priv(net_dev); + struct efx_nic *efx = efv->parent; + struct sockaddr *addr = data; + const u8 *new_addr = addr->sa_data; + int rc; + + if (efv->clid == CLIENT_HANDLE_NULL) { + netif_info(efx, drv, net_dev, "Unable to set representee MAC address (client ID is null)\n"); + } else { + BUILD_BUG_ON(MC_CMD_SET_CLIENT_MAC_ADDRESSES_OUT_LEN); + MCDI_SET_DWORD(inbuf, SET_CLIENT_MAC_ADDRESSES_IN_CLIENT_HANDLE, + efv->clid); + ether_addr_copy(MCDI_PTR(inbuf, SET_CLIENT_MAC_ADDRESSES_IN_MAC_ADDRS), + new_addr); + rc = efx_mcdi_rpc(efx, MC_CMD_SET_CLIENT_MAC_ADDRESSES, inbuf, + sizeof(inbuf), NULL, 0, NULL); + if (rc) + return rc; + } + + eth_hw_addr_set(net_dev, new_addr); + return 0; +} + static void efx_ef100_rep_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats) { @@ -126,6 +153,7 @@ static const struct net_device_ops efx_ef100_rep_netdev_ops = { .ndo_start_xmit = efx_ef100_rep_xmit, .ndo_get_port_parent_id = efx_ef100_rep_get_port_parent_id, .ndo_get_phys_port_name = efx_ef100_rep_get_phys_port_name, + .ndo_set_mac_address = efx_ef100_rep_set_mac_address, .ndo_get_stats64 = efx_ef100_rep_get_stats64, };