diff mbox series

[net-next,v2,12/14] sfc: set EF100 VF MAC address through representor

Message ID 304963d62ed1fa5f75437d1f832830d7970f9919.1658943678.git.ecree.xilinx@gmail.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series sfc: VF representors for EF100 - RX side | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 2 maintainers not CCed: edumazet@google.com habetsm.xilinx@gmail.com
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: line length of 87 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

ecree@xilinx.com July 27, 2022, 5:46 p.m. UTC
From: Edward Cree <ecree.xilinx@gmail.com>

When setting the VF rep's MAC address, set the provisioned MAC address
 for the VF through MC_CMD_SET_CLIENT_MAC_ADDRESSES.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
---
 drivers/net/ethernet/sfc/ef100_rep.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

Comments

Jakub Kicinski July 28, 2022, 3:10 a.m. UTC | #1
On Wed, 27 Jul 2022 18:46:02 +0100 ecree@xilinx.com wrote:
> From: Edward Cree <ecree.xilinx@gmail.com>
> 
> When setting the VF rep's MAC address, set the provisioned MAC address
>  for the VF through MC_CMD_SET_CLIENT_MAC_ADDRESSES.

Wait.. hm? The VF rep is not the VF. It's the other side of the wire.
Are you passing the VF rep's MAC on the VF? Ethernet packets between
the hypervisor and the VF would have the same SA and DA.
Edward Cree July 28, 2022, 3:47 p.m. UTC | #2
On 28/07/2022 04:10, Jakub Kicinski wrote:
> On Wed, 27 Jul 2022 18:46:02 +0100 ecree@xilinx.com wrote:
>> From: Edward Cree <ecree.xilinx@gmail.com>
>>
>> When setting the VF rep's MAC address, set the provisioned MAC address
>>  for the VF through MC_CMD_SET_CLIENT_MAC_ADDRESSES.
> 
> Wait.. hm? The VF rep is not the VF. It's the other side of the wire.
> Are you passing the VF rep's MAC on the VF? Ethernet packets between
> the hypervisor and the VF would have the same SA and DA.
> 

Yes (but only if there's an IP stack on the repr; I think it's fine if
 the repr is plugged straight into a bridge so any ARP picks up a
 different DA?).
I thought that was weird but I also thought that was 'how it's done'
 with reps — properties of the VF are set by applying them to the rep.
Is there some other way to configure VF MAC?  (Are we supposed to still
 be using the legacy SR-IOV interface, .ndo_set_vf_mac()?  I thought
 that was deprecated in favour of more switchdev-flavoured stuff…)

-ed
Jakub Kicinski July 28, 2022, 4:20 p.m. UTC | #3
On Thu, 28 Jul 2022 16:47:36 +0100 Edward Cree wrote:
> On 28/07/2022 04:10, Jakub Kicinski wrote:
> > On Wed, 27 Jul 2022 18:46:02 +0100 ecree@xilinx.com wrote:  
> >> When setting the VF rep's MAC address, set the provisioned MAC address
> >>  for the VF through MC_CMD_SET_CLIENT_MAC_ADDRESSES.  
> > 
> > Wait.. hm? The VF rep is not the VF. It's the other side of the wire.
> > Are you passing the VF rep's MAC on the VF? Ethernet packets between
> > the hypervisor and the VF would have the same SA and DA.
> 
> Yes (but only if there's an IP stack on the repr; I think it's fine if
>  the repr is plugged straight into a bridge so any ARP picks up a
>  different DA?).
> I thought that was weird but I also thought that was 'how it's done'
>  with reps — properties of the VF are set by applying them to the rep.
> Is there some other way to configure VF MAC?  (Are we supposed to still
>  be using the legacy SR-IOV interface, .ndo_set_vf_mac()?  I thought
>  that was deprecated in favour of more switchdev-flavoured stuff…)

It's set thru

 devlink port function set DEV/PORT_INDEX hw_addr ADDR

"port functions" is a weird object representing something 
in Mellanox FW. Hopefully it makes more sense to you than
it does to me.
Edward Cree July 28, 2022, 6:12 p.m. UTC | #4
On 28/07/2022 17:20, Jakub Kicinski wrote:
> It's set thru
> 
>  devlink port function set DEV/PORT_INDEX hw_addr ADDR
> 
> "port functions" is a weird object representing something 
> in Mellanox FW. Hopefully it makes more sense to you than
> it does to me.
Hmm that does look weird, looks like it acts on a PCI device
 (DEV is a PCI address) and then I'm not sure what PORT_INDEX
 is meant to mean (the man page doesn't describe it at all).
 Possibly it doesn't have semantics as such and is just a
 synthetic index into a list of ports…
I can't say it makes sense to me either :shrug:

We did take a look at what nfp does, as well; they use the
 old .ndo_set_vf_mac(), but they appear to support it both on
 the PF and on the VF reprs — meaning that (AFAICT) it allows
 to set the MAC address of VF 0 through the repr for VF 1.
(There is no check that I can see in nfp_app_set_vf_mac()
 that the value of `int vf` matches the caller.)

Our (SN1000) approach to the problem of configuring 'remote'
 functions (VFs in VMs, PFs on the embedded SoC) is to use
 representors for them all (VF reps as added in this & prev
 series, PF reps coming in the future.  Similarly, if we
 were ever to add Subfunctions, each SF would have a
 corresponding SF representor that would work in much the
 same way as VF reps).  At which point you should always be
 able to configure an object through its associated rep,
 and there should never be a need for an 'index' parameter
 (be that 'VF index' or 'port index').
While .ndo_set_mac_address() might be the Wrong Thing (if
 we want to be able to set VF and VF-rep addresses
 independently to different things), the Right Thing ought
 to have the same signature (i.e. just taking a netdev and
 a hwaddr).  Devlink seems to me like a needless
 complication here.

Anyway, since the proper direction is unclear, I'll respin
 the series without patches 10-13 in the hope of getting
 the rest of it in before the merge window.

-ed
Jakub Kicinski July 28, 2022, 6:32 p.m. UTC | #5
On Thu, 28 Jul 2022 19:12:34 +0100 Edward Cree wrote:
> On 28/07/2022 17:20, Jakub Kicinski wrote:
> > It's set thru
> > 
> >  devlink port function set DEV/PORT_INDEX hw_addr ADDR
> > 
> > "port functions" is a weird object representing something 
> > in Mellanox FW. Hopefully it makes more sense to you than
> > it does to me.  
> Hmm that does look weird, looks like it acts on a PCI device
>  (DEV is a PCI address) and then I'm not sure what PORT_INDEX
>  is meant to mean (the man page doesn't describe it at all).
>  Possibly it doesn't have semantics as such and is just a
>  synthetic index into a list of ports…
> I can't say it makes sense to me either :shrug:
> 
> We did take a look at what nfp does, as well; they use the
>  old .ndo_set_vf_mac(), but they appear to support it both on
>  the PF and on the VF reprs — meaning that (AFAICT) it allows
>  to set the MAC address of VF 0 through the repr for VF 1.
> (There is no check that I can see in nfp_app_set_vf_mac()
>  that the value of `int vf` matches the caller.)

IIRC the reprs are all linked to the PCI device of the PF in sysfs,
and OpenStack would pick a device linked to the PCI parent almost
at random. So the VF reprs needed the legacy NDOs. At least that's
what I remember being told.

I think the legacy NDOs are acceptable, devlink way is preferred
(devlink way did not exist when NFP code was written).

> Our (SN1000) approach to the problem of configuring 'remote'
>  functions (VFs in VMs, PFs on the embedded SoC) is to use
>  representors for them all (VF reps as added in this & prev
>  series, PF reps coming in the future.  Similarly, if we
>  were ever to add Subfunctions, each SF would have a
>  corresponding SF representor that would work in much the
>  same way as VF reps).  At which point you should always be
>  able to configure an object through its associated rep,
>  and there should never be a need for an 'index' parameter
>  (be that 'VF index' or 'port index').

How do you map reprs to VFs? The PCI devices of the VF may be on 
a different system.

> While .ndo_set_mac_address() might be the Wrong Thing (if
>  we want to be able to set VF and VF-rep addresses
>  independently to different things), the Right Thing ought
>  to have the same signature (i.e. just taking a netdev and
>  a hwaddr).  Devlink seems to me like a needless
>  complication here.

But reps are like switch ports in a switch ASIC, and the PCI
device is the other side of the virtual wire. You would not be
configuring the MAC address of a peer to peer link by setting 
the local address.

> Anyway, since the proper direction is unclear, I'll respin
>  the series without patches 10-13 in the hope of getting
>  the rest of it in before the merge window.

SG
Edward Cree July 28, 2022, 6:54 p.m. UTC | #6
On 28/07/2022 19:32, Jakub Kicinski wrote:
> IIRC the reprs are all linked to the PCI device of the PF in sysfs,
You mean /sys/class/net/$VFREP/device?  On sfc that's (intentionally)
 nonexistent; the only visible connection between repr and its owning
 PF is /sys/class/net/$VFREP/phys_switch_id which holds the same
 value as /sys/class/net/$PF/phys_port_id.

> How do you map reprs to VFs? The PCI devices of the VF may be on 
> a different system.
That's what the client ID from patch #10 is for.  We ask the FW for
 a handle to "caller's PCIe controller → caller's PF → VF number
 efv->idx", and that handle is what we store in efv->clid, and later
 pass to MC_CMD_SET_CLIENT_MAC_ADDRESSES in patch #12.

The user determines which repr corresponds to which VF by looking in
 /sys/class/net/$VFREP/phys_port_name (e.g. "p0pf0vf0").

> But reps are like switch ports in a switch ASIC, and the PCI
> device is the other side of the virtual wire. You would not be
> configuring the MAC address of a peer to peer link by setting 
> the local address.
Indeed.  I agree that .ndo_set_mac_address() is the wrong interface.
But the interface I have in mind would be something like
    int (*ndo_set_partner_mac_address)(struct net_device *, void *);
 and would only be implemented by representor netdevs.
Idk what the uAPI/UI for that would be; probably a new `ip link set`
 parameter.

-ed
Jakub Kicinski July 28, 2022, 7:27 p.m. UTC | #7
On Thu, 28 Jul 2022 19:54:21 +0100 Edward Cree wrote:
> On 28/07/2022 19:32, Jakub Kicinski wrote:
> > How do you map reprs to VFs? The PCI devices of the VF may be on 
> > a different system.  
> That's what the client ID from patch #10 is for.  We ask the FW for
>  a handle to "caller's PCIe controller → caller's PF → VF number
>  efv->idx", and that handle is what we store in efv->clid, and later
>  pass to MC_CMD_SET_CLIENT_MAC_ADDRESSES in patch #12.
> 
> The user determines which repr corresponds to which VF by looking in
>  /sys/class/net/$VFREP/phys_port_name (e.g. "p0pf0vf0").

.. and that would also most likely be what the devlink port ID would be.

> > But reps are like switch ports in a switch ASIC, and the PCI
> > device is the other side of the virtual wire. You would not be
> > configuring the MAC address of a peer to peer link by setting 
> > the local address.  
> Indeed.  I agree that .ndo_set_mac_address() is the wrong interface.
> But the interface I have in mind would be something like
>     int (*ndo_set_partner_mac_address)(struct net_device *, void *);
>  and would only be implemented by representor netdevs.
> Idk what the uAPI/UI for that would be; probably a new `ip link set`
>  parameter.

Yup... If only you were there during the fight over this uAPI.
Now it's the devlink "port function" thing.
Edward Cree July 28, 2022, 8:23 p.m. UTC | #8
On 28/07/2022 20:27, Jakub Kicinski wrote:
> On Thu, 28 Jul 2022 19:54:21 +0100 Edward Cree wrote:
>> The user determines which repr corresponds to which VF by looking in
>>  /sys/class/net/$VFREP/phys_port_name (e.g. "p0pf0vf0").
> 
> .. and that would also most likely be what the devlink port ID would be.
AFAICT the devlink port index is just an integer.  The example in the
 man page is
    devlink port function set pci/0000:01:00.0/1 hw_addr 00:00:00:11:22:33
Moreover, struct devlink_port has `unsigned int index`.
Though it does also have `struct devlink_port_attrs` which appears to
 encode the PF and VF numbers; I think those can be read with `devlink
 port show`.
But the whole devlink port abstraction is unnecessary when we already
 *have* an object to represent the port.

>> Indeed.  I agree that .ndo_set_mac_address() is the wrong interface.
>> But the interface I have in mind would be something like
>>     int (*ndo_set_partner_mac_address)(struct net_device *, void *);
>>  and would only be implemented by representor netdevs.
>> Idk what the uAPI/UI for that would be; probably a new `ip link set`
>>  parameter.
> 
> Yup... If only you were there during the fight over this uAPI.
> Now it's the devlink "port function" thing.
Sadly I was too busy with EF100 bring-up, and naïvely assumed that I
 could safely ignore devlink port stuff as it was so obviously going
 to be a classic Mellanox design: tasteless, overweight, and not
 cleanly mappable onto any other vendor.  Which seems to have been
 true but they've managed to make it the standard anyway by virtue
 of being there first, as usual :'(
(Yeah, I probably shouldn't publicly say things like that about
 another vendor's devs.  But I'm getting frustrated at this recurring
 pattern.)

Devlink port function *would* be useful for administering functions
 that don't have a representor.  I just can't see any good reason
 why such things should ever exist.
Maybe it's not too late to introduce my API to exist alongside it…
 though I have no idea how much work it would take to teach the
 orchestration frameworks to use it :/

-ed
Jakub Kicinski July 29, 2022, 1:45 a.m. UTC | #9
On Thu, 28 Jul 2022 21:23:23 +0100 Edward Cree wrote:
> Sadly I was too busy with EF100 bring-up, and naïvely assumed that I
>  could safely ignore devlink port stuff as it was so obviously going
>  to be a classic Mellanox design: tasteless, overweight, and not
>  cleanly mappable onto any other vendor.  Which seems to have been
>  true but they've managed to make it the standard anyway by virtue
>  of being there first, as usual :'(
> (Yeah, I probably shouldn't publicly say things like that about
>  another vendor's devs.  But I'm getting frustrated at this recurring
>  pattern.)

I spend an unhealthy amount of time thinking about the problem 
of vendors not paying attention when new uAPIs are forged.
Happy to try things.

> Devlink port function *would* be useful for administering functions
>  that don't have a representor.  I just can't see any good reason
>  why such things should ever exist.

The SmartNIC/DPU/IPU/isolated hv+IO CPU can expose storage functions
to the peer. nVidia is working on extending the devlink rate limit API
to cover such cases.
Edward Cree July 29, 2022, 3:17 p.m. UTC | #10
On 29/07/2022 02:45, Jakub Kicinski wrote:
>> Devlink port function *would* be useful for administering functions
>>  that don't have a representor.  I just can't see any good reason
>>  why such things should ever exist.
> 
> The SmartNIC/DPU/IPU/isolated hv+IO CPU can expose storage functions
> to the peer. nVidia is working on extending the devlink rate limit API
> to cover such cases.

All the storage-on-SmartNIC setups I can imagine involve the storage
 function (e.g. a virtio-blk PF) being connected to the network switch,
 either to access remote network storage or to export the local storage
 over the network.  (I'm not quite sure why you'd bother combining
 storage and networking functionality onto a single device if they
 _weren't_ connected in this way.)
Which means that your storage function has a v-switch port, and thus
 should have a representor netdevice so you can e.g. use tc rules to
 define its access to the physical network.
Arguably any network rate limiting you then want to apply to that
 function's v-switch port should be in the form of a tc police action.
(Which is far more flexible than devlink rate, because you can have
 different policers for traffic matching different tc filters, e.g.
 separate rate limits for control and data traffic of the dFS.)

Idk, maybe I'm being crazy in assuming that hardware has sane design
 semantics.  But the obvious way to build a SmartNIC maps very cleanly
 onto representors, without any need for devlink port function, and I
 think it makes more sense to say that maybe some weird device might
 end up having representors to control some objects that don't have
 network access, than that everyone has to implement this whole
 parallel structure of devlink objects for things that already have
 representors.

-ed
diff mbox series

Patch

diff --git a/drivers/net/ethernet/sfc/ef100_rep.c b/drivers/net/ethernet/sfc/ef100_rep.c
index ebab4579e63b..58365a4c7c6a 100644
--- a/drivers/net/ethernet/sfc/ef100_rep.c
+++ b/drivers/net/ethernet/sfc/ef100_rep.c
@@ -107,6 +107,33 @@  static int efx_ef100_rep_get_phys_port_name(struct net_device *dev,
 	return 0;
 }
 
+static int efx_ef100_rep_set_mac_address(struct net_device *net_dev, void *data)
+{
+	MCDI_DECLARE_BUF(inbuf, MC_CMD_SET_CLIENT_MAC_ADDRESSES_IN_LEN(1));
+	struct efx_rep *efv = netdev_priv(net_dev);
+	struct efx_nic *efx = efv->parent;
+	struct sockaddr *addr = data;
+	const u8 *new_addr = addr->sa_data;
+	int rc;
+
+	if (efv->clid == CLIENT_HANDLE_NULL) {
+		netif_info(efx, drv, net_dev, "Unable to set representee MAC address (client ID is null)\n");
+	} else {
+		BUILD_BUG_ON(MC_CMD_SET_CLIENT_MAC_ADDRESSES_OUT_LEN);
+		MCDI_SET_DWORD(inbuf, SET_CLIENT_MAC_ADDRESSES_IN_CLIENT_HANDLE,
+			       efv->clid);
+		ether_addr_copy(MCDI_PTR(inbuf, SET_CLIENT_MAC_ADDRESSES_IN_MAC_ADDRS),
+				new_addr);
+		rc = efx_mcdi_rpc(efx, MC_CMD_SET_CLIENT_MAC_ADDRESSES, inbuf,
+				  sizeof(inbuf), NULL, 0, NULL);
+		if (rc)
+			return rc;
+	}
+
+	eth_hw_addr_set(net_dev, new_addr);
+	return 0;
+}
+
 static void efx_ef100_rep_get_stats64(struct net_device *dev,
 				      struct rtnl_link_stats64 *stats)
 {
@@ -126,6 +153,7 @@  static const struct net_device_ops efx_ef100_rep_netdev_ops = {
 	.ndo_start_xmit		= efx_ef100_rep_xmit,
 	.ndo_get_port_parent_id	= efx_ef100_rep_get_port_parent_id,
 	.ndo_get_phys_port_name	= efx_ef100_rep_get_phys_port_name,
+	.ndo_set_mac_address    = efx_ef100_rep_set_mac_address,
 	.ndo_get_stats64	= efx_ef100_rep_get_stats64,
 };