Message ID | 20201112192424.2742-1-parav@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | Add mlx5 subfunction support | expand |
On Thu, 12 Nov 2020 21:24:10 +0200 Parav Pandit wrote: > This series introduces support for mlx5 subfunction (SF). > A subfunction is a portion of a PCI device that supports multiple > classes of devices such as netdev, RDMA and more. > > This patchset is based on Leon's series [3]. > It is a third user of proposed auxiliary bus [4]. > > Subfunction support is discussed in detail in RFC [1] and [2]. > RFC [1] and extension [2] describes requirements, design, and proposed > plumbing using devlink, auxiliary bus and sysfs for systemd/udev > support. So we're going to have two ways of adding subdevs? Via devlink and via the new vdpa netlink thing? Question number two - is this supposed to be ready to be applied to net-next? It seems there is a conflict. Also could you please wrap your code at 80 chars? Thanks.
On Mon, 2020-11-16 at 14:52 -0800, Jakub Kicinski wrote: > On Thu, 12 Nov 2020 21:24:10 +0200 Parav Pandit wrote: > > This series introduces support for mlx5 subfunction (SF). > > A subfunction is a portion of a PCI device that supports multiple > > classes of devices such as netdev, RDMA and more. > > > > This patchset is based on Leon's series [3]. > > It is a third user of proposed auxiliary bus [4]. > > > > Subfunction support is discussed in detail in RFC [1] and [2]. > > RFC [1] and extension [2] describes requirements, design, and > > proposed > > plumbing using devlink, auxiliary bus and sysfs for systemd/udev > > support. > > So we're going to have two ways of adding subdevs? Via devlink and > via > the new vdpa netlink thing? > Via devlink you add the Sub-function bus device - think of it as spawning a new VF - but has no actual characteristics (netdev/vpda/rdma) "yet" until user admin decides to load an interface on it via aux sysfs. Basically devlink adds a new eswitch port (the SF port) and loading the drivers and the interfaces is done via the auxbus subsystem only after the SF is spawned by FW. > Question number two - is this supposed to be ready to be applied to > net-next? It seems there is a conflict. > This series requires other mlx5 and auxbus infrastructure dependencies that was already submitted by leon 2-3 weeks ago and pending Greg's review, once finalized it will be merged into mlx5-next, then I will ask you to pull mlx5-next and only after, you can apply this series cleanly to net-next, sorry for the mess but we had to move forward and show how auxdev subsystem is being actually used. Leon's series: https://patchwork.ozlabs.org/project/netdev/cover/20201101201542.2027568-1-leon@kernel.org/ > Also could you please wrap your code at 80 chars? > I prefer no to do this in mlx5, in mlx5 we follow a 95 chars rule. But if you insist :) .. Thanks, Saeed.
On Mon, 16 Nov 2020 16:06:02 -0800 Saeed Mahameed wrote: > On Mon, 2020-11-16 at 14:52 -0800, Jakub Kicinski wrote: > > On Thu, 12 Nov 2020 21:24:10 +0200 Parav Pandit wrote: > > > This series introduces support for mlx5 subfunction (SF). > > > A subfunction is a portion of a PCI device that supports multiple > > > classes of devices such as netdev, RDMA and more. > > > > > > This patchset is based on Leon's series [3]. > > > It is a third user of proposed auxiliary bus [4]. > > > > > > Subfunction support is discussed in detail in RFC [1] and [2]. > > > RFC [1] and extension [2] describes requirements, design, and > > > proposed > > > plumbing using devlink, auxiliary bus and sysfs for systemd/udev > > > support. > > > > So we're going to have two ways of adding subdevs? Via devlink and > > via the new vdpa netlink thing? > > Via devlink you add the Sub-function bus device - think of it as > spawning a new VF - but has no actual characteristics > (netdev/vpda/rdma) "yet" until user admin decides to load an interface > on it via aux sysfs. By which you mean it doesn't get probed or the device type is not set (IOW it can still become a block device or netdev depending on the vdpa request)? > Basically devlink adds a new eswitch port (the SF port) and loading the > drivers and the interfaces is done via the auxbus subsystem only after > the SF is spawned by FW. But why? Is this for the SmartNIC / bare metal case? The flow for spawning on the local host gets highly convoluted. > > Also could you please wrap your code at 80 chars? > > I prefer no to do this in mlx5, in mlx5 we follow a 95 chars rule. > But if you insist :) .. Oh yeah, I meant the devlink patches!
> From: Jakub Kicinski <kuba@kernel.org> > Sent: Tuesday, November 17, 2020 7:28 AM > > On Mon, 16 Nov 2020 16:06:02 -0800 Saeed Mahameed wrote: > > > > Subfunction support is discussed in detail in RFC [1] and [2]. > > > > RFC [1] and extension [2] describes requirements, design, and > > > > proposed plumbing using devlink, auxiliary bus and sysfs for > > > > systemd/udev support. > > > > > > So we're going to have two ways of adding subdevs? Via devlink and > > > via the new vdpa netlink thing? Nop. Subfunctions (subdevs) are added only one way, i.e. devlink port as settled in RFC [1]. Just to refresh all our memory, we discussed and settled on the flow in [2]; RFC [1] followed this discussion. vdpa tool of [3] can add one or more vdpa device(s) on top of already spawned PF, VF, SF device. > > > > Via devlink you add the Sub-function bus device - think of it as > > spawning a new VF - but has no actual characteristics > > (netdev/vpda/rdma) "yet" until user admin decides to load an interface > > on it via aux sysfs. > > By which you mean it doesn't get probed or the device type is not set (IOW it can > still become a block device or netdev depending on the vdpa request)? > > > Basically devlink adds a new eswitch port (the SF port) and loading > > the drivers and the interfaces is done via the auxbus subsystem only > > after the SF is spawned by FW. > > But why? > > Is this for the SmartNIC / bare metal case? The flow for spawning on the local > host gets highly convoluted. > The flow of spawning for (a) local host or (b) for external host controller from smartnic is same. $ devlink port add.. [..] Followed by $ devlink port function set state... Only change would be to specify the destination where to spawn it. (controller number, pf, sf num etc) Please refer to the detailed examples in individual patch. Patch 12 and 13 mostly covers the complete view. > > > Also could you please wrap your code at 80 chars? > > > > I prefer no to do this in mlx5, in mlx5 we follow a 95 chars rule. > > But if you insist :) .. > > Oh yeah, I meant the devlink patches! May I ask why? Past few devlink patches [4] followed 100 chars rule. When did we revert back to 80? If so, any pointers to the thread for 80? checkpatch.pl with --strict mode didn't complain me when I prepared the patches. [1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/ [2] https://lore.kernel.org/netdev/20200324132044.GI20941@ziepe.ca/ [3] https://lists.linuxfoundation.org/pipermail/virtualization/2020-November/050623.html [4] commits dc64cc7c6310, 77069ba2e3ad, a1e8ae907c8d, 2a916ecc4056, ba356c90985d
On Tue, 17 Nov 2020 04:08:57 +0000 Parav Pandit wrote: > > On Mon, 16 Nov 2020 16:06:02 -0800 Saeed Mahameed wrote: > > > > > Subfunction support is discussed in detail in RFC [1] and [2]. > > > > > RFC [1] and extension [2] describes requirements, design, and > > > > > proposed plumbing using devlink, auxiliary bus and sysfs for > > > > > systemd/udev support. > > > > > > > > So we're going to have two ways of adding subdevs? Via devlink and > > > > via the new vdpa netlink thing? > Nop. > Subfunctions (subdevs) are added only one way, > i.e. devlink port as settled in RFC [1]. > > Just to refresh all our memory, we discussed and settled on the flow > in [2]; RFC [1] followed this discussion. > > vdpa tool of [3] can add one or more vdpa device(s) on top of already > spawned PF, VF, SF device. Nack for the networking part of that. It'd basically be VMDq. > > > Via devlink you add the Sub-function bus device - think of it as > > > spawning a new VF - but has no actual characteristics > > > (netdev/vpda/rdma) "yet" until user admin decides to load an > > > interface on it via aux sysfs. > > > > By which you mean it doesn't get probed or the device type is not > > set (IOW it can still become a block device or netdev depending on > > the vdpa request)? > > > Basically devlink adds a new eswitch port (the SF port) and > > > loading the drivers and the interfaces is done via the auxbus > > > subsystem only after the SF is spawned by FW. > > > > But why? > > > > Is this for the SmartNIC / bare metal case? The flow for spawning > > on the local host gets highly convoluted. > > The flow of spawning for (a) local host or (b) for external host > controller from smartnic is same. > > $ devlink port add.. > [..] > Followed by > $ devlink port function set state... > > Only change would be to specify the destination where to spawn it. > (controller number, pf, sf num etc) Please refer to the detailed > examples in individual patch. Patch 12 and 13 mostly covers the > complete view. Please share full examples of the workflow. I'm asking how the vdpa API fits in with this, and you're showing me the two devlink commands we already talked about in the past.
On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote: > > Just to refresh all our memory, we discussed and settled on the flow > > in [2]; RFC [1] followed this discussion. > > > > vdpa tool of [3] can add one or more vdpa device(s) on top of already > > spawned PF, VF, SF device. > > Nack for the networking part of that. It'd basically be VMDq. What are you NAK'ing? It is consistent with the multi-subsystem device sharing model we've had for ages now. The physical ethernet port is shared between multiple accelerator subsystems. netdev gets its slice of traffic, so does RDMA, iSCSI, VDPA, etc. Jason
Hi Jakub, > From: Jakub Kicinski <kuba@kernel.org> > Sent: Tuesday, November 17, 2020 10:41 PM > > On Tue, 17 Nov 2020 04:08:57 +0000 Parav Pandit wrote: > > > On Mon, 16 Nov 2020 16:06:02 -0800 Saeed Mahameed wrote: > > > > > > Subfunction support is discussed in detail in RFC [1] and [2]. > > > > > > RFC [1] and extension [2] describes requirements, design, and > > > > > > proposed plumbing using devlink, auxiliary bus and sysfs for > > > > > > systemd/udev support. > > > > > > > > > > So we're going to have two ways of adding subdevs? Via devlink > > > > > and via the new vdpa netlink thing? > > Nop. > > Subfunctions (subdevs) are added only one way, i.e. devlink port as > > settled in RFC [1]. > > > > Just to refresh all our memory, we discussed and settled on the flow > > in [2]; RFC [1] followed this discussion. > > > > vdpa tool of [3] can add one or more vdpa device(s) on top of already > > spawned PF, VF, SF device. > > Nack for the networking part of that. It'd basically be VMDq. > Can you please clarify which networking part do you mean? Which patches exactly in this patchset? > > > > Via devlink you add the Sub-function bus device - think of it as > > > > spawning a new VF - but has no actual characteristics > > > > (netdev/vpda/rdma) "yet" until user admin decides to load an > > > > interface on it via aux sysfs. > > > > > > By which you mean it doesn't get probed or the device type is not > > > set (IOW it can still become a block device or netdev depending on > > > the vdpa request)? > > > > Basically devlink adds a new eswitch port (the SF port) and > > > > loading the drivers and the interfaces is done via the auxbus > > > > subsystem only after the SF is spawned by FW. > > > > > > But why? > > > > > > Is this for the SmartNIC / bare metal case? The flow for spawning on > > > the local host gets highly convoluted. > > > > The flow of spawning for (a) local host or (b) for external host > > controller from smartnic is same. > > > > $ devlink port add.. > > [..] > > Followed by > > $ devlink port function set state... > > > > Only change would be to specify the destination where to spawn it. > > (controller number, pf, sf num etc) Please refer to the detailed > > examples in individual patch. Patch 12 and 13 mostly covers the > > complete view. > > Please share full examples of the workflow. > Please find the full example sequence below, taken from this cover letter and from the respective patches 12 and 13. Change device to switchdev mode: $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev Add a devlink port of subfunction flavour: $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88 Configure mac address of the port function: (existing API). $ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88 Now activate the function: $ devlink port function set ens2f0npf0sf88 state active Now use the auxiliary device and class devices: $ devlink dev show pci/0000:06:00.0 auxiliary/mlx5_core.sf.4 $ devlink port show auxiliary/mlx5_core.sf.4/1 auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false $ ip link show 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff altname enp6s0f0np0 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff $ rdma dev show 43: rdmap6s0f0: node_type ca fw 16.28.1002 node_guid 248a:0703:00b3:d112 sys_image_guid 248a:0703:00b3:d112 44: mlx5_0: node_type ca fw 16.28.1002 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112 At this point vdpa tool of [1] can create one or more vdpa net devices on this subfunction device in below sequence. $ vdpa parentdev list auxiliary/mlx5_core.sf.4 supported_classes net $ vdpa dev add parentdev auxiliary/mlx5_core.sf.4 type net name foo0 $ vdpa dev show foo0 foo0: parentdev auxiliary/mlx5_core.sf.4 type network parentdev vdpasim vendor_id 0 max_vqs 2 max_vq_size 256 > I'm asking how the vdpa API fits in with this, and you're showing me the two > devlink commands we already talked about in the past. Oh ok, sorry, my bad. I understood your question now about relation of vdpa commands with this. Please look at the above example sequence that covers the vdpa example also. [1] https://lore.kernel.org/netdev/20201112064005.349268-1-parav@nvidia.com/
On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote: > On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote: > > > > Just to refresh all our memory, we discussed and settled on the flow > > > in [2]; RFC [1] followed this discussion. > > > > > > vdpa tool of [3] can add one or more vdpa device(s) on top of already > > > spawned PF, VF, SF device. > > > > Nack for the networking part of that. It'd basically be VMDq. > > What are you NAK'ing? Spawning multiple netdevs from one device by slicing up its queues. > It is consistent with the multi-subsystem device sharing model we've > had for ages now. > > The physical ethernet port is shared between multiple accelerator > subsystems. netdev gets its slice of traffic, so does RDMA, iSCSI, > VDPA, etc. Right, devices of other subsystems are fine, I don't care. Sorry for not being crystal clear but quite frankly IDK what else can be expected from me given the submissions have little to no context and documentation. This comes up every damn time with the SF patches, I'm tired of having to ask for a basic workflow.
On Tue, 17 Nov 2020 18:50:57 +0000 Parav Pandit wrote: > At this point vdpa tool of [1] can create one or more vdpa net devices on this subfunction device in below sequence. > > $ vdpa parentdev list > auxiliary/mlx5_core.sf.4 > supported_classes > net > > $ vdpa dev add parentdev auxiliary/mlx5_core.sf.4 type net name foo0 > > $ vdpa dev show foo0 > foo0: parentdev auxiliary/mlx5_core.sf.4 type network parentdev vdpasim vendor_id 0 max_vqs 2 max_vq_size 256 > > > I'm asking how the vdpa API fits in with this, and you're showing me the two > > devlink commands we already talked about in the past. > Oh ok, sorry, my bad. I understood your question now about relation of vdpa commands with this. > Please look at the above example sequence that covers the vdpa example also. > > [1] https://lore.kernel.org/netdev/20201112064005.349268-1-parav@nvidia.com/ I think the biggest missing piece in my understanding is what's the technical difference between an SF and a VDPA device. Isn't a VDPA device an SF with a particular descriptor format for the queues?
On 11/18/20 7:14 PM, Jakub Kicinski wrote: > On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote: >> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote: >> >>>> Just to refresh all our memory, we discussed and settled on the flow >>>> in [2]; RFC [1] followed this discussion. >>>> >>>> vdpa tool of [3] can add one or more vdpa device(s) on top of already >>>> spawned PF, VF, SF device. >>> >>> Nack for the networking part of that. It'd basically be VMDq. >> >> What are you NAK'ing? > > Spawning multiple netdevs from one device by slicing up its queues. Why do you object to that? Slicing up h/w resources for virtual what ever has been common practice for a long time.
On Wed, 2020-11-18 at 21:35 -0700, David Ahern wrote: > On 11/18/20 7:14 PM, Jakub Kicinski wrote: > > On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote: > > > On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote: > > > > > > > > Just to refresh all our memory, we discussed and settled on > > > > > the flow > > > > > in [2]; RFC [1] followed this discussion. > > > > > > > > > > vdpa tool of [3] can add one or more vdpa device(s) on top of > > > > > already > > > > > spawned PF, VF, SF device. > > > > > > > > Nack for the networking part of that. It'd basically be VMDq. > > > > > > What are you NAK'ing? > > > > Spawning multiple netdevs from one device by slicing up its queues. > > Why do you object to that? Slicing up h/w resources for virtual what > ever has been common practice for a long time. > > We are not slicing up any queues, from our HW and FW perspective SF == VF literally, a full blown HW slice (Function), with isolated control and data plane of its own, this is very different from VMDq and more generic and secure. an SF device is exactly like a VF, doesn't steal or share any HW resources or control/data path with others. SF is basically SRIOV done right. this series has nothing to do with netdev, if you look at the list of files Parav is touching, there is 0 change in our netdev stack :) .. all Parav is doing is adding the API to create/destroy SFs and represents the low level SF function to devlink as a device, just like a VF.
On Wed, 2020-11-18 at 18:14 -0800, Jakub Kicinski wrote: > On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote: > > On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote: > > > > It is consistent with the multi-subsystem device sharing model > > we've > > had for ages now. > > > > The physical ethernet port is shared between multiple accelerator > > subsystems. netdev gets its slice of traffic, so does RDMA, iSCSI, > > VDPA, etc. not just a slice of traffic, a whole HW domain. > > Right, devices of other subsystems are fine, I don't care. > But a netdev will be loaded on SF automatically just through the current driver design and modularity, since SF == VF and our netdev is abstract and doesn't know if it runs on a PF/VF/SF .. we literally have to add code to not load a netdev on a SF. why ? :/ > Sorry for not being crystal clear but quite frankly IDK what else can > be expected from me given the submissions have little to no context > and > documentation. This comes up every damn time with the SF patches, I'm > tired of having to ask for a basic workflow. From how this discussion is going, i think you are right, we need to clarify what we are doing in a more high level simplified and generic documentation to give some initial context, Parav, let's add the missing documentation, we can also add some comments regarding how this is very different from VMDq, but i would like to avoid that, since it is different in almost every way:) ..
On Wed, 2020-11-18 at 18:23 -0800, Jakub Kicinski wrote: > On Tue, 17 Nov 2020 18:50:57 +0000 Parav Pandit wrote: > > At this point vdpa tool of [1] can create one or more vdpa net > > devices on this subfunction device in below sequence. > > > > $ vdpa parentdev list > > auxiliary/mlx5_core.sf.4 > > supported_classes > > net > > > > $ vdpa dev add parentdev auxiliary/mlx5_core.sf.4 type net name > > foo0 > > > > $ vdpa dev show foo0 > > foo0: parentdev auxiliary/mlx5_core.sf.4 type network parentdev > > vdpasim vendor_id 0 max_vqs 2 max_vq_size 256 > > > > > I'm asking how the vdpa API fits in with this, and you're showing > > > me the two > > > devlink commands we already talked about in the past. > > Oh ok, sorry, my bad. I understood your question now about relation > > of vdpa commands with this. > > Please look at the above example sequence that covers the vdpa > > example also. > > > > [1] > > https://lore.kernel.org/netdev/20201112064005.349268-1-parav@nvidia.com/ > > I think the biggest missing piece in my understanding is what's the > technical difference between an SF and a VDPA device. > Same difference as between a VF and netdev. SF == VF, so a full HW function. VDPA/RDMA/netdev/SCSI/nvme/etc.. are just interfaces (ULPs) sharing the same functions as always been, nothing new about this. Today on a VF we load a RDMA/VDPA/netdev interfaces SF will do exactly the same and the ULPs will simply load, and we don't need to modify them. > Isn't a VDPA device an SF with a particular descriptor format for the > queues? No :/, I hope the above answer clarifies things a bit. SF is a device function that provides all kinds of queues.
> From: Saeed Mahameed <saeed@kernel.org> > Sent: Thursday, November 19, 2020 11:42 AM > > From how this discussion is going, i think you are right, we need to clarify > what we are doing in a more high level simplified and generic documentation > to give some initial context, Parav, let's add the missing documentation, we > can also add some comments regarding how this is very different from > VMDq, but i would like to avoid that, since it is different in almost every way:) Sure I will add Documentation/networking/subfunction.rst in v2 describing subfunction details.
On Wed, Nov 18, 2020 at 10:22:51PM -0800, Saeed Mahameed wrote: > > I think the biggest missing piece in my understanding is what's the > > technical difference between an SF and a VDPA device. > > Same difference as between a VF and netdev. > SF == VF, so a full HW function. > VDPA/RDMA/netdev/SCSI/nvme/etc.. are just interfaces (ULPs) sharing the > same functions as always been, nothing new about this. All the implementation details are very different, but this white paper from Intel goes into some detail the basic elements and rational for the SF concept: https://software.intel.com/content/dam/develop/public/us/en/documents/intel-scalable-io-virtualization-technical-specification.pdf What we are calling a sub-function here is a close cousin to what Intel calls an Assignable Device Interface. I expect to see other drivers following this general pattern eventually. A SF will eventually be assignable to a VM and the VM won't be able to tell the difference between a VF or SF providing the assignable PCI resources. VDPA is also assignable to a guest, but the key difference between mlx5's SF and VDPA is what guest driver binds to the virtual PCI function. For a SF the guest will bind mlx5_core, for VDPA the guest will bind virtio-net. So, the driver stack for a VM using VDPA might be Physical device [pci] -> mlx5_core -> [aux] -> SF -> [aux] -> mlx5_core -> [aux] -> mlx5_vdpa -> QEMU -> |VM| -> [pci] -> virtio_net When Parav is talking about creating VDPA devices he means attaching the VDPA accelerator subsystem to a mlx5_core, where ever that mlx5_core might be attached to. To your other remark: > > What are you NAK'ing? > Spawning multiple netdevs from one device by slicing up its queues. This is a bit vauge. In SRIOV a device spawns multiple netdevs for a physical port by "slicing up its physical queues" - where do you see the cross over between VMDq (bad) and SRIOV (ok)? I thought the issue with VMDq was more on the horrid management to configure the traffic splitting, not the actual splitting itself? In classic SRIOV the traffic is split by a simple non-configurable HW switch based on MAC address of the VF. mlx5 already has the extended version of that idea, we can run in switchdev mode and use switchdev to configure the HW switch. Now configurable switchdev rules split the traffic for VFs. This SF step replaces the VF in the above, but everything else is the same. The switchdev still splits the traffic, it still ends up in same nested netdev queue structure & RSS a VF/PF would use, etc, etc. No queues are "stolen" to create the nested netdev. From the driver perspective there is no significant difference between sticking a netdev on a mlx5 VF or sticking a netdev on a mlx5 SF. A SF netdev is not going in and doing deep surgery to the PF netdev to steal queues or something. Both VF and SF will be eventually assignable to guests, both can support all the accelerator subsystems - VDPA, RDMA, etc. Both can support netdev. Compared to VMDq, I think it is really no comparison. SF/ADI is an evolution of a SRIOV VF from something PCI-SGI controlled to something device specific and lighter weight. SF/ADI come with a architectural security boundary suitable for assignment to an untrusted guest. It is not just a jumble of queues. VMDq is .. not that. Actually it has been one of the open debates in the virtualization userspace world. The approach to use switchdev to control the traffic splitting to VMs is elegant but many drivers are are not following this design. :( Finally, in the mlx5 model VDPA is just an "application". It asks the device to create a 'RDMA' raw ethernet packet QP that is uses rings formed in the virtio-net specification. We can create it in the kernel using mlx5_vdpa, and we can create it in userspace through the RDMA subsystem. Like any "RDMA" application it is contained by the security boundary of the PF/VF/SF the mlx5_core is running on. Jason
On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote: > On 11/18/20 7:14 PM, Jakub Kicinski wrote: > > On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote: > >> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote: > >> > >>>> Just to refresh all our memory, we discussed and settled on the flow > >>>> in [2]; RFC [1] followed this discussion. > >>>> > >>>> vdpa tool of [3] can add one or more vdpa device(s) on top of already > >>>> spawned PF, VF, SF device. > >>> > >>> Nack for the networking part of that. It'd basically be VMDq. > >> > >> What are you NAK'ing? > > > > Spawning multiple netdevs from one device by slicing up its queues. > > Why do you object to that? Slicing up h/w resources for virtual what > ever has been common practice for a long time. My memory of the VMDq debate is hazy, let me rope in Alex into this. I believe the argument was that we should offload software constructs, not create HW-specific APIs which depend on HW availability and implementation. So the path we took was offloading macvlan.
On Wed, 18 Nov 2020 21:57:57 -0800 Saeed Mahameed wrote: > On Wed, 2020-11-18 at 21:35 -0700, David Ahern wrote: > > On 11/18/20 7:14 PM, Jakub Kicinski wrote: > > > On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote: > > > > On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote: > > > > > > > > > > Just to refresh all our memory, we discussed and settled on > > > > > > the flow > > > > > > in [2]; RFC [1] followed this discussion. > > > > > > > > > > > > vdpa tool of [3] can add one or more vdpa device(s) on top of > > > > > > already > > > > > > spawned PF, VF, SF device. > > > > > > > > > > Nack for the networking part of that. It'd basically be VMDq. > > > > > > > > What are you NAK'ing? > > > > > > Spawning multiple netdevs from one device by slicing up its queues. > > > > Why do you object to that? Slicing up h/w resources for virtual what > > ever has been common practice for a long time. > > We are not slicing up any queues, from our HW and FW perspective SF == > VF literally, a full blown HW slice (Function), with isolated control > and data plane of its own, this is very different from VMDq and more > generic and secure. an SF device is exactly like a VF, doesn't steal or > share any HW resources or control/data path with others. SF is > basically SRIOV done right. > > this series has nothing to do with netdev, if you look at the list of > files Parav is touching, there is 0 change in our netdev stack :) .. > all Parav is doing is adding the API to create/destroy SFs and > represents the low level SF function to devlink as a device, just > like a VF. Ack, the concern is about the vdpa, not SF. So not really this patch set.
On Wed, 18 Nov 2020 22:12:22 -0800 Saeed Mahameed wrote: > > Right, devices of other subsystems are fine, I don't care. > > But a netdev will be loaded on SF automatically just through the > current driver design and modularity, since SF == VF and our netdev is > abstract and doesn't know if it runs on a PF/VF/SF .. we literally have > to add code to not load a netdev on a SF. why ? :/ A netdev is fine, but the examples so far don't make it clear (to me) if it's expected/supported to spawn _multiple_ netdevs from a single "vdpa parentdev".
> From: Jakub Kicinski <kuba@kernel.org> > Sent: Friday, November 20, 2020 7:05 AM > > On Wed, 18 Nov 2020 22:12:22 -0800 Saeed Mahameed wrote: > > > Right, devices of other subsystems are fine, I don't care. > > > > But a netdev will be loaded on SF automatically just through the > > current driver design and modularity, since SF == VF and our netdev is > > abstract and doesn't know if it runs on a PF/VF/SF .. we literally > > have to add code to not load a netdev on a SF. why ? :/ > > A netdev is fine, but the examples so far don't make it clear (to me) if it's > expected/supported to spawn _multiple_ netdevs from a single "vdpa > parentdev". We do not create Netdev from vdpa parentdev. From vdpa parentdev, only vdpa device(s) are created which is 'struct device' residing in /sys/bus/vdpa/<device>. Currently such vdpa device is already created on mlx5_vpa.ko driver load, however user has no way to inspect, stats, get/set features of this device, hence the vdpa tool.
On Thu, 19 Nov 2020 10:00:17 -0400 Jason Gunthorpe wrote: > Finally, in the mlx5 model VDPA is just an "application". It asks the > device to create a 'RDMA' raw ethernet packet QP that is uses rings > formed in the virtio-net specification. We can create it in the kernel > using mlx5_vdpa, and we can create it in userspace through the RDMA > subsystem. Like any "RDMA" application it is contained by the security > boundary of the PF/VF/SF the mlx5_core is running on. Thanks for the write up! The SF part is pretty clear to me, it is what it is. DPDK camp has been pretty excited about ADI/PASID for a while now. The part that's blurry to me is VDPA. I was under the impression that for VDPA the device is supposed to support native virtio 2.0 (or whatever the "HW friendly" spec was). I believe that's what the early patches from Intel did. You're saying it's a client application like any other - do I understand it right that the hypervisor driver will be translating descriptors between virtio and device-native then? The vdpa parent is in the hypervisor correct? Can a VDPA device have multiple children of the same type? Why do we have a representor for a SF, if the interface is actually VDPA? Block and net traffic can't reasonably be treated the same by the switch. Also I'm confused how block device can bind to mlx5_core - in that case I'm assuming the QP is bound 1:1 with a QP on the SmartNIC side, and that QP is plugged into an appropriate backend?
> From: Jakub Kicinski <kuba@kernel.org> > Sent: Friday, November 20, 2020 9:05 AM > > On Thu, 19 Nov 2020 10:00:17 -0400 Jason Gunthorpe wrote: > > Finally, in the mlx5 model VDPA is just an "application". It asks the > > device to create a 'RDMA' raw ethernet packet QP that is uses rings > > formed in the virtio-net specification. We can create it in the kernel > > using mlx5_vdpa, and we can create it in userspace through the RDMA > > subsystem. Like any "RDMA" application it is contained by the security > > boundary of the PF/VF/SF the mlx5_core is running on. > > Thanks for the write up! > > The SF part is pretty clear to me, it is what it is. DPDK camp has been pretty > excited about ADI/PASID for a while now. > > > The part that's blurry to me is VDPA. > > I was under the impression that for VDPA the device is supposed to support > native virtio 2.0 (or whatever the "HW friendly" spec was). > > I believe that's what the early patches from Intel did. > > You're saying it's a client application like any other - do I understand it right that > the hypervisor driver will be translating descriptors between virtio and device- > native then? > mlx5 device support virtio descriptors natively. So no need of translation. > The vdpa parent is in the hypervisor correct? > Yep. > Can a VDPA device have multiple children of the same type? > I guess, you mean VDPA parentdev? If so, yes, however at present we see only one_to_one mapping of vdpa device and parent dev. > Why do we have a representor for a SF, if the interface is actually VDPA? Because vdpa is just one client out of multiple. At the moment there is one to one relation of vdpa device to a SF/VF. > Block and net traffic can't reasonably be treated the same by the switch. > > Also I'm confused how block device can bind to mlx5_core - in that case I'm > assuming the QP is bound 1:1 with a QP on the SmartNIC side, and that QP is > plugged into an appropriate backend? So far there isn't mlx5_vdpa.ko or plan to do block. But yes, in future for block, it needs to bind to a QP in backend in smartnic.
On Thu, Nov 19, 2020 at 07:35:26PM -0800, Jakub Kicinski wrote: > On Thu, 19 Nov 2020 10:00:17 -0400 Jason Gunthorpe wrote: > > Finally, in the mlx5 model VDPA is just an "application". It asks the > > device to create a 'RDMA' raw ethernet packet QP that is uses rings > > formed in the virtio-net specification. We can create it in the kernel > > using mlx5_vdpa, and we can create it in userspace through the RDMA > > subsystem. Like any "RDMA" application it is contained by the security > > boundary of the PF/VF/SF the mlx5_core is running on. > > Thanks for the write up! No problem! > The part that's blurry to me is VDPA. Okay, I think I see where the gap is, I'm going to elaborate below so we are clear. > I was under the impression that for VDPA the device is supposed to > support native virtio 2.0 (or whatever the "HW friendly" spec was). I think VDPA covers a wide range of things. The basic idea is starting with the all SW virtio-net implementation we can move parts to HW. Each implementation will probably be a little different here. The kernel vdpa subsystem is a toolbox to mix the required emulation and HW capability to build a virtio-net PCI interface. The most key question to ask of any VDPA design is "what does the VDPA FW do with the packet once the HW accelerator has parsed the virtio-net descriptor?". The VDPA world has refused to agree on this due to vendor squabbling, but mlx5 has a clear answer: VDPA Tx generates an ethernet packet and sends it out the SF/VF port through a tunnel to the representor and then on to the switchdev. Other VDPA designs have a different answer!! This concept is so innate to how Mellanox views the world it is not surprising me that the cover letters and patch descriptions don't belabor this point much :) I'm going to deep dive through this answer below. I think you'll see this is the most sane and coherent architecture with the tools available in netdev.. Mellanox thinks the VDPA world should standardize on this design so we can have a standard control plane. > You're saying it's a client application like any other - do I understand > it right that the hypervisor driver will be translating descriptors > between virtio and device-native then? No, the hypervisor creates a QP and tells the HW that this QP's descriptor format follows virtio-net. The QP processes those descriptors in HW and generates ethernet packets. A "client application like any other" means that the ethernet packets VDPA forms are identical to the ones netdev or RDMA forms. They are all delivered into the tunnel on the SF/VF to the representor and on to the switch. See below > The vdpa parent is in the hypervisor correct? > > Can a VDPA device have multiple children of the same type? I'm not sure parent/child are good words here. The VDPA emulation runs in the hypervisor, and the virtio-net netdev driver runs in the guest. The VDPA is attached to a switchdev port and representor tunnel by virtue of its QPs being created under a SF/VF. If we imagine a virtio-rdma, then you might have a SF/VF hosting both VDPA and VDPA-RDMA which emulate two PCI devices assigned to a VM. Both of these peer virtio's would generate ethernet packets for TX on the SF/VF port into the tunnel through the represntor and to the switch. > Why do we have a representor for a SF, if the interface is actually VDPA? > Block and net traffic can't reasonably be treated the same by the > switch. I think you are focusing on queues, the architecture at PF/SF/VF is not queue based, it is packet based. At the physical mlx5 the netdev has a switchdev. On that switch I can create a *switch port*. The switch port is composed of a representor and a SF/VF. They form a tunnel for packets. The representor is the hypervisor side of the tunnel and contains all packets coming out of and into the SF/VF. The SF/VF is the guest side of the tunnel and has a full NIC. The SF/VF can be: - Used in the same OS as the switch - Assigned to a guest VM as a PCI device - Assigned to another processor in the SmartNIC case. In all cases if I use a queue on a SF/VF to generate an ethernet packet then that packet *always* goes into the tunnel to the representor and goes into a switch. It is always contained by any rules on the switch side. If the switch is set so the representor is VLAN tagged then a queue on a SF/VF *cannot* escape the VLAN tag. Similarly SF/VF cannot Rx any packets that are not sent into the tunnel, meaning the switch controls what packets go into the representor, through the tunnel and to the SF. Yes, block and net traffic are all reduced to ethernet packets, sent through the tunnel to the representor and treated by the switch. It is no different than a physical switch. If there is to be some net/block difference it has to be represented in the ethernet packets, eg with vlan or something. This is the fundamental security boundary of the architecture. The SF/VF is a security domain and the only exchange of information from that security domain to the hypervisor security domain is the tunnel to the representor. The exchange across the boundary is only *packets* not queues. Essentially it exactly models the physical world. If I phyically plug in a NIC to a switch then the "representor" is the switch port in the physical switch OS and the "SF/VF" is the NIC in the server. The switch OS does not know or care what the NIC is doing. It does not know or care if the NIC is doing VDPA, or if the packets are "block" or "net" - they are all just packets by the time it gets to switching. > Also I'm confused how block device can bind to mlx5_core - in that case > I'm assuming the QP is bound 1:1 with a QP on the SmartNIC side, and > that QP is plugged into an appropriate backend? Every mlx5_core is a full multi-queue instance. It can have a huge number of queues with no problems. Do not focus on the queues. *queues* are irrelevant here. Queues always have two ends. In this model one end is at the CPU and the other is just ethernet packets. The purpose of the queue is to convert CPU stuff into ethernet packets and vice versa. A mlx5 device has a wide range of accelerators that can do all sorts of transformations between CPU and packets built into the queues. A queue can only be attached to a single mlx5_core, meaning all the ethernet packets the queue sources/sinks must come from the PF/SF/VF port. For SF/VF this port is connected to a tunnel to a representor to the switch. Thus every queue has its packet side connected to the switch. However, the *queue* is an opaque detail of how the ethernet packets are created from CPU data. It doesn't matter if the queue is running VDPA, RDMA, netdev, or block traffic - all of these things inherently result in ethernet packets, and the hypervisor can't tell how the packet was created. The architecture is *not* like virtio. virtio queues are individual tunnels between hypervisor and guest. This is the key detail: A VDPA queue is *not a tunnel*. It is a engine to covert CPU data in virtio-net format to ethernet packets and deliver those packet to the SF/VF end of the tunnel to the representor and then to the switch. The tunnel is the SF/VF and representor pairing, NOT the VDPA queue. Looking at the logical life of a Tx packet from a VM doing VDPA: - VM's netdev builds the skb and writes a vitio-net formed descriptor to a send qeuue - VM triggers a doorbell via write to a BAR. In mlx5 this write goes to the device - qemu mmaps part of the device BAR to the guest - The HW begins processing a queue. The queue is in virtio-net format so it fetches the descriptor and now has the skb data - The HW forms the skb into an ethernet packet and delivers it to the representor through the tunnel, which immediately sends it to the HW switch. The VDPA QP in the SF/VF is now done. - In the switch the HW determines the packet is an exception. It applies RSS rules/etc and dynamically identifies on a per-packet basis what hypervisor queue the packet should be delivered to. This queue is in the hypervisor, and is in mlx5 native format. - The choosen hypervisor queue recives this packet and begins processing. It gets a receive buffer and writes the packet, triggers an interrupts. This queue is now done. - hypervisor netdev now has the packet. It does the exception path in netdev and puts the SKB back on another queue for TX to the physical port. This queue is in mlx5 native format, the packet goes to the physical port. It traversed three queues. The HW dynamically selected the hypervisor queue the VDPA packet is delivered to based *entirely* on switch rules. The originating queue only informs the switch of what SF/VF (and thus switch port) generated the packet. At no point does the hypervisor know the packet originated from a VDPA QP. The RX side the similar, each PF/SF/VF port has a selector that chooses which queue each packet goes to. That chooses how the packet is converted to CPU. Each PF/SF/VF can have a huge number of selectors, and SF/VF source their packets from the logical tunnel attached to a representor which receives packets from the switch. The selector is how the cross subsystem sharing of the ethernet port works, regardless of PF/SF/VF. Again the hypervisor side has *no idea* what queue the packet will be selected to when it delivers the packet to the representor side of the tunnel. Jason
On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote: > > On 11/18/20 7:14 PM, Jakub Kicinski wrote: > > > On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote: > > >> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote: > > >> > > >>>> Just to refresh all our memory, we discussed and settled on the flow > > >>>> in [2]; RFC [1] followed this discussion. > > >>>> > > >>>> vdpa tool of [3] can add one or more vdpa device(s) on top of already > > >>>> spawned PF, VF, SF device. > > >>> > > >>> Nack for the networking part of that. It'd basically be VMDq. > > >> > > >> What are you NAK'ing? > > > > > > Spawning multiple netdevs from one device by slicing up its queues. > > > > Why do you object to that? Slicing up h/w resources for virtual what > > ever has been common practice for a long time. > > My memory of the VMDq debate is hazy, let me rope in Alex into this. > I believe the argument was that we should offload software constructs, > not create HW-specific APIs which depend on HW availability and > implementation. So the path we took was offloading macvlan. I think it somewhat depends on the type of interface we are talking about. What we were wanting to avoid was drivers spawning their own unique VMDq netdevs and each having a different way of doing it. The approach Intel went with was to use a MACVLAN offload to approach it. Although I would imagine many would argue the approach is somewhat dated and limiting since you cannot do many offloads on a MACVLAN interface. With the VDPA case I believe there is a set of predefined virtio devices that are being emulated and presented so it isn't as if they are creating a totally new interface for this. What I would be interested in seeing is if there are any other vendors that have reviewed this and sign off on this approach. What we don't want to see is Nivida/Mellanox do this one way, then Broadcom or Intel come along later and have yet another way of doing this. We need an interface and feature set that will work for everyone in terms of how this will look going forward.
On 11/20/2020 9:58 AM, Alexander Duyck wrote: > On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski <kuba@kernel.org> wrote: >> >> On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote: >>> On 11/18/20 7:14 PM, Jakub Kicinski wrote: >>>> On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote: >>>>> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote: >>>>> >>>>>>> Just to refresh all our memory, we discussed and settled on the flow >>>>>>> in [2]; RFC [1] followed this discussion. >>>>>>> >>>>>>> vdpa tool of [3] can add one or more vdpa device(s) on top of already >>>>>>> spawned PF, VF, SF device. >>>>>> >>>>>> Nack for the networking part of that. It'd basically be VMDq. >>>>> >>>>> What are you NAK'ing? >>>> >>>> Spawning multiple netdevs from one device by slicing up its queues. >>> >>> Why do you object to that? Slicing up h/w resources for virtual what >>> ever has been common practice for a long time. >> >> My memory of the VMDq debate is hazy, let me rope in Alex into this. >> I believe the argument was that we should offload software constructs, >> not create HW-specific APIs which depend on HW availability and >> implementation. So the path we took was offloading macvlan. > > I think it somewhat depends on the type of interface we are talking > about. What we were wanting to avoid was drivers spawning their own > unique VMDq netdevs and each having a different way of doing it. The > approach Intel went with was to use a MACVLAN offload to approach it. > Although I would imagine many would argue the approach is somewhat > dated and limiting since you cannot do many offloads on a MACVLAN > interface. Yes. We talked about this at netdev 0x14 and the limitations of macvlan based offloads. https://netdevconf.info/0x14/session.html?talk-hardware-acceleration-of-container-networking-interfaces Subfunction seems to be a good model to expose VMDq VSI or SIOV ADI as a netdev for kernel containers. AF_XDP ZC in a container is one of the usecase this would address. Today we have to pass the entire PF/VF to a container to do AF_XDP. Looks like the current model is to create a subfunction of a specific type on auxiliary bus, do some configuration to assign resources and then activate the subfunction. > > With the VDPA case I believe there is a set of predefined virtio > devices that are being emulated and presented so it isn't as if they > are creating a totally new interface for this. > > What I would be interested in seeing is if there are any other vendors > that have reviewed this and sign off on this approach. What we don't > want to see is Nivida/Mellanox do this one way, then Broadcom or Intel > come along later and have yet another way of doing this. We need an > interface and feature set that will work for everyone in terms of how > this will look going forward.
On Fri, 2020-11-20 at 11:04 -0800, Samudrala, Sridhar wrote: > > On 11/20/2020 9:58 AM, Alexander Duyck wrote: > > On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski <kuba@kernel.org> > > wrote: > > > On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote: > > > > On 11/18/20 7:14 PM, Jakub Kicinski wrote: > > > > > On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote: > > > > > > On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski > > > > > > wrote: > > > > > > > > > > > > > > Just to refresh all our memory, we discussed and > > > > > > > > settled on the flow > > > > > > > > in [2]; RFC [1] followed this discussion. > > > > > > > > > > > > > > > > vdpa tool of [3] can add one or more vdpa device(s) on > > > > > > > > top of already > > > > > > > > spawned PF, VF, SF device. > > > > > > > > > > > > > > Nack for the networking part of that. It'd basically be > > > > > > > VMDq. > > > > > > > > > > > > What are you NAK'ing? > > > > > > > > > > Spawning multiple netdevs from one device by slicing up its > > > > > queues. > > > > > > > > Why do you object to that? Slicing up h/w resources for virtual > > > > what > > > > ever has been common practice for a long time. > > > > > > My memory of the VMDq debate is hazy, let me rope in Alex into > > > this. > > > I believe the argument was that we should offload software > > > constructs, > > > not create HW-specific APIs which depend on HW availability and > > > implementation. So the path we took was offloading macvlan. > > > > I think it somewhat depends on the type of interface we are talking > > about. What we were wanting to avoid was drivers spawning their own > > unique VMDq netdevs and each having a different way of doing it. Agreed, but SF netdevs are not a VMDq netdevs, they are avaiable in the switchdev model where they correspond to a full blown port (HW domain). > > The > > approach Intel went with was to use a MACVLAN offload to approach > > it. > > Although I would imagine many would argue the approach is somewhat > > dated and limiting since you cannot do many offloads on a MACVLAN > > interface. > > Yes. We talked about this at netdev 0x14 and the limitations of > macvlan > based offloads. > https://netdevconf.info/0x14/session.html?talk-hardware-acceleration-of-container-networking-interfaces > > Subfunction seems to be a good model to expose VMDq VSI or SIOV ADI > as a Exactly, Subfunctions is the most generic model to overcome any SW model limitations e.g macvtap offload, all HW vendors are already creating netdevs on a given PF/VF .. all we need is to model the SF and all the rest is the same! most likely every thing else comes for free like in the mlx5 model where the netdev/rmda interfaces are abstracted from the underlying HW, same netdev loads on a PF/VF/SF or even an embedded function ! > netdev for kernel containers. AF_XDP ZC in a container is one of the > usecase this would address. Today we have to pass the entire PF/VF to > a > container to do AF_XDP. > this will be supported out of the box for free with SFs. > Looks like the current model is to create a subfunction of a > specific > type on auxiliary bus, do some configuration to assign resources and > then activate the subfunction. > > > With the VDPA case I believe there is a set of predefined virtio > > devices that are being emulated and presented so it isn't as if > > they > > are creating a totally new interface for this. > > > > What I would be interested in seeing is if there are any other > > vendors > > that have reviewed this and sign off on this approach. What we > > don't > > want to see is Nivida/Mellanox do this one way, then Broadcom or > > Intel > > come along later and have yet another way of doing this. We need an > > interface and feature set that will work for everyone in terms of > > how > > this will look going forward. Well, the vdpa interface was created by the virtio community and especially redhat, i am not sure mellanox were even involved in the initial development stages :-) anyway historically speaking vDPA was originally created for DPDK, but same API applies to device drivers who can deliver the same set of queues and API while bypassing the whole DPDK stack, enters Kernel vDPA which was created to overcome some of the userspace limitations and complexity and to leverage some of the kernel great feature such as eBPF. https://www.redhat.com/en/blog/introduction-vdpa-kernel-framework
On 2020/11/21 上午3:04, Samudrala, Sridhar wrote: > > > On 11/20/2020 9:58 AM, Alexander Duyck wrote: >> On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski <kuba@kernel.org> wrote: >>> >>> On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote: >>>> On 11/18/20 7:14 PM, Jakub Kicinski wrote: >>>>> On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote: >>>>>> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote: >>>>>> >>>>>>>> Just to refresh all our memory, we discussed and settled on the >>>>>>>> flow >>>>>>>> in [2]; RFC [1] followed this discussion. >>>>>>>> >>>>>>>> vdpa tool of [3] can add one or more vdpa device(s) on top of >>>>>>>> already >>>>>>>> spawned PF, VF, SF device. >>>>>>> >>>>>>> Nack for the networking part of that. It'd basically be VMDq. >>>>>> >>>>>> What are you NAK'ing? >>>>> >>>>> Spawning multiple netdevs from one device by slicing up its queues. >>>> >>>> Why do you object to that? Slicing up h/w resources for virtual what >>>> ever has been common practice for a long time. >>> >>> My memory of the VMDq debate is hazy, let me rope in Alex into this. >>> I believe the argument was that we should offload software constructs, >>> not create HW-specific APIs which depend on HW availability and >>> implementation. So the path we took was offloading macvlan. >> >> I think it somewhat depends on the type of interface we are talking >> about. What we were wanting to avoid was drivers spawning their own >> unique VMDq netdevs and each having a different way of doing it. The >> approach Intel went with was to use a MACVLAN offload to approach it. >> Although I would imagine many would argue the approach is somewhat >> dated and limiting since you cannot do many offloads on a MACVLAN >> interface. > > Yes. We talked about this at netdev 0x14 and the limitations of > macvlan based offloads. > https://netdevconf.info/0x14/session.html?talk-hardware-acceleration-of-container-networking-interfaces > > > Subfunction seems to be a good model to expose VMDq VSI or SIOV ADI as > a netdev for kernel containers. AF_XDP ZC in a container is one of the > usecase this would address. Today we have to pass the entire PF/VF to > a container to do AF_XDP. > > Looks like the current model is to create a subfunction of a specific > type on auxiliary bus, do some configuration to assign resources and > then activate the subfunction. > >> >> With the VDPA case I believe there is a set of predefined virtio >> devices that are being emulated and presented so it isn't as if they >> are creating a totally new interface for this. vDPA doesn't have any limitation of how the devices is created or implemented. It could be predefined or created dynamically. vDPA leaves all of those to the parent device with the help of a unified management API[1]. E.g It could be a PCI device (PF or VF), sub-function or software emulated devices. >> >> What I would be interested in seeing is if there are any other vendors >> that have reviewed this and sign off on this approach. For "this approach" do you mean vDPA subfucntion? My understanding is that it's totally vendor specific, vDPA subsystem don't want to be limited by a specific type of device. >> What we don't >> want to see is Nivida/Mellanox do this one way, then Broadcom or Intel >> come along later and have yet another way of doing this. We need an >> interface and feature set that will work for everyone in terms of how >> this will look going forward. For feature set, it would be hard to force (we can have a recommendation set of features) vendors to implement a common set of features consider they can be negotiated. So the management interface is expected to implement features like cpu clusters in order to make sure the migration compatibility, or qemu can assist for the missing feature with performance lose. Thanks
On 2020/11/24 下午3:01, Jason Wang wrote: > > On 2020/11/21 上午3:04, Samudrala, Sridhar wrote: >> >> >> On 11/20/2020 9:58 AM, Alexander Duyck wrote: >>> On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski <kuba@kernel.org> wrote: >>>> >>>> On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote: >>>>> On 11/18/20 7:14 PM, Jakub Kicinski wrote: >>>>>> On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote: >>>>>>> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote: >>>>>>> >>>>>>>>> Just to refresh all our memory, we discussed and settled on >>>>>>>>> the flow >>>>>>>>> in [2]; RFC [1] followed this discussion. >>>>>>>>> >>>>>>>>> vdpa tool of [3] can add one or more vdpa device(s) on top of >>>>>>>>> already >>>>>>>>> spawned PF, VF, SF device. >>>>>>>> >>>>>>>> Nack for the networking part of that. It'd basically be VMDq. >>>>>>> >>>>>>> What are you NAK'ing? >>>>>> >>>>>> Spawning multiple netdevs from one device by slicing up its queues. >>>>> >>>>> Why do you object to that? Slicing up h/w resources for virtual what >>>>> ever has been common practice for a long time. >>>> >>>> My memory of the VMDq debate is hazy, let me rope in Alex into this. >>>> I believe the argument was that we should offload software constructs, >>>> not create HW-specific APIs which depend on HW availability and >>>> implementation. So the path we took was offloading macvlan. >>> >>> I think it somewhat depends on the type of interface we are talking >>> about. What we were wanting to avoid was drivers spawning their own >>> unique VMDq netdevs and each having a different way of doing it. The >>> approach Intel went with was to use a MACVLAN offload to approach it. >>> Although I would imagine many would argue the approach is somewhat >>> dated and limiting since you cannot do many offloads on a MACVLAN >>> interface. >> >> Yes. We talked about this at netdev 0x14 and the limitations of >> macvlan based offloads. >> https://netdevconf.info/0x14/session.html?talk-hardware-acceleration-of-container-networking-interfaces >> >> >> Subfunction seems to be a good model to expose VMDq VSI or SIOV ADI >> as a netdev for kernel containers. AF_XDP ZC in a container is one of >> the usecase this would address. Today we have to pass the entire >> PF/VF to a container to do AF_XDP. >> >> Looks like the current model is to create a subfunction of a specific >> type on auxiliary bus, do some configuration to assign resources and >> then activate the subfunction. >> >>> >>> With the VDPA case I believe there is a set of predefined virtio >>> devices that are being emulated and presented so it isn't as if they >>> are creating a totally new interface for this. > > > vDPA doesn't have any limitation of how the devices is created or > implemented. It could be predefined or created dynamically. vDPA > leaves all of those to the parent device with the help of a unified > management API[1]. E.g It could be a PCI device (PF or VF), > sub-function or software emulated devices. Miss the link, https://www.spinics.net/lists/netdev/msg699374.html. Thanks > > >>> >>> What I would be interested in seeing is if there are any other vendors >>> that have reviewed this and sign off on this approach. > > > For "this approach" do you mean vDPA subfucntion? My understanding is > that it's totally vendor specific, vDPA subsystem don't want to be > limited by a specific type of device. > > >>> What we don't >>> want to see is Nivida/Mellanox do this one way, then Broadcom or Intel >>> come along later and have yet another way of doing this. We need an >>> interface and feature set that will work for everyone in terms of how >>> this will look going forward. > > For feature set, it would be hard to force (we can have a > recommendation set of features) vendors to implement a common set of > features consider they can be negotiated. So the management interface > is expected to implement features like cpu clusters in order to make > sure the migration compatibility, or qemu can assist for the missing > feature with performance lose. > > Thanks > >
On 11/18/20 10:57 PM, Saeed Mahameed wrote: > > We are not slicing up any queues, from our HW and FW perspective SF == > VF literally, a full blown HW slice (Function), with isolated control > and data plane of its own, this is very different from VMDq and more > generic and secure. an SF device is exactly like a VF, doesn't steal or > share any HW resources or control/data path with others. SF is > basically SRIOV done right. What does that mean with respect to mac filtering and ntuple rules? Also, Tx is fairly easy to imagine, but how does hardware know how to direct packets for the Rx path? As an example, consider 2 VMs or containers with the same destination ip both using subfunction devices. How does the nic know how to direct the ingress flows to the right queues for the subfunction?
Hi David, > From: David Ahern <dsahern@gmail.com> > Sent: Wednesday, November 25, 2020 11:04 AM > > On 11/18/20 10:57 PM, Saeed Mahameed wrote: > > > > We are not slicing up any queues, from our HW and FW perspective SF == > > VF literally, a full blown HW slice (Function), with isolated control > > and data plane of its own, this is very different from VMDq and more > > generic and secure. an SF device is exactly like a VF, doesn't steal > > or share any HW resources or control/data path with others. SF is > > basically SRIOV done right. > > What does that mean with respect to mac filtering and ntuple rules? > > Also, Tx is fairly easy to imagine, but how does hardware know how to direct > packets for the Rx path? As an example, consider 2 VMs or containers with the > same destination ip both using subfunction devices. Since both VM/containers are having same IP, it is better to place them in different L2 domains via vlan, vxlan etc. > How does the nic know how to direct the ingress flows to the right queues for > the subfunction? > Rx steering occurs through tc filters via representor netdev of SF. Exactly same way as VF representor netdev operation. When devlink eswitch port is created as shown in example in cover letter, and also in patch-12, it creates the representor netdevice. Below is the snippet of it. Add a devlink port of subfunction flavour: $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88 Configure mac address of the port function: $ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88 ^^^^^^^^^^^^^^ This is the representor netdevice. It is created by port add command. This name is setup by systemd/udev v245 and higher by utilizing the existing phys_port_name infrastructure already exists for PF and VF representors. Now user can add unicast rx tc rule for example, $ tc filter add dev ens2f0np0 parent ffff: prio 1 flower dst_mac 00:00:00:00:88:88 action mirred egress redirect dev ens2f0npf0sf88 I didn't cover this tc example in cover letter, to keep it short. But I had a one line description as below in the 'detail' section of cover-letter. Hope it helps. - A SF supports eswitch representation and tc offload support similar to existing PF and VF representors. Now above portion answers, how to forward the packet to subfunction. But how to forward to the right rx queue out of multiple rxqueues? This is done by the rss configuration done by the user, number of channels from ethtool. Just like VF and PF. The driver defaults are similar to VF, which user can change via ethtool.
On 11/24/20 11:00 PM, Parav Pandit wrote: > Hi David, > >> From: David Ahern <dsahern@gmail.com> >> Sent: Wednesday, November 25, 2020 11:04 AM >> >> On 11/18/20 10:57 PM, Saeed Mahameed wrote: >>> >>> We are not slicing up any queues, from our HW and FW perspective SF == >>> VF literally, a full blown HW slice (Function), with isolated control >>> and data plane of its own, this is very different from VMDq and more >>> generic and secure. an SF device is exactly like a VF, doesn't steal >>> or share any HW resources or control/data path with others. SF is >>> basically SRIOV done right. >> >> What does that mean with respect to mac filtering and ntuple rules? >> >> Also, Tx is fairly easy to imagine, but how does hardware know how to direct >> packets for the Rx path? As an example, consider 2 VMs or containers with the >> same destination ip both using subfunction devices. > Since both VM/containers are having same IP, it is better to place them in different L2 domains via vlan, vxlan etc. ok, so relying on <vlan, dmac> pairs. > >> How does the nic know how to direct the ingress flows to the right queues for >> the subfunction? >> > Rx steering occurs through tc filters via representor netdev of SF. > Exactly same way as VF representor netdev operation. > > When devlink eswitch port is created as shown in example in cover letter, and also in patch-12, it creates the representor netdevice. > Below is the snippet of it. > > Add a devlink port of subfunction flavour: > $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88 > > Configure mac address of the port function: > $ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88 > ^^^^^^^^^^^^^^ > This is the representor netdevice. It is created by port add command. > This name is setup by systemd/udev v245 and higher by utilizing the existing phys_port_name infrastructure already exists for PF and VF representors. hardware ensures only packets with that dmac are sent to the subfunction device. > > Now user can add unicast rx tc rule for example, > > $ tc filter add dev ens2f0np0 parent ffff: prio 1 flower dst_mac 00:00:00:00:88:88 action mirred egress redirect dev ens2f0npf0sf88 > > I didn't cover this tc example in cover letter, to keep it short. > But I had a one line description as below in the 'detail' section of cover-letter. > Hope it helps. > > - A SF supports eswitch representation and tc offload support similar > to existing PF and VF representors. > > Now above portion answers, how to forward the packet to subfunction. > But how to forward to the right rx queue out of multiple rxqueues? > This is done by the rss configuration done by the user, number of channels from ethtool. > Just like VF and PF. > The driver defaults are similar to VF, which user can change via ethtool. > so users can add flow steering or drop rules to SF devices. thanks,