Message ID | 20201214214352.198172-1-saeed@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Add mlx5 subfunction support | expand |
On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org> wrote: > > Hi Dave, Jakub, Jason, > > This series form Parav was the theme of this mlx5 release cycle, > we've been waiting anxiously for the auxbus infrastructure to make it into > the kernel, and now as the auxbus is in and all the stars are aligned, I > can finally submit this V2 of the devlink and mlx5 subfunction support. > > Subfunctions came to solve the scaling issue of virtualization > and switchdev environments, where SRIOV failed to deliver and users ran > out of VFs very quickly as SRIOV demands huge amount of physical resources > in both of the servers and the NIC. > > Subfunction provide the same functionality as SRIOV but in a very > lightweight manner, please see the thorough and detailed > documentation from Parav below, in the commit messages and the > Networking documentation patches at the end of this series. > Just to clarify a few things for myself. You mention virtualization and SR-IOV in your patch description but you cannot support direct assignment with this correct? The idea here is simply logical partitioning of an existing network interface, correct? So this isn't so much a solution for virtualization, but may work better for containers. I view this as an important distinction to make as the first thing that came to mind when I read this was mediated devices which is similar, but focused only on the virtualization case: https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-device.html > Parav Pandit Says: > ================= > > This patchset introduces support for mlx5 subfunction (SF). > > A subfunction is a lightweight function that has a parent PCI function on > which it is deployed. mlx5 subfunction has its own function capabilities > and its own resources. This means a subfunction has its own dedicated > queues(txq, rxq, cq, eq). These queues are neither shared nor stealed from > the parent PCI function. Rather than calling this a subfunction, would it make more sense to call it something such as a queue set? It seems like this is exposing some of the same functionality we did in the Intel drivers such as ixgbe and i40e via the macvlan offload interface. However the ixgbe/i40e hardware was somewhat limited in that we were only able to expose Ethernet interfaces via this sort of VMQ/VMDQ feature, and even with that we have seen some limitations to the interface. It sounds like you are able to break out RDMA capable devices this way as well. So in terms of ways to go I would argue this is likely better. However one downside is that we are going to end up seeing each subfunction being different from driver to driver and vendor to vendor which I would argue was also one of the problems with SR-IOV as you end up with a bit of vendor lock-in as a result of this feature since each vendor will be providing a different interface. > When subfunction is RDMA capable, it has its own QP1, GID table and rdma > resources neither shared nor stealed from the parent PCI function. > > A subfunction has dedicated window in PCI BAR space that is not shared > with ther other subfunctions or parent PCI function. This ensures that all > class devices of the subfunction accesses only assigned PCI BAR space. > > A Subfunction supports eswitch representation through which it supports tc > offloads. User must configure eswitch to send/receive packets from/to > subfunction port. > > Subfunctions share PCI level resources such as PCI MSI-X IRQs with > their other subfunctions and/or with its parent PCI function. This piece to the architecture for this has me somewhat concerned. If all your resources are shared and you are allowing devices to be created incrementally you either have to pre-partition the entire function which usually results in limited resources for your base setup, or free resources from existing interfaces and redistribute them as things change. I would be curious which approach you are taking here? So for example if you hit a certain threshold will you need to reset the port and rebalance the IRQs between the various functions? > Patch summary: > -------------- > Patch 1 to 4 prepares devlink > patch 5 to 7 mlx5 adds SF device support > Patch 8 to 11 mlx5 adds SF devlink port support > Patch 12 and 14 adds documentation > > Patch-1 prepares code to handle multiple port function attributes > Patch-2 introduces devlink pcisf port flavour similar to pcipf and pcivf > Patch-3 adds port add and delete driver callbacks > Patch-4 adds port function state get and set callbacks > Patch-5 mlx5 vhca event notifier support to distribute subfunction > state change notification > Patch-6 adds SF auxiliary device > Patch-7 adds SF auxiliary driver > Patch-8 prepares eswitch to handler SF vport > Patch-9 adds eswitch helpers to add/remove SF vport > Patch-10 implements devlink port add/del callbacks > Patch-11 implements devlink port function get/set callbacks > Patch-12 to 14 adds documentation > Patch-12 added mlx5 port function documentation > Patch-13 adds subfunction documentation > Patch-14 adds mlx5 subfunction documentation > > Subfunction support is discussed in detail in RFC [1] and [2]. > RFC [1] and extension [2] describes requirements, design and proposed > plumbing using devlink, auxiliary bus and sysfs for systemd/udev > support. Functionality of this patchset is best explained using real > examples further below. > > overview: > -------- > A subfunction can be created and deleted by a user using devlink port > add/delete interface. > > A subfunction can be configured using devlink port function attribute > before its activated. > > When a subfunction is activated, it results in an auxiliary device on > the host PCI device where it is deployed. A driver binds to the > auxiliary device that further creates supported class devices. > > example subfunction usage sequence: > ----------------------------------- > Change device to switchdev mode: > $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev > > Add a devlink port of subfunction flaovur: > $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88 Typo in your description. Also I don't know if you want to stick with "flavour" or just shorten it to the U.S. spelling which is "flavor". > Configure mac address of the port function: > $ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88 > > Now activate the function: > $ devlink port function set ens2f0npf0sf88 state active > > Now use the auxiliary device and class devices: > $ devlink dev show > pci/0000:06:00.0 > auxiliary/mlx5_core.sf.4 > > $ ip link show > 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 > link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff > altname enp6s0f0np0 > 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 > link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff> I assume that p0sf88 is supposed to be the newly created subfunction. However I thought the naming was supposed to be the same as what you are referring to in the devlink, or did I miss something? > $ rdma dev show > 43: rdmap6s0f0: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d112 sys_image_guid 248a:0703:00b3:d112 > 44: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112 > > After use inactivate the function: > $ devlink port function set ens2f0npf0sf88 state inactive > > Now delete the subfunction port: > $ devlink port del ens2f0npf0sf88 This seems wrong to me as it breaks the symmetry with the port add command and assumes you have ownership of the interface in the host. I would much prefer to to see the same arguments that were passed to the add command being used to do the teardown as that would allow for the parent function to create the object, assign it to a container namespace, and not need to pull it back in order to destroy it.
On 12/14/20 6:53 PM, Alexander Duyck wrote: >> example subfunction usage sequence: >> ----------------------------------- >> Change device to switchdev mode: >> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev >> >> Add a devlink port of subfunction flaovur: >> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88 > > Typo in your description. Also I don't know if you want to stick with > "flavour" or just shorten it to the U.S. spelling which is "flavor". The term exists in devlink today (since 2018). When support was added to iproute2 I decided there was no reason to require the US spelling over the British spelling, so I accepted the patch.
> From: Alexander Duyck <alexander.duyck@gmail.com> > Sent: Tuesday, December 15, 2020 7:24 AM > > On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org> > wrote: > > > > Hi Dave, Jakub, Jason, > > > > Just to clarify a few things for myself. You mention virtualization and SR-IOV > in your patch description but you cannot support direct assignment with this > correct? Correct. it cannot be directly assigned. > The idea here is simply logical partitioning of an existing network > interface, correct? No. Idea is to spawn multiple functions from a single PCI device. These functions are not born in PCI device and in OS until they are created by user. Jason and Saeed explained this in great detail few weeks back in v0 version of the patchset at [1], [2] and [3]. I better not repeat all of it here again. Please go through it. If you may want to read precursor to it, RFC from Jiri at [4] is also explains this in great detail. > So this isn't so much a solution for virtualization, but may > work better for containers. I view this as an important distinction to make as > the first thing that came to mind when I read this was mediated devices > which is similar, but focused only on the virtualization case: > https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated- > device.html > Managing subfunction using medicated device is already ruled out last year at [5] as it is the abuse of the mdev bus for this purpose + has severe limitations of managing the subfunction device. We are not going back to it anymore. It will be duplicating lot of the plumbing which exists in devlink, netlink, auxiliary bus and more. > Rather than calling this a subfunction, would it make more sense to call it > something such as a queue set? No, queue is just one way to send and receive data/packets. Jason and Saeed explained and discussed this piece to you and others during v0 few weeks back at [1], [2], [3]. Please take a look. > So in terms of ways to go I would argue this is likely better. However one > downside is that we are going to end up seeing each subfunction being > different from driver to driver and vendor to vendor which I would argue > was also one of the problems with SR-IOV as you end up with a bit of vendor > lock-in as a result of this feature since each vendor will be providing a > different interface. > Each and several vendors provided unified interface for managing VFs. i.e. (a) enable/disable was via vendor neutral sysfs (b) sriov capability exposed via standard pci capability and sysfs (c) sriov vf config (mac, vlan, rss, tx rate, spoof check trust) are using vendor agnostic netlink Even though the driver's internal implementation largely differs on how trust, spoof, mac, vlan rate etc are enforced. So subfunction feature/attribute/functionality will be implemented differently internally in the driver matching vendor's device, for reasonably abstract concept of 'subfunction'. > > A Subfunction supports eswitch representation through which it > > supports tc offloads. User must configure eswitch to send/receive > > packets from/to subfunction port. > > > > Subfunctions share PCI level resources such as PCI MSI-X IRQs with > > their other subfunctions and/or with its parent PCI function. > > This piece to the architecture for this has me somewhat concerned. If all your > resources are shared and All resources are not shared. > you are allowing devices to be created > incrementally you either have to pre-partition the entire function which > usually results in limited resources for your base setup, or free resources > from existing interfaces and redistribute them as things change. I would be > curious which approach you are taking here? So for example if you hit a > certain threshold will you need to reset the port and rebalance the IRQs > between the various functions? No. Its works bit differently for mlx5 device. When base function is started, it started as if it doesn't have any subfunctions. When subfunction is instantiated, it spawns new resources in device (hw, fw, memory) depending on how much a function wants. For example, PCI PF uses BAR 0, while subfunctions uses BAR 2. For IRQs, subfunction instance shares the IRQ with its parent/hosting PCI PF. In future, yes, a dedicated IRQs per SF is likely desired. Sridhar also talked about limiting number of queues to a subfunction. I believe there will be resources/attributes of the function to be controlled. devlink already provides rich interface to achieve that using devlink resources [8]. [..] > > $ ip link show > > 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state > DOWN mode DEFAULT group default qlen 1000 > > link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff > > altname enp6s0f0np0 > > 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN > mode DEFAULT group default qlen 1000 > > link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff> > > I assume that p0sf88 is supposed to be the newly created subfunction. > However I thought the naming was supposed to be the same as what you are > referring to in the devlink, or did I miss something? > I believe you are confused with the representor netdevice of subfuction with devices of subfunction. (netdev, rdma, vdpa etc). I suggest that please refer to the diagram in patch_15 in [7] to see the stack, modules, objects. Hope below description clarifies a bit. There are two netdevices. (a) representor netdevice, attached to the devlink port of the eswitch (b) netdevice of the SF used by the end application (in your example, this is assigned to container). Both netdevice follow obviously a different naming scheme. Representor netdevice follows naming scheme well defined in kernel + systemd/udev v245 and higher. It is based on phys_port_name sysfs attribute. This is same for existing PF and SF representors exist for year+ now. Further used by subfunction. For subfunction netdevice (p0s88), system/udev will be extended. I put example based on my few lines of udev rule that reads phys_port_name and user supplied sfnum, so that user exactly knows which interface to assign to container. > > After use inactivate the function: > > $ devlink port function set ens2f0npf0sf88 state inactive > > > > Now delete the subfunction port: > > $ devlink port del ens2f0npf0sf88 > > This seems wrong to me as it breaks the symmetry with the port add > command and Example of the representor device is only to make life easier for the user. Devlink port del command works based on the devlink port index, just like existing devlink port commands (get,set,split,unsplit). I explained this in a thread with Sridhar at [6]. In short devlink port del <bus/device_name/port_index command is just fine. Port index is unique handle for the devlink instance that user refers to delete, get, set port and port function attributes post its creation. I choose the representor netdev example because it is more intuitive to related to, but port index is equally fine and supported. > assumes you have ownership of the interface in the host. I > would much prefer to to see the same arguments that were passed to the > add command being used to do the teardown as that would allow for the > parent function to create the object, assign it to a container namespace, and > not need to pull it back in order to destroy it. Parent function will not have same netdevice name as that of representor netdevice, because both devices exist in single system for large part of the use cases. So port delete command works on the port index. Host doesn't need to pull it back to destroy it. It is destroyed via port del command. [1] https://lore.kernel.org/netdev/20201112192424.2742-1-parav@nvidia.com/ [2] https://lore.kernel.org/netdev/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/ [3] https://lore.kernel.org/netdev/20201120161659.GE917484@nvidia.com/ [4] https://lore.kernel.org/netdev/20200501091449.GA25211@nanopsycho.orion/ [5] https://lore.kernel.org/netdev/20191107160448.20962-1-parav@mellanox.com/ [6] https://lore.kernel.org/netdev/BY5PR12MB43227784BB34D929CA64E315DCCA0@BY5PR12MB4322.namprd12.prod.outlook.com/ [7] https://lore.kernel.org/netdev/20201214214352.198172-16-saeed@kernel.org/T/#u [8] https://man7.org/linux/man-pages/man8/devlink-resource.8.html
On Mon, 2020-12-14 at 17:53 -0800, Alexander Duyck wrote: > On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org> > wrote: > > Hi Dave, Jakub, Jason, > > > > This series form Parav was the theme of this mlx5 release cycle, > > we've been waiting anxiously for the auxbus infrastructure to make > > it into > > the kernel, and now as the auxbus is in and all the stars are > > aligned, I > > can finally submit this V2 of the devlink and mlx5 subfunction > > support. > > > > Subfunctions came to solve the scaling issue of virtualization > > and switchdev environments, where SRIOV failed to deliver and users > > ran > > out of VFs very quickly as SRIOV demands huge amount of physical > > resources > > in both of the servers and the NIC. > > > > Subfunction provide the same functionality as SRIOV but in a very > > lightweight manner, please see the thorough and detailed > > documentation from Parav below, in the commit messages and the > > Networking documentation patches at the end of this series. > > > > Just to clarify a few things for myself. You mention virtualization > and SR-IOV in your patch description but you cannot support direct > assignment with this correct? The idea here is simply logical > partitioning of an existing network interface, correct? So this isn't > so much a solution for virtualization, but may work better for > containers. I view this as an important distinction to make as the at the current state yes, but the SF solution can be extended to support direct assignment, so this is why i think SF solution can do better and eventually replace SRIOV. also many customers are currently using SRIOV with containers to get the performance and isolation features since there was no other options. > first thing that came to mind when I read this was mediated devices > which is similar, but focused only on the virtualization case: > https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-device.html > > > Parav Pandit Says: > > ================= > > > > This patchset introduces support for mlx5 subfunction (SF). > > > > A subfunction is a lightweight function that has a parent PCI > > function on > > which it is deployed. mlx5 subfunction has its own function > > capabilities > > and its own resources. This means a subfunction has its own > > dedicated > > queues(txq, rxq, cq, eq). These queues are neither shared nor > > stealed from > > the parent PCI function. > > Rather than calling this a subfunction, would it make more sense to > call it something such as a queue set? It seems like this is exposing > some of the same functionality we did in the Intel drivers such as > ixgbe and i40e via the macvlan offload interface. However the > ixgbe/i40e hardware was somewhat limited in that we were only able to > expose Ethernet interfaces via this sort of VMQ/VMDQ feature, and > even > with that we have seen some limitations to the interface. It sounds > like you are able to break out RDMA capable devices this way as well. > So in terms of ways to go I would argue this is likely better. We've discussed this thoroughly on V0, the SF solutions is closer to a VF than a VMDQ, this is not just a set of queues. https://lore.kernel.org/linux-rdma/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/ > However > one downside is that we are going to end up seeing each subfunction > being different from driver to driver and vendor to vendor which I > would argue was also one of the problems with SR-IOV as you end up > with a bit of vendor lock-in as a result of this feature since each > vendor will be providing a different interface. > I disagree, SFs are tightly coupled with switchdev model and devlink functions port, they are backed with the a well defined model, i can say the same about sriov with switchdev mode, this sort of vendor lock- in issues is eliminated when you migrate to switchdev mode. > > When subfunction is RDMA capable, it has its own QP1, GID table and > > rdma > > resources neither shared nor stealed from the parent PCI function. > > > > A subfunction has dedicated window in PCI BAR space that is not > > shared > > with ther other subfunctions or parent PCI function. This ensures > > that all > > class devices of the subfunction accesses only assigned PCI BAR > > space. > > > > A Subfunction supports eswitch representation through which it > > supports tc > > offloads. User must configure eswitch to send/receive packets > > from/to > > subfunction port. > > > > Subfunctions share PCI level resources such as PCI MSI-X IRQs with > > their other subfunctions and/or with its parent PCI function. > > This piece to the architecture for this has me somewhat concerned. If > all your resources are shared and you are allowing devices to be not all, only PCI MSIX, for now.. > created incrementally you either have to pre-partition the entire > function which usually results in limited resources for your base > setup, or free resources from existing interfaces and redistribute > them as things change. I would be curious which approach you are > taking here? So for example if you hit a certain threshold will you > need to reset the port and rebalance the IRQs between the various > functions? > Currently SFs will use whatever IRQs the PF has pre-allocated for itself, so there is no IRQ limit issue at the moment, we are considering a dynamic IRQ pool with dynamic balancing, or even better us the IMS approach, which perfectly fits the SF architecture. https://patchwork.kernel.org/project/linux-pci/cover/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/ for internal resources the are fully isolated (not shared) and they are internally managed by FW exactly like a VF internal resources.
On Mon, Dec 14, 2020 at 6:44 PM David Ahern <dsahern@gmail.com> wrote: > > On 12/14/20 6:53 PM, Alexander Duyck wrote: > >> example subfunction usage sequence: > >> ----------------------------------- > >> Change device to switchdev mode: > >> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev > >> > >> Add a devlink port of subfunction flaovur: > >> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88 > > > > Typo in your description. Also I don't know if you want to stick with > > "flavour" or just shorten it to the U.S. spelling which is "flavor". > > The term exists in devlink today (since 2018). When support was added to > iproute2 I decided there was no reason to require the US spelling over > the British spelling, so I accepted the patch. Okay. The only reason why I noticed is because "flaovur" is definitely a wrong spelling. If it is already in the interface then no need to change it.
> From: Alexander Duyck <alexander.duyck@gmail.com> > Sent: Tuesday, December 15, 2020 9:47 PM > > On Mon, Dec 14, 2020 at 6:44 PM David Ahern <dsahern@gmail.com> wrote: > > > > On 12/14/20 6:53 PM, Alexander Duyck wrote: > > >> example subfunction usage sequence: > > >> ----------------------------------- > > >> Change device to switchdev mode: > > >> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev > > >> > > >> Add a devlink port of subfunction flaovur: > > >> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88 > > > > > > Typo in your description. Also I don't know if you want to stick > > > with "flavour" or just shorten it to the U.S. spelling which is "flavor". > > > > The term exists in devlink today (since 2018). When support was added > > to > > iproute2 I decided there was no reason to require the US spelling over > > the British spelling, so I accepted the patch. > > Okay. The only reason why I noticed is because "flaovur" is definitely a wrong > spelling. If it is already in the interface then no need to change it. I am using to write "flavor" and I realized that I should probably say "flavour" because in devlink it is that way. So I added 'u' and typo added it at wrong location. :-) Thanks for catching it. Saeed sent the v5 fixing this along with few more English corrections.
On Mon, Dec 14, 2020 at 9:48 PM Parav Pandit <parav@nvidia.com> wrote: > > > > From: Alexander Duyck <alexander.duyck@gmail.com> > > Sent: Tuesday, December 15, 2020 7:24 AM > > > > On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org> > > wrote: > > > > > > Hi Dave, Jakub, Jason, > > > > > > > Just to clarify a few things for myself. You mention virtualization and SR-IOV > > in your patch description but you cannot support direct assignment with this > > correct? > Correct. it cannot be directly assigned. > > > The idea here is simply logical partitioning of an existing network > > interface, correct? > No. Idea is to spawn multiple functions from a single PCI device. > These functions are not born in PCI device and in OS until they are created by user. That is the definition of logical partitioning. You are essentially taking one physical PCIe function and splitting up the resources over multiple logical devices. With something like an MFD driver you would partition the device as soon as the driver loads, but with this you are peeling our resources and defining the devices on demand. > Jason and Saeed explained this in great detail few weeks back in v0 version of the patchset at [1], [2] and [3]. > I better not repeat all of it here again. Please go through it. > If you may want to read precursor to it, RFC from Jiri at [4] is also explains this in great detail. I think I have a pretty good idea of how the feature works. My concern is more the use of marketing speak versus actual functionality. The way this is being setup it sounds like it is useful for virtualization and it is not, at least in its current state. It may be at some point in the future but I worry that it is really going to muddy the waters as we end up with yet another way to partition devices. > > So this isn't so much a solution for virtualization, but may > > work better for containers. I view this as an important distinction to make as > > the first thing that came to mind when I read this was mediated devices > > which is similar, but focused only on the virtualization case: > > https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated- > > device.html > > > Managing subfunction using medicated device is already ruled out last year at [5] as it is the abuse of the mdev bus for this purpose + has severe limitations of managing the subfunction device. I agree with you on that. My thought was more the fact that the two can be easily confused. If we are going to do this we need to define that for networking devices perhaps that using the mdev interface would be deprecated and we would need to go through devlink. However before we do that we need to make sure we have this completely standardized. > We are not going back to it anymore. > It will be duplicating lot of the plumbing which exists in devlink, netlink, auxiliary bus and more. That is kind of my point. It is already in the kernel. What you are adding is the stuff that is duplicating it. I'm assuming that in order to be able to virtualize your interfaces in the future you are going to have to make use of the same vfio plumbing that the mediated devices do. > > Rather than calling this a subfunction, would it make more sense to call it > > something such as a queue set? > No, queue is just one way to send and receive data/packets. > Jason and Saeed explained and discussed this piece to you and others during v0 few weeks back at [1], [2], [3]. > Please take a look. Yeah, I recall that. However I feel like it is being oversold. It isn't "SR-IOV done right" it seems more like "VMDq done better". The fact that interrupts are shared between the subfunctions is telling. That is exactly how things work for Intel parts when they do VMDq as well. The queues are split up into pools and a block of queues belongs to a specific queue. From what I can can tell the only difference is that there is isolation of the pool into specific pages in the BAR. Which is essentially a requirement for mediated devices so that they can be direct assigned. > > So in terms of ways to go I would argue this is likely better. However one > > downside is that we are going to end up seeing each subfunction being > > different from driver to driver and vendor to vendor which I would argue > > was also one of the problems with SR-IOV as you end up with a bit of vendor > > lock-in as a result of this feature since each vendor will be providing a > > different interface. > > > Each and several vendors provided unified interface for managing VFs. i.e. > (a) enable/disable was via vendor neutral sysfs > (b) sriov capability exposed via standard pci capability and sysfs > (c) sriov vf config (mac, vlan, rss, tx rate, spoof check trust) are using vendor agnostic netlink > Even though the driver's internal implementation largely differs on how trust, spoof, mac, vlan rate etc are enforced. > > So subfunction feature/attribute/functionality will be implemented differently internally in the driver matching vendor's device, for reasonably abstract concept of 'subfunction'. I think you are missing the point. The biggest issue with SR-IOV adoption was the fact that the drivers all created different VF interfaces and those interfaces didn't support migration. That was the two biggest drawbacks for SR-IOV. I don't really see this approach resolving either of those and so that is one of the reasons why I say this is closer to "VMDq done better" rather than "SR-IOV done right". Assuming at some point one of the flavours is a virtio-net style interface you could eventually get to the point of something similar to what seems to have been the goal of mdev which was meant to address these two points. > > > A Subfunction supports eswitch representation through which it > > > supports tc offloads. User must configure eswitch to send/receive > > > packets from/to subfunction port. > > > > > > Subfunctions share PCI level resources such as PCI MSI-X IRQs with > > > their other subfunctions and/or with its parent PCI function. > > > > This piece to the architecture for this has me somewhat concerned. If all your > > resources are shared and > All resources are not shared. Just to clarify, when I say "shared" I mean that they are all coming from the same function. So if for example I were to direct-assign the PF then all the resources for the subfunctions would go with it. > > you are allowing devices to be created > > incrementally you either have to pre-partition the entire function which > > usually results in limited resources for your base setup, or free resources > > from existing interfaces and redistribute them as things change. I would be > > curious which approach you are taking here? So for example if you hit a > > certain threshold will you need to reset the port and rebalance the IRQs > > between the various functions? > No. Its works bit differently for mlx5 device. > When base function is started, it started as if it doesn't have any subfunctions. > When subfunction is instantiated, it spawns new resources in device (hw, fw, memory) depending on how much a function wants. > > For example, PCI PF uses BAR 0, while subfunctions uses BAR 2. In the grand scheme BAR doesn't really matter much. The assumption here is that resources are page aligned so that you can map the pages into a guest eventually. > For IRQs, subfunction instance shares the IRQ with its parent/hosting PCI PF. > In future, yes, a dedicated IRQs per SF is likely desired. > Sridhar also talked about limiting number of queues to a subfunction. > I believe there will be resources/attributes of the function to be controlled. > devlink already provides rich interface to achieve that using devlink resources [8]. > > [..] So it sounds like the device firmware is pre-partitioining the resources and splitting them up between the PCI PF and your subfunctions. Although in your case it sounds like there are significantly more resources than you might find in an ixgbe interface for instance. :) The point is that we should probably define some sort of standard and/or expectations on what should happen when you spawn a new interface. Would it be acceptable for the PF and existing subfunctions to have to reset if you need to rebalance the IRQ distribution, or should they not be disrupted when you spawn a new interface? One of the things I prefer about the mediated device setup is the fact that it has essentially pre-partitioned things beforehand and you know how many of each type of interface you can spawn. Is there any information like that provided by this interface? > > > $ ip link show > > > 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state > > DOWN mode DEFAULT group default qlen 1000 > > > link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff > > > altname enp6s0f0np0 > > > 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN > > mode DEFAULT group default qlen 1000 > > > link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff> > > > > I assume that p0sf88 is supposed to be the newly created subfunction. > > However I thought the naming was supposed to be the same as what you are > > referring to in the devlink, or did I miss something? > > > I believe you are confused with the representor netdevice of subfuction with devices of subfunction. (netdev, rdma, vdpa etc). > I suggest that please refer to the diagram in patch_15 in [7] to see the stack, modules, objects. > Hope below description clarifies a bit. > There are two netdevices. > (a) representor netdevice, attached to the devlink port of the eswitch > (b) netdevice of the SF used by the end application (in your example, this is assigned to container). Sorry, that wasn't clear from your example. So in this case you started in a namespace and the new device you created via devlink was spawned in the root namespace? > Both netdevice follow obviously a different naming scheme. > Representor netdevice follows naming scheme well defined in kernel + systemd/udev v245 and higher. > It is based on phys_port_name sysfs attribute. > This is same for existing PF and SF representors exist for year+ now. Further used by subfunction. > > For subfunction netdevice (p0s88), system/udev will be extended. I put example based on my few lines of udev rule that reads > phys_port_name and user supplied sfnum, so that user exactly knows which interface to assign to container. Admittedly I have been out of the loop for the last couple years since I had switched over to memory management work for a while. It would be useful to include something that shows your created network interface in the example in addition to your switchdev port. > > > After use inactivate the function: > > > $ devlink port function set ens2f0npf0sf88 state inactive > > > > > > Now delete the subfunction port: > > > $ devlink port del ens2f0npf0sf88 > > > > This seems wrong to me as it breaks the symmetry with the port add > > command and > Example of the representor device is only to make life easier for the user. > Devlink port del command works based on the devlink port index, just like existing devlink port commands (get,set,split,unsplit). > I explained this in a thread with Sridhar at [6]. > In short devlink port del <bus/device_name/port_index command is just fine. > Port index is unique handle for the devlink instance that user refers to delete, get, set port and port function attributes post its creation. > I choose the representor netdev example because it is more intuitive to related to, but port index is equally fine and supported. Okay then, that addresses my concern. I just wanted to make sure we weren't in some situation where you had to have the interface in order to remove it. > > assumes you have ownership of the interface in the host. I > > would much prefer to to see the same arguments that were passed to the > > add command being used to do the teardown as that would allow for the > > parent function to create the object, assign it to a container namespace, and > > not need to pull it back in order to destroy it. > Parent function will not have same netdevice name as that of representor netdevice, because both devices exist in single system for large part of the use cases. > So port delete command works on the port index. > Host doesn't need to pull it back to destroy it. It is destroyed via port del command. > > [1] https://lore.kernel.org/netdev/20201112192424.2742-1-parav@nvidia.com/ > [2] https://lore.kernel.org/netdev/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/ > [3] https://lore.kernel.org/netdev/20201120161659.GE917484@nvidia.com/ > [4] https://lore.kernel.org/netdev/20200501091449.GA25211@nanopsycho.orion/ > [5] https://lore.kernel.org/netdev/20191107160448.20962-1-parav@mellanox.com/ > [6] https://lore.kernel.org/netdev/BY5PR12MB43227784BB34D929CA64E315DCCA0@BY5PR12MB4322.namprd12.prod.outlook.com/ > [7] https://lore.kernel.org/netdev/20201214214352.198172-16-saeed@kernel.org/T/#u > [8] https://man7.org/linux/man-pages/man8/devlink-resource.8.html >
On Mon, Dec 14, 2020 at 10:15 PM Saeed Mahameed <saeed@kernel.org> wrote: > > On Mon, 2020-12-14 at 17:53 -0800, Alexander Duyck wrote: > > On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org> > > wrote: > > > Hi Dave, Jakub, Jason, > > > > > > This series form Parav was the theme of this mlx5 release cycle, > > > we've been waiting anxiously for the auxbus infrastructure to make > > > it into > > > the kernel, and now as the auxbus is in and all the stars are > > > aligned, I > > > can finally submit this V2 of the devlink and mlx5 subfunction > > > support. > > > > > > Subfunctions came to solve the scaling issue of virtualization > > > and switchdev environments, where SRIOV failed to deliver and users > > > ran > > > out of VFs very quickly as SRIOV demands huge amount of physical > > > resources > > > in both of the servers and the NIC. > > > > > > Subfunction provide the same functionality as SRIOV but in a very > > > lightweight manner, please see the thorough and detailed > > > documentation from Parav below, in the commit messages and the > > > Networking documentation patches at the end of this series. > > > > > > > Just to clarify a few things for myself. You mention virtualization > > and SR-IOV in your patch description but you cannot support direct > > assignment with this correct? The idea here is simply logical > > partitioning of an existing network interface, correct? So this isn't > > so much a solution for virtualization, but may work better for > > containers. I view this as an important distinction to make as the > > at the current state yes, but the SF solution can be extended to > support direct assignment, so this is why i think SF solution can do > better and eventually replace SRIOV. My only real concern is that this and mediated devices are essentially the same thing. When you start making this work for direct-assignment the only real difference becomes the switchdev and devlink interfaces. Basically this is netdev specific mdev versus the PCIe specific mdev. > also many customers are currently using SRIOV with containers to get > the performance and isolation features since there was no other > options. There were, but you hadn't implemented them. The fact is the approach Intel had taken for that was offloaded macvlan. I think the big thing we really should do if we are going to go this route is to look at standardizing what the flavours are that get created by the parent netdevice. Otherwise we are just creating the same mess we had with SRIOV all over again and muddying the waters of mediated devices. > > first thing that came to mind when I read this was mediated devices > > which is similar, but focused only on the virtualization case: > > https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-device.html > > > > > Parav Pandit Says: > > > ================= > > > > > > This patchset introduces support for mlx5 subfunction (SF). > > > > > > A subfunction is a lightweight function that has a parent PCI > > > function on > > > which it is deployed. mlx5 subfunction has its own function > > > capabilities > > > and its own resources. This means a subfunction has its own > > > dedicated > > > queues(txq, rxq, cq, eq). These queues are neither shared nor > > > stealed from > > > the parent PCI function. > > > > Rather than calling this a subfunction, would it make more sense to > > call it something such as a queue set? It seems like this is exposing > > some of the same functionality we did in the Intel drivers such as > > ixgbe and i40e via the macvlan offload interface. However the > > ixgbe/i40e hardware was somewhat limited in that we were only able to > > expose Ethernet interfaces via this sort of VMQ/VMDQ feature, and > > even > > with that we have seen some limitations to the interface. It sounds > > like you are able to break out RDMA capable devices this way as well. > > So in terms of ways to go I would argue this is likely better. > > We've discussed this thoroughly on V0, the SF solutions is closer to a > VF than a VMDQ, this is not just a set of queues. > > https://lore.kernel.org/linux-rdma/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/ VMDq is more than just a set of queues. The fact is it is a pool of resources that get created to handle the requests for a specific VM. The extra bits that are added here are essentially stuff that was required to support mediated devices. > > However > > one downside is that we are going to end up seeing each subfunction > > being different from driver to driver and vendor to vendor which I > > would argue was also one of the problems with SR-IOV as you end up > > with a bit of vendor lock-in as a result of this feature since each > > vendor will be providing a different interface. > > > > I disagree, SFs are tightly coupled with switchdev model and devlink > functions port, they are backed with the a well defined model, i can > say the same about sriov with switchdev mode, this sort of vendor lock- > in issues is eliminated when you migrate to switchdev mode. What you are talking about is the backend. I am talking about what is exposed to the user. The user is going to see a Mellanox device having to be placed into their container in order to support this. One of the advantages of the Intel approach was that the macvlan interface was generic so you could have an offloaded interface or not and the user wouldn't necessarily know. The offload could be disabled and the user would be none the wiser as it is moved from one interface to another. I see that as a big thing that is missing in this solution. > > > When subfunction is RDMA capable, it has its own QP1, GID table and > > > rdma > > > resources neither shared nor stealed from the parent PCI function. > > > > > > A subfunction has dedicated window in PCI BAR space that is not > > > shared > > > with ther other subfunctions or parent PCI function. This ensures > > > that all > > > class devices of the subfunction accesses only assigned PCI BAR > > > space. > > > > > > A Subfunction supports eswitch representation through which it > > > supports tc > > > offloads. User must configure eswitch to send/receive packets > > > from/to > > > subfunction port. > > > > > > Subfunctions share PCI level resources such as PCI MSI-X IRQs with > > > their other subfunctions and/or with its parent PCI function. > > > > This piece to the architecture for this has me somewhat concerned. If > > all your resources are shared and you are allowing devices to be > > not all, only PCI MSIX, for now.. They aren't shared after you partition them but they are coming from the same device. Basically you are subdividing the BAR2 in order to generate the subfunctions. BAR2 is a shared resource in my point of view. > > created incrementally you either have to pre-partition the entire > > function which usually results in limited resources for your base > > setup, or free resources from existing interfaces and redistribute > > them as things change. I would be curious which approach you are > > taking here? So for example if you hit a certain threshold will you > > need to reset the port and rebalance the IRQs between the various > > functions? > > > > Currently SFs will use whatever IRQs the PF has pre-allocated for > itself, so there is no IRQ limit issue at the moment, we are > considering a dynamic IRQ pool with dynamic balancing, or even better > us the IMS approach, which perfectly fits the SF architecture. > https://patchwork.kernel.org/project/linux-pci/cover/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/ When you say you are using the PF's interrupts are you just using that as a pool of resources or having the interrupt process interrupts for both the PF and SFs? Without IMS you are limited to 2048 interrupts. Moving over to that would make sense since SF is similar to mdev in the way it partitions up the device and resources. > for internal resources the are fully isolated (not shared) and > they are internally managed by FW exactly like a VF internal resources. I assume by isolated you mean they are contained within page aligned blocks like what was required for mdev?
On Tue, 2020-12-15 at 10:47 -0800, Alexander Duyck wrote: > On Mon, Dec 14, 2020 at 9:48 PM Parav Pandit <parav@nvidia.com> > wrote: > > > > > From: Alexander Duyck <alexander.duyck@gmail.com> > > > Sent: Tuesday, December 15, 2020 7:24 AM > > > > > > On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org> > > > wrote: > > > > Hi Dave, Jakub, Jason, > > > > > > > > > > Just to clarify a few things for myself. You mention > > > virtualization and SR-IOV > > > in your patch description but you cannot support direct > > > assignment with this > > > correct? > > Correct. it cannot be directly assigned. > > > > > The idea here is simply logical partitioning of an existing > > > network > > > interface, correct? > > No. Idea is to spawn multiple functions from a single PCI device. > > These functions are not born in PCI device and in OS until they are > > created by user. > > That is the definition of logical partitioning. You are essentially > taking one physical PCIe function and splitting up the resources over > multiple logical devices. With something like an MFD driver you would > partition the device as soon as the driver loads, but with this you > are peeling our resources and defining the devices on demand. > Sure, same for SRIOV and same for VMDQ. they are all logical partitioning and they are all sharing the same resources of the system. our point here is that the SF mechanisms are more similar to SRIOV than VMDQ, other than sharing the MSIX vectors and slicing up the BARs, everything else works exactly like SRIOV. > > Jason and Saeed explained this in great detail few weeks back in v0 > > version of the patchset at [1], [2] and [3]. > > I better not repeat all of it here again. Please go through it. > > If you may want to read precursor to it, RFC from Jiri at [4] is > > also explains this in great detail. > > I think I have a pretty good idea of how the feature works. My > concern > is more the use of marketing speak versus actual functionality. The > way this is being setup it sounds like it is useful for > virtualization > and it is not, at least in its current state. It may be at some point > in the future but I worry that it is really going to muddy the waters > as we end up with yet another way to partition devices. > Ok, maybe we have different views on the feature and use cases, this is useful for visualization for various reasons, take vdpa for an instance, but anyway, we can improve documentation to address your concerns and at some point in the future when we add the direct assignment we can add the necessary documentation. > > > So this isn't so much a solution for virtualization, but may > > > work better for containers. I view this as an important > > > distinction to make as > > > the first thing that came to mind when I read this was mediated > > > devices > > > which is similar, but focused only on the virtualization case: > > > https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated- > > > device.html > > > > > Managing subfunction using medicated device is already ruled out > > last year at [5] as it is the abuse of the mdev bus for this > > purpose + has severe limitations of managing the subfunction > > device. > > I agree with you on that. My thought was more the fact that the two > can be easily confused. If we are going to do this we need to define > that for networking devices perhaps that using the mdev interface > would be deprecated and we would need to go through devlink. However > before we do that we need to make sure we have this completely > standardized. > Then lets keep this discussion for later, when we add the direct assignment support :) > > We are not going back to it anymore. > > It will be duplicating lot of the plumbing which exists in devlink, > > netlink, auxiliary bus and more. > > That is kind of my point. It is already in the kernel. What you are > adding is the stuff that is duplicating it. I'm assuming that in > order > to be able to virtualize your interfaces in the future you are going > to have to make use of the same vfio plumbing that the mediated > devices do. > > > > Rather than calling this a subfunction, would it make more sense > > > to call it > > > something such as a queue set? > > No, queue is just one way to send and receive data/packets. > > Jason and Saeed explained and discussed this piece to you and > > others during v0 few weeks back at [1], [2], [3]. > > Please take a look. > > Yeah, I recall that. However I feel like it is being oversold. It > isn't "SR-IOV done right" it seems more like "VMDq done better". The Ok then will improve documentation to not oversell this as "SRIOV done right". > fact that interrupts are shared between the subfunctions is telling. > That is exactly how things work for Intel parts when they do VMDq as > well. The queues are split up into pools and a block of queues > belongs > to a specific queue. From what I can can tell the only difference is > that there is isolation of the pool into specific pages in the BAR. > Which is essentially a requirement for mediated devices so that they > can be direct assigned. > I disagree, this is very dissimilar to VDMQ. SF is a VF without a unique PCI function and bar, the BAR is split up to give the "SF" access to its own partition in the HW, and then it will access the HW exactly like a VF would do, there are no "pools" involved here or any PF resource/queue management, from our HW architecture perspective there is no difference between SF and VF.. really. > > > So in terms of ways to go I would argue this is likely better. > > > However one > > > downside is that we are going to end up seeing each subfunction > > > being > > > different from driver to driver and vendor to vendor which I > > > would argue > > > was also one of the problems with SR-IOV as you end up with a bit > > > of vendor > > > lock-in as a result of this feature since each vendor will be > > > providing a > > > different interface. > > > > > Each and several vendors provided unified interface for managing > > VFs. i.e. > > (a) enable/disable was via vendor neutral sysfs > > (b) sriov capability exposed via standard pci capability and sysfs > > (c) sriov vf config (mac, vlan, rss, tx rate, spoof check trust) > > are using vendor agnostic netlink > > Even though the driver's internal implementation largely differs on > > how trust, spoof, mac, vlan rate etc are enforced. > > > > So subfunction feature/attribute/functionality will be implemented > > differently internally in the driver matching vendor's device, for > > reasonably abstract concept of 'subfunction'. > > I think you are missing the point. The biggest issue with SR-IOV > adoption was the fact that the drivers all created different VF > interfaces and those interfaces didn't support migration. That was > the > two biggest drawbacks for SR-IOV. I don't really see this approach > resolving either of those and so that is one of the reasons why I say > this is closer to "VMDq done better" rather than "SR-IOV done > right". > Assuming at some point one of the flavours is a virtio-net style > interface you could eventually get to the point of something similar > to what seems to have been the goal of mdev which was meant to > address > these two points. > Improving documentation will address these concerns ? but to be clear it is much easier to solve sriov live migration when SFs are involved. > > > > A Subfunction supports eswitch representation through which it > > > > supports tc offloads. User must configure eswitch to > > > > send/receive > > > > packets from/to subfunction port. > > > > > > > > Subfunctions share PCI level resources such as PCI MSI-X IRQs > > > > with > > > > their other subfunctions and/or with its parent PCI function. > > > > > > This piece to the architecture for this has me somewhat > > > concerned. If all your > > > resources are shared and > > All resources are not shared. > > Just to clarify, when I say "shared" I mean that they are all coming > from the same function. So if for example I were to direct-assign the > PF then all the resources for the subfunctions would go with it. > > > > you are allowing devices to be created > > > incrementally you either have to pre-partition the entire > > > function which > > > usually results in limited resources for your base setup, or free > > > resources > > > from existing interfaces and redistribute them as things change. > > > I would be > > > curious which approach you are taking here? So for example if you > > > hit a > > > certain threshold will you need to reset the port and rebalance > > > the IRQs > > > between the various functions? > > No. Its works bit differently for mlx5 device. > > When base function is started, it started as if it doesn't have any > > subfunctions. > > When subfunction is instantiated, it spawns new resources in device > > (hw, fw, memory) depending on how much a function wants. > > > > For example, PCI PF uses BAR 0, while subfunctions uses BAR 2. > > In the grand scheme BAR doesn't really matter much. The assumption > here is that resources are page aligned so that you can map the pages > into a guest eventually. > > > For IRQs, subfunction instance shares the IRQ with its > > parent/hosting PCI PF. > > In future, yes, a dedicated IRQs per SF is likely desired. > > Sridhar also talked about limiting number of queues to a > > subfunction. > > I believe there will be resources/attributes of the function to be > > controlled. > > devlink already provides rich interface to achieve that using > > devlink resources [8]. > > > > [..] > > So it sounds like the device firmware is pre-partitioining the > resources and splitting them up between the PCI PF and your > subfunctions. Although in your case it sounds like there are > significantly more resources than you might find in an ixgbe > interface > for instance. :) > > The point is that we should probably define some sort of standard > and/or expectations on what should happen when you spawn a new > interface. Would it be acceptable for the PF and existing > subfunctions > to have to reset if you need to rebalance the IRQ distribution, or > should they not be disrupted when you spawn a new interface? > > One of the things I prefer about the mediated device setup is the > fact > that it has essentially pre-partitioned things beforehand and you > know > how many of each type of interface you can spawn. Is there any > information like that provided by this interface? > > > > > $ ip link show > > > > 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state > > > DOWN mode DEFAULT group default qlen 1000 > > > > link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff > > > > altname enp6s0f0np0 > > > > 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state > > > > DOWN > > > mode DEFAULT group default qlen 1000 > > > > link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff> > > > > > > I assume that p0sf88 is supposed to be the newly created > > > subfunction. > > > However I thought the naming was supposed to be the same as what > > > you are > > > referring to in the devlink, or did I miss something? > > > > > I believe you are confused with the representor netdevice of > > subfuction with devices of subfunction. (netdev, rdma, vdpa etc). > > I suggest that please refer to the diagram in patch_15 in [7] to > > see the stack, modules, objects. > > Hope below description clarifies a bit. > > There are two netdevices. > > (a) representor netdevice, attached to the devlink port of the > > eswitch > > (b) netdevice of the SF used by the end application (in your > > example, this is assigned to container). > > Sorry, that wasn't clear from your example. So in this case you > started in a namespace and the new device you created via devlink was > spawned in the root namespace? > > > Both netdevice follow obviously a different naming scheme. > > Representor netdevice follows naming scheme well defined in kernel > > + systemd/udev v245 and higher. > > It is based on phys_port_name sysfs attribute. > > This is same for existing PF and SF representors exist for year+ > > now. Further used by subfunction. > > > > For subfunction netdevice (p0s88), system/udev will be extended. I > > put example based on my few lines of udev rule that reads > > phys_port_name and user supplied sfnum, so that user exactly knows > > which interface to assign to container. > > Admittedly I have been out of the loop for the last couple years > since > I had switched over to memory management work for a while. It would > be > useful to include something that shows your created network interface > in the example in addition to your switchdev port. > > > > > After use inactivate the function: > > > > $ devlink port function set ens2f0npf0sf88 state inactive > > > > > > > > Now delete the subfunction port: > > > > $ devlink port del ens2f0npf0sf88 > > > > > > This seems wrong to me as it breaks the symmetry with the port > > > add > > > command and > > Example of the representor device is only to make life easier for > > the user. > > Devlink port del command works based on the devlink port index, > > just like existing devlink port commands (get,set,split,unsplit). > > I explained this in a thread with Sridhar at [6]. > > In short devlink port del <bus/device_name/port_index command is > > just fine. > > Port index is unique handle for the devlink instance that user > > refers to delete, get, set port and port function attributes post > > its creation. > > I choose the representor netdev example because it is more > > intuitive to related to, but port index is equally fine and > > supported. > > Okay then, that addresses my concern. I just wanted to make sure we > weren't in some situation where you had to have the interface in > order > to remove it. > > > > assumes you have ownership of the interface in the host. I > > > would much prefer to to see the same arguments that were passed > > > to the > > > add command being used to do the teardown as that would allow for > > > the > > > parent function to create the object, assign it to a container > > > namespace, and > > > not need to pull it back in order to destroy it. > > Parent function will not have same netdevice name as that of > > representor netdevice, because both devices exist in single system > > for large part of the use cases. > > So port delete command works on the port index. > > Host doesn't need to pull it back to destroy it. It is destroyed > > via port del command. > > > > [1] > > https://lore.kernel.org/netdev/20201112192424.2742-1-parav@nvidia.com/ > > [2] > > https://lore.kernel.org/netdev/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/ > > [3] > > https://lore.kernel.org/netdev/20201120161659.GE917484@nvidia.com/ > > [4] > > https://lore.kernel.org/netdev/20200501091449.GA25211@nanopsycho.orion/ > > [5] > > https://lore.kernel.org/netdev/20191107160448.20962-1-parav@mellanox.com/ > > [6] > > https://lore.kernel.org/netdev/BY5PR12MB43227784BB34D929CA64E315DCCA0@BY5PR12MB4322.namprd12.prod.outlook.com/ > > [7] > > https://lore.kernel.org/netdev/20201214214352.198172-16-saeed@kernel.org/T/#u > > [8] https://man7.org/linux/man-pages/man8/devlink-resource.8.html > >
On Tue, 2020-12-15 at 11:12 -0800, Alexander Duyck wrote: > On Mon, Dec 14, 2020 at 10:15 PM Saeed Mahameed <saeed@kernel.org> > wrote: > > On Mon, 2020-12-14 at 17:53 -0800, Alexander Duyck wrote: > > > On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org> > > > wrote: > > > > Hi Dave, Jakub, Jason, > > > > > > > > This series form Parav was the theme of this mlx5 release > > > > cycle, > > > > we've been waiting anxiously for the auxbus infrastructure to > > > > make > > > > it into > > > > the kernel, and now as the auxbus is in and all the stars are > > > > aligned, I > > > > can finally submit this V2 of the devlink and mlx5 subfunction > > > > support. > > > > > > > > Subfunctions came to solve the scaling issue of virtualization > > > > and switchdev environments, where SRIOV failed to deliver and > > > > users > > > > ran > > > > out of VFs very quickly as SRIOV demands huge amount of > > > > physical > > > > resources > > > > in both of the servers and the NIC. > > > > > > > > Subfunction provide the same functionality as SRIOV but in a > > > > very > > > > lightweight manner, please see the thorough and detailed > > > > documentation from Parav below, in the commit messages and the > > > > Networking documentation patches at the end of this series. > > > > > > > > > > Just to clarify a few things for myself. You mention > > > virtualization > > > and SR-IOV in your patch description but you cannot support > > > direct > > > assignment with this correct? The idea here is simply logical > > > partitioning of an existing network interface, correct? So this > > > isn't > > > so much a solution for virtualization, but may work better for > > > containers. I view this as an important distinction to make as > > > the > > > > at the current state yes, but the SF solution can be extended to > > support direct assignment, so this is why i think SF solution can > > do > > better and eventually replace SRIOV. > > My only real concern is that this and mediated devices are > essentially > the same thing. When you start making this work for direct-assignment > the only real difference becomes the switchdev and devlink > interfaces. not just devlink and switchdev, auxbus was also introduced to standardize some of the interfaces. > Basically this is netdev specific mdev versus the PCIe specific mdev. > SF is not netdev specific mdev .. :/ > > also many customers are currently using SRIOV with containers to > > get > > the performance and isolation features since there was no other > > options. > > There were, but you hadn't implemented them. The fact is the approach > Intel had taken for that was offloaded macvlan. > offloaded macvlan is just a macvlan with checksum/tso and gro. macvlan can't provide RDMA, TC offloads, ethtool steering, PTP, vdpa ... our SF provides the same set of features a VF can provide > I think the big thing we really should do if we are going to go this > route is to look at standardizing what the flavours are that get > created by the parent netdevice. Otherwise we are just creating the > same mess we had with SRIOV all over again and muddying the waters of > mediated devices. > yes in the near future we will be working on auxbus interfaces for auto-probing and user flavor selection, this is a must have feature for us. > > > first thing that came to mind when I read this was mediated > > > devices > > > which is similar, but focused only on the virtualization case: > > > https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-device.html > > > > > > > Parav Pandit Says: > > > > ================= > > > > > > > > This patchset introduces support for mlx5 subfunction (SF). > > > > > > > > A subfunction is a lightweight function that has a parent PCI > > > > function on > > > > which it is deployed. mlx5 subfunction has its own function > > > > capabilities > > > > and its own resources. This means a subfunction has its own > > > > dedicated > > > > queues(txq, rxq, cq, eq). These queues are neither shared nor > > > > stealed from > > > > the parent PCI function. > > > > > > Rather than calling this a subfunction, would it make more sense > > > to > > > call it something such as a queue set? It seems like this is > > > exposing > > > some of the same functionality we did in the Intel drivers such > > > as > > > ixgbe and i40e via the macvlan offload interface. However the > > > ixgbe/i40e hardware was somewhat limited in that we were only > > > able to > > > expose Ethernet interfaces via this sort of VMQ/VMDQ feature, and > > > even > > > with that we have seen some limitations to the interface. It > > > sounds > > > like you are able to break out RDMA capable devices this way as > > > well. > > > So in terms of ways to go I would argue this is likely better. > > > > We've discussed this thoroughly on V0, the SF solutions is closer > > to a > > VF than a VMDQ, this is not just a set of queues. > > > > https://lore.kernel.org/linux-rdma/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/ > > VMDq is more than just a set of queues. The fact is it is a pool of > resources that get created to handle the requests for a specific VM. > The extra bits that are added here are essentially stuff that was > required to support mediated devices. > VMDq pools are managed by the driver and only logically isolated in the kernel, SFs has no shared pool for transport resources (queues), SFs have their own isolated steering domains, processing engines, and HW objects, exactly like a VF. > > > However > > > one downside is that we are going to end up seeing each > > > subfunction > > > being different from driver to driver and vendor to vendor which > > > I > > > would argue was also one of the problems with SR-IOV as you end > > > up > > > with a bit of vendor lock-in as a result of this feature since > > > each > > > vendor will be providing a different interface. > > > > > > > I disagree, SFs are tightly coupled with switchdev model and > > devlink > > functions port, they are backed with the a well defined model, i > > can > > say the same about sriov with switchdev mode, this sort of vendor > > lock- > > in issues is eliminated when you migrate to switchdev mode. > > What you are talking about is the backend. I am talking about what is > exposed to the user. The user is going to see a Mellanox device > having > to be placed into their container in order to support this. One of > the > advantages of the Intel approach was that the macvlan interface was > generic so you could have an offloaded interface or not and the user > wouldn't necessarily know. The offload could be disabled and the user > would be none the wiser as it is moved from one interface to another. > I see that as a big thing that is missing in this solution. > You are talking about the basic netdev users, Sure there are users who would want a more generic netdev, so yes. but most of my customers are not like that, they want vdpa/rdma and heavy netdev offload such as encap/decap/crypto and driver xdp in their containers, the SF approach will make more sense to them than sriov and VMDq. > > > > When subfunction is RDMA capable, it has its own QP1, GID table > > > > and > > > > rdma > > > > resources neither shared nor stealed from the parent PCI > > > > function. > > > > > > > > A subfunction has dedicated window in PCI BAR space that is not > > > > shared > > > > with ther other subfunctions or parent PCI function. This > > > > ensures > > > > that all > > > > class devices of the subfunction accesses only assigned PCI BAR > > > > space. > > > > > > > > A Subfunction supports eswitch representation through which it > > > > supports tc > > > > offloads. User must configure eswitch to send/receive packets > > > > from/to > > > > subfunction port. > > > > > > > > Subfunctions share PCI level resources such as PCI MSI-X IRQs > > > > with > > > > their other subfunctions and/or with its parent PCI function. > > > > > > This piece to the architecture for this has me somewhat > > > concerned. If > > > all your resources are shared and you are allowing devices to be > > > > not all, only PCI MSIX, for now.. > > They aren't shared after you partition them but they are coming from > the same device. Basically you are subdividing the BAR2 in order to > generate the subfunctions. BAR2 is a shared resource in my point of > view. > Sure, but it doesn't host any actual resources, only the communication channel with the HW partition, so other than the BAR and the msix the actual HW resources, steering pipelines, offloads and queues are totlay isolated and separated. > > > created incrementally you either have to pre-partition the entire > > > function which usually results in limited resources for your base > > > setup, or free resources from existing interfaces and > > > redistribute > > > them as things change. I would be curious which approach you are > > > taking here? So for example if you hit a certain threshold will > > > you > > > need to reset the port and rebalance the IRQs between the various > > > functions? > > > > > > > Currently SFs will use whatever IRQs the PF has pre-allocated for > > itself, so there is no IRQ limit issue at the moment, we are > > considering a dynamic IRQ pool with dynamic balancing, or even > > better > > us the IMS approach, which perfectly fits the SF architecture. > > https://patchwork.kernel.org/project/linux-pci/cover/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/ > > When you say you are using the PF's interrupts are you just using > that > as a pool of resources or having the interrupt process interrupts for > both the PF and SFs? Without IMS you are limited to 2048 interrupts. > Moving over to that would make sense since SF is similar to mdev in > the way it partitions up the device and resources. > Yes moving to IMS is on the top of our priorities. > > for internal resources the are fully isolated (not shared) and > > they are internally managed by FW exactly like a VF internal > > resources. > > I assume by isolated you mean they are contained within page aligned > blocks like what was required for mdev? I mean they are isolated and abstracted in the FW, we don't really expose any resource directly to the BAR. the BAR is only used for communicating with the device, so VF and SF will work exactly the same the only difference is where they get their BAR and offsets from, everything else is just similar.
On 12/14/20 10:48 PM, Parav Pandit wrote: > >> From: Alexander Duyck <alexander.duyck@gmail.com> >> Sent: Tuesday, December 15, 2020 7:24 AM >> >> On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org> >> wrote: >>> >>> Hi Dave, Jakub, Jason, >>> >> >> Just to clarify a few things for myself. You mention virtualization and SR-IOV >> in your patch description but you cannot support direct assignment with this >> correct? > Correct. it cannot be directly assigned. > >> The idea here is simply logical partitioning of an existing network >> interface, correct? > No. Idea is to spawn multiple functions from a single PCI device. > These functions are not born in PCI device and in OS until they are created by user. > Jason and Saeed explained this in great detail few weeks back in v0 version of the patchset at [1], [2] and [3]. > I better not repeat all of it here again. Please go through it. > If you may want to read precursor to it, RFC from Jiri at [4] is also explains this in great detail. > >> So this isn't so much a solution for virtualization, but may >> work better for containers. I view this as an important distinction to make as >> the first thing that came to mind when I read this was mediated devices >> which is similar, but focused only on the virtualization case: >> https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated- >> device.html >> > Managing subfunction using medicated device is already ruled out last year at [5] as it is the abuse of the mdev bus for this purpose + has severe limitations of managing the subfunction device. > We are not going back to it anymore. > It will be duplicating lot of the plumbing which exists in devlink, netlink, auxiliary bus and more. > >> Rather than calling this a subfunction, would it make more sense to call it >> something such as a queue set? > No, queue is just one way to send and receive data/packets. > Jason and Saeed explained and discussed this piece to you and others during v0 few weeks back at [1], [2], [3]. > Please take a look. > >> So in terms of ways to go I would argue this is likely better. However one >> downside is that we are going to end up seeing each subfunction being >> different from driver to driver and vendor to vendor which I would argue >> was also one of the problems with SR-IOV as you end up with a bit of vendor >> lock-in as a result of this feature since each vendor will be providing a >> different interface. >> > Each and several vendors provided unified interface for managing VFs. i.e. > (a) enable/disable was via vendor neutral sysfs > (b) sriov capability exposed via standard pci capability and sysfs > (c) sriov vf config (mac, vlan, rss, tx rate, spoof check trust) are using vendor agnostic netlink > Even though the driver's internal implementation largely differs on how trust, spoof, mac, vlan rate etc are enforced. > > So subfunction feature/attribute/functionality will be implemented differently internally in the driver matching vendor's device, for reasonably abstract concept of 'subfunction'. > >>> A Subfunction supports eswitch representation through which it >>> supports tc offloads. User must configure eswitch to send/receive >>> packets from/to subfunction port. >>> >>> Subfunctions share PCI level resources such as PCI MSI-X IRQs with >>> their other subfunctions and/or with its parent PCI function. >> >> This piece to the architecture for this has me somewhat concerned. If all your >> resources are shared and > All resources are not shared. > >> you are allowing devices to be created >> incrementally you either have to pre-partition the entire function which >> usually results in limited resources for your base setup, or free resources >> from existing interfaces and redistribute them as things change. I would be >> curious which approach you are taking here? So for example if you hit a >> certain threshold will you need to reset the port and rebalance the IRQs >> between the various functions? > No. Its works bit differently for mlx5 device. > When base function is started, it started as if it doesn't have any subfunctions. > When subfunction is instantiated, it spawns new resources in device (hw, fw, memory) depending on how much a function wants. > > For example, PCI PF uses BAR 0, while subfunctions uses BAR 2. > For IRQs, subfunction instance shares the IRQ with its parent/hosting PCI PF. > In future, yes, a dedicated IRQs per SF is likely desired. > Sridhar also talked about limiting number of queues to a subfunction. > I believe there will be resources/attributes of the function to be controlled. > devlink already provides rich interface to achieve that using devlink resources [8]. > > [..] > >>> $ ip link show >>> 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state >> DOWN mode DEFAULT group default qlen 1000 >>> link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff >>> altname enp6s0f0np0 >>> 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN >> mode DEFAULT group default qlen 1000 >>> link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff> >> >> I assume that p0sf88 is supposed to be the newly created subfunction. >> However I thought the naming was supposed to be the same as what you are >> referring to in the devlink, or did I miss something? >> > I believe you are confused with the representor netdevice of subfuction with devices of subfunction. (netdev, rdma, vdpa etc). > I suggest that please refer to the diagram in patch_15 in [7] to see the stack, modules, objects. > Hope below description clarifies a bit. > There are two netdevices. > (a) representor netdevice, attached to the devlink port of the eswitch > (b) netdevice of the SF used by the end application (in your example, this is assigned to container). > > Both netdevice follow obviously a different naming scheme. > Representor netdevice follows naming scheme well defined in kernel + systemd/udev v245 and higher. > It is based on phys_port_name sysfs attribute. > This is same for existing PF and SF representors exist for year+ now. Further used by subfunction. > > For subfunction netdevice (p0s88), system/udev will be extended. I put example based on my few lines of udev rule that reads > phys_port_name and user supplied sfnum, so that user exactly knows which interface to assign to container. > >>> After use inactivate the function: >>> $ devlink port function set ens2f0npf0sf88 state inactive >>> >>> Now delete the subfunction port: >>> $ devlink port del ens2f0npf0sf88 >> >> This seems wrong to me as it breaks the symmetry with the port add >> command and > Example of the representor device is only to make life easier for the user. > Devlink port del command works based on the devlink port index, just like existing devlink port commands (get,set,split,unsplit). > I explained this in a thread with Sridhar at [6]. > In short devlink port del <bus/device_name/port_index command is just fine. > Port index is unique handle for the devlink instance that user refers to delete, get, set port and port function attributes post its creation. > I choose the representor netdev example because it is more intuitive to related to, but port index is equally fine and supported. > >> assumes you have ownership of the interface in the host. I >> would much prefer to to see the same arguments that were passed to the >> add command being used to do the teardown as that would allow for the >> parent function to create the object, assign it to a container namespace, and >> not need to pull it back in order to destroy it. > Parent function will not have same netdevice name as that of representor netdevice, because both devices exist in single system for large part of the use cases. > So port delete command works on the port index. > Host doesn't need to pull it back to destroy it. It is destroyed via port del command. > > [1] https://lore.kernel.org/netdev/20201112192424.2742-1-parav@nvidia.com/ > [2] https://lore.kernel.org/netdev/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/ > [3] https://lore.kernel.org/netdev/20201120161659.GE917484@nvidia.com/ > [4] https://lore.kernel.org/netdev/20200501091449.GA25211@nanopsycho.orion/ > [5] https://lore.kernel.org/netdev/20191107160448.20962-1-parav@mellanox.com/ > [6] https://lore.kernel.org/netdev/BY5PR12MB43227784BB34D929CA64E315DCCA0@BY5PR12MB4322.namprd12.prod.outlook.com/ > [7] https://lore.kernel.org/netdev/20201214214352.198172-16-saeed@kernel.org/T/#u > [8] https://man7.org/linux/man-pages/man8/devlink-resource.8.html > Seems to be a repeated line of questions. You might want to add these FAQs, responses and references to the subfunction document once this set gets merged.
On Tue, Dec 15, 2020 at 10:47:36AM -0800, Alexander Duyck wrote: > > Jason and Saeed explained this in great detail few weeks back in v0 version of the patchset at [1], [2] and [3]. > > I better not repeat all of it here again. Please go through it. > > If you may want to read precursor to it, RFC from Jiri at [4] is also explains this in great detail. > > I think I have a pretty good idea of how the feature works. My concern > is more the use of marketing speak versus actual functionality. The > way this is being setup it sounds like it is useful for virtualization > and it is not, at least in its current state. It may be at some point > in the future but I worry that it is really going to muddy the waters > as we end up with yet another way to partition devices. If we do a virtualization version then it will take a SF and instead of loading a mlx5_core on the SF aux device, we will load some vfio_mdev_mlx5 driver which will convert the SF aux device into a /dev/vfio/* This is essentially the same as how you'd take a PCI VF and replace mlx5_core with vfio-pci to get /dev/vfio/*. It has to be a special mdev driver because it sits on the SF aux device, not on the VF PCI device. The vfio_mdev_mlx5 driver will create what Intel calls an SIOV ADI from the SF, in other words the SF is already a superset of what a SIOV ADI should be. This matches very nicely the driver model in Linux, and I don't think it becomes more muddied as we go along. If anything it is becoming more clear and sane as things progress. > I agree with you on that. My thought was more the fact that the two > can be easily confused. If we are going to do this we need to define > that for networking devices perhaps that using the mdev interface > would be deprecated and we would need to go through devlink. However > before we do that we need to make sure we have this completely > standardized. mdev is for creating /dev/vfio/* interfaces in userspace. Using it for anything else is a bad abuse of the driver model. We had this debate endlessly already. AFAIK, there is nothing to deprecate, there are no mdev_drivers in drivers/net, and none should ever be added. The only mdev_driver that should ever exists is in vfio_mdev.c If someone is using a mdev_driver in drivers/net out of tree then they will need to convert to an aux driver for in-tree. > Yeah, I recall that. However I feel like it is being oversold. It > isn't "SR-IOV done right" it seems more like "VMDq done better". The > fact that interrupts are shared between the subfunctions is telling. The interrupt sharing is a consequence of having an ADI-like model without relying on IMS. When IMS works then shared interrupts won't be very necessary. Otherwise there is no choice but to share the MSI table of the function. > That is exactly how things work for Intel parts when they do VMDq as > well. The queues are split up into pools and a block of queues belongs > to a specific queue. From what I can can tell the only difference is > that there is isolation of the pool into specific pages in the BAR. > Which is essentially a requirement for mediated devices so that they > can be direct assigned. No, I said this to Jakub, mlx5 SFs have very little to do with queues. There is no some 'queue' HW element that needs partitioning. The SF is a hardware security boundary that wraps every operation a mlx5 device can do. This is why it is an ADI. It is not a crappy ADI that relies on hypervisor emulation, it is the real thing, just like a SRIOV VF. You stick it in the VM and the guest can directly talk to the HW. The HW provides the security. I can't put focus on this enough: A mlx5 SF can run a *full RDMA stack*. This means the driver can create all the RDMA HW objects and resources under the SF. This is *not* just steering some ethernet traffic to a few different ethernet queues like VMDq is. The Intel analog to a SF is a *full virtual function* on one of the Intel iWarp capable NICs, not VMDq. > Assuming at some point one of the flavours is a virtio-net style > interface you could eventually get to the point of something similar > to what seems to have been the goal of mdev which was meant to address > these two points. mlx5 already supports VDPA virtio-net on PF/VF and with this series SF too. ie you can take a SF, bind the vdpa_mlx5 driver, and get a fully HW accelerated "ADI" that does virtio-net. This can be assigned to a guest and shows up as a PCI virtio-net netdev. With VT-d guest packet tx/rx on this netdev never uses the hypervisor CPU. > The point is that we should probably define some sort of standard > and/or expectations on what should happen when you spawn a new > interface. Would it be acceptable for the PF and existing subfunctions > to have to reset if you need to rebalance the IRQ distribution, or > should they not be disrupted when you spawn a new interface? It is best to think of the SF as an ADI, so if you change something in the PF and that causes the driver attached to the ADI in a VM to reset, is that OK? I'd say no. Jason
On Tue, 15 Dec 2020 12:35:20 -0800 Saeed Mahameed wrote: > > I think the big thing we really should do if we are going to go this > > route is to look at standardizing what the flavours are that get > > created by the parent netdevice. Otherwise we are just creating the > > same mess we had with SRIOV all over again and muddying the waters of > > mediated devices. > > yes in the near future we will be working on auxbus interfaces for > auto-probing and user flavor selection, this is a must have feature for > us. Can you elaborate? I thought config would be via devlink.
On Tue, Dec 15, 2020 at 12:35 PM Saeed Mahameed <saeed@kernel.org> wrote: > > On Tue, 2020-12-15 at 11:12 -0800, Alexander Duyck wrote: > > On Mon, Dec 14, 2020 at 10:15 PM Saeed Mahameed <saeed@kernel.org> > > wrote: > > > On Mon, 2020-12-14 at 17:53 -0800, Alexander Duyck wrote: > > > > On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org> > > > > wrote: > > > > > Hi Dave, Jakub, Jason, > > > > > > > > > > This series form Parav was the theme of this mlx5 release > > > > > cycle, > > > > > we've been waiting anxiously for the auxbus infrastructure to > > > > > make > > > > > it into > > > > > the kernel, and now as the auxbus is in and all the stars are > > > > > aligned, I > > > > > can finally submit this V2 of the devlink and mlx5 subfunction > > > > > support. > > > > > > > > > > Subfunctions came to solve the scaling issue of virtualization > > > > > and switchdev environments, where SRIOV failed to deliver and > > > > > users > > > > > ran > > > > > out of VFs very quickly as SRIOV demands huge amount of > > > > > physical > > > > > resources > > > > > in both of the servers and the NIC. > > > > > > > > > > Subfunction provide the same functionality as SRIOV but in a > > > > > very > > > > > lightweight manner, please see the thorough and detailed > > > > > documentation from Parav below, in the commit messages and the > > > > > Networking documentation patches at the end of this series. > > > > > > > > > > > > > Just to clarify a few things for myself. You mention > > > > virtualization > > > > and SR-IOV in your patch description but you cannot support > > > > direct > > > > assignment with this correct? The idea here is simply logical > > > > partitioning of an existing network interface, correct? So this > > > > isn't > > > > so much a solution for virtualization, but may work better for > > > > containers. I view this as an important distinction to make as > > > > the > > > > > > at the current state yes, but the SF solution can be extended to > > > support direct assignment, so this is why i think SF solution can > > > do > > > better and eventually replace SRIOV. > > > > My only real concern is that this and mediated devices are > > essentially > > the same thing. When you start making this work for direct-assignment > > the only real difference becomes the switchdev and devlink > > interfaces. > > not just devlink and switchdev, auxbus was also introduced to > standardize some of the interfaces. The auxbus is just there to make up for the fact that there isn't another bus type for this though. I would imagine otherwise this would be on some sort of platform bus. > > Basically this is netdev specific mdev versus the PCIe specific mdev. > > > > SF is not netdev specific mdev .. :/ I agree it is not. However there are just a few extensions to it. What I would really like to see is a solid standardization of what this is. Otherwise the comparison is going to be made. Especially since a year ago Mellanox was pushing this as an mdev type interface. There is more here than just mdev, however my concern is that we may be also losing some of the advantages of mdev. It would be much easier for me to go along with this if we had more than one vendor pushing it. My concern is that this is becoming something that may end up being vendor specific. > > > also many customers are currently using SRIOV with containers to > > > get > > > the performance and isolation features since there was no other > > > options. > > > > There were, but you hadn't implemented them. The fact is the approach > > Intel had taken for that was offloaded macvlan. > > > > offloaded macvlan is just a macvlan with checksum/tso and gro. > > macvlan can't provide RDMA, TC offloads, ethtool steering, PTP, vdpa Agreed. I have already acknowledged that macvlan couldn't meet the needs for all use cases. However at the same time it provides a consistent interface regardless of vendors. If we decide to go with the vendor specific drivers for subfunctions that is fine, however I see that going down the same path as SR-IOV and ultimately ending in obscurity since I don't see many being willing to adopt it. > ... > our SF provides the same set of features a VF can provide That is all well and good. However if we agree that SR-IOV wasn't done right saying that you are spinning up something that works just like SR-IOV isn't all that appealing, is it? > > I think the big thing we really should do if we are going to go this > > route is to look at standardizing what the flavours are that get > > created by the parent netdevice. Otherwise we are just creating the > > same mess we had with SRIOV all over again and muddying the waters of > > mediated devices. > > > > yes in the near future we will be working on auxbus interfaces for > auto-probing and user flavor selection, this is a must have feature for > us. I would take this one step further. If we are going to have flavours maybe we should have interfaces pre-defined that are vendor agnostic that can represent the possible flavors. Basically an Ethernet interface for that case, and RDMA interface for that case and so on. It limits what functionality can be exposed, however it frees up the containers and/or guests to work on whatever NIC you want as long as it supports that interface. > > > > first thing that came to mind when I read this was mediated > > > > devices > > > > which is similar, but focused only on the virtualization case: > > > > https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-device.html > > > > > > > > > Parav Pandit Says: > > > > > ================= > > > > > > > > > > This patchset introduces support for mlx5 subfunction (SF). > > > > > > > > > > A subfunction is a lightweight function that has a parent PCI > > > > > function on > > > > > which it is deployed. mlx5 subfunction has its own function > > > > > capabilities > > > > > and its own resources. This means a subfunction has its own > > > > > dedicated > > > > > queues(txq, rxq, cq, eq). These queues are neither shared nor > > > > > stealed from > > > > > the parent PCI function. > > > > > > > > Rather than calling this a subfunction, would it make more sense > > > > to > > > > call it something such as a queue set? It seems like this is > > > > exposing > > > > some of the same functionality we did in the Intel drivers such > > > > as > > > > ixgbe and i40e via the macvlan offload interface. However the > > > > ixgbe/i40e hardware was somewhat limited in that we were only > > > > able to > > > > expose Ethernet interfaces via this sort of VMQ/VMDQ feature, and > > > > even > > > > with that we have seen some limitations to the interface. It > > > > sounds > > > > like you are able to break out RDMA capable devices this way as > > > > well. > > > > So in terms of ways to go I would argue this is likely better. > > > > > > We've discussed this thoroughly on V0, the SF solutions is closer > > > to a > > > VF than a VMDQ, this is not just a set of queues. > > > > > > https://lore.kernel.org/linux-rdma/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/ > > > > VMDq is more than just a set of queues. The fact is it is a pool of > > resources that get created to handle the requests for a specific VM. > > The extra bits that are added here are essentially stuff that was > > required to support mediated devices. > > > > VMDq pools are managed by the driver and only logically isolated in the > kernel, SFs has no shared pool for transport resources (queues), SFs > have their own isolated steering domains, processing engines, and HW > objects, exactly like a VF. You are describing your specific implementation. That may not apply to others. What you are defining as the differences of VMDq and SR-IOV are not the same as other vendors. You are essentially arguing implementation semantics, if it is configured by the driver or firmware it doesn't really make any difference. Being fully isolated versus only logically isolated only really matters in terms of direct assignment. In the grand scheme of things the only real difference between SR-IOV and VMDq is the spawning of the PCIe device with its own BAR to access the resources. Isolating the queues to their own 4K bounded subset of a BAR is pretty straightforward and I assume that and the firmware is what is giving you most of your isolation in this case. > > > > However > > > > one downside is that we are going to end up seeing each > > > > subfunction > > > > being different from driver to driver and vendor to vendor which > > > > I > > > > would argue was also one of the problems with SR-IOV as you end > > > > up > > > > with a bit of vendor lock-in as a result of this feature since > > > > each > > > > vendor will be providing a different interface. > > > > > > > > > > I disagree, SFs are tightly coupled with switchdev model and > > > devlink > > > functions port, they are backed with the a well defined model, i > > > can > > > say the same about sriov with switchdev mode, this sort of vendor > > > lock- > > > in issues is eliminated when you migrate to switchdev mode. > > > > What you are talking about is the backend. I am talking about what is > > exposed to the user. The user is going to see a Mellanox device > > having > > to be placed into their container in order to support this. One of > > the > > advantages of the Intel approach was that the macvlan interface was > > generic so you could have an offloaded interface or not and the user > > wouldn't necessarily know. The offload could be disabled and the user > > would be none the wiser as it is moved from one interface to another. > > I see that as a big thing that is missing in this solution. > > > > You are talking about the basic netdev users, Sure there are users who > would want a more generic netdev, so yes. but most of my customers are > not like that, they want vdpa/rdma and heavy netdev offload such as > encap/decap/crypto and driver xdp in their containers, the SF approach > will make more sense to them than sriov and VMDq. I am talking about my perspective. From what I have seen, one-off features that are only available from specific vendors are a pain to deal with and difficult to enable when you have to support multiple vendors within your ecosystem. What you end up going for is usually the lowest common denominator because you ideally want to be able to configure all your devices the same and have one recipe for setup. I'm not saying you cannot enable those features. However at the same time I am saying it would be nice to have a vendor neutral way of dealing with those if we are going to support SF, ideally with some sort of software fallback that may not perform as well but will at least get us the same functionality. I'm trying to remember which netdev conference it was. I referred to this as a veth switchdev offload when something like this was first brought up. The more I think about it now it would almost make more sense to have something like that as a flavor. The way I view it we have a few different use cases floating around which will have different needs. My thought is having a standardized interface that could address those needs would be a good way to go for this as it would force everyone to come together and define a standardized feature set that all of the vendors would want to expose. > > > > > When subfunction is RDMA capable, it has its own QP1, GID table > > > > > and > > > > > rdma > > > > > resources neither shared nor stealed from the parent PCI > > > > > function. > > > > > > > > > > A subfunction has dedicated window in PCI BAR space that is not > > > > > shared > > > > > with ther other subfunctions or parent PCI function. This > > > > > ensures > > > > > that all > > > > > class devices of the subfunction accesses only assigned PCI BAR > > > > > space. > > > > > > > > > > A Subfunction supports eswitch representation through which it > > > > > supports tc > > > > > offloads. User must configure eswitch to send/receive packets > > > > > from/to > > > > > subfunction port. > > > > > > > > > > Subfunctions share PCI level resources such as PCI MSI-X IRQs > > > > > with > > > > > their other subfunctions and/or with its parent PCI function. > > > > > > > > This piece to the architecture for this has me somewhat > > > > concerned. If > > > > all your resources are shared and you are allowing devices to be > > > > > > not all, only PCI MSIX, for now.. > > > > They aren't shared after you partition them but they are coming from > > the same device. Basically you are subdividing the BAR2 in order to > > generate the subfunctions. BAR2 is a shared resource in my point of > > view. > > > > Sure, but it doesn't host any actual resources, only the communication > channel with the HW partition, so other than the BAR and the msix the > actual HW resources, steering pipelines, offloads and queues are totlay > isolated and separated. I understand what you are trying to say, however this is semantics specific to the implementation. Ultimately you are having to share the function. > > > > created incrementally you either have to pre-partition the entire > > > > function which usually results in limited resources for your base > > > > setup, or free resources from existing interfaces and > > > > redistribute > > > > them as things change. I would be curious which approach you are > > > > taking here? So for example if you hit a certain threshold will > > > > you > > > > need to reset the port and rebalance the IRQs between the various > > > > functions? > > > > > > > > > > Currently SFs will use whatever IRQs the PF has pre-allocated for > > > itself, so there is no IRQ limit issue at the moment, we are > > > considering a dynamic IRQ pool with dynamic balancing, or even > > > better > > > us the IMS approach, which perfectly fits the SF architecture. > > > https://patchwork.kernel.org/project/linux-pci/cover/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/ > > > > When you say you are using the PF's interrupts are you just using > > that > > as a pool of resources or having the interrupt process interrupts for > > both the PF and SFs? Without IMS you are limited to 2048 interrupts. > > Moving over to that would make sense since SF is similar to mdev in > > the way it partitions up the device and resources. > > > > Yes moving to IMS is on the top of our priorities. > > > > for internal resources the are fully isolated (not shared) and > > > they are internally managed by FW exactly like a VF internal > > > resources. > > > > I assume by isolated you mean they are contained within page aligned > > blocks like what was required for mdev? > > I mean they are isolated and abstracted in the FW, we don't really > expose any resource directly to the BAR. the BAR is only used for > communicating with the device, so VF and SF will work exactly the same > the only difference is where they get their BAR and offsets from, > everything else is just similar. I think where you and I differ is our use of the term "resource". I would consider the address space a "resource", while you argue that the resources are hidden behind the BAR. I agree with you that the firmware should be managing most of the resources in the device. So it isn't surprising that it would be splitting them up and then dolling out pieces as needed to put together a subfunction.
On Tue, Dec 15, 2020 at 01:41:04PM -0800, Alexander Duyck wrote: > > not just devlink and switchdev, auxbus was also introduced to > > standardize some of the interfaces. > > The auxbus is just there to make up for the fact that there isn't > another bus type for this though. I would imagine otherwise this would > be on some sort of platform bus. Please lets not start this again. This was gone over with Greg for literally a year and a half and he explicitly NAK'd platform bus for this purpose. Aux bus exists to connect different kernel subsystems that touch the same HW block together. Here we have the mlx5_core subsystem, vdpa, rdma, and netdev all being linked together using auxbus. It is kind of like what MFD does, but again, using MFD for this was also NAK'd by Greg. At the very worst we might sometime find out there is some common stuff between ADIs that we might get an ADI bus, but I'm not optimistic. So far it looks like there is no commonality. Aux bus has at least 4 users already in various stages of submission, and many other target areas that should be replaced by it. > I would really like to see is a solid standardization of what this is. > Otherwise the comparison is going to be made. Especially since a year > ago Mellanox was pushing this as an mdev type interface. mdev was NAK'd too. mdev is only for creating /dev/vfio/*. > That is all well and good. However if we agree that SR-IOV wasn't done > right saying that you are spinning up something that works just like > SR-IOV isn't all that appealing, is it? Fitting into some universal least-common-denominator was never a goal for SR-IOV, so I wouldn't agree it was done wrong. > I am talking about my perspective. From what I have seen, one-off > features that are only available from specific vendors are a pain to > deal with and difficult to enable when you have to support multiple > vendors within your ecosystem. What you end up going for is usually > the lowest common denominator because you ideally want to be able to > configure all your devices the same and have one recipe for setup. So encourage other vendors to support the switchdev model for managing VFs and ADIs! > I'm not saying you cannot enable those features. However at the same > time I am saying it would be nice to have a vendor neutral way of > dealing with those if we are going to support SF, ideally with some > sort of software fallback that may not perform as well but will at > least get us the same functionality. Is it really true there is no way to create a software device on a switchdev today? I looked for a while and couldn't find anything. openvswitch can do this, so it does seem like a gap, but this has nothing to do with this series. A software switchdev path should still end up with the representor and user facing netdev, and the behavior of the two netdevs should be identical to the VF switchdev flow we already have today. SF doesn't change any of this, it just shines a light that, yes, people actually have been using VFs with netdevs in containers and switchdev, as part of their operations. FWIW, I view this as a positive because it shows the switchdev model is working very well and seeing adoption beyond the original idea of controlling VMs with SRIOV. > I'm trying to remember which netdev conference it was. I referred to > this as a veth switchdev offload when something like this was first > brought up. Sure, though I think the way you'd create such a thing might be different. These APIs are really about creating an ADI that might be assigned to a VM and never have a netdev. It would be nonsense to create a veth-switchdev thing with out a netdev, and there have been various past attempts already NAK'd to transform a netdev into an ADI. Anyhow, if such a thing exists someday it could make sense to automatically substitute the HW version using a SF, if available. > could address those needs would be a good way to go for this as it > would force everyone to come together and define a standardized > feature set that all of the vendors would want to expose. I would say switchdev is already the standard feature set. Jason
On Tue, Dec 15, 2020 at 10:49 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> It isn't "SR-IOV done right" it seems more like "VMDq done better".
I don't think I agree with that assertion. The fact that VMDq can talk
to a common driver still makes VMDq preferable in some respects. Thus,
subfunctions do appear to be more of a better SR-IOV than a better
VMDq, but I'm similarly not sold on whether a better SR-IOV is
sufficient benefit to warrant the additional complexity this
introduces. If I understand correctly, subfunctions buy two things:
1) More than 256 SFs are possible: Maybe it's about time PCI-SIG
addresses this limit for VFs? If that were the only problem with VFs,
then fixing it once there would be cleaner. The devlink interface for
configuring a SF is certainly more sexy than legacy SR-IOV, but it
shouldn't be fundamentally impossible to zhuzh up VFs either. One can
also imagine possibilities around remapping multiple PFs (and their
VFs) in a clever way to get around the limited number of PCI resources
exposed to the host.
2) More flexible division of resources: It's not clear that device
firmwarre can't perform smarter allocation than N/<num VFs>, but
subfunctions also appear to allow sharing of certain resources by the
PF driver, if desirable. To the extent that resources are shared, how
are workloads isolated from each other?
I'm not sure I like the idea of having to support another resource
allocation model in our driver just to support this, at least not
without a clearer understanding of what is being gained.
Like you, I would also prefer a more common infrastructure for
exposing something based on VirtIO/VMDq as the container/VM facing
netdevs. Is the lowest common denominator that a VMDq based interface
would constrain things to really unsuitable for container use cases?
Is the added complexity and confusion around VF vs SF vs VMDq really
warranted? I also don't see how this tackles container/VF portability,
migration of workloads, kernel network stack bypass, or any of the
other legacy limitations regarding SR-IOV VFs when we have vendor
specific aux bus drivers talking directly to vendor specific backend
hardware resources. In this regard, don't subfunctions, by definition,
have most of the same limitations as SR-IOV VFs?
Regards,
Edwin Peer
On Tue, Dec 15, 2020 at 4:20 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Tue, Dec 15, 2020 at 01:41:04PM -0800, Alexander Duyck wrote: > > > > not just devlink and switchdev, auxbus was also introduced to > > > standardize some of the interfaces. > > > > The auxbus is just there to make up for the fact that there isn't > > another bus type for this though. I would imagine otherwise this would > > be on some sort of platform bus. > > Please lets not start this again. This was gone over with Greg for > literally a year and a half and he explicitly NAK'd platform bus for > this purpose. > > Aux bus exists to connect different kernel subsystems that touch the > same HW block together. Here we have the mlx5_core subsystem, vdpa, > rdma, and netdev all being linked together using auxbus. > > It is kind of like what MFD does, but again, using MFD for this was > also NAK'd by Greg. Sorry I wasn't saying we needed to use MFD or platform bus. I'm aware of the discussions I was just saying there are some parallels, not that we needed to go through and use that. > At the very worst we might sometime find out there is some common > stuff between ADIs that we might get an ADI bus, but I'm not > optimistic. So far it looks like there is no commonality. > > Aux bus has at least 4 users already in various stages of submission, > and many other target areas that should be replaced by it. Aux bus is fine and I am happy with it. I was just saying that auxbus isn't something that should really be used to say that this is significantly different from mdev as they both rely on a bus topology. > > I would really like to see is a solid standardization of what this is. > > Otherwise the comparison is going to be made. Especially since a year > > ago Mellanox was pushing this as an mdev type interface. > > mdev was NAK'd too. > > mdev is only for creating /dev/vfio/*. Agreed. However my worry is that as we start looking to make this support virtualization it will still end up swinging more toward mdev. I would much prefer to make certain that any mdev is a flavor and doesn't end up coloring the entire interface. > > That is all well and good. However if we agree that SR-IOV wasn't done > > right saying that you are spinning up something that works just like > > SR-IOV isn't all that appealing, is it? > > Fitting into some universal least-common-denominator was never a goal > for SR-IOV, so I wouldn't agree it was done wrong. It isn't so much about right or wrong but he use cases. My experience has been that SR-IOV ends up being used for very niche use cases where you are direct assigning it into either DPDK or some NFV VM and you are essentially building the application around the NIC. It is all well and good, but for general virtualization it never really caught on. My thought is that if we are going to do this we need to do this in a way that improves the usability, otherwise we are looking at more niche use cases. > > I am talking about my perspective. From what I have seen, one-off > > features that are only available from specific vendors are a pain to > > deal with and difficult to enable when you have to support multiple > > vendors within your ecosystem. What you end up going for is usually > > the lowest common denominator because you ideally want to be able to > > configure all your devices the same and have one recipe for setup. > > So encourage other vendors to support the switchdev model for managing > VFs and ADIs! Ugh, don't get me started on switchdev. The biggest issue as I see it with switchev is that you have to have a true switch in order to really be able to use it. As such dumbed down hardware like the ixgbe for instance cannot use it since it defaults to outputting anything that doesn't have an existing rule to the external port. If we could tweak the design to allow for more dumbed down hardware it would probably be much easier to get wider adoption. Honestly, the switchdev interface isn't what I was talking about. I was talking about the SF interface, not the switchdev side of it. In my mind you can place your complexity into the switchdev side of the interface, but keep the SF interface simple. Then you can back it with whatever you want, but without having to have a vendor specific version of the interface being plugged into the guest or container. One of the reasons why virtio-net is being pushed as a common interface for vendors is for this reason. It is an interface that can be emulated by software or hardware and it allows the guest to run on any arbitrary hardware. > > I'm not saying you cannot enable those features. However at the same > > time I am saying it would be nice to have a vendor neutral way of > > dealing with those if we are going to support SF, ideally with some > > sort of software fallback that may not perform as well but will at > > least get us the same functionality. > > Is it really true there is no way to create a software device on a > switchdev today? I looked for a while and couldn't find > anything. openvswitch can do this, so it does seem like a gap, but > this has nothing to do with this series. It has plenty to do with this series. This topic has been under discussion since something like 2017 when Mellanox first brought it up at Netdev 2.1. At the time I told them they should implement this as a veth offload. Then it becomes obvious what the fallback becomes as you can place packets into one end of a veth and it comes out the other, just like a switchdev representor and the SF in this case. It would make much more sense to do it this way rather than setting up yet another vendor proprietary interface pair. > A software switchdev path should still end up with the representor and > user facing netdev, and the behavior of the two netdevs should be > identical to the VF switchdev flow we already have today. > > SF doesn't change any of this, it just shines a light that, yes, > people actually have been using VFs with netdevs in containers and > switchdev, as part of their operations. > > FWIW, I view this as a positive because it shows the switchdev model > is working very well and seeing adoption beyond the original idea of > controlling VMs with SRIOV. PF/VF isolation is a given. So the existing switchdev behavior is fine in that regard. I wouldn't expect that to change. The fact is we actually support something similar in the macvlan approach we put in the Intel drivers since the macvlan itself provided an option for isolation or to hairpin the traffic if you wanted to allow the VMDq instances to be bypassed. That was another thing I view as a huge feature as broadcast/multicast traffic can really get ugly when the devices are all separate pipelines and being able to switch that off and just do software replication and hairpinning can be extremely useful. > > I'm trying to remember which netdev conference it was. I referred to > > this as a veth switchdev offload when something like this was first > > brought up. > > Sure, though I think the way you'd create such a thing might be > different. These APIs are really about creating an ADI that might be > assigned to a VM and never have a netdev. > > It would be nonsense to create a veth-switchdev thing with out a > netdev, and there have been various past attempts already NAK'd to > transform a netdev into an ADI. > > Anyhow, if such a thing exists someday it could make sense to > automatically substitute the HW version using a SF, if available. The main problem as I see it is the fact that the SF interface is bound too tightly to the hardware. The direct pass-thru with no hairpin is always an option but if we are going to have an interface where both ends are in the same system there are many cases where always pushing all the packets off to the hardware don't necessarily make sense. > > could address those needs would be a good way to go for this as it > > would force everyone to come together and define a standardized > > feature set that all of the vendors would want to expose. > > I would say switchdev is already the standard feature set. Yes, it is a standard feature set for the control plane. However for the data-path it is somewhat limited as I feel it only describes what goes through the switch. Not the interfaces that are exposed as the endpoints. It is the problem of that last bit and how it is handled that can make things ugly. For example the multicast/broadcast replication problem that just occurred to me while writing this up. The fact is for east-west traffic there has always been a problem with the switchdev model as it limits everything to PCIe/DMA so there are cases where software switches can outperform the hardware ones.
On Tue, Dec 15, 2020 at 05:12:33PM -0800, Edwin Peer wrote: > 1) More than 256 SFs are possible: Maybe it's about time PCI-SIG > addresses this limit for VFs? They can't, the Bus/Device/Function is limited by protocol and changing that would upend the entire PCI world. Instead PCI-SIG said PASID is the way forward. > If that were the only problem with VFs, then fixing it once there > would be cleaner. Maybe, but half the problem with VFs is how HW expensive they are. The mlx5 SF version is not such a good example, but Intel has shown in other recent patches, like for their idxd, that the HW side of an ADI can be very simple and hypervisor emulation can build a simple HW capability into a full ADI for assignment to a guest. A lot of the trappings that PCI-SIG requires to be implemented in HW for a VF, like PCI config space, MSI tables, BAR space, etc. is all just dead weight when scaling up to 1000's of VFs. The ADI scheme is not that bad, the very simplest HW is just a queue that can have all DMA contained by a PASID and can trigger an addr/data interrupt message. Much less HW costly than a SRIOV VF. Regardless, Intel kicked this path off years ago when they published their SIOV cookbook and everyone started integrating PASID support into their IOMMUs and working on ADIs. The mlx5 SFs are kind of early because the HW is flexible enough to avoid the parts of SIOV that are not ready or widely deployed yet, like IMS and PASID. > Like you, I would also prefer a more common infrastructure for > exposing something based on VirtIO/VMDq as the container/VM facing > netdevs. A major point is to get switchdev. > I also don't see how this tackles container/VF portability, > migration of workloads, kernel network stack bypass, or any of the > other legacy limitations regarding SR-IOV VFs It isn't ment too. SF/ADI are just a way to have more VF's than PCI SIG can support.. Jason
On Tue, Dec 15, 2020 at 06:19:18PM -0800, Alexander Duyck wrote: > > > I would really like to see is a solid standardization of what this is. > > > Otherwise the comparison is going to be made. Especially since a year > > > ago Mellanox was pushing this as an mdev type interface. > > > > mdev was NAK'd too. > > > > mdev is only for creating /dev/vfio/*. > > Agreed. However my worry is that as we start looking to make this > support virtualization it will still end up swinging more toward > mdev. Of course. mdev is also the only way to create a /dev/vfio/* :) So all paths that want to use vfio must end up creating a mdev. Here we would choose to create the mdev on top of the SF aux device. There isn't really anything mlx5 specific about that decision. The SF models the vendor specific ADI in the driver model. > It isn't so much about right or wrong but he use cases. My experience > has been that SR-IOV ends up being used for very niche use cases where > you are direct assigning it into either DPDK or some NFV VM and you > are essentially building the application around the NIC. It is all > well and good, but for general virtualization it never really caught > on. Sure > > So encourage other vendors to support the switchdev model for managing > > VFs and ADIs! > > Ugh, don't get me started on switchdev. The biggest issue as I see it > with switchev is that you have to have a true switch in order to > really be able to use it. That cuts both ways, suggesting HW with a true switch model itself with VMDq is equally problematic. > As such dumbed down hardware like the ixgbe for instance cannot use > it since it defaults to outputting anything that doesn't have an > existing rule to the external port. If we could tweak the design to > allow for more dumbed down hardware it would probably be much easier > to get wider adoption. I'd agree with this > interface, but keep the SF interface simple. Then you can back it with > whatever you want, but without having to have a vendor specific > version of the interface being plugged into the guest or container. The entire point *is* to create the vendor version because that serves the niche cases where SRIOV assignment is already being used. Having a general solution that can't do vendor SRIOV is useful for other application, but doesn't eliminate the need for the SRIOV case. > One of the reasons why virtio-net is being pushed as a common > interface for vendors is for this reason. It is an interface that can > be emulated by software or hardware and it allows the guest to run on > any arbitrary hardware. Yes, and there is mlx5_vdpa to support this usecase, and it binds to the SF. Of course all of that is vendor specific too, the driver to convert HW specifc register programming into a virio-net ADI has to live *somewhere* > It has plenty to do with this series. This topic has been under > discussion since something like 2017 when Mellanox first brought it up > at Netdev 2.1. At the time I told them they should implement this as a > veth offload. veth doesn't give an ADI, it is useless for these niche cases. veth offload might be interesting for some container case, but feels like writing an enormous amount of code to accomplish nothing new... > Then it becomes obvious what the fallback becomes as you can place > packets into one end of a veth and it comes out the other, just like > a switchdev representor and the SF in this case. It would make much > more sense to do it this way rather than setting up yet another > vendor proprietary interface pair. I agree it makes sense to have an all SW veth-like option, but I wouldn't try to make that as the entry point for all the HW acceleration or to serve the niche SRIOV use cases, or to represent an ADI. It just can't do that and it would make a huge mess if you tried to force it. Didn't Intel already try this once with trying to use the macvlan netdev and its queue offload to build an ADI? > > Anyhow, if such a thing exists someday it could make sense to > > automatically substitute the HW version using a SF, if available. > > The main problem as I see it is the fact that the SF interface is > bound too tightly to the hardware. That is goal here. This is not about creating just a netdev, this is about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev. The SF has to support all of that completely. Focusing only on the one use case of netdevs in containers misses the bigger picture. Yes, lots of this stuff is niche, but niche stuff needs to be supported too. > Yes, it is a standard feature set for the control plane. However for > the data-path it is somewhat limited as I feel it only describes what > goes through the switch. Sure, I think that is its main point. > Not the interfaces that are exposed as the endpoints. It came from modeling physical HW so the endports are 'physical' things like actual HW switch ports, or SRIOV VFs, ADI, etc. > It is the problem of that last bit and how it is handled that can > make things ugly. For example the multicast/broadcast replication > problem that just occurred to me while writing this up. The fact is > for east-west traffic there has always been a problem with the > switchdev model as it limits everything to PCIe/DMA so there are > cases where software switches can outperform the hardware ones. Yes, but, mixing CPU and DMA in the same packet delivery scheme is very complicated :) Jason
On Tue, Dec 15, 2020 at 5:13 PM Edwin Peer <edwin.peer@broadcom.com> wrote: > > On Tue, Dec 15, 2020 at 10:49 AM Alexander Duyck > <alexander.duyck@gmail.com> wrote: > > > It isn't "SR-IOV done right" it seems more like "VMDq done better". > > I don't think I agree with that assertion. The fact that VMDq can talk > to a common driver still makes VMDq preferable in some respects. Thus, > subfunctions do appear to be more of a better SR-IOV than a better > VMDq, but I'm similarly not sold on whether a better SR-IOV is > sufficient benefit to warrant the additional complexity this > introduces. If I understand correctly, subfunctions buy two things: > > 1) More than 256 SFs are possible: Maybe it's about time PCI-SIG > addresses this limit for VFs? If that were the only problem with VFs, > then fixing it once there would be cleaner. The devlink interface for > configuring a SF is certainly more sexy than legacy SR-IOV, but it > shouldn't be fundamentally impossible to zhuzh up VFs either. One can > also imagine possibilities around remapping multiple PFs (and their > VFs) in a clever way to get around the limited number of PCI resources > exposed to the host. The fact is SR-IOV just wasn't designed to scale well. I think we are probably going to see most vendors move away from it. The fact is what we are talking about now is the future of all this and how to implement Scalable I/O Virtualization (https://software.intel.com/content/www/us/en/develop/download/intel-scalable-io-virtualization-technical-specification.html). The document is a good primer to many of the features we are discussing as it describes how to compose a device. The problem is as it was with SR-IOV that the S-IOV specification is very PCIe centric and doesn't do a good job explaining how to deal with the network as it relates to all this. Then to complicate things the S-IOV expected this to be used with direct assigned devices for guests/applications and instead we are talking about using the devices in the host which makes things a bit messier. > 2) More flexible division of resources: It's not clear that device > firmwarre can't perform smarter allocation than N/<num VFs>, but > subfunctions also appear to allow sharing of certain resources by the > PF driver, if desirable. To the extent that resources are shared, how > are workloads isolated from each other? > > I'm not sure I like the idea of having to support another resource > allocation model in our driver just to support this, at least not > without a clearer understanding of what is being gained. I view this as the future alternative to SR-IOV. It is just a matter of how we define it. Eventually we probably would be dropping the SR-IOV implementation and instead moving over to S-IOV as an alternative instead. As such if this is done right I don't see this as a thing where we need to support both. Really we should be able to drop support for one if we have the other.
On Tue, Dec 15, 2020 at 7:04 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Tue, Dec 15, 2020 at 06:19:18PM -0800, Alexander Duyck wrote: > > > > > I would really like to see is a solid standardization of what this is. > > > > Otherwise the comparison is going to be made. Especially since a year > > > > ago Mellanox was pushing this as an mdev type interface. > > > > > > mdev was NAK'd too. > > > > > > mdev is only for creating /dev/vfio/*. > > > > Agreed. However my worry is that as we start looking to make this > > support virtualization it will still end up swinging more toward > > mdev. > > Of course. mdev is also the only way to create a /dev/vfio/* :) > > So all paths that want to use vfio must end up creating a mdev. > > Here we would choose to create the mdev on top of the SF aux device. > There isn't really anything mlx5 specific about that decision. > > The SF models the vendor specific ADI in the driver model. > > > It isn't so much about right or wrong but he use cases. My experience > > has been that SR-IOV ends up being used for very niche use cases where > > you are direct assigning it into either DPDK or some NFV VM and you > > are essentially building the application around the NIC. It is all > > well and good, but for general virtualization it never really caught > > on. > > Sure > > > > So encourage other vendors to support the switchdev model for managing > > > VFs and ADIs! > > > > Ugh, don't get me started on switchdev. The biggest issue as I see it > > with switchev is that you have to have a true switch in order to > > really be able to use it. > > That cuts both ways, suggesting HW with a true switch model itself > with VMDq is equally problematic. Yes and no. For example the macvlan offload I had setup could be configured both ways and it made use of VMDq. I'm not necessarily arguing that we need to do VMDq here, however at the same time saying that this is only meant to replace SR-IOV becomes problematic since we already have SR-IOV so why replace it with something that has many of the same limitations? > > As such dumbed down hardware like the ixgbe for instance cannot use > > it since it defaults to outputting anything that doesn't have an > > existing rule to the external port. If we could tweak the design to > > allow for more dumbed down hardware it would probably be much easier > > to get wider adoption. > > I'd agree with this > > > interface, but keep the SF interface simple. Then you can back it with > > whatever you want, but without having to have a vendor specific > > version of the interface being plugged into the guest or container. > > The entire point *is* to create the vendor version because that serves > the niche cases where SRIOV assignment is already being used. > > Having a general solution that can't do vendor SRIOV is useful for > other application, but doesn't eliminate the need for the SRIOV case. So part of the problem here is we already have SR-IOV. So we don't need to repeat the mistakes. Rather, we need to have a solution to the existing problems and then we can look at eliminating it. That said I understand your argument, however I view the elimination of SR-IOV to be something we do after we get this interface right and can justify doing so. I don't have a problem necessarily with vendor specific instances, unless we are only able to get vendor specific instances. Thus I would prefer that we have a solution in place before we allow the switch over. > > One of the reasons why virtio-net is being pushed as a common > > interface for vendors is for this reason. It is an interface that can > > be emulated by software or hardware and it allows the guest to run on > > any arbitrary hardware. > > Yes, and there is mlx5_vdpa to support this usecase, and it binds to > the SF. Of course all of that is vendor specific too, the driver to > convert HW specifc register programming into a virio-net ADI has to > live *somewhere* Right, but this is more the model I am in favor of. The backend is hidden from the guest and lives somewhere on the host. Also it might be useful to call out the flavours and planned flavours in the cover page. Admittedly the description is somewhat lacking in that regard. > > It has plenty to do with this series. This topic has been under > > discussion since something like 2017 when Mellanox first brought it up > > at Netdev 2.1. At the time I told them they should implement this as a > > veth offload. > > veth doesn't give an ADI, it is useless for these niche cases. > > veth offload might be interesting for some container case, but feels > like writing an enormous amount of code to accomplish nothing new... My concern is if we are going to start partitioning up a PF on the host we might as well make the best use of it. I would argue that it would make more sense to have some standardized mechanism in place for the PF to communicate and interact with the SFs. I would argue that is one of the reasons why this keeps being compared to either VMDq or VMQ as it is something that SR-IOV has yet to fully replace and has many features that would be useful in an interface that is a subpartition of an existing interface. > > Then it becomes obvious what the fallback becomes as you can place > > packets into one end of a veth and it comes out the other, just like > > a switchdev representor and the SF in this case. It would make much > > more sense to do it this way rather than setting up yet another > > vendor proprietary interface pair. > > I agree it makes sense to have an all SW veth-like option, but I > wouldn't try to make that as the entry point for all the HW > acceleration or to serve the niche SRIOV use cases, or to represent an > ADI. > > It just can't do that and it would make a huge mess if you tried to > force it. Didn't Intel already try this once with trying to use the > macvlan netdev and its queue offload to build an ADI? The Intel drivers still have the macvlan as the assignable ADI and make use of VMDq to enable it. Actually I would consider it an example of the kind of thing I am talking about. It is capable of doing software switching between interfaces, broadcast/multicast replication in software, and makes use of the hardware interfaces to allow for receiving directly from the driver into the macvlan interface. The limitation as I see it is that the macvlan interface doesn't allow for much in the way of custom offloads and the Intel hardware doesn't support switchdev. As such it is good for a basic interface, but doesn't really do well in terms of supporting advanced vendor-specific features. > > > Anyhow, if such a thing exists someday it could make sense to > > > automatically substitute the HW version using a SF, if available. > > > > The main problem as I see it is the fact that the SF interface is > > bound too tightly to the hardware. > > That is goal here. This is not about creating just a netdev, this is > about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev. One issue is right now we are only seeing the rdma and netdev. It is kind of backwards as it is using the ADIs on the host when this was really meant to be used for things like mdev. > The SF has to support all of that completely. Focusing only on the > one use case of netdevs in containers misses the bigger picture. > > Yes, lots of this stuff is niche, but niche stuff needs to be > supported too. I have no problem with niche stuff, however we need to address the basics before we move on to the niche stuff. > > Yes, it is a standard feature set for the control plane. However for > > the data-path it is somewhat limited as I feel it only describes what > > goes through the switch. > > Sure, I think that is its main point. > > > Not the interfaces that are exposed as the endpoints. > > It came from modeling physical HW so the endports are 'physical' > things like actual HW switch ports, or SRIOV VFs, ADI, etc. The problem is the "physical things" such as the SRIOV VFs and ADI aren't really defined in the specification and are left up to the implementers interpretation. These specs have always been fuzzy since they are essentially PCI specifications and don't explain anything about how the network on such a device should be configured or expected to work. The swtichdev API puts some restrictions in place but there still ends up being parts without any definition. > > It is the problem of that last bit and how it is handled that can > > make things ugly. For example the multicast/broadcast replication > > problem that just occurred to me while writing this up. The fact is > > for east-west traffic there has always been a problem with the > > switchdev model as it limits everything to PCIe/DMA so there are > > cases where software switches can outperform the hardware ones. > > Yes, but, mixing CPU and DMA in the same packet delivery scheme is > very complicated :) I'm not necessarily saying we need to mix the two. However there are cases such as multicast/broadcast where it would make much more sense to avoid the duplication of packets and instead simply send one copy and have it replicated by the software. What would probably make sense for now would be to look at splitting the netdev into two pieces. The frontend which would provide the netdev and be a common driver for subfunction netdevs in this case, and a backend which would be a common point for all the subfunctions that are being used directly on the host. This is essentially what we have with the macvlan model. The idea is that if we wanted to do software switching or duplication of traffic we could, but if not then we wouldn't.
> From: Alexander Duyck <alexander.duyck@gmail.com> > Sent: Wednesday, December 16, 2020 9:43 AM > > > > > That is goal here. This is not about creating just a netdev, this is > > about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev. > > One issue is right now we are only seeing the rdma and netdev. It is kind of > backwards as it is using the ADIs on the host when this was really meant to > be used for things like mdev. > mdev is just yet another _use_ of subfunction. There are users of subfunction who want to use it without mapping subfunction as mdev to guest VM. i.e. as netdev, rdma, vdpa. In future direct assignment can be via mdev of subfunction, like how rest of the above devices are done. Creating a subfunction for non vfio purpose via vfio mdev is just not right. If I understand you correctly, I hope you are not suggesting that.
On Tue, Dec 15, 2020 at 01:28:05PM -0800, Jakub Kicinski wrote: > On Tue, 15 Dec 2020 12:35:20 -0800 Saeed Mahameed wrote: > > > I think the big thing we really should do if we are going to go this > > > route is to look at standardizing what the flavours are that get > > > created by the parent netdevice. Otherwise we are just creating the > > > same mess we had with SRIOV all over again and muddying the waters of > > > mediated devices. > > > > yes in the near future we will be working on auxbus interfaces for > > auto-probing and user flavor selection, this is a must have feature for > > us. > > Can you elaborate? I thought config would be via devlink. Yes, everything continues to be done through devlink. One of the immediate features is an ability to disable/enable creation of specific SF types. For example, if user doesn't want RDMA, the SF RDMA won't be created. Thanks
On Tue, Dec 15, 2020 at 08:13:21PM -0800, Alexander Duyck wrote: > > > Ugh, don't get me started on switchdev. The biggest issue as I see it > > > with switchev is that you have to have a true switch in order to > > > really be able to use it. > > > > That cuts both ways, suggesting HW with a true switch model itself > > with VMDq is equally problematic. > > Yes and no. For example the macvlan offload I had setup could be > configured both ways and it made use of VMDq. I'm not necessarily > arguing that we need to do VMDq here, however at the same time saying > that this is only meant to replace SR-IOV becomes problematic since we > already have SR-IOV so why replace it with something that has many of > the same limitations? Why? Because SR-IOV is the *only* option for many use cases. Still. I said this already, something more generic does not magicaly eliminate SR-IOV. The SIOV ADI model is a small refinement to the existing VF scheme, it is completely parallel to making more generic things. It is not "repeating mistakes" it is accepting the limitations of SR-IOV because benefits exist and applications need those benefits. > That said I understand your argument, however I view the elimination > of SR-IOV to be something we do after we get this interface right and > can justify doing so. Elimination of SR-IOV isn't even a goal here! > Also it might be useful to call out the flavours and planned flavours > in the cover page. Admittedly the description is somewhat lacking in > that regard. This is more of a general switchdev remark though. In the swithdev model you have a the switch and a switch port. Each port has a swichdev representor on the switch side and a "user port" of some kind. It can be a physical thing: - SFP - QSFP - WiFi Antennae It could be a semi-physical thing outside the view of the kernel: - SmartNIC VF/SF attached to another CPU It can be a semi-physical thing in view of this kernel: - SRIOV VF (struct pci device) - SF (struct aux device) It could be a SW construct in this kernel: - netdev (struct net device) *all* of these different port types are needed. Probably more down the road! Notice I don't have VPDA, VF/SF netdev, or virtio-mdev as a "user port" type here. Instead creating the user port pci or aux device allows the user to use the Linux driver model to control what happens to the pci/aux device next. > I would argue that is one of the reasons why this keeps being > compared to either VMDq or VMQ as it is something that SR-IOV has > yet to fully replace and has many features that would be useful in > an interface that is a subpartition of an existing interface. In what sense does switchdev and a VF not fully replace macvlan VMDq? > The Intel drivers still have the macvlan as the assignable ADI and > make use of VMDq to enable it. Is this in-tree or only in the proprietary driver? AFAIK there is no in-tree way to extract the DMA queue from the macvlan netdev into userspace.. Remeber all this VF/SF/VDPA stuff results in a HW dataplane, not a SW one. It doesn't really make sense to compare a SW dataplane to a HW one. HW dataplanes come with limitations and require special driver code. > The limitation as I see it is that the macvlan interface doesn't allow > for much in the way of custom offloads and the Intel hardware doesn't > support switchdev. As such it is good for a basic interface, but > doesn't really do well in terms of supporting advanced vendor-specific > features. I don't know what it is that prevents Intel from modeling their selector HW in switchdev, but I think it is on them to work with the switchdev folks to figure something out. I'm a bit surprised HW that can do macvlan can't be modeled with switchdev? What is missing? > > That is goal here. This is not about creating just a netdev, this is > > about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev. > > One issue is right now we are only seeing the rdma and netdev. It is > kind of backwards as it is using the ADIs on the host when this was > really meant to be used for things like mdev. This is second 15 patch series on this path already. It is not possible to pack every single thing into this series. This is the micro step of introducing the SF idea and using SF==VF to show how the driver stack works. The minimal changing to the existing drivers implies this can support an ADI as well. Further, this does already show an ADI! vdpa_mlx5 will run on the VF/SF and eventually causes qemu to build a virtio-net ADI that directly passes HW DMA rings into the guest. Isn't this exactly the kind of generic SRIOV replacement option you have been asking for? Doesn't this completely supersede stuff built on macvlan? > expected to work. The swtichdev API puts some restrictions in place > but there still ends up being parts without any definition. I'm curious what you see as needing definition here? The SRIOV model has the HW register programming API is device specific. The switchdev model is: no matter what HW register programing is done on the VF/SF all the packets tx/rx'd will flow through the switchdev. The purpose of switchdev/SRIOV/SIOV has never been to define a single "one register set to rule them all". That is the area that VDPA virtio-net and others are covering. Jason
On Wed, Dec 16, 2020 at 5:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Tue, Dec 15, 2020 at 08:13:21PM -0800, Alexander Duyck wrote: > > > > > Ugh, don't get me started on switchdev. The biggest issue as I see it > > > > with switchev is that you have to have a true switch in order to > > > > really be able to use it. > > > > > > That cuts both ways, suggesting HW with a true switch model itself > > > with VMDq is equally problematic. > > > > Yes and no. For example the macvlan offload I had setup could be > > configured both ways and it made use of VMDq. I'm not necessarily > > arguing that we need to do VMDq here, however at the same time saying > > that this is only meant to replace SR-IOV becomes problematic since we > > already have SR-IOV so why replace it with something that has many of > > the same limitations? > > Why? Because SR-IOV is the *only* option for many use cases. Still. I > said this already, something more generic does not magicaly eliminate > SR-IOV. > > The SIOV ADI model is a small refinement to the existing VF scheme, it > is completely parallel to making more generic things. > > It is not "repeating mistakes" it is accepting the limitations of > SR-IOV because benefits exist and applications need those benefits. If we have two interfaces, both with pretty much the same limitations then many would view it as "repeating mistakes". The fact is we already have SR-IOV. Why introduce yet another interface that has the same functionality? You say this will scale better but I am not even sure about that. The fact is SR-IOV could scale to 256 VFs, but for networking I kind of doubt the limitation would have been the bus number and would more likely be issues with packet replication and PCIe throughput, especially when you start dealing with east-west traffic within the same system. > > That said I understand your argument, however I view the elimination > > of SR-IOV to be something we do after we get this interface right and > > can justify doing so. > > Elimination of SR-IOV isn't even a goal here! Sorry you used the word "replace", and my assumption here was that the goal is to get something in place that can take the place of SR-IOV so that you wouldn't be maintaining the two systems at the same time. That is my concern as I don't want us having SR-IOV, and then several flavors of SIOV. We need to decide on one thing that will be the way forward. > > Also it might be useful to call out the flavours and planned flavours > > in the cover page. Admittedly the description is somewhat lacking in > > that regard. > > This is more of a general switchdev remark though. In the swithdev > model you have a the switch and a switch port. Each port has a > swichdev representor on the switch side and a "user port" of some > kind. > > It can be a physical thing: > - SFP > - QSFP > - WiFi Antennae > > It could be a semi-physical thing outside the view of the kernel: > - SmartNIC VF/SF attached to another CPU > > It can be a semi-physical thing in view of this kernel: > - SRIOV VF (struct pci device) > - SF (struct aux device) > > It could be a SW construct in this kernel: > - netdev (struct net device) > > *all* of these different port types are needed. Probably more down the > road! > > Notice I don't have VPDA, VF/SF netdev, or virtio-mdev as a "user > port" type here. Instead creating the user port pci or aux device > allows the user to use the Linux driver model to control what happens > to the pci/aux device next. I get that. That is why I said switchdev isn't a standard for the endpoint. One of the biggest issues with SR-IOV that I have seen is the fact that the last piece isn't really defined. We never did a good job of defining how the ADI should look to the guest and as a result it kind of stalled in adoption. > > I would argue that is one of the reasons why this keeps being > > compared to either VMDq or VMQ as it is something that SR-IOV has > > yet to fully replace and has many features that would be useful in > > an interface that is a subpartition of an existing interface. > > In what sense does switchdev and a VF not fully replace macvlan VMDq? One of the biggest is east-west traffic. You quickly run up against the PCIe bandwidth bottleneck and then the performance tanks. I have seen a number of cases where peer-to-peer on the same host swamps the network interface. > > The Intel drivers still have the macvlan as the assignable ADI and > > make use of VMDq to enable it. > > Is this in-tree or only in the proprietary driver? AFAIK there is no > in-tree way to extract the DMA queue from the macvlan netdev into > userspace.. > > Remeber all this VF/SF/VDPA stuff results in a HW dataplane, not a SW > one. It doesn't really make sense to compare a SW dataplane to a HW > one. HW dataplanes come with limitations and require special driver > code. I get that. At the same time we can mask some of those limitations by allowing for the backend to be somewhat abstract so you have the possibility of augmenting the hardware dataplane with a software one if needed. > > The limitation as I see it is that the macvlan interface doesn't allow > > for much in the way of custom offloads and the Intel hardware doesn't > > support switchdev. As such it is good for a basic interface, but > > doesn't really do well in terms of supporting advanced vendor-specific > > features. > > I don't know what it is that prevents Intel from modeling their > selector HW in switchdev, but I think it is on them to work with the > switchdev folks to figure something out. They tried for the ixgbe and i40e. The problem is the hardware couldn't conform to what was asked for if I recall. It has been a few years since I worked in the Ethernet group at intel so I don't recall the exact details. > I'm a bit surprised HW that can do macvlan can't be modeled with > switchdev? What is missing? If I recall it was the fact that the hardware defaults to transmitting everything that doesn't match an existing rule to the external port unless it comes from the external port. > > > That is goal here. This is not about creating just a netdev, this is > > > about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev. > > > > One issue is right now we are only seeing the rdma and netdev. It is > > kind of backwards as it is using the ADIs on the host when this was > > really meant to be used for things like mdev. > > This is second 15 patch series on this path already. It is not > possible to pack every single thing into this series. This is the > micro step of introducing the SF idea and using SF==VF to show how the > driver stack works. The minimal changing to the existing drivers > implies this can support an ADI as well. > > Further, this does already show an ADI! vdpa_mlx5 will run on the > VF/SF and eventually causes qemu to build a virtio-net ADI that > directly passes HW DMA rings into the guest. > > Isn't this exactly the kind of generic SRIOV replacement option you > have been asking for? Doesn't this completely supersede stuff built on > macvlan? Something like the vdpa model is more like what I had in mind. Only vdpa only works for the userspace networking case. Basically the idea is to have an assignable device interface that isn't directly tied to the hardware. Instead it is making use of a slice of it and referencing the PF as the parent leaving the PF as the owner of the slice. If at some point in the future we could make changes to allow for software to step in and do some switching if needed. The key bit is the abstraction of the assignable interface so that it is vendor agnostic and could be switched over to pure software backing if needed. > > expected to work. The swtichdev API puts some restrictions in place > > but there still ends up being parts without any definition. > > I'm curious what you see as needing definition here? > > The SRIOV model has the HW register programming API is device > specific. > > The switchdev model is: no matter what HW register programing is done > on the VF/SF all the packets tx/rx'd will flow through the switchdev. > > The purpose of switchdev/SRIOV/SIOV has never been to define a single > "one register set to rule them all". > > That is the area that VDPA virtio-net and others are covering. That is fine and that covers it for direct assigned devices. However that doesn't cover the container case. My thought is if we are going to partition a PF into multiple netdevices we should have some generic interface that can be provided to represent the netdevs so that if they are pushed into containers you don't have to rip them out if for some reason you need to change the network configuration. For the Intel NICs we did that with macvlan in the VMDq case. I see no reason why you couldn't do something like that here with the subfunction case.
On Wed, Dec 16, 2020 at 08:31:44AM -0800, Alexander Duyck wrote: > You say this will scale better but I am not even sure about that. The > fact is SR-IOV could scale to 256 VFs, but for networking I kind of > doubt the limitation would have been the bus number and would more > likely be issues with packet replication and PCIe throughput, > especially when you start dealing with east-west traffic within the > same system. We have been seeing deployments already hitting the 256 limit. This is not a "theoretical use" patch set. There are already VM and container farms with SW networking that support much more than 256 VM/containers per server. The optimization here is to reduce the hypervisor workload and free up CPU cycles for the VMs/containers to consume. This means less handling of packets in the CPU, especially for VM cases w/ SRIOV or VDPA. Even the extra DMA on the NIC is not really a big deal. These are 400G NICs with big fast PCI. If you top them out you are already doing an aggregate of 400G of network traffic. That is a big number for a single sever, it is OK. Someone might feel differently if they did this on a 10/40G NIC, in which case this is not the solution for their application. > Sorry you used the word "replace", and my assumption here was that the > goal is to get something in place that can take the place of SR-IOV so > that you wouldn't be maintaining the two systems at the same time. > That is my concern as I don't want us having SR-IOV, and then several > flavors of SIOV. We need to decide on one thing that will be the way > forward. SRIOV has to continue until the PASID and IMS platform features are widely available and mature. It will probably be 10 years before we see most people able to use SIOV for everything they want. I think we will see lots of SIOV varients, I know Intel is already pushing SIOV parts outside netdev. > I get that. That is why I said switchdev isn't a standard for the > endpoint. One of the biggest issues with SR-IOV that I have seen is > the fact that the last piece isn't really defined. We never did a good > job of defining how the ADI should look to the guest and as a result > it kind of stalled in adoption. The ADI is supposed to present the HW programming API that is desired. It is always up to the implementation. SIOV was never a project to standardize HW programming models like virtio-net, NVMe, etc. > > I'm a bit surprised HW that can do macvlan can't be modeled with > > switchdev? What is missing? > > If I recall it was the fact that the hardware defaults to transmitting > everything that doesn't match an existing rule to the external port > unless it comes from the external port. That seems small enough it should be resolvable, IMHO. eg some new switch rule that matches that specific HW behavior? > Something like the vdpa model is more like what I had in mind. Only > vdpa only works for the userspace networking case. That's because making a driver that converts the native HW to VDPA and then running a generic netdev on the resulting virtio-net is a pretty wild thing to do. I can't really think of an actual use case. > Basically the idea is to have an assignable device interface that > isn't directly tied to the hardware. The switchdev model is to create a switch port. As I explained in Linux we see "pci device" and "aux device" as being some "user port" options to access to that switch. If you want a "generic device" that is fine, but what exactly is that programming interface in Linux? Sketch out an API, where does the idea go? What does the driver that implement it look like? What consumes it? Should this be a great idea, then a mlx5 version of this will still be to create an SF aux device, bind mlx5_core, then bind "generic device" on top of that. This is simply a reflection of how the mlx5 HW/SW layering works. Squashing all of this into a single layer is work with no bad ROI. > they are pushed into containers you don't have to rip them out if for > some reason you need to change the network configuration. Why would you need to rip them out? Jason
On Wed, 2020-12-16 at 08:50 +0200, Leon Romanovsky wrote: > On Tue, Dec 15, 2020 at 01:28:05PM -0800, Jakub Kicinski wrote: > > On Tue, 15 Dec 2020 12:35:20 -0800 Saeed Mahameed wrote: > > > > I think the big thing we really should do if we are going to go > > > > this > > > > route is to look at standardizing what the flavours are that > > > > get > > > > created by the parent netdevice. Otherwise we are just creating > > > > the > > > > same mess we had with SRIOV all over again and muddying the > > > > waters of > > > > mediated devices. > > > > > > yes in the near future we will be working on auxbus interfaces > > > for > > > auto-probing and user flavor selection, this is a must have > > > feature for > > > us. > > > > Can you elaborate? I thought config would be via devlink. > > Yes, everything continues to be done through devlink. > > One of the immediate features is an ability to disable/enable > creation > of specific SF types. > > For example, if user doesn't want RDMA, the SF RDMA won't be created. > Devlink is an option too, we still don't have our mind set on a specific API, we are considering both as a valuable solutions since devlink make sense as a go to interface for everything SF, but on the other hand, auto-probing and device instantiating is done at the auxbus level, so it also make sense to have some sort of "device type" user selection api in the auxbus, anyway this discussion is for a future patch.
On Wed, Dec 16, 2020 at 9:51 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Wed, Dec 16, 2020 at 08:31:44AM -0800, Alexander Duyck wrote: > > > You say this will scale better but I am not even sure about that. The > > fact is SR-IOV could scale to 256 VFs, but for networking I kind of > > doubt the limitation would have been the bus number and would more > > likely be issues with packet replication and PCIe throughput, > > especially when you start dealing with east-west traffic within the > > same system. > > We have been seeing deployments already hitting the 256 limit. This is > not a "theoretical use" patch set. There are already VM and container > farms with SW networking that support much more than 256 VM/containers > per server. That has been the case for a long time. However it had been my experience that SR-IOV never scaled well to meet those needs and so it hadn't been used in such deployments. > The optimization here is to reduce the hypervisor workload and free up > CPU cycles for the VMs/containers to consume. This means less handling > of packets in the CPU, especially for VM cases w/ SRIOV or VDPA. > > Even the extra DMA on the NIC is not really a big deal. These are 400G > NICs with big fast PCI. If you top them out you are already doing an > aggregate of 400G of network traffic. That is a big number for a > single sever, it is OK. Yes, but at a certain point you start bumping up against memory throughput limitations as well. Doubling up the memory footprint by having the device have to write to new pages instead of being able to do something like pinning and zero-copy would be expensive. For something like an NFV use case it might make sense, but in the general high server count case it seems like it is a setup that would be detrimental. > Someone might feel differently if they did this on a 10/40G NIC, in > which case this is not the solution for their application. My past experience was with 10/40G NIC with tens of VFs. When we start talking about hundreds I would imagine the overhead becomes orders of magnitudes worse as the problem becomes more of an n^2 issue since you will have n times more systems sending to n times more systems receiving. As such things like broadcast traffic would end up consuming a fair bit of traffic. > > Sorry you used the word "replace", and my assumption here was that the > > goal is to get something in place that can take the place of SR-IOV so > > that you wouldn't be maintaining the two systems at the same time. > > That is my concern as I don't want us having SR-IOV, and then several > > flavors of SIOV. We need to decide on one thing that will be the way > > forward. > > SRIOV has to continue until the PASID and IMS platform features are > widely available and mature. It will probably be 10 years before we > see most people able to use SIOV for everything they want. > > I think we will see lots of SIOV varients, I know Intel is already > pushing SIOV parts outside netdev. The key bit here is outside of netdev. Like I said, SIOV and SR-IOV tend to be PCIe specific specifications. What we are defining here is how the network interfaces presented by such devices will work. > > I get that. That is why I said switchdev isn't a standard for the > > endpoint. One of the biggest issues with SR-IOV that I have seen is > > the fact that the last piece isn't really defined. We never did a good > > job of defining how the ADI should look to the guest and as a result > > it kind of stalled in adoption. > > The ADI is supposed to present the HW programming API that is > desired. It is always up to the implementation. > > SIOV was never a project to standardize HW programming models like > virtio-net, NVMe, etc. I agree. Just like SR-IOV never spelled out that network devices should be using switchdev. That is something we decided on as a community. What I am explaining here is that we should be thinking about the implications of how the network interface is exposed in the host in the case of subfunctions that are associated with a given switchdev device. > > > I'm a bit surprised HW that can do macvlan can't be modeled with > > > switchdev? What is missing? > > > > If I recall it was the fact that the hardware defaults to transmitting > > everything that doesn't match an existing rule to the external port > > unless it comes from the external port. > > That seems small enough it should be resolvable, IMHO. eg some new > switch rule that matches that specific HW behavior? I would have to go digging to find the conversation. It was about 3 or 4 years ago. I seem to recall mentioning the idea of having some static rules but it was a no-go at the time. If we wanted to spin off this conversation and pull in some Intel folks I would be up for us revisiting it. However I'm not with Intel anymore so it would mostly be something I would be working on as a hobby project instead of anything serious. > > Something like the vdpa model is more like what I had in mind. Only > > vdpa only works for the userspace networking case. > > That's because making a driver that converts the native HW to VDPA and > then running a generic netdev on the resulting virtio-net is a pretty > wild thing to do. I can't really think of an actual use case. I'm not talking about us drastically changing existing models. I would still expect the mlx5 driver to be running on top of the aux device. However it may be that the aux device is associated with something like the switchdev port as a parent and the output from the traffic is then going to the subfunction netdev. > > Basically the idea is to have an assignable device interface that > > isn't directly tied to the hardware. > > The switchdev model is to create a switch port. As I explained in > Linux we see "pci device" and "aux device" as being some "user port" > options to access to that switch. > > If you want a "generic device" that is fine, but what exactly is that > programming interface in Linux? Sketch out an API, where does the idea > go? What does the driver that implement it look like? What consumes > it? > > Should this be a great idea, then a mlx5 version of this will still be > to create an SF aux device, bind mlx5_core, then bind "generic device" > on top of that. This is simply a reflection of how the mlx5 HW/SW > layering works. Squashing all of this into a single layer is work with > no bad ROI. In my mind the mlx5_core is still binding to the SF aux device so I am fine with that. The way I view this working is that it would work sort of like the macvlan approach that was taken with the Intel parts. Although instead of binding to the PF it might make more sense to have it bound to the switchdev port associated with the subfunction and looking like a veth pair from the host perspective in terms of behavior. The basic idea is most of the control is still in the switchdev port, but it becomes more visible that the switchdev port and the subfunction netdev are linked with the switchdev port as the parent. > > they are pushed into containers you don't have to rip them out if for > > some reason you need to change the network configuration. > > Why would you need to rip them out? Because they are specifically tied to the mlx5 device. So if for example I need to hotplug out the mlx5 and replace it, it would be useful to allow the interface in the containers to stay in place and fail over to some other software backing interface in the switchdev namespace. One of the issues with VFs is that we have always had to push some sort of bond on top, or switch over to a virtio-net interface in order to support fail-over. If we can resolve that in the host case that would go a long way toward solving one of the main issues of SR-IOV.
On Wed, Dec 16, 2020 at 11:27:32AM -0800, Alexander Duyck wrote: > That has been the case for a long time. However it had been my > experience that SR-IOV never scaled well to meet those needs and so it > hadn't been used in such deployments. Seems to be going quite well here, perhaps the applications are different. > > The optimization here is to reduce the hypervisor workload and free up > > CPU cycles for the VMs/containers to consume. This means less handling > > of packets in the CPU, especially for VM cases w/ SRIOV or VDPA. > > > > Even the extra DMA on the NIC is not really a big deal. These are 400G > > NICs with big fast PCI. If you top them out you are already doing an > > aggregate of 400G of network traffic. That is a big number for a > > single sever, it is OK. > > Yes, but at a certain point you start bumping up against memory > throughput limitations as well. Doubling up the memory footprint by > having the device have to write to new pages instead of being able to > do something like pinning and zero-copy would be expensive. You can't zero-copy when using VMs. And when using containers every skb still has to go through all the switching and encapsulation logic, which is not free in SW. At a certain point the gains of avoiding the DMA copy are lost by the costs of all the extra CPU work. The factor being optimized here is CPU capacity. > > Someone might feel differently if they did this on a 10/40G NIC, in > > which case this is not the solution for their application. > > My past experience was with 10/40G NIC with tens of VFs. When we start > talking about hundreds I would imagine the overhead becomes orders of > magnitudes worse as the problem becomes more of an n^2 issue since you > will have n times more systems sending to n times more systems The traffic demand is application dependent. If an application has an n^2 traffic pattern then it needs a network to sustain that cross sectional bandwidth regardless of how the VMs are packed. It just becomes a design factor of the network and now the network includes that switching component on the PCIe NIC as part of the capacity for cross sectional BW. There is some balance where a VM can only generate so much traffic based on the CPU it has available, and you can design the entire infrastructure to balance the CPU with the NIC with the switches and come to some packing factor of VMs. As CPU constrains VM performance, removing CPU overheads from the system will improve packing density. A HW network data path in the VMs is one such case that can turn to a net win if the CPU bottleneck is bigger than the network bottleneck. It is really over simplifiying to just say PCIe DMA copies are bad. > receiving. As such things like broadcast traffic would end up > consuming a fair bit of traffic. I think you have a lot bigger network problems if your broadcast traffic is so high that you start to worry about DMA copy performance in a 400G NIC. > The key bit here is outside of netdev. Like I said, SIOV and SR-IOV > tend to be PCIe specific specifications. What we are defining here is > how the network interfaces presented by such devices will work. I think we've achieved this.. > > That seems small enough it should be resolvable, IMHO. eg some new > > switch rule that matches that specific HW behavior? > > I would have to go digging to find the conversation. It was about 3 or > 4 years ago. I seem to recall mentioning the idea of having some > static rules but it was a no-go at the time. If we wanted to spin off > this conversation and pull in some Intel folks I would be up for us > revisiting it. However I'm not with Intel anymore so it would mostly > be something I would be working on as a hobby project instead of > anything serious. Personally I welcome getting more drivers to implement the switchdev model, I think it is only good for the netdev community as as a whole to understand and standardize on this. > > > Something like the vdpa model is more like what I had in mind. Only > > > vdpa only works for the userspace networking case. > > > > That's because making a driver that converts the native HW to VDPA and > > then running a generic netdev on the resulting virtio-net is a pretty > > wild thing to do. I can't really think of an actual use case. > > I'm not talking about us drastically changing existing models. I would > still expect the mlx5 driver to be running on top of the aux device. > However it may be that the aux device is associated with something > like the switchdev port as a parent ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ That is exactly how this works. The switchdev representor and the aux device are paired and form the analog of the veth tunnel. IIRC this relationship with the aux device is shown in the devlink output for the switchdev ports. I still can't understand what you think should be changed here. We can't get rid of the aux device, it is integral to the software layering and essential to support any device assignment flow. We can't add a mandatory netdev, because that is just pure waste for any device assignment flow. > The basic idea is most of the control is still in the switchdev > port, but it becomes more visible that the switchdev port and the > subfunction netdev are linked with the switchdev port as the parent. If a user netdev is created, say for a container, then it should have as a parent the aux device. If you inspect the port configuration on the switchdev side it will show the aux device associated with the port. Are you just asking that the userspace tools give a little help and show that switchdev port XX is visable as netdev YYY by cross matching the auxbus? > > > they are pushed into containers you don't have to rip them out if for > > > some reason you need to change the network configuration. > > > > Why would you need to rip them out? > > Because they are specifically tied to the mlx5 device. So if for > example I need to hotplug out the mlx5 and replace it, Uhh, I think we are very very far away from being able to hot unplug a switchdev driver, keep the switchdev running, and drop in a different driver. That isn't even on the radar, AFAIK. > namespace. One of the issues with VFs is that we have always had to > push some sort of bond on top, or switch over to a virtio-net > interface in order to support fail-over. If we can resolve that in the > host case that would go a long way toward solving one of the main > issues of SR-IOV. This is all solved already, virtio-net is the answer. qemu can swap back ends under the virtio-net ADI it created on the fly. This means it can go from processing a virtio-net queue in mlx5 HW, to full SW, to some other HW on another machine. All hitlessly and transparently to the guest VM. Direct HW processing of a queue inside a VM without any downsides for VM migration. Check. I have the feeling this stuff you are asking for is already done.. Jason
On Wed, Dec 16, 2020 at 12:35 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Wed, Dec 16, 2020 at 11:27:32AM -0800, Alexander Duyck wrote: > > > That has been the case for a long time. However it had been my > > experience that SR-IOV never scaled well to meet those needs and so it > > hadn't been used in such deployments. > > Seems to be going quite well here, perhaps the applications are > different. > > > > The optimization here is to reduce the hypervisor workload and free up > > > CPU cycles for the VMs/containers to consume. This means less handling > > > of packets in the CPU, especially for VM cases w/ SRIOV or VDPA. > > > > > > Even the extra DMA on the NIC is not really a big deal. These are 400G > > > NICs with big fast PCI. If you top them out you are already doing an > > > aggregate of 400G of network traffic. That is a big number for a > > > single sever, it is OK. > > > > Yes, but at a certain point you start bumping up against memory > > throughput limitations as well. Doubling up the memory footprint by > > having the device have to write to new pages instead of being able to > > do something like pinning and zero-copy would be expensive. > > You can't zero-copy when using VMs. > > And when using containers every skb still has to go through all the > switching and encapsulation logic, which is not free in SW. > > At a certain point the gains of avoiding the DMA copy are lost by the > costs of all the extra CPU work. The factor being optimized here is > CPU capacity. > > > > Someone might feel differently if they did this on a 10/40G NIC, in > > > which case this is not the solution for their application. > > > > My past experience was with 10/40G NIC with tens of VFs. When we start > > talking about hundreds I would imagine the overhead becomes orders of > > magnitudes worse as the problem becomes more of an n^2 issue since you > > will have n times more systems sending to n times more systems > > The traffic demand is application dependent. If an application has an > n^2 traffic pattern then it needs a network to sustain that cross > sectional bandwidth regardless of how the VMs are packed. > > It just becomes a design factor of the network and now the network > includes that switching component on the PCIe NIC as part of the > capacity for cross sectional BW. > > There is some balance where a VM can only generate so much traffic > based on the CPU it has available, and you can design the entire > infrastructure to balance the CPU with the NIC with the switches and > come to some packing factor of VMs. > > As CPU constrains VM performance, removing CPU overheads from the > system will improve packing density. A HW network data path in the VMs > is one such case that can turn to a net win if the CPU bottleneck is > bigger than the network bottleneck. > > It is really over simplifiying to just say PCIe DMA copies are bad. I'm not saying the copies are bad. However they can be limiting. As you said it all depends on the use case. If you have nothing but functions that are performing bump-in-the-wire type operations odds are the PCIe bandwidth won't be the problem. It all depends on the use case and that is why I would prefer the interface to be more flexible rather than just repeating what has been done with SR-IOV. The problem in my case was based on a past experience where east-west traffic became a problem and it was easily shown that bypassing the NIC for traffic was significantly faster. > > receiving. As such things like broadcast traffic would end up > > consuming a fair bit of traffic. > > I think you have a lot bigger network problems if your broadcast > traffic is so high that you start to worry about DMA copy performance > in a 400G NIC. Usually the problems were more multicast rather than broadcast, but yeah this typically isn't an issue. However still at 256 VFs you would be talking about a replication rate such that at 1-2Gbps one VF could saturate the entire 400G device. > > The key bit here is outside of netdev. Like I said, SIOV and SR-IOV > > tend to be PCIe specific specifications. What we are defining here is > > how the network interfaces presented by such devices will work. > > I think we've achieved this.. Somewhat. We have it explained for the control plane. What we are defining now is how it will appear in the guest/container. > > > That seems small enough it should be resolvable, IMHO. eg some new > > > switch rule that matches that specific HW behavior? > > > > I would have to go digging to find the conversation. It was about 3 or > > 4 years ago. I seem to recall mentioning the idea of having some > > static rules but it was a no-go at the time. If we wanted to spin off > > this conversation and pull in some Intel folks I would be up for us > > revisiting it. However I'm not with Intel anymore so it would mostly > > be something I would be working on as a hobby project instead of > > anything serious. > > Personally I welcome getting more drivers to implement the switchdev > model, I think it is only good for the netdev community as as a whole > to understand and standardize on this. Agreed. > > > > Something like the vdpa model is more like what I had in mind. Only > > > > vdpa only works for the userspace networking case. > > > > > > That's because making a driver that converts the native HW to VDPA and > > > then running a generic netdev on the resulting virtio-net is a pretty > > > wild thing to do. I can't really think of an actual use case. > > > > I'm not talking about us drastically changing existing models. I would > > still expect the mlx5 driver to be running on top of the aux device. > > However it may be that the aux device is associated with something > > like the switchdev port as a parent > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > That is exactly how this works. The switchdev representor and the aux > device are paired and form the analog of the veth tunnel. IIRC this > relationship with the aux device is shown in the devlink output for > the switchdev ports. > > I still can't understand what you think should be changed here. > > We can't get rid of the aux device, it is integral to the software > layering and essential to support any device assignment flow. > > We can't add a mandatory netdev, because that is just pure waste for > any device assignment flow. I'm not saying to get rid of the netdev. I'm saying the netdev created should be a generic interface that could be reused by other vendors. The whole idea behind using something like macvlan is to hide the underlying device. It becomes the namespace assignable interface. What I would like to see is us get rid of that step and instead just have a generic interface spawned in the first place and push the driver specific bits back to the switchdev port. The problem right now is that the switchdev netdev and the subfunction netdev are treated as peers. In my mind it should work somewhere in between the macvlan and veth. Basically you have the switchedev port handling both ends of the traffic and handling the aux device directly, and the subfunction device is floating on top of it sometimes acting like a macvlan when traffic is going to/from the aux device, and acting like a veth pair in some cases perhaps such as broadcast/multicast allowing the swtichdev to take care of that locally. > > The basic idea is most of the control is still in the switchdev > > port, but it becomes more visible that the switchdev port and the > > subfunction netdev are linked with the switchdev port as the parent. > > If a user netdev is created, say for a container, then it should have > as a parent the aux device. > > If you inspect the port configuration on the switchdev side it will > show the aux device associated with the port. > > Are you just asking that the userspace tools give a little help and > show that switchdev port XX is visable as netdev YYY by cross matching > the auxbus? It isn't about the association, it is about who is handling the traffic. Going back to the macvlan model what we did is we had a group of rings on the device that would automatically forward unicast packets to the macvlan interface and would be reserved for transmitting packets from the macvlan interface. We took care of multicast and broadcast replication in software. In my mind it should be possible to do something similar for the swtichdev case. Basically the packets would be routed from the subfunction netdev, to the switchdev netdev, and could then be transmitted through the aux device. In order to make this work in the case of ixgbe I had come up with the concept of a subordinate device, "sb_dev", which was a pointer used in the Tx case to identify Tx rings and qdiscs that had been given to macvlan interface. > > > > they are pushed into containers you don't have to rip them out if for > > > > some reason you need to change the network configuration. > > > > > > Why would you need to rip them out? > > > > Because they are specifically tied to the mlx5 device. So if for > > example I need to hotplug out the mlx5 and replace it, > > Uhh, I think we are very very far away from being able to hot unplug a > switchdev driver, keep the switchdev running, and drop in a different > driver. > > That isn't even on the radar, AFAIK. That might be a bad example, I was thinking of the issues we have had with VFs and direct assignment to Qemu based guests in the past. Essentially what I am getting at is that the setup in the container should be vendor agnostic. The interface exposed shouldn't be specific to any one vendor. So if I want to fire up a container or Mellanox, Broadcom, or some other vendor it shouldn't matter or be visible to the user. They should just see a vendor agnostic subfunction netdevice. Something like that is doable already using something like a macvlan on top of a subfunction interface, but I feel like that is an unnecessary step and creates unnecessary netdevices. > > namespace. One of the issues with VFs is that we have always had to > > push some sort of bond on top, or switch over to a virtio-net > > interface in order to support fail-over. If we can resolve that in the > > host case that would go a long way toward solving one of the main > > issues of SR-IOV. > > This is all solved already, virtio-net is the answer. > > qemu can swap back ends under the virtio-net ADI it created on the > fly. This means it can go from processing a virtio-net queue in mlx5 > HW, to full SW, to some other HW on another machine. All hitlessly and > transparently to the guest VM. > > Direct HW processing of a queue inside a VM without any downsides for > VM migration. Check. > > I have the feeling this stuff you are asking for is already done.. The case you are describing has essentially solved it for Qemu virtualization and direct assignment. It still doesn't necessarily solve it for the container case though.
On Wed, Dec 16, 2020 at 02:53:07PM -0800, Alexander Duyck wrote: > It isn't about the association, it is about who is handling the > traffic. Going back to the macvlan model what we did is we had a group > of rings on the device that would automatically forward unicast > packets to the macvlan interface and would be reserved for > transmitting packets from the macvlan interface. We took care of > multicast and broadcast replication in software. Okay, maybe I'm starting to see where you are coming from. First, I think some clarity here, as I see it the devlink infrastructure is all about creating the auxdevice for a switchdev port. What goes into that auxdevice is *completely* up to the driver. mlx5 is doing a SF which == VF, but that is not a requirement of the design at all. If an Intel driver wants to put a queue block into the aux device and that is != VF, it is just fine. The Intel netdev that binds to the auxdevice can transform the queue block and specific switchdev config into a netdev identical to accelerated macvlan. Nothing about the breaks the switchdev model. Essentially think of it as generalizing the acceleration plugin for a netdev. Instead of making something specific to limited macvlan, the driver gets to provide exactly the structure that matches its HW to provide the netdev as the user side of the switchdev port. I see no limitation here so long as the switchdev model for controlling traffic is followed. Let me segue into a short story from RDMA.. We've had a netdev called IPoIB for a long time. It is actually kind of similar to this general thing you are talking about, in that there is a programming layer under the IPOIB netdev called RDMA verbs that generalizes the actual HW. Over the years this became more complicated because every new netdev offloaded needed mirroring into the RDMA verbs general API. TSO, GSO, checksum offload, endlessly onwards. It became quite dumb in the end. We gave up and said the HW driver should directly implement netdev. Implementing a middle API layer makes zero sense when netdev is already perfectly suited to implement ontop of HW. Removing SW layers caused performance to go up something like 2x. The hard earned lesson I take from that is don't put software layers between a struct net_device and the actual HW. The closest coupling is really the best thing. Provide libary code in the kernel to help drivers implement common patterns when making their netdevs, do not provide wrapper netdevs around drivers. IMHO the approach of macvlan accleration made some sense in 2013, but today I would say it is mashing unrelated layers together and polluting what should be a pure SW implementation with HW hooks. I see from the mailing list comments this was done because creating a device specific netdev via 'ip link add' was rightly rejected. However here we *can* create a device specific vmdq *auxdevice*. This is OK because the netdev is controlling and containing the aux device via switchdev. So, Intel can get the "VMDQ link type" that was originally desired more or less directly, so long as the associated switchdev port controls the MAC filter process, not "ip link add". And if you want to make the vmdq auxdevice into an ADI by user DMA to queues, then sure, that model is completely sane too (vs hacking up macvlan to expose user queues) - so long as the kernel controls the selection of traffic into those queues and follows the switchdev model. I would recommend creating a simple RDMA raw ethernet queue driver over the aux device for something like this :) > That might be a bad example, I was thinking of the issues we have had > with VFs and direct assignment to Qemu based guests in the past. As described, this is solved by VDPA. > Essentially what I am getting at is that the setup in the container > should be vendor agnostic. The interface exposed shouldn't be specific > to any one vendor. So if I want to fire up a container or Mellanox, > Broadcom, or some other vendor it shouldn't matter or be visible to > the user. They should just see a vendor agnostic subfunction > netdevice. Agree. The agnostic container user interface here is 'struct net_device'. > > I have the feeling this stuff you are asking for is already done.. > > The case you are describing has essentially solved it for Qemu > virtualization and direct assignment. It still doesn't necessarily > solve it for the container case though. The container case doesn't need solving. Any scheme I've heard for container live migration, like CRIU, essentially hot plugs the entire kernel in/out of a user process. We rely on the kernel providing low leakage of the implementation details of the struct net_device as part of it's uAPI contract. When CRIU swaps the kernel the new kernel can have any implementation of the container netdev it wants. I've never heard of a use case to hot swap the implemention *under* a netdev from a container. macvlan can't do this today. If you have a use case here, it really has nothing to do with with this series. Jason
On Wed, Dec 16, 2020 at 4:38 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Wed, Dec 16, 2020 at 02:53:07PM -0800, Alexander Duyck wrote: > > > It isn't about the association, it is about who is handling the > > traffic. Going back to the macvlan model what we did is we had a group > > of rings on the device that would automatically forward unicast > > packets to the macvlan interface and would be reserved for > > transmitting packets from the macvlan interface. We took care of > > multicast and broadcast replication in software. > > Okay, maybe I'm starting to see where you are coming from. > > First, I think some clarity here, as I see it the devlink > infrastructure is all about creating the auxdevice for a switchdev > port. > > What goes into that auxdevice is *completely* up to the driver. mlx5 > is doing a SF which == VF, but that is not a requirement of the design > at all. > > If an Intel driver wants to put a queue block into the aux device and > that is != VF, it is just fine. > > The Intel netdev that binds to the auxdevice can transform the queue > block and specific switchdev config into a netdev identical to > accelerated macvlan. Nothing about the breaks the switchdev model. Just to clarify I am not with Intel, nor do I plan to work on any Intel drivers related to this. My concern has more to do with how this is being plumbed and the fact that the basic architecture is somewhat limiting. > Essentially think of it as generalizing the acceleration plugin for a > netdev. Instead of making something specific to limited macvlan, the > driver gets to provide exactly the structure that matches its HW to > provide the netdev as the user side of the switchdev port. I see no > limitation here so long as the switchdev model for controlling traffic > is followed. I see plenty. The problem is it just sets up more vendor lock-in and features that have to be thrown away when you have to settle for least-common denominator in order to maintain functionality across vendors. > Let me segue into a short story from RDMA.. We've had a netdev called > IPoIB for a long time. It is actually kind of similar to this general > thing you are talking about, in that there is a programming layer > under the IPOIB netdev called RDMA verbs that generalizes the actual > HW. Over the years this became more complicated because every new > netdev offloaded needed mirroring into the RDMA verbs general > API. TSO, GSO, checksum offload, endlessly onwards. It became quite > dumb in the end. We gave up and said the HW driver should directly > implement netdev. Implementing a middle API layer makes zero sense > when netdev is already perfectly suited to implement ontop of > HW. Removing SW layers caused performance to go up something like > 2x. > > The hard earned lesson I take from that is don't put software layers > between a struct net_device and the actual HW. The closest coupling is > really the best thing. Provide libary code in the kernel to help > drivers implement common patterns when making their netdevs, do not > provide wrapper netdevs around drivers. > > IMHO the approach of macvlan accleration made some sense in 2013, but > today I would say it is mashing unrelated layers together and > polluting what should be a pure SW implementation with HW hooks. I disagree here. In my mind a design where two interfaces, which both exist in the kernel, have to go to hardware in order to communicate is very limiting. The main thing I am wanting to see is the option of being able to pass traffic directly between the switchdev and the SF without the need to touch the hardware. An easy example of such traffic that would likely benefit from this is multicast/broadcast traffic. Instead of having to process each and every broadcast packet in hardware you could very easily process it at the switchdev and then directly hand it off from the switchdev to the SF in this case instead of having to send it to hardware for each switchdev instance. > I see from the mailing list comments this was done because creating a > device specific netdev via 'ip link add' was rightly rejected. However > here we *can* create a device specific vmdq *auxdevice*. This is OK > because the netdev is controlling and containing the aux device via > switchdev. > > So, Intel can get the "VMDQ link type" that was originally desired more > or less directly, so long as the associated switchdev port controls > the MAC filter process, not "ip link add". > > And if you want to make the vmdq auxdevice into an ADI by user DMA to > queues, then sure, that model is completely sane too (vs hacking up > macvlan to expose user queues) - so long as the kernel controls the > selection of traffic into those queues and follows the switchdev > model. I would recommend creating a simple RDMA raw ethernet queue > driver over the aux device for something like this :) You lost me here, I'm not seeing how RDMA and macvlan are connected. > > That might be a bad example, I was thinking of the issues we have had > > with VFs and direct assignment to Qemu based guests in the past. > > As described, this is solved by VDPA. > > > Essentially what I am getting at is that the setup in the container > > should be vendor agnostic. The interface exposed shouldn't be specific > > to any one vendor. So if I want to fire up a container or Mellanox, > > Broadcom, or some other vendor it shouldn't matter or be visible to > > the user. They should just see a vendor agnostic subfunction > > netdevice. > > Agree. The agnostic container user interface here is 'struct > net_device'. I disagree here. The fact is a mellanox netdev, versus a broadcom netdev, versus an intel netdev all have a very different look at feel as the netdev is essentially just the base device you are building around. In addition it still doesn't address my concern as called out above which is the east-west traffic problem. > > > I have the feeling this stuff you are asking for is already done.. > > > > The case you are describing has essentially solved it for Qemu > > virtualization and direct assignment. It still doesn't necessarily > > solve it for the container case though. > > The container case doesn't need solving. I disagree and that is at the heart where you and I have different views. I view there being two advantages to having the container case solved: 1. A standardized set of features that can be provided regardless of vendor 2. Allowing for the case where east-west traffic can avoid having to touch hardware > Any scheme I've heard for container live migration, like CRIU, > essentially hot plugs the entire kernel in/out of a user process. We > rely on the kernel providing low leakage of the implementation details > of the struct net_device as part of it's uAPI contract. When CRIU > swaps the kernel the new kernel can have any implementation of the > container netdev it wants. I'm not thinking about migration. I am thinking more about the user experience. In my mind if I set up a container I shouldn't need to know which vendor provided the network interface when I set it up. The problem is most NICs have so many one-off proprietary tweaks needed that it gets annoying. That is why in my mind it would make much more sense to have a simple vendor agnostic interface. That is why I would prefer to avoid the VF model. > I've never heard of a use case to hot swap the implemention *under* a > netdev from a container. macvlan can't do this today. If you have a > use case here, it really has nothing to do with with this series. Again, the hot-swap isn't necessarily what I am talking about. I am talking about setting up a config for a set of containers in a datacenter. What I don't want to do is have to have one set of configs for an mlx5 SF, another for a broadcom SF, and yet another set for any other vendors out there. I would much rather have all of that dealt with within the namespace that is handling the switchdev setup. In addition, the east-west traffic is the other bit I would like to see addressed. I am okay excusing this in the case of direct assignment since the resources for the SF will not be available to the host. However if the SF will be operating in the same kernel as the PF/switchev it would make much more sense to enable an east/west channel which would allow for hardware bypass under certain circumstances without having to ever leave the kernel.
On Thu, Dec 17, 2020 at 10:48:48AM -0800, Alexander Duyck wrote: > Just to clarify I am not with Intel, nor do I plan to work on any > Intel drivers related to this. Sure > I disagree here. In my mind a design where two interfaces, which both > exist in the kernel, have to go to hardware in order to communicate is > very limiting. The main thing I am wanting to see is the option of > being able to pass traffic directly between the switchdev and the SF > without the need to touch the hardware. I view the SW bypass path you are talking about similarly to GSO/etc. It should be accessed by the HW driver as an optional service provided by the core netdev, not implemented as some wrapper netdev around a HW implementation. If you feel strongly it is needed then there is nothing standing in the way to implement it in the switchdev auxdevice model. It is simple enough, the HW driver's tx path would somehow detect east/west and queue it differently, and the rx path would somehow be able to mux in skbs from a SW queue. Not seeing any blockers here. > > model. I would recommend creating a simple RDMA raw ethernet queue > > driver over the aux device for something like this :) > > You lost me here, I'm not seeing how RDMA and macvlan are connected. RDMA is the standard uAPI to get a userspace HW DMA queue for ethernet packets. > > > Essentially what I am getting at is that the setup in the container > > > should be vendor agnostic. The interface exposed shouldn't be specific > > > to any one vendor. So if I want to fire up a container or Mellanox, > > > Broadcom, or some other vendor it shouldn't matter or be visible to > > > the user. They should just see a vendor agnostic subfunction > > > netdevice. > > > > Agree. The agnostic container user interface here is 'struct > > net_device'. > > I disagree here. The fact is a mellanox netdev, versus a broadcom > netdev, versus an intel netdev all have a very different look at feel > as the netdev is essentially just the base device you are building > around. Then fix the lack of standardization of netdev implementations! Adding more abstraction layers isn't going to fix that fundamental problem. Frankly it seems a bit absurd to complain that the very basic element of the common kernel uAPI - struct net_device - is so horribly fragmented and vendor polluted that we can't rely on it as a stable interface for containers. Even if that is true, I don't belive for a second that adding a different HW abstraction layer is going to somehow undo the mistakes of the last 20 years. > Again, the hot-swap isn't necessarily what I am talking about. I am > talking about setting up a config for a set of containers in a > datacenter. What I don't want to do is have to have one set of configs > for an mlx5 SF, another for a broadcom SF, and yet another set for any > other vendors out there. I would much rather have all of that dealt > with within the namespace that is handling the switchdev setup. If there is real problems here then I very much encourage you to start an effort to push all the vendors to implement a consistent user experience for the HW netdevs. I don't know what your issues are, but it sounds like it would be a very interesting conference presentation. But it has nothing to do with this series. Jason
On Thu, Dec 17, 2020 at 11:40 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Thu, Dec 17, 2020 at 10:48:48AM -0800, Alexander Duyck wrote: > > > Just to clarify I am not with Intel, nor do I plan to work on any > > Intel drivers related to this. > > Sure > > > I disagree here. In my mind a design where two interfaces, which both > > exist in the kernel, have to go to hardware in order to communicate is > > very limiting. The main thing I am wanting to see is the option of > > being able to pass traffic directly between the switchdev and the SF > > without the need to touch the hardware. > > I view the SW bypass path you are talking about similarly to > GSO/etc. It should be accessed by the HW driver as an optional service > provided by the core netdev, not implemented as some wrapper netdev > around a HW implementation. I view it as being something that would be a part of the switchdev API itself. Basically the switchev and endpoint would need to be able to control something like this because if XDP were enabled on one end or the other you would need to be able to switch it off so that all of the packets followed the same flow and could be scanned by the XDP program. > If you feel strongly it is needed then there is nothing standing in > the way to implement it in the switchdev auxdevice model. > > It is simple enough, the HW driver's tx path would somehow detect > east/west and queue it differently, and the rx path would somehow be > able to mux in skbs from a SW queue. Not seeing any blockers here. In my mind the simple proof of concept for this would be to check for the multicast bit being set in the destination MAC address for packets coming from the subfunction. If it is then shunt to this bypass route, and if not then you transmit to the hardware queues. In the case of packets coming from the switchdev port it would probably depend. The part I am not sure about is if any packets need to be actually transmitted to the hardware in the standard case for packets going from the switchdev port to the subfunction. If there is no XDP or anything like that present in the subfunction it probably wouldn't matter and you could just shunt it straight across and bypass the hardware, however if XDP is present you would need to get the packets into the ring which would force the bypass to be turned off. > > > model. I would recommend creating a simple RDMA raw ethernet queue > > > driver over the aux device for something like this :) > > > > You lost me here, I'm not seeing how RDMA and macvlan are connected. > > RDMA is the standard uAPI to get a userspace HW DMA queue for ethernet > packets. Ah, I think you are talking about device assignment. In my mind I was just talking about the interface assigned to the container which as you have stated is basically just a netdev. > > > > Essentially what I am getting at is that the setup in the container > > > > should be vendor agnostic. The interface exposed shouldn't be specific > > > > to any one vendor. So if I want to fire up a container or Mellanox, > > > > Broadcom, or some other vendor it shouldn't matter or be visible to > > > > the user. They should just see a vendor agnostic subfunction > > > > netdevice. > > > > > > Agree. The agnostic container user interface here is 'struct > > > net_device'. > > > > I disagree here. The fact is a mellanox netdev, versus a broadcom > > netdev, versus an intel netdev all have a very different look at feel > > as the netdev is essentially just the base device you are building > > around. > > Then fix the lack of standardization of netdev implementations! We're trying to work on that, but trying to fix it after the fact is like herding cats. > Adding more abstraction layers isn't going to fix that fundamental > problem. > > Frankly it seems a bit absurd to complain that the very basic element > of the common kernel uAPI - struct net_device - is so horribly > fragmented and vendor polluted that we can't rely on it as a stable > interface for containers. The problem isn't necessarily the net_device it is more the net_device_ops and the fact that there are so many different ways to get things done. Arguably the flexibility of the netd_device is great for allowing vendors to expose their features. However at the same time it allows for features to be left out so what you end up with a wide variety of things that are net_devices. > Even if that is true, I don't belive for a second that adding a > different HW abstraction layer is going to somehow undo the mistakes > of the last 20 years. It depends on how it is done. The general idea is to address the biggest limitation that has occured, which is the fact that in many cases we don't have software offloads to take care of things when the hardware offloads provided by a certain piece of hardware are not present. It would basically allow us to reset the feature set. If something cannot be offloaded in software in a reasonable way, it is not allowed to be present in the interface provided to a container. That way instead of having to do all the custom configuration in the container recipe it can be centralized to one container handling all of the switching and hardware configuration. > > Again, the hot-swap isn't necessarily what I am talking about. I am > > talking about setting up a config for a set of containers in a > > datacenter. What I don't want to do is have to have one set of configs > > for an mlx5 SF, another for a broadcom SF, and yet another set for any > > other vendors out there. I would much rather have all of that dealt > > with within the namespace that is handling the switchdev setup. > > If there is real problems here then I very much encourage you to start > an effort to push all the vendors to implement a consistent user > experience for the HW netdevs. To some extent that has been going on for some time. It is one of the reasons why there is supposed to be software offloads for any datapath features that get added to the hardware. Such as GSO to offset TSO. However there always ends up the occasional thing that ends up getting past and that is where the frustration comes in. > I don't know what your issues are, but it sounds like it would be a > very interesting conference presentation. > > But it has nothing to do with this series. > > Jason There I disagree. Now I can agree that most of the series is about presenting the aux device and that part I am fine with. However when the aux device is a netdev and that netdev is being loaded into the same kernel as the switchdev port is where the red flags start flying, especially when we start talking about how it is the same as a VF. In my mind we are talking about how the switchdev will behave and it makes sense to see about defining if a east-west bypass makes sense and how it could be implemented, rather than saying we won't bother for now and potentially locking in the subfunction to virtual function equality. In my mind we need more than just the increased count to justify going to subfunctions, and I think being able to solve the east-west problem at least in terms of containers would be such a thing.
On Thu, Dec 17, 2020 at 01:05:03PM -0800, Alexander Duyck wrote: > > I view the SW bypass path you are talking about similarly to > > GSO/etc. It should be accessed by the HW driver as an optional service > > provided by the core netdev, not implemented as some wrapper netdev > > around a HW implementation. > > I view it as being something that would be a part of the switchdev API > itself. Basically the switchev and endpoint would need to be able to > control something like this because if XDP were enabled on one end or > the other you would need to be able to switch it off so that all of > the packets followed the same flow and could be scanned by the XDP > program. To me that still all comes down to being something like an optional offload that the HW driver can trigger if the conditions are met. > > It is simple enough, the HW driver's tx path would somehow detect > > east/west and queue it differently, and the rx path would somehow be > > able to mux in skbs from a SW queue. Not seeing any blockers here. > > In my mind the simple proof of concept for this would be to check for > the multicast bit being set in the destination MAC address for packets > coming from the subfunction. If it is then shunt to this bypass route, > and if not then you transmit to the hardware queues. Sure, not sure multicast optimization like this isn't incredibly niche too, but it would be an interesting path to explore. But again, there is nothing fundamental about the model here that precludes this optional optimization. > > Even if that is true, I don't belive for a second that adding a > > different HW abstraction layer is going to somehow undo the mistakes > > of the last 20 years. > > It depends on how it is done. The general idea is to address the > biggest limitation that has occured, which is the fact that in many > cases we don't have software offloads to take care of things when the > hardware offloads provided by a certain piece of hardware are not > present. This is really disappointing to hear. Admittedly I don't follow all the twists and turns on the mailing list, but I thought having a SW version of everything was one of the fundamental tenants of netdev that truly distinguished it from something like RDMA. > It would basically allow us to reset the feature set. If something > cannot be offloaded in software in a reasonable way, it is not > allowed to be present in the interface provided to a container. > That way instead of having to do all the custom configuration in the > container recipe it can be centralized to one container handling all > of the switching and hardware configuration. Well, you could start by blocking stuff without a SW fallback.. > There I disagree. Now I can agree that most of the series is about > presenting the aux device and that part I am fine with. However when > the aux device is a netdev and that netdev is being loaded into the > same kernel as the switchdev port is where the red flags start flying, > especially when we start talking about how it is the same as a VF. Well, it happens for the same reason a VF can create a netdev, stopping it would actually be more patches. As I said before, people are already doing this model with VFs. I can agree with some of our points, but this is not the series to argue them. What you want is to start some new thread on optimizing switchdev for the container user case. > In my mind we are talking about how the switchdev will behave and it > makes sense to see about defining if a east-west bypass makes sense > and how it could be implemented, rather than saying we won't bother > for now and potentially locking in the subfunction to virtual function > equality. At least for mlx5 SF == VF, that is a consequence of the HW. Any SW bypass would need to be specially built in the mlx5 netdev running on a VF/SF attached to a switchdev port. I don't see anything about this part of the model that precludes ever doing that, and I also don't see this optimization as being valuable enough to block things "just to be sure" > In my mind we need more than just the increased count to justify > going to subfunctions, and I think being able to solve the east-west > problem at least in terms of containers would be such a thing. Increased count is pretty important for users with SRIOV. Jason
On 12/16/20 3:53 PM, Alexander Duyck wrote: > The problem in my case was based on a past experience where east-west > traffic became a problem and it was easily shown that bypassing the > NIC for traffic was significantly faster. If a deployment expects a lot of east-west traffic *within a host* why is it using hardware based isolation like a VF. That is a side effect of a design choice that is remedied by other options.
On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote: > > On 12/16/20 3:53 PM, Alexander Duyck wrote: > > The problem in my case was based on a past experience where east-west > > traffic became a problem and it was easily shown that bypassing the > > NIC for traffic was significantly faster. > > If a deployment expects a lot of east-west traffic *within a host* why > is it using hardware based isolation like a VF. That is a side effect of > a design choice that is remedied by other options. I am mostly talking about this from past experience as I had seen a few instances when I was at Intel when it became an issue. Sales and marketing people aren't exactly happy when you tell them "don't sell that" in response to them trying to sell a feature into an area where it doesn't belong. Generally they want a solution. The macvlan offload addressed these issues as the replication and local switching can be handled in software. The problem is PCIe DMA wasn't designed to function as a network switch fabric and when we start talking about a 400Gb NIC trying to handle over 256 subfunctions it will quickly reduce the receive/transmit throughput to gigabit or less speeds when encountering hardware multicast/broadcast replication. With 256 subfunctions a simple 60B ARP could consume more than 19KB of PCIe bandwidth due to the packet having to be duplicated so many times. In my mind it should be simpler to simply clone a single skb 256 times, forward that to the switchdev ports, and have them perform a bypass (if available) to deliver it to the subfunctions. That's why I was thinking it might be a good time to look at addressing it.
On 12/17/20 8:11 PM, Alexander Duyck wrote: > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote: >> >> On 12/16/20 3:53 PM, Alexander Duyck wrote: >>> The problem in my case was based on a past experience where east-west >>> traffic became a problem and it was easily shown that bypassing the >>> NIC for traffic was significantly faster. >> >> If a deployment expects a lot of east-west traffic *within a host* why >> is it using hardware based isolation like a VF. That is a side effect of >> a design choice that is remedied by other options. > > I am mostly talking about this from past experience as I had seen a > few instances when I was at Intel when it became an issue. Sales and > marketing people aren't exactly happy when you tell them "don't sell > that" in response to them trying to sell a feature into an area where that's a problem engineers can never solve... > it doesn't belong. Generally they want a solution. The macvlan offload > addressed these issues as the replication and local switching can be > handled in software. well, I guess almost never. :-) > > The problem is PCIe DMA wasn't designed to function as a network > switch fabric and when we start talking about a 400Gb NIC trying to > handle over 256 subfunctions it will quickly reduce the > receive/transmit throughput to gigabit or less speeds when > encountering hardware multicast/broadcast replication. With 256 > subfunctions a simple 60B ARP could consume more than 19KB of PCIe > bandwidth due to the packet having to be duplicated so many times. In > my mind it should be simpler to simply clone a single skb 256 times, > forward that to the switchdev ports, and have them perform a bypass > (if available) to deliver it to the subfunctions. That's why I was > thinking it might be a good time to look at addressing it. > east-west traffic within a host is more than likely the same tenant in which case a proper VPC is a better solution than the s/w stack trying to detect and guess that a bypass is needed. Guesses cost cycles in the fast path which is a net loss - and even more so as speeds increase.
> From: Alexander Duyck <alexander.duyck@gmail.com> > Sent: Friday, December 18, 2020 8:41 AM > > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote: > > > > On 12/16/20 3:53 PM, Alexander Duyck wrote: > The problem is PCIe DMA wasn't designed to function as a network switch > fabric and when we start talking about a 400Gb NIC trying to handle over 256 > subfunctions it will quickly reduce the receive/transmit throughput to gigabit > or less speeds when encountering hardware multicast/broadcast replication. > With 256 subfunctions a simple 60B ARP could consume more than 19KB of > PCIe bandwidth due to the packet having to be duplicated so many times. In > my mind it should be simpler to simply clone a single skb 256 times, forward > that to the switchdev ports, and have them perform a bypass (if available) to > deliver it to the subfunctions. That's why I was thinking it might be a good > time to look at addressing it. Linux tc framework is rich to address this and already used by openvswich for years now. Today arp broadcasts are not offloaded. They go through software patch and replicated in the L2 domain. It is a solved problem for many years now.
> From: Parav Pandit <parav@nvidia.com> > Sent: Friday, December 18, 2020 10:51 AM > > > From: Alexander Duyck <alexander.duyck@gmail.com> > > Sent: Friday, December 18, 2020 8:41 AM > > > > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote: > > > > > > On 12/16/20 3:53 PM, Alexander Duyck wrote: > > The problem is PCIe DMA wasn't designed to function as a network > > switch fabric and when we start talking about a 400Gb NIC trying to > > handle over 256 subfunctions it will quickly reduce the > > receive/transmit throughput to gigabit or less speeds when encountering > hardware multicast/broadcast replication. > > With 256 subfunctions a simple 60B ARP could consume more than 19KB of > > PCIe bandwidth due to the packet having to be duplicated so many > > times. In my mind it should be simpler to simply clone a single skb > > 256 times, forward that to the switchdev ports, and have them perform > > a bypass (if available) to deliver it to the subfunctions. That's why > > I was thinking it might be a good time to look at addressing it. > Linux tc framework is rich to address this and already used by openvswich for > years now. > Today arp broadcasts are not offloaded. They go through software patch and s/patch/path > replicated in the L2 domain. > It is a solved problem for many years now.
On Thu, Dec 17, 2020 at 7:55 PM David Ahern <dsahern@gmail.com> wrote: > > On 12/17/20 8:11 PM, Alexander Duyck wrote: > > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote: > >> > >> On 12/16/20 3:53 PM, Alexander Duyck wrote: > >>> The problem in my case was based on a past experience where east-west > >>> traffic became a problem and it was easily shown that bypassing the > >>> NIC for traffic was significantly faster. > >> > >> If a deployment expects a lot of east-west traffic *within a host* why > >> is it using hardware based isolation like a VF. That is a side effect of > >> a design choice that is remedied by other options. > > > > I am mostly talking about this from past experience as I had seen a > > few instances when I was at Intel when it became an issue. Sales and > > marketing people aren't exactly happy when you tell them "don't sell > > that" in response to them trying to sell a feature into an area where > > that's a problem engineers can never solve... > > > it doesn't belong. Generally they want a solution. The macvlan offload > > addressed these issues as the replication and local switching can be > > handled in software. > > well, I guess almost never. :-) > > > > > The problem is PCIe DMA wasn't designed to function as a network > > switch fabric and when we start talking about a 400Gb NIC trying to > > handle over 256 subfunctions it will quickly reduce the > > receive/transmit throughput to gigabit or less speeds when > > encountering hardware multicast/broadcast replication. With 256 > > subfunctions a simple 60B ARP could consume more than 19KB of PCIe > > bandwidth due to the packet having to be duplicated so many times. In > > my mind it should be simpler to simply clone a single skb 256 times, > > forward that to the switchdev ports, and have them perform a bypass > > (if available) to deliver it to the subfunctions. That's why I was > > thinking it might be a good time to look at addressing it. > > > > east-west traffic within a host is more than likely the same tenant in > which case a proper VPC is a better solution than the s/w stack trying > to detect and guess that a bypass is needed. Guesses cost cycles in the > fast path which is a net loss - and even more so as speeds increase. Yes, but this becomes the hardware limitations deciding the layout of the network. I lean towards more flexibility to allow more configuration options being a good thing rather than us needing to dictate how a network has to be constructed based on the limitations of the hardware and software. For broadcast/multicast it isn't so much a guess. It would be a single bit test. My understanding is the switchdev setup is already making special cases for things like broadcast/multicast due to the extra overhead incurred. I mentioned ARP because in many cases it has to be offloaded specifically due to these sorts of issues.
On Thu, Dec 17, 2020 at 9:20 PM Parav Pandit <parav@nvidia.com> wrote: > > > > From: Alexander Duyck <alexander.duyck@gmail.com> > > Sent: Friday, December 18, 2020 8:41 AM > > > > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote: > > > > > > On 12/16/20 3:53 PM, Alexander Duyck wrote: > > The problem is PCIe DMA wasn't designed to function as a network switch > > fabric and when we start talking about a 400Gb NIC trying to handle over 256 > > subfunctions it will quickly reduce the receive/transmit throughput to gigabit > > or less speeds when encountering hardware multicast/broadcast replication. > > With 256 subfunctions a simple 60B ARP could consume more than 19KB of > > PCIe bandwidth due to the packet having to be duplicated so many times. In > > my mind it should be simpler to simply clone a single skb 256 times, forward > > that to the switchdev ports, and have them perform a bypass (if available) to > > deliver it to the subfunctions. That's why I was thinking it might be a good > > time to look at addressing it. > Linux tc framework is rich to address this and already used by openvswich for years now. > Today arp broadcasts are not offloaded. They go through software path and replicated in the L2 domain. > It is a solved problem for many years now. When you say they are replicated in the L2 domain I assume you are talking about the software switch connected to the switchdev ports. My question is what are you doing with them after you have replicated them? I'm assuming they are being sent to the other switchdev ports which will require a DMA to transmit them, and another to receive them on the VF/SF, or are you saying something else is going on here? My argument is that this cuts into both the transmit and receive DMA bandwidth of the NIC, and could easily be avoided in the case where SF exists in the same kernel as the switchdev port by identifying the multicast bit being set and simply bypassing the device.
> From: Alexander Duyck <alexander.duyck@gmail.com> > Sent: Friday, December 18, 2020 9:31 PM > > On Thu, Dec 17, 2020 at 9:20 PM Parav Pandit <parav@nvidia.com> wrote: > > > > > > > From: Alexander Duyck <alexander.duyck@gmail.com> > > > Sent: Friday, December 18, 2020 8:41 AM > > > > > > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> > wrote: > > > > > > > > On 12/16/20 3:53 PM, Alexander Duyck wrote: > > > The problem is PCIe DMA wasn't designed to function as a network > > > switch fabric and when we start talking about a 400Gb NIC trying to > > > handle over 256 subfunctions it will quickly reduce the > > > receive/transmit throughput to gigabit or less speeds when encountering > hardware multicast/broadcast replication. > > > With 256 subfunctions a simple 60B ARP could consume more than 19KB > > > of PCIe bandwidth due to the packet having to be duplicated so many > > > times. In my mind it should be simpler to simply clone a single skb > > > 256 times, forward that to the switchdev ports, and have them > > > perform a bypass (if available) to deliver it to the subfunctions. > > > That's why I was thinking it might be a good time to look at addressing it. > > Linux tc framework is rich to address this and already used by openvswich > for years now. > > Today arp broadcasts are not offloaded. They go through software path > and replicated in the L2 domain. > > It is a solved problem for many years now. > > When you say they are replicated in the L2 domain I assume you are talking > about the software switch connected to the switchdev ports. Yes. > My question is > what are you doing with them after you have replicated them? I'm assuming > they are being sent to the other switchdev ports which will require a DMA to > transmit them, and another to receive them on the VF/SF, or are you saying > something else is going on here? > Yes, that is correct. > My argument is that this cuts into both the transmit and receive DMA > bandwidth of the NIC, and could easily be avoided in the case where SF > exists in the same kernel as the switchdev port by identifying the multicast > bit being set and simply bypassing the device. It probably can be avoided but its probably not worth for occasional ARP packets on neighbor cache miss. If I am not mistaken, even some recent HW can forward such ARP packets to multiple switchcev ports with commit 7ee3f6d2486e without following the above described DMA path.
On Fri, Dec 18, 2020 at 10:01 AM Parav Pandit <parav@nvidia.com> wrote: > > > > From: Alexander Duyck <alexander.duyck@gmail.com> > > Sent: Friday, December 18, 2020 9:31 PM > > > > On Thu, Dec 17, 2020 at 9:20 PM Parav Pandit <parav@nvidia.com> wrote: > > > > > > > > > > From: Alexander Duyck <alexander.duyck@gmail.com> > > > > Sent: Friday, December 18, 2020 8:41 AM > > > > > > > > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> > > wrote: > > > > > > > > > > On 12/16/20 3:53 PM, Alexander Duyck wrote: > > > > The problem is PCIe DMA wasn't designed to function as a network > > > > switch fabric and when we start talking about a 400Gb NIC trying to > > > > handle over 256 subfunctions it will quickly reduce the > > > > receive/transmit throughput to gigabit or less speeds when encountering > > hardware multicast/broadcast replication. > > > > With 256 subfunctions a simple 60B ARP could consume more than 19KB > > > > of PCIe bandwidth due to the packet having to be duplicated so many > > > > times. In my mind it should be simpler to simply clone a single skb > > > > 256 times, forward that to the switchdev ports, and have them > > > > perform a bypass (if available) to deliver it to the subfunctions. > > > > That's why I was thinking it might be a good time to look at addressing it. > > > Linux tc framework is rich to address this and already used by openvswich > > for years now. > > > Today arp broadcasts are not offloaded. They go through software path > > and replicated in the L2 domain. > > > It is a solved problem for many years now. > > > > When you say they are replicated in the L2 domain I assume you are talking > > about the software switch connected to the switchdev ports. > Yes. > > > My question is > > what are you doing with them after you have replicated them? I'm assuming > > they are being sent to the other switchdev ports which will require a DMA to > > transmit them, and another to receive them on the VF/SF, or are you saying > > something else is going on here? > > > Yes, that is correct. > > > My argument is that this cuts into both the transmit and receive DMA > > bandwidth of the NIC, and could easily be avoided in the case where SF > > exists in the same kernel as the switchdev port by identifying the multicast > > bit being set and simply bypassing the device. > It probably can be avoided but its probably not worth for occasional ARP packets on neighbor cache miss. > If I am not mistaken, even some recent HW can forward such ARP packets to multiple switchdev ports with commit 7ee3f6d2486e without following the above described DMA path. Even with that it sounds like it will have to DMA the packet to multiple Rx destinations even if it is only performing the Tx DMA once. The Intel NICs did all this replication in hardware as well so that is what I was thinking of when I was talking about the replication behavior seen with SR-IOV. Basically what I am getting at is that this could be used as an architectural feature for switchdev to avoid creating increased DMA overhead for broadcast, multicast and unknown-unicast traffic. I'm not saying this is anything mandatory, and I would be perfectly okay with something like this being optional and defaulted to off. In my mind the setup only has the interfaces handling traffic to single point destinations so that at most you are only looking at a 2x bump in PCIe bandwidth for those cases where the packet ends up needing to go out the physical port. It would essentially be a software offload to avoid saturating the PCIe bus. This setup would only work if both interfaces are present in the same kernel though so that is why I chose now to bring it up as historically SR-IOV hasn't normally been associated with containers due to the limited number of interfaces that could be created. Also as far as the patch count complaints I have seen in a few threads I would be fine with splitting things up so that the devlink and aux device creation get handled in one set, and then we work out the details of mlx5 attaching to the devices and spawning of the SF netdevs in another since that seems to be where the debate is.
On Fri, Dec 18, 2020 at 11:22:12AM -0800, Alexander Duyck wrote: > Also as far as the patch count complaints I have seen in a few threads > I would be fine with splitting things up so that the devlink and aux > device creation get handled in one set, and then we work out the > details of mlx5 attaching to the devices and spawning of the SF > netdevs in another since that seems to be where the debate is. It doesn't work like that. The aux device creates a mlx5_core and every mlx5_core can run mlx5_en. This really isn't the series to raise this feature request. Adding an optional short cut path to VF/SF is something that can be done later if up to date benchmarks show it has value. There is no blocker in this model to doing that. Jason
On Fri, Dec 18, 2020 at 12:18 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Dec 18, 2020 at 11:22:12AM -0800, Alexander Duyck wrote: > > > Also as far as the patch count complaints I have seen in a few threads > > I would be fine with splitting things up so that the devlink and aux > > device creation get handled in one set, and then we work out the > > details of mlx5 attaching to the devices and spawning of the SF > > netdevs in another since that seems to be where the debate is. > > It doesn't work like that. The aux device creates a mlx5_core and > every mlx5_core can run mlx5_en. The aux device is still just a device isn't it? Until you register a driver that latches on to "MLX5_SF_DEV_ID_NAME" the device is there and it should function like an unbound/idle VF. > This really isn't the series to raise this feature request. Adding an > optional short cut path to VF/SF is something that can be done later > if up to date benchmarks show it has value. There is no blocker in > this model to doing that. That is the point of contention that we probably won't solve. I feel like if we are going to put out something that is an alternative to the macvlan and SR-IOV approaches it should address the issues resolved in both. Right now this just throws up yet another virtualization solution that ends up reintroducing many of the existing problems we already had with SR-IOV, and possibly exacerbating them by allowing for an even greater number of subfunctions. Anyway I have made my opinion known, and I am not in the position to block the patches since I am not the maintainer.