Message ID | 20240215030814.451812-16-saeed@kernel.org (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next,V3,01/15] net/mlx5: Add MPIR bit in mcam_access_reg | expand |
On Wed, 14 Feb 2024 19:08:14 -0800 Saeed Mahameed wrote: > +The advanced Multi-PF NIC technology enables several CPUs within a multi-socket server to There are multiple devlink instances, right? In that case we should call out that there may be more than one. > +Currently the sysfs is kept untouched, letting the netdev sysfs point to its primary PF. > +Enhancing sysfs to reflect the actual topology is to be discussed and contributed separately. I don't anticipate it to be particularly hard, let's not merge half-baked code and force users to grow workarounds that are hard to remove. Also could you add examples of how the queue and napis look when listed via the netdev genl on these devices?
On 16/02/2024 7:23, Jakub Kicinski wrote: > On Wed, 14 Feb 2024 19:08:14 -0800 Saeed Mahameed wrote: >> +The advanced Multi-PF NIC technology enables several CPUs within a multi-socket server to Hi Jakub, > > There are multiple devlink instances, right? Right. > In that case we should call out that there may be more than one. > We are combining the PFs in the netdev level. I did not focus on the parts that we do not touch. That's why I didn't mention the sysfs for example, until you asked. For example, irqns for the two PFs are still reachable as they used to, under two distinct paths: ll /sys/bus/pci/devices/0000\:08\:00.0/msi_irqs/ ll /sys/bus/pci/devices/0000\:09\:00.0/msi_irqs/ >> +Currently the sysfs is kept untouched, letting the netdev sysfs point to its primary PF. >> +Enhancing sysfs to reflect the actual topology is to be discussed and contributed separately. > > I don't anticipate it to be particularly hard, let's not merge > half-baked code and force users to grow workarounds that are hard > to remove. > Changing sysfs to expose queues from multiple PFs under one path might be misleading and break backward compatibility. IMO it should come as an extension to the existing entries. Anyway, the interesting info exposed in sysfs is now available through the netdev genl. Now, is this sysfs part integral to the feature? IMO, no. This in-driver feature is large enough to be completed in stages and not as a one shot. > Also could you add examples of how the queue and napis look when listed > via the netdev genl on these devices? > Sure. Example for a 24-cores system: $ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml --dump queue-get --json '{"ifindex": 5}' [{'id': 0, 'ifindex': 5, 'napi-id': 539, 'type': 'rx'}, {'id': 1, 'ifindex': 5, 'napi-id': 540, 'type': 'rx'}, {'id': 2, 'ifindex': 5, 'napi-id': 541, 'type': 'rx'}, {'id': 3, 'ifindex': 5, 'napi-id': 542, 'type': 'rx'}, {'id': 4, 'ifindex': 5, 'napi-id': 543, 'type': 'rx'}, {'id': 5, 'ifindex': 5, 'napi-id': 544, 'type': 'rx'}, {'id': 6, 'ifindex': 5, 'napi-id': 545, 'type': 'rx'}, {'id': 7, 'ifindex': 5, 'napi-id': 546, 'type': 'rx'}, {'id': 8, 'ifindex': 5, 'napi-id': 547, 'type': 'rx'}, {'id': 9, 'ifindex': 5, 'napi-id': 548, 'type': 'rx'}, {'id': 10, 'ifindex': 5, 'napi-id': 549, 'type': 'rx'}, {'id': 11, 'ifindex': 5, 'napi-id': 550, 'type': 'rx'}, {'id': 12, 'ifindex': 5, 'napi-id': 551, 'type': 'rx'}, {'id': 13, 'ifindex': 5, 'napi-id': 552, 'type': 'rx'}, {'id': 14, 'ifindex': 5, 'napi-id': 553, 'type': 'rx'}, {'id': 15, 'ifindex': 5, 'napi-id': 554, 'type': 'rx'}, {'id': 16, 'ifindex': 5, 'napi-id': 555, 'type': 'rx'}, {'id': 17, 'ifindex': 5, 'napi-id': 556, 'type': 'rx'}, {'id': 18, 'ifindex': 5, 'napi-id': 557, 'type': 'rx'}, {'id': 19, 'ifindex': 5, 'napi-id': 558, 'type': 'rx'}, {'id': 20, 'ifindex': 5, 'napi-id': 559, 'type': 'rx'}, {'id': 21, 'ifindex': 5, 'napi-id': 560, 'type': 'rx'}, {'id': 22, 'ifindex': 5, 'napi-id': 561, 'type': 'rx'}, {'id': 23, 'ifindex': 5, 'napi-id': 562, 'type': 'rx'}, {'id': 0, 'ifindex': 5, 'napi-id': 539, 'type': 'tx'}, {'id': 1, 'ifindex': 5, 'napi-id': 540, 'type': 'tx'}, {'id': 2, 'ifindex': 5, 'napi-id': 541, 'type': 'tx'}, {'id': 3, 'ifindex': 5, 'napi-id': 542, 'type': 'tx'}, {'id': 4, 'ifindex': 5, 'napi-id': 543, 'type': 'tx'}, {'id': 5, 'ifindex': 5, 'napi-id': 544, 'type': 'tx'}, {'id': 6, 'ifindex': 5, 'napi-id': 545, 'type': 'tx'}, {'id': 7, 'ifindex': 5, 'napi-id': 546, 'type': 'tx'}, {'id': 8, 'ifindex': 5, 'napi-id': 547, 'type': 'tx'}, {'id': 9, 'ifindex': 5, 'napi-id': 548, 'type': 'tx'}, {'id': 10, 'ifindex': 5, 'napi-id': 549, 'type': 'tx'}, {'id': 11, 'ifindex': 5, 'napi-id': 550, 'type': 'tx'}, {'id': 12, 'ifindex': 5, 'napi-id': 551, 'type': 'tx'}, {'id': 13, 'ifindex': 5, 'napi-id': 552, 'type': 'tx'}, {'id': 14, 'ifindex': 5, 'napi-id': 553, 'type': 'tx'}, {'id': 15, 'ifindex': 5, 'napi-id': 554, 'type': 'tx'}, {'id': 16, 'ifindex': 5, 'napi-id': 555, 'type': 'tx'}, {'id': 17, 'ifindex': 5, 'napi-id': 556, 'type': 'tx'}, {'id': 18, 'ifindex': 5, 'napi-id': 557, 'type': 'tx'}, {'id': 19, 'ifindex': 5, 'napi-id': 558, 'type': 'tx'}, {'id': 20, 'ifindex': 5, 'napi-id': 559, 'type': 'tx'}, {'id': 21, 'ifindex': 5, 'napi-id': 560, 'type': 'tx'}, {'id': 22, 'ifindex': 5, 'napi-id': 561, 'type': 'tx'}, {'id': 23, 'ifindex': 5, 'napi-id': 562, 'type': 'tx'}] $ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 5}' [{'id': 562, 'ifindex': 5, 'irq': 84}, {'id': 561, 'ifindex': 5, 'irq': 83}, {'id': 560, 'ifindex': 5, 'irq': 82}, {'id': 559, 'ifindex': 5, 'irq': 81}, {'id': 558, 'ifindex': 5, 'irq': 80}, {'id': 557, 'ifindex': 5, 'irq': 79}, {'id': 556, 'ifindex': 5, 'irq': 78}, {'id': 555, 'ifindex': 5, 'irq': 77}, {'id': 554, 'ifindex': 5, 'irq': 76}, {'id': 553, 'ifindex': 5, 'irq': 75}, {'id': 552, 'ifindex': 5, 'irq': 74}, {'id': 551, 'ifindex': 5, 'irq': 73}, {'id': 550, 'ifindex': 5, 'irq': 72}, {'id': 549, 'ifindex': 5, 'irq': 71}, {'id': 548, 'ifindex': 5, 'irq': 70}, {'id': 547, 'ifindex': 5, 'irq': 69}, {'id': 546, 'ifindex': 5, 'irq': 68}, {'id': 545, 'ifindex': 5, 'irq': 67}, {'id': 544, 'ifindex': 5, 'irq': 66}, {'id': 543, 'ifindex': 5, 'irq': 65}, {'id': 542, 'ifindex': 5, 'irq': 64}, {'id': 541, 'ifindex': 5, 'irq': 63}, {'id': 540, 'ifindex': 5, 'irq': 39}, {'id': 539, 'ifindex': 5, 'irq': 36}]
Thu, Feb 15, 2024 at 04:08:14AM CET, saeed@kernel.org wrote: >From: Tariq Toukan <tariqt@nvidia.com> > >Add documentation for the multi-pf netdev feature. >Describe the mlx5 implementation and design decisions. > >Signed-off-by: Tariq Toukan <tariqt@nvidia.com> >Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> >--- > Documentation/networking/index.rst | 1 + > Documentation/networking/multi-pf-netdev.rst | 157 +++++++++++++++++++ > 2 files changed, 158 insertions(+) > create mode 100644 Documentation/networking/multi-pf-netdev.rst > >diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst >index 69f3d6dcd9fd..473d72c36d61 100644 >--- a/Documentation/networking/index.rst >+++ b/Documentation/networking/index.rst >@@ -74,6 +74,7 @@ Contents: > mpls-sysctl > mptcp-sysctl > multiqueue >+ multi-pf-netdev > napi > net_cachelines/index > netconsole >diff --git a/Documentation/networking/multi-pf-netdev.rst b/Documentation/networking/multi-pf-netdev.rst >new file mode 100644 >index 000000000000..6ef2ac448d1e >--- /dev/null >+++ b/Documentation/networking/multi-pf-netdev.rst >@@ -0,0 +1,157 @@ >+.. SPDX-License-Identifier: GPL-2.0 >+.. include:: <isonum.txt> >+ >+=============== >+Multi-PF Netdev >+=============== >+ >+Contents >+======== >+ >+- `Background`_ >+- `Overview`_ >+- `mlx5 implementation`_ >+- `Channels distribution`_ >+- `Topology`_ >+- `Steering`_ >+- `Mutually exclusive features`_ >+ >+Background >+========== >+ >+The advanced Multi-PF NIC technology enables several CPUs within a multi-socket server to >+connect directly to the network, each through its own dedicated PCIe interface. Through either a >+connection harness that splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a >+single card. This results in eliminating the network traffic traversing over the internal bus >+between the sockets, significantly reducing overhead and latency, in addition to reducing CPU >+utilization and increasing network throughput. >+ >+Overview >+======== >+ >+This feature adds support for combining multiple devices (PFs) of the same port in a Multi-PF >+environment under one netdev instance. Passing traffic through different devices belonging to >+different NUMA sockets saves cross-numa traffic and allows apps running on the same netdev from >+different numas to still feel a sense of proximity to the device and achieve improved performance. >+ >+mlx5 implementation >+=================== >+ >+Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same >+NIC and has the socket-direct property enabled, once all PFS are probed, we create a single netdev How do you enable this property? >+to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed. >+ >+The netdev network channels are distributed between all devices, a proper configuration would utilize >+the correct close numa node when working on a certain app/cpu. >+ >+We pick one PF to be a primary (leader), and it fills a special role. The other devices >+(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent >+mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of >+the leader PF (east <-> west traffic) to function. All RX/TX traffic is steered through the primary >+to/from the secondaries. >+ >+Currently, we limit the support to PFs only, and up to two PFs (sockets). For the record, could you please describe why exactly you didn't use drivers/base/component.c infrastructure for this? I know you told me, but I don't recall. Better to have this written down, I believe. >+ >+Channels distribution >+===================== >+ >+We distribute the channels between the different PFs to achieve local NUMA node performance >+on multiple NUMA nodes. >+ >+Each combined channel works against one specific PF, creating all its datapath queues against it. We distribute >+channels to PFs in a round-robin policy. >+ >+:: >+ >+ Example for 2 PFs and 6 channels: >+ +--------+--------+ >+ | ch idx | PF idx | >+ +--------+--------+ >+ | 0 | 0 | >+ | 1 | 1 | >+ | 2 | 0 | >+ | 3 | 1 | >+ | 4 | 0 | >+ | 5 | 1 | >+ +--------+--------+ >+ >+ >+We prefer this round-robin distribution policy over another suggested intuitive distribution, in >+which we first distribute one half of the channels to PF0 and then the second half to PF1. >+ >+The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The >+mapping between a channel index and a PF is fixed, no matter how many channels the user configures. >+As the channel stats are persistent across channel's closure, changing the mapping every single time >+would turn the accumulative stats less representing of the channel's history. >+ >+This is achieved by using the correct core device instance (mdev) in each channel, instead of them >+all using the same instance under "priv->mdev". >+ >+Topology >+======== >+Currently the sysfs is kept untouched, letting the netdev sysfs point to its primary PF. >+Enhancing sysfs to reflect the actual topology is to be discussed and contributed separately. >+For now, debugfs is being used to reflect the topology: >+ >+.. code-block:: bash >+ >+ $ grep -H . /sys/kernel/debug/mlx5/0000\:08\:00.0/sd/* >+ /sys/kernel/debug/mlx5/0000:08:00.0/sd/group_id:0x00000101 >+ /sys/kernel/debug/mlx5/0000:08:00.0/sd/primary:0000:08:00.0 vhca 0x0 >+ /sys/kernel/debug/mlx5/0000:08:00.0/sd/secondary_0:0000:09:00.0 vhca 0x2 Ugh :/ SD is something that is likely going to stay with us for some time. Can't we have some proper UAPI instead of this? IDK. >+ >+Steering >+======== >+Secondary PFs are set to "silent" mode, meaning they are disconnected from the network. >+ >+In RX, the steering tables belong to the primary PF only, and it is its role to distribute incoming >+traffic to other PFs, via cross-vhca steering capabilities. Nothing special about the RSS table >+content, except that it needs a capable device to point to the receive queues of a different PF. >+ >+In TX, the primary PF creates a new TX flow table, which is aliased by the secondaries, so they can >+go out to the network through it. >+ >+In addition, we set default XPS configuration that, based on the cpu, selects an SQ belonging to the >+PF on the same node as the cpu. >+ >+XPS default config example: >+ >+NUMA node(s): 2 >+NUMA node0 CPU(s): 0-11 >+NUMA node1 CPU(s): 12-23 How can user know which queue is bound to which cpu? >+ >+PF0 on node0, PF1 on node1. >+ >+- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001 >+- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000 >+- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002 >+- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000 >+- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004 >+- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000 >+- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008 >+- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000 >+- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010 >+- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000 >+- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020 >+- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000 >+- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040 >+- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000 >+- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080 >+- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000 >+- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100 >+- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000 >+- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200 >+- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000 >+- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400 >+- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000 >+- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800 >+- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000 >+ >+Mutually exclusive features >+=========================== >+ >+The nature of Multi-PF, where different channels work with different PFs, conflicts with >+stateful features where the state is maintained in one of the PFs. >+For example, in the TLS device-offload feature, special context objects are created per connection >+and maintained in the PF. Transitioning between different RQs/SQs would break the feature. Hence, >+we disable this combination for now. >-- >2.43.0 > >
On Mon, 19 Feb 2024 17:26:36 +0200 Tariq Toukan wrote: > > There are multiple devlink instances, right? > > Right. Just to be clear I'm asking you questions about things which need to be covered by the doc :) > > In that case we should call out that there may be more than one. > > > > We are combining the PFs in the netdev level. > I did not focus on the parts that we do not touch. Sure but one of the goals here is to drive convergence. So if another vendor is on the fence let's nudge them towards the same decision. > That's why I didn't mention the sysfs for example, until you asked. > > For example, irqns for the two PFs are still reachable as they used to, > under two distinct paths: > ll /sys/bus/pci/devices/0000\:08\:00.0/msi_irqs/ > ll /sys/bus/pci/devices/0000\:09\:00.0/msi_irqs/ > > >> +Currently the sysfs is kept untouched, letting the netdev sysfs point to its primary PF. > >> +Enhancing sysfs to reflect the actual topology is to be discussed and contributed separately. > > > > I don't anticipate it to be particularly hard, let's not merge > > half-baked code and force users to grow workarounds that are hard > > to remove. > > Changing sysfs to expose queues from multiple PFs under one path might > be misleading and break backward compatibility. IMO it should come as an > extension to the existing entries. I don't know what "multiple PFs under one path" means, links in VFs are one to one, right? :) > Anyway, the interesting info exposed in sysfs is now available through > the netdev genl. Right, that's true. Greg, we have a feature here where a single device of class net has multiple "bus parents". We used to have one attr under class net (device) which is a link to the bus parent. Now we either need to add more or not bother with the linking of the whole device. Is there any precedent / preference for solving this from the device model perspective? > Now, is this sysfs part integral to the feature? IMO, no. This in-driver > feature is large enough to be completed in stages and not as a one shot. It's not a question of size and/or implementing everything. What I want to make sure is that you surveyed the known user space implementations sufficiently to know what looks at those links, and perhaps ethtool -i. Perhaps the answer is indeed "nothing much will care" and given we can link IRQs correctly we put that as a conclusion in the doc. Saying "sysfs is coming soon" is not adding much information :( > > Also could you add examples of how the queue and napis look when listed > > via the netdev genl on these devices? > > > > Sure. Example for a 24-cores system: Could you reconfigure to 5 channels to make the output asymmetric and shorter and include the example in the doc?
On 20 Feb 17:33, Jakub Kicinski wrote: >On Mon, 19 Feb 2024 17:26:36 +0200 Tariq Toukan wrote: >> > There are multiple devlink instances, right? >> >> Right. > >Just to be clear I'm asking you questions about things which need to >be covered by the doc :) > >> > In that case we should call out that there may be more than one. >> > >> >> We are combining the PFs in the netdev level. >> I did not focus on the parts that we do not touch. > >> Anyway, the interesting info exposed in sysfs is now available through >> the netdev genl. > >Right, that's true. > [...] >Greg, we have a feature here where a single device of class net has >multiple "bus parents". We used to have one attr under class net >(device) which is a link to the bus parent. Now we either need to add >more or not bother with the linking of the whole device. Is there any >precedent / preference for solving this from the device model >perspective? > >> Now, is this sysfs part integral to the feature? IMO, no. This in-driver >> feature is large enough to be completed in stages and not as a one shot. > >It's not a question of size and/or implementing everything. >What I want to make sure is that you surveyed the known user space >implementations sufficiently to know what looks at those links, >and perhaps ethtool -i. >Perhaps the answer is indeed "nothing much will care" and given >we can link IRQs correctly we put that as a conclusion in the doc. > >Saying "sysfs is coming soon" is not adding much information :( > linking multiple parent devices at the netdev subsystems doesn't add anything, the netdev abstraction should stop at linking rx/tx channels to physical irqs and NUMA nodes, complicating the sysfs will required a proper infrastructure to model the multi-pf mode for all vendors to use uniformly, but for what? currently there's no configuration mechanism for this feature yet, and we don't need it at the moment, once configuration becomes necessary, I would recommend adding one infrastructure to all vendors to register to at the parent device level, which will handle the sysfs/devlink abstraction, and leave netdev abstraction as is (IRQ/NUMA) and maybe take this a step further and give the user control of attaching specific channels to specific IRQs/NUMA nodes.
On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote: > Greg, we have a feature here where a single device of class net has > multiple "bus parents". We used to have one attr under class net > (device) which is a link to the bus parent. Now we either need to add > more or not bother with the linking of the whole device. Is there any > precedent / preference for solving this from the device model > perspective? How, logically, can a netdevice be controlled properly from 2 parent devices on two different busses? How is that even possible from a physical point-of-view? What exact bus types are involved here? This "shouldn't" be possible as in the end, it's usually a PCI device handling this all, right? thanks, greg k-h
On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote: > On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote: > > Greg, we have a feature here where a single device of class net has > > multiple "bus parents". We used to have one attr under class net > > (device) which is a link to the bus parent. Now we either need to add > > more or not bother with the linking of the whole device. Is there any > > precedent / preference for solving this from the device model > > perspective? > > How, logically, can a netdevice be controlled properly from 2 parent > devices on two different busses? How is that even possible from a > physical point-of-view? What exact bus types are involved here? Two PCIe buses, two endpoints, two networking ports. It's one piece of silicon, tho, so the "slices" can talk to each other internally. The NVRAM configuration tells both endpoints that the user wants them "bonded", when the PCI drivers probe they "find each other" using some cookie or DSN or whatnot. And once they did, they spawn a single netdev. > This "shouldn't" be possible as in the end, it's usually a PCI device > handling this all, right? It's really a special type of bonding of two netdevs. Like you'd bond two ports to get twice the bandwidth. With the twist that the balancing is done on NUMA proximity, rather than traffic hash. Well, plus, the major twist that it's all done magically "for you" in the vendor driver, and the two "lower" devices are not visible. You only see the resulting bond. I personally think that the magic hides as many problems as it introduces and we'd be better off creating two separate netdevs. And then a new type of "device bond" on top. Small win that the "new device bond on top" can be shared code across vendors. But there's only so many hours in the day to argue with vendors.
On 2/22/2024 5:00 PM, Jakub Kicinski wrote: > On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote: >> On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote: >>> Greg, we have a feature here where a single device of class net has >>> multiple "bus parents". We used to have one attr under class net >>> (device) which is a link to the bus parent. Now we either need to add >>> more or not bother with the linking of the whole device. Is there any >>> precedent / preference for solving this from the device model >>> perspective? >> >> How, logically, can a netdevice be controlled properly from 2 parent >> devices on two different busses? How is that even possible from a >> physical point-of-view? What exact bus types are involved here? > > Two PCIe buses, two endpoints, two networking ports. It's one piece Isn't it only 1 networking port with multiple PFs? > of silicon, tho, so the "slices" can talk to each other internally. > The NVRAM configuration tells both endpoints that the user wants > them "bonded", when the PCI drivers probe they "find each other" > using some cookie or DSN or whatnot. And once they did, they spawn > a single netdev. > >> This "shouldn't" be possible as in the end, it's usually a PCI device >> handling this all, right? > > It's really a special type of bonding of two netdevs. Like you'd bond > two ports to get twice the bandwidth. With the twist that the balancing > is done on NUMA proximity, rather than traffic hash. > > Well, plus, the major twist that it's all done magically "for you" > in the vendor driver, and the two "lower" devices are not visible. > You only see the resulting bond. > > I personally think that the magic hides as many problems as it > introduces and we'd be better off creating two separate netdevs. > And then a new type of "device bond" on top. Small win that > the "new device bond on top" can be shared code across vendors. Yes. We have been exploring a small extension to bonding driver to enable a single numa-aware multi-threaded application to efficiently utilize multiple NICs across numa nodes. Here is an early version of a patch we have been trying and seems to be working well. ========================================================================= bonding: select tx device based on rx device of a flow If napi_id is cached in the sk associated with skb, use the device associated with napi_id as the transmit device. Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 7a7d584f378a..77e3bf6c4502 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -5146,6 +5146,30 @@ static struct slave *bond_xmit_3ad_xor_slave_get(struct bonding *bond, unsigned int count; u32 hash; + if (skb->sk) { + int napi_id = skb->sk->sk_napi_id; + struct net_device *dev; + int idx; + + rcu_read_lock(); + dev = dev_get_by_napi_id(napi_id); + rcu_read_unlock(); + + if (!dev) + goto hash; + + count = slaves ? READ_ONCE(slaves->count) : 0; + if (unlikely(!count)) + return NULL; + + for (idx = 0; idx < count; idx++) { + slave = slaves->arr[idx]; + if (slave->dev->ifindex == dev->ifindex) + return slave; + } + } + +hash: hash = bond_xmit_hash(bond, skb); count = slaves ? READ_ONCE(slaves->count) : 0; if (unlikely(!count)) ========================================================================= If we make this as a configurable bonding option, would this be an acceptable solution to accelerate numa-aware apps? > > But there's only so many hours in the day to argue with vendors. >
Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote: >On 2/22/2024 5:00 PM, Jakub Kicinski wrote: >> On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote: >>> On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote: >>>> Greg, we have a feature here where a single device of class net has >>>> multiple "bus parents". We used to have one attr under class net >>>> (device) which is a link to the bus parent. Now we either need to add >>>> more or not bother with the linking of the whole device. Is there any >>>> precedent / preference for solving this from the device model >>>> perspective? >>> >>> How, logically, can a netdevice be controlled properly from 2 parent >>> devices on two different busses? How is that even possible from a >>> physical point-of-view? What exact bus types are involved here? >> Two PCIe buses, two endpoints, two networking ports. It's one piece > >Isn't it only 1 networking port with multiple PFs? > >> of silicon, tho, so the "slices" can talk to each other internally. >> The NVRAM configuration tells both endpoints that the user wants >> them "bonded", when the PCI drivers probe they "find each other" >> using some cookie or DSN or whatnot. And once they did, they spawn >> a single netdev. >> >>> This "shouldn't" be possible as in the end, it's usually a PCI device >>> handling this all, right? >> It's really a special type of bonding of two netdevs. Like you'd bond >> two ports to get twice the bandwidth. With the twist that the balancing >> is done on NUMA proximity, rather than traffic hash. >> Well, plus, the major twist that it's all done magically "for you" >> in the vendor driver, and the two "lower" devices are not visible. >> You only see the resulting bond. >> I personally think that the magic hides as many problems as it >> introduces and we'd be better off creating two separate netdevs. >> And then a new type of "device bond" on top. Small win that >> the "new device bond on top" can be shared code across vendors. > >Yes. We have been exploring a small extension to bonding driver to enable >a single numa-aware multi-threaded application to efficiently utilize >multiple NICs across numa nodes. Is this referring to something like the multi-pf under discussion, or just generically with two arbitrary network devices installed one each per NUMA node? >Here is an early version of a patch we have been trying and seems to be >working well. > >========================================================================= >bonding: select tx device based on rx device of a flow > >If napi_id is cached in the sk associated with skb, use the >device associated with napi_id as the transmit device. > >Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> > >diff --git a/drivers/net/bonding/bond_main.c >b/drivers/net/bonding/bond_main.c >index 7a7d584f378a..77e3bf6c4502 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -5146,6 +5146,30 @@ static struct slave >*bond_xmit_3ad_xor_slave_get(struct bonding *bond, > unsigned int count; > u32 hash; > >+ if (skb->sk) { >+ int napi_id = skb->sk->sk_napi_id; >+ struct net_device *dev; >+ int idx; >+ >+ rcu_read_lock(); >+ dev = dev_get_by_napi_id(napi_id); >+ rcu_read_unlock(); >+ >+ if (!dev) >+ goto hash; >+ >+ count = slaves ? READ_ONCE(slaves->count) : 0; >+ if (unlikely(!count)) >+ return NULL; >+ >+ for (idx = 0; idx < count; idx++) { >+ slave = slaves->arr[idx]; >+ if (slave->dev->ifindex == dev->ifindex) >+ return slave; >+ } >+ } >+ >+hash: > hash = bond_xmit_hash(bond, skb); > count = slaves ? READ_ONCE(slaves->count) : 0; > if (unlikely(!count)) >========================================================================= > >If we make this as a configurable bonding option, would this be an >acceptable solution to accelerate numa-aware apps? Assuming for the moment this is for "regular" network devices installed one per NUMA node, why do this in bonding instead of at a higher layer (multiple subnets or ECMP, for example)? Is the intent here that the bond would aggregate its interfaces via LACP with the peer being some kind of cross-chassis link aggregation (MLAG, et al)? Given that sk_napi_id seems to be associated with CONFIG_NET_RX_BUSY_POLL, am I correct in presuming the target applications are DPDK-style busy poll packet processors? -J --- -Jay Vosburgh, jay.vosburgh@canonical.com
On 2/22/2024 8:05 PM, Jay Vosburgh wrote: > Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote: >> On 2/22/2024 5:00 PM, Jakub Kicinski wrote: >>> On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote: >>>> On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote: >>>>> Greg, we have a feature here where a single device of class net has >>>>> multiple "bus parents". We used to have one attr under class net >>>>> (device) which is a link to the bus parent. Now we either need to add >>>>> more or not bother with the linking of the whole device. Is there any >>>>> precedent / preference for solving this from the device model >>>>> perspective? >>>> >>>> How, logically, can a netdevice be controlled properly from 2 parent >>>> devices on two different busses? How is that even possible from a >>>> physical point-of-view? What exact bus types are involved here? >>> Two PCIe buses, two endpoints, two networking ports. It's one piece >> >> Isn't it only 1 networking port with multiple PFs? >> >>> of silicon, tho, so the "slices" can talk to each other internally. >>> The NVRAM configuration tells both endpoints that the user wants >>> them "bonded", when the PCI drivers probe they "find each other" >>> using some cookie or DSN or whatnot. And once they did, they spawn >>> a single netdev. >>> >>>> This "shouldn't" be possible as in the end, it's usually a PCI device >>>> handling this all, right? >>> It's really a special type of bonding of two netdevs. Like you'd bond >>> two ports to get twice the bandwidth. With the twist that the balancing >>> is done on NUMA proximity, rather than traffic hash. >>> Well, plus, the major twist that it's all done magically "for you" >>> in the vendor driver, and the two "lower" devices are not visible. >>> You only see the resulting bond. >>> I personally think that the magic hides as many problems as it >>> introduces and we'd be better off creating two separate netdevs. >>> And then a new type of "device bond" on top. Small win that >>> the "new device bond on top" can be shared code across vendors. >> >> Yes. We have been exploring a small extension to bonding driver to enable >> a single numa-aware multi-threaded application to efficiently utilize >> multiple NICs across numa nodes. > > Is this referring to something like the multi-pf under > discussion, or just generically with two arbitrary network devices > installed one each per NUMA node? Normal network devices one per NUMA node > >> Here is an early version of a patch we have been trying and seems to be >> working well. >> >> ========================================================================= >> bonding: select tx device based on rx device of a flow >> >> If napi_id is cached in the sk associated with skb, use the >> device associated with napi_id as the transmit device. >> >> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >> >> diff --git a/drivers/net/bonding/bond_main.c >> b/drivers/net/bonding/bond_main.c >> index 7a7d584f378a..77e3bf6c4502 100644 >> --- a/drivers/net/bonding/bond_main.c >> +++ b/drivers/net/bonding/bond_main.c >> @@ -5146,6 +5146,30 @@ static struct slave >> *bond_xmit_3ad_xor_slave_get(struct bonding *bond, >> unsigned int count; >> u32 hash; >> >> + if (skb->sk) { >> + int napi_id = skb->sk->sk_napi_id; >> + struct net_device *dev; >> + int idx; >> + >> + rcu_read_lock(); >> + dev = dev_get_by_napi_id(napi_id); >> + rcu_read_unlock(); >> + >> + if (!dev) >> + goto hash; >> + >> + count = slaves ? READ_ONCE(slaves->count) : 0; >> + if (unlikely(!count)) >> + return NULL; >> + >> + for (idx = 0; idx < count; idx++) { >> + slave = slaves->arr[idx]; >> + if (slave->dev->ifindex == dev->ifindex) >> + return slave; >> + } >> + } >> + >> +hash: >> hash = bond_xmit_hash(bond, skb); >> count = slaves ? READ_ONCE(slaves->count) : 0; >> if (unlikely(!count)) >> ========================================================================= >> >> If we make this as a configurable bonding option, would this be an >> acceptable solution to accelerate numa-aware apps? > > Assuming for the moment this is for "regular" network devices > installed one per NUMA node, why do this in bonding instead of at a > higher layer (multiple subnets or ECMP, for example)? > > Is the intent here that the bond would aggregate its interfaces > via LACP with the peer being some kind of cross-chassis link aggregation > (MLAG, et al)? Yes. basic LACP bonding setup. There could be multiple peers connecting to the server via switch providing LACP based link aggregation. No cross-chassis MLAG. > > Given that sk_napi_id seems to be associated with > CONFIG_NET_RX_BUSY_POLL, am I correct in presuming the target > applications are DPDK-style busy poll packet processors? I am using sk_napi_id to get the incoming interface. Busy poll is not a requirement and this can be used with any socket based apps. In a numa-aware app, the app threads are split into pools of threads aligned to each numa node and the associated NIC. In the rx path, a thread is picked from a pool associated with a numa node using SO_INCOMING_CPU or similar method by setting irq affinity to the local cores. napi id is cached in the sk in the receive path. In the tx path, bonding driver picks the same NIC as the outgoing device using the cached sk->napi_id. This enables numa affinitized data path for an app thread doing network I/O. If we also configure xps based on rx queues, tx and rx of a TCP flow can be aligned to the same queue pair of a NIC even when using bonding. > > -J > > --- > -Jay Vosburgh, jay.vosburgh@canonical.com
Fri, Feb 23, 2024 at 02:23:32AM CET, sridhar.samudrala@intel.com wrote: > > >On 2/22/2024 5:00 PM, Jakub Kicinski wrote: >> On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote: >> > On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote: >> > > Greg, we have a feature here where a single device of class net has >> > > multiple "bus parents". We used to have one attr under class net >> > > (device) which is a link to the bus parent. Now we either need to add >> > > more or not bother with the linking of the whole device. Is there any >> > > precedent / preference for solving this from the device model >> > > perspective? >> > >> > How, logically, can a netdevice be controlled properly from 2 parent >> > devices on two different busses? How is that even possible from a >> > physical point-of-view? What exact bus types are involved here? >> >> Two PCIe buses, two endpoints, two networking ports. It's one piece > >Isn't it only 1 networking port with multiple PFs? AFAIK, yes. I have one device in hands like this. One physical port, 2 PCI slots, 2 PFs on PCI bus. > >> of silicon, tho, so the "slices" can talk to each other internally. >> The NVRAM configuration tells both endpoints that the user wants >> them "bonded", when the PCI drivers probe they "find each other" >> using some cookie or DSN or whatnot. And once they did, they spawn >> a single netdev. >> >> > This "shouldn't" be possible as in the end, it's usually a PCI device >> > handling this all, right? >> >> It's really a special type of bonding of two netdevs. Like you'd bond >> two ports to get twice the bandwidth. With the twist that the balancing >> is done on NUMA proximity, rather than traffic hash. >> >> Well, plus, the major twist that it's all done magically "for you" >> in the vendor driver, and the two "lower" devices are not visible. >> You only see the resulting bond. >> >> I personally think that the magic hides as many problems as it >> introduces and we'd be better off creating two separate netdevs. >> And then a new type of "device bond" on top. Small win that >> the "new device bond on top" can be shared code across vendors. > >Yes. We have been exploring a small extension to bonding driver to enable a >single numa-aware multi-threaded application to efficiently utilize multiple >NICs across numa nodes. Bonding was my immediate response when we discussed this internally for the first time. But I had to eventually admit it is probably not that suitable in this case, here's why: 1) there are no 2 physical ports, only one. 2) it is basically a matter of device layout/provisioning that this feature should be enabled, not user configuration. 3) other subsystems like RDMA would benefit the same feature, so this int not netdev specific in general.
Fri, Feb 23, 2024 at 06:00:40AM CET, sridhar.samudrala@intel.com wrote: > > >On 2/22/2024 8:05 PM, Jay Vosburgh wrote: >> Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote: >> > On 2/22/2024 5:00 PM, Jakub Kicinski wrote: >> > > On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote: >> > > > On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote: >> > > > > Greg, we have a feature here where a single device of class net has >> > > > > multiple "bus parents". We used to have one attr under class net >> > > > > (device) which is a link to the bus parent. Now we either need to add >> > > > > more or not bother with the linking of the whole device. Is there any >> > > > > precedent / preference for solving this from the device model >> > > > > perspective? >> > > > >> > > > How, logically, can a netdevice be controlled properly from 2 parent >> > > > devices on two different busses? How is that even possible from a >> > > > physical point-of-view? What exact bus types are involved here? >> > > Two PCIe buses, two endpoints, two networking ports. It's one piece >> > >> > Isn't it only 1 networking port with multiple PFs? >> > >> > > of silicon, tho, so the "slices" can talk to each other internally. >> > > The NVRAM configuration tells both endpoints that the user wants >> > > them "bonded", when the PCI drivers probe they "find each other" >> > > using some cookie or DSN or whatnot. And once they did, they spawn >> > > a single netdev. >> > > >> > > > This "shouldn't" be possible as in the end, it's usually a PCI device >> > > > handling this all, right? >> > > It's really a special type of bonding of two netdevs. Like you'd bond >> > > two ports to get twice the bandwidth. With the twist that the balancing >> > > is done on NUMA proximity, rather than traffic hash. >> > > Well, plus, the major twist that it's all done magically "for you" >> > > in the vendor driver, and the two "lower" devices are not visible. >> > > You only see the resulting bond. >> > > I personally think that the magic hides as many problems as it >> > > introduces and we'd be better off creating two separate netdevs. >> > > And then a new type of "device bond" on top. Small win that >> > > the "new device bond on top" can be shared code across vendors. >> > >> > Yes. We have been exploring a small extension to bonding driver to enable >> > a single numa-aware multi-threaded application to efficiently utilize >> > multiple NICs across numa nodes. >> >> Is this referring to something like the multi-pf under >> discussion, or just generically with two arbitrary network devices >> installed one each per NUMA node? > >Normal network devices one per NUMA node > >> >> > Here is an early version of a patch we have been trying and seems to be >> > working well. >> > >> > ========================================================================= >> > bonding: select tx device based on rx device of a flow >> > >> > If napi_id is cached in the sk associated with skb, use the >> > device associated with napi_id as the transmit device. >> > >> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >> > >> > diff --git a/drivers/net/bonding/bond_main.c >> > b/drivers/net/bonding/bond_main.c >> > index 7a7d584f378a..77e3bf6c4502 100644 >> > --- a/drivers/net/bonding/bond_main.c >> > +++ b/drivers/net/bonding/bond_main.c >> > @@ -5146,6 +5146,30 @@ static struct slave >> > *bond_xmit_3ad_xor_slave_get(struct bonding *bond, >> > unsigned int count; >> > u32 hash; >> > >> > + if (skb->sk) { >> > + int napi_id = skb->sk->sk_napi_id; >> > + struct net_device *dev; >> > + int idx; >> > + >> > + rcu_read_lock(); >> > + dev = dev_get_by_napi_id(napi_id); >> > + rcu_read_unlock(); >> > + >> > + if (!dev) >> > + goto hash; >> > + >> > + count = slaves ? READ_ONCE(slaves->count) : 0; >> > + if (unlikely(!count)) >> > + return NULL; >> > + >> > + for (idx = 0; idx < count; idx++) { >> > + slave = slaves->arr[idx]; >> > + if (slave->dev->ifindex == dev->ifindex) >> > + return slave; >> > + } >> > + } >> > + >> > +hash: >> > hash = bond_xmit_hash(bond, skb); >> > count = slaves ? READ_ONCE(slaves->count) : 0; >> > if (unlikely(!count)) >> > ========================================================================= >> > >> > If we make this as a configurable bonding option, would this be an >> > acceptable solution to accelerate numa-aware apps? >> >> Assuming for the moment this is for "regular" network devices >> installed one per NUMA node, why do this in bonding instead of at a >> higher layer (multiple subnets or ECMP, for example)? >> >> Is the intent here that the bond would aggregate its interfaces >> via LACP with the peer being some kind of cross-chassis link aggregation >> (MLAG, et al)? No. > >Yes. basic LACP bonding setup. There could be multiple peers connecting to >the server via switch providing LACP based link aggregation. No cross-chassis >MLAG. LACP does not make any sense, when you have only a single physical port. That applies to ECMP mentioned above too I believe.
On 2/23/2024 3:40 AM, Jiri Pirko wrote: > Fri, Feb 23, 2024 at 06:00:40AM CET, sridhar.samudrala@intel.com wrote: >> >> >> On 2/22/2024 8:05 PM, Jay Vosburgh wrote: >>> Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote: >>>> On 2/22/2024 5:00 PM, Jakub Kicinski wrote: >>>>> On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote: >>>>>> On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote: >>>>>>> Greg, we have a feature here where a single device of class net has >>>>>>> multiple "bus parents". We used to have one attr under class net >>>>>>> (device) which is a link to the bus parent. Now we either need to add >>>>>>> more or not bother with the linking of the whole device. Is there any >>>>>>> precedent / preference for solving this from the device model >>>>>>> perspective? >>>>>> >>>>>> How, logically, can a netdevice be controlled properly from 2 parent >>>>>> devices on two different busses? How is that even possible from a >>>>>> physical point-of-view? What exact bus types are involved here? >>>>> Two PCIe buses, two endpoints, two networking ports. It's one piece >>>> >>>> Isn't it only 1 networking port with multiple PFs? >>>> >>>>> of silicon, tho, so the "slices" can talk to each other internally. >>>>> The NVRAM configuration tells both endpoints that the user wants >>>>> them "bonded", when the PCI drivers probe they "find each other" >>>>> using some cookie or DSN or whatnot. And once they did, they spawn >>>>> a single netdev. >>>>> >>>>>> This "shouldn't" be possible as in the end, it's usually a PCI device >>>>>> handling this all, right? >>>>> It's really a special type of bonding of two netdevs. Like you'd bond >>>>> two ports to get twice the bandwidth. With the twist that the balancing >>>>> is done on NUMA proximity, rather than traffic hash. >>>>> Well, plus, the major twist that it's all done magically "for you" >>>>> in the vendor driver, and the two "lower" devices are not visible. >>>>> You only see the resulting bond. >>>>> I personally think that the magic hides as many problems as it >>>>> introduces and we'd be better off creating two separate netdevs. >>>>> And then a new type of "device bond" on top. Small win that >>>>> the "new device bond on top" can be shared code across vendors. >>>> >>>> Yes. We have been exploring a small extension to bonding driver to enable >>>> a single numa-aware multi-threaded application to efficiently utilize >>>> multiple NICs across numa nodes. >>> >>> Is this referring to something like the multi-pf under >>> discussion, or just generically with two arbitrary network devices >>> installed one each per NUMA node? >> >> Normal network devices one per NUMA node >> >>> >>>> Here is an early version of a patch we have been trying and seems to be >>>> working well. >>>> >>>> ========================================================================= >>>> bonding: select tx device based on rx device of a flow >>>> >>>> If napi_id is cached in the sk associated with skb, use the >>>> device associated with napi_id as the transmit device. >>>> >>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >>>> >>>> diff --git a/drivers/net/bonding/bond_main.c >>>> b/drivers/net/bonding/bond_main.c >>>> index 7a7d584f378a..77e3bf6c4502 100644 >>>> --- a/drivers/net/bonding/bond_main.c >>>> +++ b/drivers/net/bonding/bond_main.c >>>> @@ -5146,6 +5146,30 @@ static struct slave >>>> *bond_xmit_3ad_xor_slave_get(struct bonding *bond, >>>> unsigned int count; >>>> u32 hash; >>>> >>>> + if (skb->sk) { >>>> + int napi_id = skb->sk->sk_napi_id; >>>> + struct net_device *dev; >>>> + int idx; >>>> + >>>> + rcu_read_lock(); >>>> + dev = dev_get_by_napi_id(napi_id); >>>> + rcu_read_unlock(); >>>> + >>>> + if (!dev) >>>> + goto hash; >>>> + >>>> + count = slaves ? READ_ONCE(slaves->count) : 0; >>>> + if (unlikely(!count)) >>>> + return NULL; >>>> + >>>> + for (idx = 0; idx < count; idx++) { >>>> + slave = slaves->arr[idx]; >>>> + if (slave->dev->ifindex == dev->ifindex) >>>> + return slave; >>>> + } >>>> + } >>>> + >>>> +hash: >>>> hash = bond_xmit_hash(bond, skb); >>>> count = slaves ? READ_ONCE(slaves->count) : 0; >>>> if (unlikely(!count)) >>>> ========================================================================= >>>> >>>> If we make this as a configurable bonding option, would this be an >>>> acceptable solution to accelerate numa-aware apps? >>> >>> Assuming for the moment this is for "regular" network devices >>> installed one per NUMA node, why do this in bonding instead of at a >>> higher layer (multiple subnets or ECMP, for example)? >>> >>> Is the intent here that the bond would aggregate its interfaces >>> via LACP with the peer being some kind of cross-chassis link aggregation >>> (MLAG, et al)? > > No. > >> >> Yes. basic LACP bonding setup. There could be multiple peers connecting to >> the server via switch providing LACP based link aggregation. No cross-chassis >> MLAG. > > LACP does not make any sense, when you have only a single physical port. > That applies to ECMP mentioned above too I believe. I meant for the 2 regular NICs on 2 numa node setup, not for multi-PF 1 port setup.
Sat, Feb 24, 2024 at 12:56:52AM CET, sridhar.samudrala@intel.com wrote: > > >On 2/23/2024 3:40 AM, Jiri Pirko wrote: >> Fri, Feb 23, 2024 at 06:00:40AM CET, sridhar.samudrala@intel.com wrote: >> > >> > >> > On 2/22/2024 8:05 PM, Jay Vosburgh wrote: >> > > Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote: >> > > > On 2/22/2024 5:00 PM, Jakub Kicinski wrote: >> > > > > On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote: >> > > > > > On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote: >> > > > > > > Greg, we have a feature here where a single device of class net has >> > > > > > > multiple "bus parents". We used to have one attr under class net >> > > > > > > (device) which is a link to the bus parent. Now we either need to add >> > > > > > > more or not bother with the linking of the whole device. Is there any >> > > > > > > precedent / preference for solving this from the device model >> > > > > > > perspective? >> > > > > > >> > > > > > How, logically, can a netdevice be controlled properly from 2 parent >> > > > > > devices on two different busses? How is that even possible from a >> > > > > > physical point-of-view? What exact bus types are involved here? >> > > > > Two PCIe buses, two endpoints, two networking ports. It's one piece >> > > > >> > > > Isn't it only 1 networking port with multiple PFs? >> > > > >> > > > > of silicon, tho, so the "slices" can talk to each other internally. >> > > > > The NVRAM configuration tells both endpoints that the user wants >> > > > > them "bonded", when the PCI drivers probe they "find each other" >> > > > > using some cookie or DSN or whatnot. And once they did, they spawn >> > > > > a single netdev. >> > > > > >> > > > > > This "shouldn't" be possible as in the end, it's usually a PCI device >> > > > > > handling this all, right? >> > > > > It's really a special type of bonding of two netdevs. Like you'd bond >> > > > > two ports to get twice the bandwidth. With the twist that the balancing >> > > > > is done on NUMA proximity, rather than traffic hash. >> > > > > Well, plus, the major twist that it's all done magically "for you" >> > > > > in the vendor driver, and the two "lower" devices are not visible. >> > > > > You only see the resulting bond. >> > > > > I personally think that the magic hides as many problems as it >> > > > > introduces and we'd be better off creating two separate netdevs. >> > > > > And then a new type of "device bond" on top. Small win that >> > > > > the "new device bond on top" can be shared code across vendors. >> > > > >> > > > Yes. We have been exploring a small extension to bonding driver to enable >> > > > a single numa-aware multi-threaded application to efficiently utilize >> > > > multiple NICs across numa nodes. >> > > >> > > Is this referring to something like the multi-pf under >> > > discussion, or just generically with two arbitrary network devices >> > > installed one each per NUMA node? >> > >> > Normal network devices one per NUMA node >> > >> > > >> > > > Here is an early version of a patch we have been trying and seems to be >> > > > working well. >> > > > >> > > > ========================================================================= >> > > > bonding: select tx device based on rx device of a flow >> > > > >> > > > If napi_id is cached in the sk associated with skb, use the >> > > > device associated with napi_id as the transmit device. >> > > > >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >> > > > >> > > > diff --git a/drivers/net/bonding/bond_main.c >> > > > b/drivers/net/bonding/bond_main.c >> > > > index 7a7d584f378a..77e3bf6c4502 100644 >> > > > --- a/drivers/net/bonding/bond_main.c >> > > > +++ b/drivers/net/bonding/bond_main.c >> > > > @@ -5146,6 +5146,30 @@ static struct slave >> > > > *bond_xmit_3ad_xor_slave_get(struct bonding *bond, >> > > > unsigned int count; >> > > > u32 hash; >> > > > >> > > > + if (skb->sk) { >> > > > + int napi_id = skb->sk->sk_napi_id; >> > > > + struct net_device *dev; >> > > > + int idx; >> > > > + >> > > > + rcu_read_lock(); >> > > > + dev = dev_get_by_napi_id(napi_id); >> > > > + rcu_read_unlock(); >> > > > + >> > > > + if (!dev) >> > > > + goto hash; >> > > > + >> > > > + count = slaves ? READ_ONCE(slaves->count) : 0; >> > > > + if (unlikely(!count)) >> > > > + return NULL; >> > > > + >> > > > + for (idx = 0; idx < count; idx++) { >> > > > + slave = slaves->arr[idx]; >> > > > + if (slave->dev->ifindex == dev->ifindex) >> > > > + return slave; >> > > > + } >> > > > + } >> > > > + >> > > > +hash: >> > > > hash = bond_xmit_hash(bond, skb); >> > > > count = slaves ? READ_ONCE(slaves->count) : 0; >> > > > if (unlikely(!count)) >> > > > ========================================================================= >> > > > >> > > > If we make this as a configurable bonding option, would this be an >> > > > acceptable solution to accelerate numa-aware apps? >> > > >> > > Assuming for the moment this is for "regular" network devices >> > > installed one per NUMA node, why do this in bonding instead of at a >> > > higher layer (multiple subnets or ECMP, for example)? >> > > >> > > Is the intent here that the bond would aggregate its interfaces >> > > via LACP with the peer being some kind of cross-chassis link aggregation >> > > (MLAG, et al)? >> >> No. >> >> > >> > Yes. basic LACP bonding setup. There could be multiple peers connecting to >> > the server via switch providing LACP based link aggregation. No cross-chassis >> > MLAG. >> >> LACP does not make any sense, when you have only a single physical port. >> That applies to ECMP mentioned above too I believe. > >I meant for the 2 regular NICs on 2 numa node setup, not for multi-PF 1 port >setup. Okay, not sure how it is related to this thread then :)
On Fri, 23 Feb 2024 10:36:25 +0100 Jiri Pirko wrote: > >> It's really a special type of bonding of two netdevs. Like you'd bond > >> two ports to get twice the bandwidth. With the twist that the balancing > >> is done on NUMA proximity, rather than traffic hash. > >> > >> Well, plus, the major twist that it's all done magically "for you" > >> in the vendor driver, and the two "lower" devices are not visible. > >> You only see the resulting bond. > >> > >> I personally think that the magic hides as many problems as it > >> introduces and we'd be better off creating two separate netdevs. > >> And then a new type of "device bond" on top. Small win that > >> the "new device bond on top" can be shared code across vendors. > > > >Yes. We have been exploring a small extension to bonding driver to enable a > >single numa-aware multi-threaded application to efficiently utilize multiple > >NICs across numa nodes. > > Bonding was my immediate response when we discussed this internally for > the first time. But I had to eventually admit it is probably not that > suitable in this case, here's why: > 1) there are no 2 physical ports, only one. Right, sorry, number of PFs matches number of ports for each bus. But it's not necessarily a deal breaker - it's similar to a multi-host device. We also have multiple netdevs and PCIe links, they just go to different host rather than different NUMA nodes on one host. > 2) it is basically a matter of device layout/provisioning that this > feature should be enabled, not user configuration. We can still auto-instantiate it, not a deal breaker. I'm not sure you're right in that assumption, tho. At Meta, we support container sizes ranging from few CPUs to multiple NUMA nodes. Each NUMA node may have it's own NIC, and the orchestration needs to stitch and un-stitch NICs depending on whether the cores were allocated to small containers or a huge one. So it would be _easier_ to deal with multiple netdevs. Orchestration layer already understands netdev <> NUMA mapping, it does not understand multi-NUMA netdevs, and how to match up queues to nodes. > 3) other subsystems like RDMA would benefit the same feature, so this > int not netdev specific in general. Yes, looks RDMA-centric. RDMA being infamously bonding-challenged. Anyway, back to the initial question - from Greg's reply I'm guessing there's no precedent for doing such things in the device model either. So we're on our own.
Wed, Feb 28, 2024 at 03:06:19AM CET, kuba@kernel.org wrote: >On Fri, 23 Feb 2024 10:36:25 +0100 Jiri Pirko wrote: >> >> It's really a special type of bonding of two netdevs. Like you'd bond >> >> two ports to get twice the bandwidth. With the twist that the balancing >> >> is done on NUMA proximity, rather than traffic hash. >> >> >> >> Well, plus, the major twist that it's all done magically "for you" >> >> in the vendor driver, and the two "lower" devices are not visible. >> >> You only see the resulting bond. >> >> >> >> I personally think that the magic hides as many problems as it >> >> introduces and we'd be better off creating two separate netdevs. >> >> And then a new type of "device bond" on top. Small win that >> >> the "new device bond on top" can be shared code across vendors. >> > >> >Yes. We have been exploring a small extension to bonding driver to enable a >> >single numa-aware multi-threaded application to efficiently utilize multiple >> >NICs across numa nodes. >> >> Bonding was my immediate response when we discussed this internally for >> the first time. But I had to eventually admit it is probably not that >> suitable in this case, here's why: >> 1) there are no 2 physical ports, only one. > >Right, sorry, number of PFs matches number of ports for each bus. >But it's not necessarily a deal breaker - it's similar to a multi-host >device. We also have multiple netdevs and PCIe links, they just go to >different host rather than different NUMA nodes on one host. That is a different scenario. You have multiple hosts and a switch between them and the physical port. Yeah, it might be invisible switch, but there still is one. On DPU/smartnic, it is visible and configurable. > >> 2) it is basically a matter of device layout/provisioning that this >> feature should be enabled, not user configuration. > >We can still auto-instantiate it, not a deal breaker. "Auto-instantiate" in meating of userspace orchestration deamon, not kernel, that's what you mean? > >I'm not sure you're right in that assumption, tho. At Meta, we support >container sizes ranging from few CPUs to multiple NUMA nodes. Each NUMA >node may have it's own NIC, and the orchestration needs to stitch and >un-stitch NICs depending on whether the cores were allocated to small >containers or a huge one. Yeah, but still, there is one physical port for NIC-numanode pair. Correct? Does the orchestration setup a bond on top of them or some other master device or let the container use them independently? > >So it would be _easier_ to deal with multiple netdevs. Orchestration >layer already understands netdev <> NUMA mapping, it does not understand >multi-NUMA netdevs, and how to match up queues to nodes. > >> 3) other subsystems like RDMA would benefit the same feature, so this >> int not netdev specific in general. > >Yes, looks RDMA-centric. RDMA being infamously bonding-challenged. Not really. It's just needed to consider all usecases, not only netdev. > >Anyway, back to the initial question - from Greg's reply I'm guessing >there's no precedent for doing such things in the device model either. >So we're on our own.
On Wed, 28 Feb 2024 09:13:57 +0100 Jiri Pirko wrote: > >> 2) it is basically a matter of device layout/provisioning that this > >> feature should be enabled, not user configuration. > > > >We can still auto-instantiate it, not a deal breaker. > > "Auto-instantiate" in meating of userspace orchestration deamon, > not kernel, that's what you mean? Either kernel, or pass some hints to a user space agent, like networkd and have it handle the creation. We have precedent for "kernel side bonding" with the VF<>virtio bonding thing. > >I'm not sure you're right in that assumption, tho. At Meta, we support > >container sizes ranging from few CPUs to multiple NUMA nodes. Each NUMA > >node may have it's own NIC, and the orchestration needs to stitch and > >un-stitch NICs depending on whether the cores were allocated to small > >containers or a huge one. > > Yeah, but still, there is one physical port for NIC-numanode pair. Well, today there is. > Correct? Does the orchestration setup a bond on top of them or some other > master device or let the container use them independently? Just multi-nexthop routing and binding sockets to the netdev (with some BPF magic, I think). > >So it would be _easier_ to deal with multiple netdevs. Orchestration > >layer already understands netdev <> NUMA mapping, it does not understand > >multi-NUMA netdevs, and how to match up queues to nodes. > > > >> 3) other subsystems like RDMA would benefit the same feature, so this > >> int not netdev specific in general. > > > >Yes, looks RDMA-centric. RDMA being infamously bonding-challenged. > > Not really. It's just needed to consider all usecases, not only netdev. All use cases or lowest common denominator, depends on priorities.
On Wed, 28 Feb 2024 09:06:04 -0800 Jakub Kicinski wrote: > > >Yes, looks RDMA-centric. RDMA being infamously bonding-challenged. > > > > Not really. It's just needed to consider all usecases, not only netdev. > > All use cases or lowest common denominator, depends on priorities. To be clear, I'm not trying to shut down this proposal, I think both have disadvantages. This one is better for RDMA and iperf, the explicit netdevs are better for more advanced TCP apps. All I want is clear docs so users are not confused, and vendors don't diverge pointlessly.
Wed, Feb 28, 2024 at 06:06:04PM CET, kuba@kernel.org wrote: >On Wed, 28 Feb 2024 09:13:57 +0100 Jiri Pirko wrote: >> >> 2) it is basically a matter of device layout/provisioning that this >> >> feature should be enabled, not user configuration. >> > >> >We can still auto-instantiate it, not a deal breaker. >> >> "Auto-instantiate" in meating of userspace orchestration deamon, >> not kernel, that's what you mean? > >Either kernel, or pass some hints to a user space agent, like networkd >and have it handle the creation. We have precedent for "kernel side >bonding" with the VF<>virtio bonding thing. > >> >I'm not sure you're right in that assumption, tho. At Meta, we support >> >container sizes ranging from few CPUs to multiple NUMA nodes. Each NUMA >> >node may have it's own NIC, and the orchestration needs to stitch and >> >un-stitch NICs depending on whether the cores were allocated to small >> >containers or a huge one. >> >> Yeah, but still, there is one physical port for NIC-numanode pair. > >Well, today there is. > >> Correct? Does the orchestration setup a bond on top of them or some other >> master device or let the container use them independently? > >Just multi-nexthop routing and binding sockets to the netdev (with >some BPF magic, I think). Yeah, so basically 2 independent ports, 2 netdevices working independently. Not sure I see the parallel to the subject we discuss here :/ > >> >So it would be _easier_ to deal with multiple netdevs. Orchestration >> >layer already understands netdev <> NUMA mapping, it does not understand >> >multi-NUMA netdevs, and how to match up queues to nodes. >> > >> >> 3) other subsystems like RDMA would benefit the same feature, so this >> >> int not netdev specific in general. >> > >> >Yes, looks RDMA-centric. RDMA being infamously bonding-challenged. >> >> Not really. It's just needed to consider all usecases, not only netdev. > >All use cases or lowest common denominator, depends on priorities.
On Thu, 29 Feb 2024 09:21:26 +0100 Jiri Pirko wrote: > >> Correct? Does the orchestration setup a bond on top of them or some other > >> master device or let the container use them independently? > > > >Just multi-nexthop routing and binding sockets to the netdev (with > >some BPF magic, I think). > > Yeah, so basically 2 independent ports, 2 netdevices working > independently. Not sure I see the parallel to the subject we discuss > here :/ From the user's perspective it's almost exactly the same. User wants NUMA nodes to have a way to reach the network without crossing the interconnect. Whether you do that with 2 200G NICs or 1 400G NIC connected to two nodes is an implementation detail.
On 28 Feb 09:43, Jakub Kicinski wrote: >On Wed, 28 Feb 2024 09:06:04 -0800 Jakub Kicinski wrote: >> > >Yes, looks RDMA-centric. RDMA being infamously bonding-challenged. >> > >> > Not really. It's just needed to consider all usecases, not only netdev. >> >> All use cases or lowest common denominator, depends on priorities. > >To be clear, I'm not trying to shut down this proposal, I think both >have disadvantages. This one is better for RDMA and iperf, the explicit >netdevs are better for more advanced TCP apps. All I want is clear docs >so users are not confused, and vendors don't diverge pointlessly. Just posted v4 with updated documentation that should cover the basic feature which we believe is the most basic that all vendors should implement, mlx5 implementation won't change much if we decide later to move to some sort of a "generic netdev" interface, we don't agree it should be a new kind of bond, as bond was meant for actual link aggregation of multi-port devices, but again the mlx5 implementation will remain the same regardless of any future extension of the feature, the defaults are well documented and carefully selected for best user expectations.
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 69f3d6dcd9fd..473d72c36d61 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -74,6 +74,7 @@ Contents: mpls-sysctl mptcp-sysctl multiqueue + multi-pf-netdev napi net_cachelines/index netconsole diff --git a/Documentation/networking/multi-pf-netdev.rst b/Documentation/networking/multi-pf-netdev.rst new file mode 100644 index 000000000000..6ef2ac448d1e --- /dev/null +++ b/Documentation/networking/multi-pf-netdev.rst @@ -0,0 +1,157 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + +=============== +Multi-PF Netdev +=============== + +Contents +======== + +- `Background`_ +- `Overview`_ +- `mlx5 implementation`_ +- `Channels distribution`_ +- `Topology`_ +- `Steering`_ +- `Mutually exclusive features`_ + +Background +========== + +The advanced Multi-PF NIC technology enables several CPUs within a multi-socket server to +connect directly to the network, each through its own dedicated PCIe interface. Through either a +connection harness that splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a +single card. This results in eliminating the network traffic traversing over the internal bus +between the sockets, significantly reducing overhead and latency, in addition to reducing CPU +utilization and increasing network throughput. + +Overview +======== + +This feature adds support for combining multiple devices (PFs) of the same port in a Multi-PF +environment under one netdev instance. Passing traffic through different devices belonging to +different NUMA sockets saves cross-numa traffic and allows apps running on the same netdev from +different numas to still feel a sense of proximity to the device and achieve improved performance. + +mlx5 implementation +=================== + +Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same +NIC and has the socket-direct property enabled, once all PFS are probed, we create a single netdev +to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed. + +The netdev network channels are distributed between all devices, a proper configuration would utilize +the correct close numa node when working on a certain app/cpu. + +We pick one PF to be a primary (leader), and it fills a special role. The other devices +(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent +mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of +the leader PF (east <-> west traffic) to function. All RX/TX traffic is steered through the primary +to/from the secondaries. + +Currently, we limit the support to PFs only, and up to two PFs (sockets). + +Channels distribution +===================== + +We distribute the channels between the different PFs to achieve local NUMA node performance +on multiple NUMA nodes. + +Each combined channel works against one specific PF, creating all its datapath queues against it. We distribute +channels to PFs in a round-robin policy. + +:: + + Example for 2 PFs and 6 channels: + +--------+--------+ + | ch idx | PF idx | + +--------+--------+ + | 0 | 0 | + | 1 | 1 | + | 2 | 0 | + | 3 | 1 | + | 4 | 0 | + | 5 | 1 | + +--------+--------+ + + +We prefer this round-robin distribution policy over another suggested intuitive distribution, in +which we first distribute one half of the channels to PF0 and then the second half to PF1. + +The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The +mapping between a channel index and a PF is fixed, no matter how many channels the user configures. +As the channel stats are persistent across channel's closure, changing the mapping every single time +would turn the accumulative stats less representing of the channel's history. + +This is achieved by using the correct core device instance (mdev) in each channel, instead of them +all using the same instance under "priv->mdev". + +Topology +======== +Currently the sysfs is kept untouched, letting the netdev sysfs point to its primary PF. +Enhancing sysfs to reflect the actual topology is to be discussed and contributed separately. +For now, debugfs is being used to reflect the topology: + +.. code-block:: bash + + $ grep -H . /sys/kernel/debug/mlx5/0000\:08\:00.0/sd/* + /sys/kernel/debug/mlx5/0000:08:00.0/sd/group_id:0x00000101 + /sys/kernel/debug/mlx5/0000:08:00.0/sd/primary:0000:08:00.0 vhca 0x0 + /sys/kernel/debug/mlx5/0000:08:00.0/sd/secondary_0:0000:09:00.0 vhca 0x2 + +Steering +======== +Secondary PFs are set to "silent" mode, meaning they are disconnected from the network. + +In RX, the steering tables belong to the primary PF only, and it is its role to distribute incoming +traffic to other PFs, via cross-vhca steering capabilities. Nothing special about the RSS table +content, except that it needs a capable device to point to the receive queues of a different PF. + +In TX, the primary PF creates a new TX flow table, which is aliased by the secondaries, so they can +go out to the network through it. + +In addition, we set default XPS configuration that, based on the cpu, selects an SQ belonging to the +PF on the same node as the cpu. + +XPS default config example: + +NUMA node(s): 2 +NUMA node0 CPU(s): 0-11 +NUMA node1 CPU(s): 12-23 + +PF0 on node0, PF1 on node1. + +- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001 +- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000 +- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002 +- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000 +- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004 +- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000 +- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008 +- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000 +- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010 +- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000 +- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020 +- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000 +- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040 +- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000 +- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080 +- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000 +- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100 +- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000 +- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200 +- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000 +- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400 +- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000 +- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800 +- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000 + +Mutually exclusive features +=========================== + +The nature of Multi-PF, where different channels work with different PFs, conflicts with +stateful features where the state is maintained in one of the PFs. +For example, in the TLS device-offload feature, special context objects are created per connection +and maintained in the PF. Transitioning between different RQs/SQs would break the feature. Hence, +we disable this combination for now.