diff mbox series

[net-next,V3,15/15] Documentation: networking: Add description for multi-pf netdev

Message ID 20240215030814.451812-16-saeed@kernel.org (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net-next,V3,01/15] net/mlx5: Add MPIR bit in mcam_access_reg | expand

Checks

Context Check Description
netdev/series_format success Pull request is its own cover letter
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 8 this patch: 8
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 2 maintainers not CCed: linux-doc@vger.kernel.org corbet@lwn.net
netdev/build_clang success Errors and warnings before: 1007 this patch: 1007
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 8 this patch: 8
netdev/checkpatch warning WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-02-15--18-00 (tests: 1441)

Commit Message

Saeed Mahameed Feb. 15, 2024, 3:08 a.m. UTC
From: Tariq Toukan <tariqt@nvidia.com>

Add documentation for the multi-pf netdev feature.
Describe the mlx5 implementation and design decisions.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 Documentation/networking/index.rst           |   1 +
 Documentation/networking/multi-pf-netdev.rst | 157 +++++++++++++++++++
 2 files changed, 158 insertions(+)
 create mode 100644 Documentation/networking/multi-pf-netdev.rst

Comments

Jakub Kicinski Feb. 16, 2024, 5:23 a.m. UTC | #1
On Wed, 14 Feb 2024 19:08:14 -0800 Saeed Mahameed wrote:
> +The advanced Multi-PF NIC technology enables several CPUs within a multi-socket server to

There are multiple devlink instances, right?
In that case we should call out that there may be more than one.

> +Currently the sysfs is kept untouched, letting the netdev sysfs point to its primary PF.
> +Enhancing sysfs to reflect the actual topology is to be discussed and contributed separately.

I don't anticipate it to be particularly hard, let's not merge
half-baked code and force users to grow workarounds that are hard 
to remove.

Also could you add examples of how the queue and napis look when listed
via the netdev genl on these devices?
Tariq Toukan Feb. 19, 2024, 3:26 p.m. UTC | #2
On 16/02/2024 7:23, Jakub Kicinski wrote:
> On Wed, 14 Feb 2024 19:08:14 -0800 Saeed Mahameed wrote:
>> +The advanced Multi-PF NIC technology enables several CPUs within a multi-socket server to

Hi Jakub,

> 
> There are multiple devlink instances, right?

Right.

> In that case we should call out that there may be more than one.
> 

We are combining the PFs in the netdev level.
I did not focus on the parts that we do not touch.
That's why I didn't mention the sysfs for example, until you asked.

For example, irqns for the two PFs are still reachable as they used to, 
under two distinct paths:
ll /sys/bus/pci/devices/0000\:08\:00.0/msi_irqs/
ll /sys/bus/pci/devices/0000\:09\:00.0/msi_irqs/

>> +Currently the sysfs is kept untouched, letting the netdev sysfs point to its primary PF.
>> +Enhancing sysfs to reflect the actual topology is to be discussed and contributed separately.
> 
> I don't anticipate it to be particularly hard, let's not merge
> half-baked code and force users to grow workarounds that are hard
> to remove.
> 

Changing sysfs to expose queues from multiple PFs under one path might 
be misleading and break backward compatibility. IMO it should come as an 
extension to the existing entries.

Anyway, the interesting info exposed in sysfs is now available through 
the netdev genl.

Now, is this sysfs part integral to the feature? IMO, no. This in-driver 
feature is large enough to be completed in stages and not as a one shot.

> Also could you add examples of how the queue and napis look when listed
> via the netdev genl on these devices?
> 

Sure. Example for a 24-cores system:

$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml 
--dump queue-get --json '{"ifindex": 5}'
[{'id': 0, 'ifindex': 5, 'napi-id': 539, 'type': 'rx'},
  {'id': 1, 'ifindex': 5, 'napi-id': 540, 'type': 'rx'},
  {'id': 2, 'ifindex': 5, 'napi-id': 541, 'type': 'rx'},
  {'id': 3, 'ifindex': 5, 'napi-id': 542, 'type': 'rx'},
  {'id': 4, 'ifindex': 5, 'napi-id': 543, 'type': 'rx'},
  {'id': 5, 'ifindex': 5, 'napi-id': 544, 'type': 'rx'},
  {'id': 6, 'ifindex': 5, 'napi-id': 545, 'type': 'rx'},
  {'id': 7, 'ifindex': 5, 'napi-id': 546, 'type': 'rx'},
  {'id': 8, 'ifindex': 5, 'napi-id': 547, 'type': 'rx'},
  {'id': 9, 'ifindex': 5, 'napi-id': 548, 'type': 'rx'},
  {'id': 10, 'ifindex': 5, 'napi-id': 549, 'type': 'rx'},
  {'id': 11, 'ifindex': 5, 'napi-id': 550, 'type': 'rx'},
  {'id': 12, 'ifindex': 5, 'napi-id': 551, 'type': 'rx'},
  {'id': 13, 'ifindex': 5, 'napi-id': 552, 'type': 'rx'},
  {'id': 14, 'ifindex': 5, 'napi-id': 553, 'type': 'rx'},
  {'id': 15, 'ifindex': 5, 'napi-id': 554, 'type': 'rx'},
  {'id': 16, 'ifindex': 5, 'napi-id': 555, 'type': 'rx'},
  {'id': 17, 'ifindex': 5, 'napi-id': 556, 'type': 'rx'},
  {'id': 18, 'ifindex': 5, 'napi-id': 557, 'type': 'rx'},
  {'id': 19, 'ifindex': 5, 'napi-id': 558, 'type': 'rx'},
  {'id': 20, 'ifindex': 5, 'napi-id': 559, 'type': 'rx'},
  {'id': 21, 'ifindex': 5, 'napi-id': 560, 'type': 'rx'},
  {'id': 22, 'ifindex': 5, 'napi-id': 561, 'type': 'rx'},
  {'id': 23, 'ifindex': 5, 'napi-id': 562, 'type': 'rx'},
  {'id': 0, 'ifindex': 5, 'napi-id': 539, 'type': 'tx'},
  {'id': 1, 'ifindex': 5, 'napi-id': 540, 'type': 'tx'},
  {'id': 2, 'ifindex': 5, 'napi-id': 541, 'type': 'tx'},
  {'id': 3, 'ifindex': 5, 'napi-id': 542, 'type': 'tx'},
  {'id': 4, 'ifindex': 5, 'napi-id': 543, 'type': 'tx'},
  {'id': 5, 'ifindex': 5, 'napi-id': 544, 'type': 'tx'},
  {'id': 6, 'ifindex': 5, 'napi-id': 545, 'type': 'tx'},
  {'id': 7, 'ifindex': 5, 'napi-id': 546, 'type': 'tx'},
  {'id': 8, 'ifindex': 5, 'napi-id': 547, 'type': 'tx'},
  {'id': 9, 'ifindex': 5, 'napi-id': 548, 'type': 'tx'},
  {'id': 10, 'ifindex': 5, 'napi-id': 549, 'type': 'tx'},
  {'id': 11, 'ifindex': 5, 'napi-id': 550, 'type': 'tx'},
  {'id': 12, 'ifindex': 5, 'napi-id': 551, 'type': 'tx'},
  {'id': 13, 'ifindex': 5, 'napi-id': 552, 'type': 'tx'},
  {'id': 14, 'ifindex': 5, 'napi-id': 553, 'type': 'tx'},
  {'id': 15, 'ifindex': 5, 'napi-id': 554, 'type': 'tx'},
  {'id': 16, 'ifindex': 5, 'napi-id': 555, 'type': 'tx'},
  {'id': 17, 'ifindex': 5, 'napi-id': 556, 'type': 'tx'},
  {'id': 18, 'ifindex': 5, 'napi-id': 557, 'type': 'tx'},
  {'id': 19, 'ifindex': 5, 'napi-id': 558, 'type': 'tx'},
  {'id': 20, 'ifindex': 5, 'napi-id': 559, 'type': 'tx'},
  {'id': 21, 'ifindex': 5, 'napi-id': 560, 'type': 'tx'},
  {'id': 22, 'ifindex': 5, 'napi-id': 561, 'type': 'tx'},
  {'id': 23, 'ifindex': 5, 'napi-id': 562, 'type': 'tx'}]

$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml 
--dump napi-get --json='{"ifindex": 5}'
[{'id': 562, 'ifindex': 5, 'irq': 84},
  {'id': 561, 'ifindex': 5, 'irq': 83},
  {'id': 560, 'ifindex': 5, 'irq': 82},
  {'id': 559, 'ifindex': 5, 'irq': 81},
  {'id': 558, 'ifindex': 5, 'irq': 80},
  {'id': 557, 'ifindex': 5, 'irq': 79},
  {'id': 556, 'ifindex': 5, 'irq': 78},
  {'id': 555, 'ifindex': 5, 'irq': 77},
  {'id': 554, 'ifindex': 5, 'irq': 76},
  {'id': 553, 'ifindex': 5, 'irq': 75},
  {'id': 552, 'ifindex': 5, 'irq': 74},
  {'id': 551, 'ifindex': 5, 'irq': 73},
  {'id': 550, 'ifindex': 5, 'irq': 72},
  {'id': 549, 'ifindex': 5, 'irq': 71},
  {'id': 548, 'ifindex': 5, 'irq': 70},
  {'id': 547, 'ifindex': 5, 'irq': 69},
  {'id': 546, 'ifindex': 5, 'irq': 68},
  {'id': 545, 'ifindex': 5, 'irq': 67},
  {'id': 544, 'ifindex': 5, 'irq': 66},
  {'id': 543, 'ifindex': 5, 'irq': 65},
  {'id': 542, 'ifindex': 5, 'irq': 64},
  {'id': 541, 'ifindex': 5, 'irq': 63},
  {'id': 540, 'ifindex': 5, 'irq': 39},
  {'id': 539, 'ifindex': 5, 'irq': 36}]
Jiri Pirko Feb. 19, 2024, 6:04 p.m. UTC | #3
Thu, Feb 15, 2024 at 04:08:14AM CET, saeed@kernel.org wrote:
>From: Tariq Toukan <tariqt@nvidia.com>
>
>Add documentation for the multi-pf netdev feature.
>Describe the mlx5 implementation and design decisions.
>
>Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
>---
> Documentation/networking/index.rst           |   1 +
> Documentation/networking/multi-pf-netdev.rst | 157 +++++++++++++++++++
> 2 files changed, 158 insertions(+)
> create mode 100644 Documentation/networking/multi-pf-netdev.rst
>
>diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
>index 69f3d6dcd9fd..473d72c36d61 100644
>--- a/Documentation/networking/index.rst
>+++ b/Documentation/networking/index.rst
>@@ -74,6 +74,7 @@ Contents:
>    mpls-sysctl
>    mptcp-sysctl
>    multiqueue
>+   multi-pf-netdev
>    napi
>    net_cachelines/index
>    netconsole
>diff --git a/Documentation/networking/multi-pf-netdev.rst b/Documentation/networking/multi-pf-netdev.rst
>new file mode 100644
>index 000000000000..6ef2ac448d1e
>--- /dev/null
>+++ b/Documentation/networking/multi-pf-netdev.rst
>@@ -0,0 +1,157 @@
>+.. SPDX-License-Identifier: GPL-2.0
>+.. include:: <isonum.txt>
>+
>+===============
>+Multi-PF Netdev
>+===============
>+
>+Contents
>+========
>+
>+- `Background`_
>+- `Overview`_
>+- `mlx5 implementation`_
>+- `Channels distribution`_
>+- `Topology`_
>+- `Steering`_
>+- `Mutually exclusive features`_
>+
>+Background
>+==========
>+
>+The advanced Multi-PF NIC technology enables several CPUs within a multi-socket server to
>+connect directly to the network, each through its own dedicated PCIe interface. Through either a
>+connection harness that splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a
>+single card. This results in eliminating the network traffic traversing over the internal bus
>+between the sockets, significantly reducing overhead and latency, in addition to reducing CPU
>+utilization and increasing network throughput.
>+
>+Overview
>+========
>+
>+This feature adds support for combining multiple devices (PFs) of the same port in a Multi-PF
>+environment under one netdev instance. Passing traffic through different devices belonging to
>+different NUMA sockets saves cross-numa traffic and allows apps running on the same netdev from
>+different numas to still feel a sense of proximity to the device and achieve improved performance.
>+
>+mlx5 implementation
>+===================
>+
>+Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same
>+NIC and has the socket-direct property enabled, once all PFS are probed, we create a single netdev

How do you enable this property?


>+to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed.
>+
>+The netdev network channels are distributed between all devices, a proper configuration would utilize
>+the correct close numa node when working on a certain app/cpu.
>+
>+We pick one PF to be a primary (leader), and it fills a special role. The other devices
>+(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent
>+mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of
>+the leader PF (east <-> west traffic) to function. All RX/TX traffic is steered through the primary
>+to/from the secondaries.
>+
>+Currently, we limit the support to PFs only, and up to two PFs (sockets).

For the record, could you please describe why exactly you didn't use
drivers/base/component.c infrastructure for this? I know you told me,
but I don't recall. Better to have this written down, I believe.


>+
>+Channels distribution
>+=====================
>+
>+We distribute the channels between the different PFs to achieve local NUMA node performance
>+on multiple NUMA nodes.
>+
>+Each combined channel works against one specific PF, creating all its datapath queues against it. We distribute
>+channels to PFs in a round-robin policy.
>+
>+::
>+
>+        Example for 2 PFs and 6 channels:
>+        +--------+--------+
>+        | ch idx | PF idx |
>+        +--------+--------+
>+        |    0   |    0   |
>+        |    1   |    1   |
>+        |    2   |    0   |
>+        |    3   |    1   |
>+        |    4   |    0   |
>+        |    5   |    1   |
>+        +--------+--------+
>+
>+
>+We prefer this round-robin distribution policy over another suggested intuitive distribution, in
>+which we first distribute one half of the channels to PF0 and then the second half to PF1.
>+
>+The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
>+mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
>+As the channel stats are persistent across channel's closure, changing the mapping every single time
>+would turn the accumulative stats less representing of the channel's history.
>+
>+This is achieved by using the correct core device instance (mdev) in each channel, instead of them
>+all using the same instance under "priv->mdev".
>+
>+Topology
>+========
>+Currently the sysfs is kept untouched, letting the netdev sysfs point to its primary PF.
>+Enhancing sysfs to reflect the actual topology is to be discussed and contributed separately.
>+For now, debugfs is being used to reflect the topology:
>+
>+.. code-block:: bash
>+
>+        $ grep -H . /sys/kernel/debug/mlx5/0000\:08\:00.0/sd/*
>+        /sys/kernel/debug/mlx5/0000:08:00.0/sd/group_id:0x00000101
>+        /sys/kernel/debug/mlx5/0000:08:00.0/sd/primary:0000:08:00.0 vhca 0x0
>+        /sys/kernel/debug/mlx5/0000:08:00.0/sd/secondary_0:0000:09:00.0 vhca 0x2

Ugh :/

SD is something that is likely going to stay with us for some time.
Can't we have some proper UAPI instead of this? IDK.


>+
>+Steering
>+========
>+Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
>+
>+In RX, the steering tables belong to the primary PF only, and it is its role to distribute incoming
>+traffic to other PFs, via cross-vhca steering capabilities. Nothing special about the RSS table
>+content, except that it needs a capable device to point to the receive queues of a different PF.
>+
>+In TX, the primary PF creates a new TX flow table, which is aliased by the secondaries, so they can
>+go out to the network through it.
>+
>+In addition, we set default XPS configuration that, based on the cpu, selects an SQ belonging to the
>+PF on the same node as the cpu.
>+
>+XPS default config example:
>+
>+NUMA node(s):          2
>+NUMA node0 CPU(s):     0-11
>+NUMA node1 CPU(s):     12-23

How can user know which queue is bound to which cpu?


>+
>+PF0 on node0, PF1 on node1.
>+
>+- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
>+- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
>+- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
>+- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
>+- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
>+- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
>+- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
>+- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
>+- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
>+- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
>+- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
>+- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
>+- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
>+- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
>+- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
>+- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
>+- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
>+- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
>+- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
>+- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
>+- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
>+- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
>+- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
>+- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
>+
>+Mutually exclusive features
>+===========================
>+
>+The nature of Multi-PF, where different channels work with different PFs, conflicts with
>+stateful features where the state is maintained in one of the PFs.
>+For example, in the TLS device-offload feature, special context objects are created per connection
>+and maintained in the PF.  Transitioning between different RQs/SQs would break the feature. Hence,
>+we disable this combination for now.
>-- 
>2.43.0
>
>
Jakub Kicinski Feb. 21, 2024, 1:33 a.m. UTC | #4
On Mon, 19 Feb 2024 17:26:36 +0200 Tariq Toukan wrote:
> > There are multiple devlink instances, right?  
> 
> Right.

Just to be clear I'm asking you questions about things which need to 
be covered by the doc :)

> > In that case we should call out that there may be more than one.
> >   
> 
> We are combining the PFs in the netdev level.
> I did not focus on the parts that we do not touch.

Sure but one of the goals here is to drive convergence.
So if another vendor is on the fence let's nudge them towards the same
decision.

> That's why I didn't mention the sysfs for example, until you asked.
> 
> For example, irqns for the two PFs are still reachable as they used to, 
> under two distinct paths:
> ll /sys/bus/pci/devices/0000\:08\:00.0/msi_irqs/
> ll /sys/bus/pci/devices/0000\:09\:00.0/msi_irqs/
> 
> >> +Currently the sysfs is kept untouched, letting the netdev sysfs point to its primary PF.
> >> +Enhancing sysfs to reflect the actual topology is to be discussed and contributed separately.  
> > 
> > I don't anticipate it to be particularly hard, let's not merge
> > half-baked code and force users to grow workarounds that are hard
> > to remove.
> 
> Changing sysfs to expose queues from multiple PFs under one path might 
> be misleading and break backward compatibility. IMO it should come as an 
> extension to the existing entries.

I don't know what "multiple PFs under one path" means, links in VFs are
one to one, right? :)

> Anyway, the interesting info exposed in sysfs is now available through 
> the netdev genl.

Right, that's true.

Greg, we have a feature here where a single device of class net has
multiple "bus parents". We used to have one attr under class net
(device) which is a link to the bus parent. Now we either need to add
more or not bother with the linking of the whole device. Is there any
precedent / preference for solving this from the device model
perspective?

> Now, is this sysfs part integral to the feature? IMO, no. This in-driver 
> feature is large enough to be completed in stages and not as a one shot.

It's not a question of size and/or implementing everything.
What I want to make sure is that you surveyed the known user space
implementations sufficiently to know what looks at those links,
and perhaps ethtool -i.
Perhaps the answer is indeed "nothing much will care" and given
we can link IRQs correctly we put that as a conclusion in the doc.

Saying "sysfs is coming soon" is not adding much information :(

> > Also could you add examples of how the queue and napis look when listed
> > via the netdev genl on these devices?
> >   
> 
> Sure. Example for a 24-cores system:

Could you reconfigure to 5 channels to make the output asymmetric and
shorter and include the example in the doc?
Saeed Mahameed Feb. 21, 2024, 2:10 a.m. UTC | #5
On 20 Feb 17:33, Jakub Kicinski wrote:
>On Mon, 19 Feb 2024 17:26:36 +0200 Tariq Toukan wrote:
>> > There are multiple devlink instances, right?
>>
>> Right.
>
>Just to be clear I'm asking you questions about things which need to
>be covered by the doc :)
>
>> > In that case we should call out that there may be more than one.
>> >
>>
>> We are combining the PFs in the netdev level.
>> I did not focus on the parts that we do not touch.
>
>> Anyway, the interesting info exposed in sysfs is now available through
>> the netdev genl.
>
>Right, that's true.
>

[...]

>Greg, we have a feature here where a single device of class net has
>multiple "bus parents". We used to have one attr under class net
>(device) which is a link to the bus parent. Now we either need to add
>more or not bother with the linking of the whole device. Is there any
>precedent / preference for solving this from the device model
>perspective?
>
>> Now, is this sysfs part integral to the feature? IMO, no. This in-driver
>> feature is large enough to be completed in stages and not as a one shot.
>
>It's not a question of size and/or implementing everything.
>What I want to make sure is that you surveyed the known user space
>implementations sufficiently to know what looks at those links,
>and perhaps ethtool -i.
>Perhaps the answer is indeed "nothing much will care" and given
>we can link IRQs correctly we put that as a conclusion in the doc.
>
>Saying "sysfs is coming soon" is not adding much information :(
>

linking multiple parent devices at the netdev subsystems doesn't add
anything, the netdev abstraction should stop at linking rx/tx channels to
physical irqs and NUMA nodes, complicating the sysfs will required a proper
infrastructure to model the multi-pf mode for all vendors to use uniformly,
but for what? currently there's no configuration mechanism for this feature
yet, and we don't need it at the moment, once configuration becomes necessary,
I would recommend adding one infrastructure to all vendors to register to
at the parent device level, which will handle the sysfs/devlink abstraction, 
and leave netdev abstraction as is (IRQ/NUMA) and maybe take this a step 
further and give the user control of attaching specific channels to specific
IRQs/NUMA nodes.
Greg KH Feb. 22, 2024, 7:51 a.m. UTC | #6
On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
> Greg, we have a feature here where a single device of class net has
> multiple "bus parents". We used to have one attr under class net
> (device) which is a link to the bus parent. Now we either need to add
> more or not bother with the linking of the whole device. Is there any
> precedent / preference for solving this from the device model
> perspective?

How, logically, can a netdevice be controlled properly from 2 parent
devices on two different busses?  How is that even possible from a
physical point-of-view?  What exact bus types are involved here?

This "shouldn't" be possible as in the end, it's usually a PCI device
handling this all, right?

thanks,

greg k-h
Jakub Kicinski Feb. 22, 2024, 11 p.m. UTC | #7
On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
> On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
> > Greg, we have a feature here where a single device of class net has
> > multiple "bus parents". We used to have one attr under class net
> > (device) which is a link to the bus parent. Now we either need to add
> > more or not bother with the linking of the whole device. Is there any
> > precedent / preference for solving this from the device model
> > perspective?  
> 
> How, logically, can a netdevice be controlled properly from 2 parent
> devices on two different busses?  How is that even possible from a
> physical point-of-view?  What exact bus types are involved here?

Two PCIe buses, two endpoints, two networking ports. It's one piece
of silicon, tho, so the "slices" can talk to each other internally.
The NVRAM configuration tells both endpoints that the user wants
them "bonded", when the PCI drivers probe they "find each other"
using some cookie or DSN or whatnot. And once they did, they spawn
a single netdev.

> This "shouldn't" be possible as in the end, it's usually a PCI device
> handling this all, right?

It's really a special type of bonding of two netdevs. Like you'd bond
two ports to get twice the bandwidth. With the twist that the balancing
is done on NUMA proximity, rather than traffic hash.

Well, plus, the major twist that it's all done magically "for you"
in the vendor driver, and the two "lower" devices are not visible.
You only see the resulting bond.

I personally think that the magic hides as many problems as it
introduces and we'd be better off creating two separate netdevs.
And then a new type of "device bond" on top. Small win that
the "new device bond on top" can be shared code across vendors.

But there's only so many hours in the day to argue with vendors.
Samudrala, Sridhar Feb. 23, 2024, 1:23 a.m. UTC | #8
On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
> On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>> On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>>> Greg, we have a feature here where a single device of class net has
>>> multiple "bus parents". We used to have one attr under class net
>>> (device) which is a link to the bus parent. Now we either need to add
>>> more or not bother with the linking of the whole device. Is there any
>>> precedent / preference for solving this from the device model
>>> perspective?
>>
>> How, logically, can a netdevice be controlled properly from 2 parent
>> devices on two different busses?  How is that even possible from a
>> physical point-of-view?  What exact bus types are involved here?
> 
> Two PCIe buses, two endpoints, two networking ports. It's one piece

Isn't it only 1 networking port with multiple PFs?

> of silicon, tho, so the "slices" can talk to each other internally.
> The NVRAM configuration tells both endpoints that the user wants
> them "bonded", when the PCI drivers probe they "find each other"
> using some cookie or DSN or whatnot. And once they did, they spawn
> a single netdev.
> 
>> This "shouldn't" be possible as in the end, it's usually a PCI device
>> handling this all, right?
> 
> It's really a special type of bonding of two netdevs. Like you'd bond
> two ports to get twice the bandwidth. With the twist that the balancing
> is done on NUMA proximity, rather than traffic hash.
> 
> Well, plus, the major twist that it's all done magically "for you"
> in the vendor driver, and the two "lower" devices are not visible.
> You only see the resulting bond.
> 
> I personally think that the magic hides as many problems as it
> introduces and we'd be better off creating two separate netdevs.
> And then a new type of "device bond" on top. Small win that
> the "new device bond on top" can be shared code across vendors.

Yes. We have been exploring a small extension to bonding driver to 
enable a single numa-aware multi-threaded application to efficiently 
utilize multiple NICs across numa nodes.

Here is an early version of a patch we have been trying and seems to be 
working well.

=========================================================================
bonding: select tx device based on rx device of a flow

If napi_id is cached in the sk associated with skb, use the
device associated with napi_id as the transmit device.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>

diff --git a/drivers/net/bonding/bond_main.c 
b/drivers/net/bonding/bond_main.c
index 7a7d584f378a..77e3bf6c4502 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -5146,6 +5146,30 @@ static struct slave 
*bond_xmit_3ad_xor_slave_get(struct bonding *bond,
         unsigned int count;
         u32 hash;

+       if (skb->sk) {
+               int napi_id = skb->sk->sk_napi_id;
+               struct net_device *dev;
+               int idx;
+
+               rcu_read_lock();
+               dev = dev_get_by_napi_id(napi_id);
+               rcu_read_unlock();
+
+               if (!dev)
+                       goto hash;
+
+               count = slaves ? READ_ONCE(slaves->count) : 0;
+               if (unlikely(!count))
+                       return NULL;
+
+               for (idx = 0; idx < count; idx++) {
+                       slave = slaves->arr[idx];
+                       if (slave->dev->ifindex == dev->ifindex)
+                               return slave;
+               }
+       }
+
+hash:
         hash = bond_xmit_hash(bond, skb);
         count = slaves ? READ_ONCE(slaves->count) : 0;
         if (unlikely(!count))
=========================================================================

If we make this as a configurable bonding option, would this be an 
acceptable solution to accelerate numa-aware apps?

> 
> But there's only so many hours in the day to argue with vendors.
>
Jay Vosburgh Feb. 23, 2024, 2:05 a.m. UTC | #9
Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote:
>On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
>> On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>>> On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>>>> Greg, we have a feature here where a single device of class net has
>>>> multiple "bus parents". We used to have one attr under class net
>>>> (device) which is a link to the bus parent. Now we either need to add
>>>> more or not bother with the linking of the whole device. Is there any
>>>> precedent / preference for solving this from the device model
>>>> perspective?
>>>
>>> How, logically, can a netdevice be controlled properly from 2 parent
>>> devices on two different busses?  How is that even possible from a
>>> physical point-of-view?  What exact bus types are involved here?
>> Two PCIe buses, two endpoints, two networking ports. It's one piece
>
>Isn't it only 1 networking port with multiple PFs?
>
>> of silicon, tho, so the "slices" can talk to each other internally.
>> The NVRAM configuration tells both endpoints that the user wants
>> them "bonded", when the PCI drivers probe they "find each other"
>> using some cookie or DSN or whatnot. And once they did, they spawn
>> a single netdev.
>> 
>>> This "shouldn't" be possible as in the end, it's usually a PCI device
>>> handling this all, right?
>> It's really a special type of bonding of two netdevs. Like you'd bond
>> two ports to get twice the bandwidth. With the twist that the balancing
>> is done on NUMA proximity, rather than traffic hash.
>> Well, plus, the major twist that it's all done magically "for you"
>> in the vendor driver, and the two "lower" devices are not visible.
>> You only see the resulting bond.
>> I personally think that the magic hides as many problems as it
>> introduces and we'd be better off creating two separate netdevs.
>> And then a new type of "device bond" on top. Small win that
>> the "new device bond on top" can be shared code across vendors.
>
>Yes. We have been exploring a small extension to bonding driver to enable
>a single numa-aware multi-threaded application to efficiently utilize
>multiple NICs across numa nodes.

	Is this referring to something like the multi-pf under
discussion, or just generically with two arbitrary network devices
installed one each per NUMA node?

>Here is an early version of a patch we have been trying and seems to be
>working well.
>
>=========================================================================
>bonding: select tx device based on rx device of a flow
>
>If napi_id is cached in the sk associated with skb, use the
>device associated with napi_id as the transmit device.
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>
>diff --git a/drivers/net/bonding/bond_main.c
>b/drivers/net/bonding/bond_main.c
>index 7a7d584f378a..77e3bf6c4502 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -5146,6 +5146,30 @@ static struct slave
>*bond_xmit_3ad_xor_slave_get(struct bonding *bond,
>        unsigned int count;
>        u32 hash;
>
>+       if (skb->sk) {
>+               int napi_id = skb->sk->sk_napi_id;
>+               struct net_device *dev;
>+               int idx;
>+
>+               rcu_read_lock();
>+               dev = dev_get_by_napi_id(napi_id);
>+               rcu_read_unlock();
>+
>+               if (!dev)
>+                       goto hash;
>+
>+               count = slaves ? READ_ONCE(slaves->count) : 0;
>+               if (unlikely(!count))
>+                       return NULL;
>+
>+               for (idx = 0; idx < count; idx++) {
>+                       slave = slaves->arr[idx];
>+                       if (slave->dev->ifindex == dev->ifindex)
>+                               return slave;
>+               }
>+       }
>+
>+hash:
>        hash = bond_xmit_hash(bond, skb);
>        count = slaves ? READ_ONCE(slaves->count) : 0;
>        if (unlikely(!count))
>=========================================================================
>
>If we make this as a configurable bonding option, would this be an
>acceptable solution to accelerate numa-aware apps?

	Assuming for the moment this is for "regular" network devices
installed one per NUMA node, why do this in bonding instead of at a
higher layer (multiple subnets or ECMP, for example)?

	Is the intent here that the bond would aggregate its interfaces
via LACP with the peer being some kind of cross-chassis link aggregation
(MLAG, et al)?

	Given that sk_napi_id seems to be associated with
CONFIG_NET_RX_BUSY_POLL, am I correct in presuming the target
applications are DPDK-style busy poll packet processors?

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com
Samudrala, Sridhar Feb. 23, 2024, 5 a.m. UTC | #10
On 2/22/2024 8:05 PM, Jay Vosburgh wrote:
> Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote:
>> On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
>>> On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>>>> On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>>>>> Greg, we have a feature here where a single device of class net has
>>>>> multiple "bus parents". We used to have one attr under class net
>>>>> (device) which is a link to the bus parent. Now we either need to add
>>>>> more or not bother with the linking of the whole device. Is there any
>>>>> precedent / preference for solving this from the device model
>>>>> perspective?
>>>>
>>>> How, logically, can a netdevice be controlled properly from 2 parent
>>>> devices on two different busses?  How is that even possible from a
>>>> physical point-of-view?  What exact bus types are involved here?
>>> Two PCIe buses, two endpoints, two networking ports. It's one piece
>>
>> Isn't it only 1 networking port with multiple PFs?
>>
>>> of silicon, tho, so the "slices" can talk to each other internally.
>>> The NVRAM configuration tells both endpoints that the user wants
>>> them "bonded", when the PCI drivers probe they "find each other"
>>> using some cookie or DSN or whatnot. And once they did, they spawn
>>> a single netdev.
>>>
>>>> This "shouldn't" be possible as in the end, it's usually a PCI device
>>>> handling this all, right?
>>> It's really a special type of bonding of two netdevs. Like you'd bond
>>> two ports to get twice the bandwidth. With the twist that the balancing
>>> is done on NUMA proximity, rather than traffic hash.
>>> Well, plus, the major twist that it's all done magically "for you"
>>> in the vendor driver, and the two "lower" devices are not visible.
>>> You only see the resulting bond.
>>> I personally think that the magic hides as many problems as it
>>> introduces and we'd be better off creating two separate netdevs.
>>> And then a new type of "device bond" on top. Small win that
>>> the "new device bond on top" can be shared code across vendors.
>>
>> Yes. We have been exploring a small extension to bonding driver to enable
>> a single numa-aware multi-threaded application to efficiently utilize
>> multiple NICs across numa nodes.
> 
> 	Is this referring to something like the multi-pf under
> discussion, or just generically with two arbitrary network devices
> installed one each per NUMA node?

Normal network devices one per NUMA node

> 
>> Here is an early version of a patch we have been trying and seems to be
>> working well.
>>
>> =========================================================================
>> bonding: select tx device based on rx device of a flow
>>
>> If napi_id is cached in the sk associated with skb, use the
>> device associated with napi_id as the transmit device.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>
>> diff --git a/drivers/net/bonding/bond_main.c
>> b/drivers/net/bonding/bond_main.c
>> index 7a7d584f378a..77e3bf6c4502 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -5146,6 +5146,30 @@ static struct slave
>> *bond_xmit_3ad_xor_slave_get(struct bonding *bond,
>>         unsigned int count;
>>         u32 hash;
>>
>> +       if (skb->sk) {
>> +               int napi_id = skb->sk->sk_napi_id;
>> +               struct net_device *dev;
>> +               int idx;
>> +
>> +               rcu_read_lock();
>> +               dev = dev_get_by_napi_id(napi_id);
>> +               rcu_read_unlock();
>> +
>> +               if (!dev)
>> +                       goto hash;
>> +
>> +               count = slaves ? READ_ONCE(slaves->count) : 0;
>> +               if (unlikely(!count))
>> +                       return NULL;
>> +
>> +               for (idx = 0; idx < count; idx++) {
>> +                       slave = slaves->arr[idx];
>> +                       if (slave->dev->ifindex == dev->ifindex)
>> +                               return slave;
>> +               }
>> +       }
>> +
>> +hash:
>>         hash = bond_xmit_hash(bond, skb);
>>         count = slaves ? READ_ONCE(slaves->count) : 0;
>>         if (unlikely(!count))
>> =========================================================================
>>
>> If we make this as a configurable bonding option, would this be an
>> acceptable solution to accelerate numa-aware apps?
> 
> 	Assuming for the moment this is for "regular" network devices
> installed one per NUMA node, why do this in bonding instead of at a
> higher layer (multiple subnets or ECMP, for example)?
> 
> 	Is the intent here that the bond would aggregate its interfaces
> via LACP with the peer being some kind of cross-chassis link aggregation
> (MLAG, et al)?

Yes. basic LACP bonding setup. There could be multiple peers connecting 
to the server via switch providing LACP based link aggregation. No 
cross-chassis MLAG.

> 
> 	Given that sk_napi_id seems to be associated with
> CONFIG_NET_RX_BUSY_POLL, am I correct in presuming the target
> applications are DPDK-style busy poll packet processors?

I am using sk_napi_id to get the incoming interface. Busy poll is not a 
requirement and this can be used with any socket based apps.

In a numa-aware app, the app threads are split into pools of threads 
aligned to each numa node and the associated NIC. In the rx path, a 
thread is picked from a pool associated with a numa node using 
SO_INCOMING_CPU or similar method by setting irq affinity to the local 
cores. napi id is cached in the sk in the receive path. In the tx path, 
bonding driver picks the same NIC as the outgoing device using the 
cached sk->napi_id.

This enables numa affinitized data path for an app thread doing network 
I/O. If we also configure xps based on rx queues, tx and rx of a TCP 
flow can be aligned to the same queue pair of a NIC even when using bonding.

> 
> 	-J
> 
> ---
> 	-Jay Vosburgh, jay.vosburgh@canonical.com
Jiri Pirko Feb. 23, 2024, 9:36 a.m. UTC | #11
Fri, Feb 23, 2024 at 02:23:32AM CET, sridhar.samudrala@intel.com wrote:
>
>
>On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
>> On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>> > On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>> > > Greg, we have a feature here where a single device of class net has
>> > > multiple "bus parents". We used to have one attr under class net
>> > > (device) which is a link to the bus parent. Now we either need to add
>> > > more or not bother with the linking of the whole device. Is there any
>> > > precedent / preference for solving this from the device model
>> > > perspective?
>> > 
>> > How, logically, can a netdevice be controlled properly from 2 parent
>> > devices on two different busses?  How is that even possible from a
>> > physical point-of-view?  What exact bus types are involved here?
>> 
>> Two PCIe buses, two endpoints, two networking ports. It's one piece
>
>Isn't it only 1 networking port with multiple PFs?

AFAIK, yes. I have one device in hands like this. One physical port,
2 PCI slots, 2 PFs on PCI bus.


>
>> of silicon, tho, so the "slices" can talk to each other internally.
>> The NVRAM configuration tells both endpoints that the user wants
>> them "bonded", when the PCI drivers probe they "find each other"
>> using some cookie or DSN or whatnot. And once they did, they spawn
>> a single netdev.
>> 
>> > This "shouldn't" be possible as in the end, it's usually a PCI device
>> > handling this all, right?
>> 
>> It's really a special type of bonding of two netdevs. Like you'd bond
>> two ports to get twice the bandwidth. With the twist that the balancing
>> is done on NUMA proximity, rather than traffic hash.
>> 
>> Well, plus, the major twist that it's all done magically "for you"
>> in the vendor driver, and the two "lower" devices are not visible.
>> You only see the resulting bond.
>> 
>> I personally think that the magic hides as many problems as it
>> introduces and we'd be better off creating two separate netdevs.
>> And then a new type of "device bond" on top. Small win that
>> the "new device bond on top" can be shared code across vendors.
>
>Yes. We have been exploring a small extension to bonding driver to enable a
>single numa-aware multi-threaded application to efficiently utilize multiple
>NICs across numa nodes.

Bonding was my immediate response when we discussed this internally for
the first time. But I had to eventually admit it is probably not that
suitable in this case, here's why:
1) there are no 2 physical ports, only one.
2) it is basically a matter of device layout/provisioning that this
   feature should be enabled, not user configuration.
3) other subsystems like RDMA would benefit the same feature, so this
   int not netdev specific in general.
Jiri Pirko Feb. 23, 2024, 9:40 a.m. UTC | #12
Fri, Feb 23, 2024 at 06:00:40AM CET, sridhar.samudrala@intel.com wrote:
>
>
>On 2/22/2024 8:05 PM, Jay Vosburgh wrote:
>> Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote:
>> > On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
>> > > On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>> > > > On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>> > > > > Greg, we have a feature here where a single device of class net has
>> > > > > multiple "bus parents". We used to have one attr under class net
>> > > > > (device) which is a link to the bus parent. Now we either need to add
>> > > > > more or not bother with the linking of the whole device. Is there any
>> > > > > precedent / preference for solving this from the device model
>> > > > > perspective?
>> > > > 
>> > > > How, logically, can a netdevice be controlled properly from 2 parent
>> > > > devices on two different busses?  How is that even possible from a
>> > > > physical point-of-view?  What exact bus types are involved here?
>> > > Two PCIe buses, two endpoints, two networking ports. It's one piece
>> > 
>> > Isn't it only 1 networking port with multiple PFs?
>> > 
>> > > of silicon, tho, so the "slices" can talk to each other internally.
>> > > The NVRAM configuration tells both endpoints that the user wants
>> > > them "bonded", when the PCI drivers probe they "find each other"
>> > > using some cookie or DSN or whatnot. And once they did, they spawn
>> > > a single netdev.
>> > > 
>> > > > This "shouldn't" be possible as in the end, it's usually a PCI device
>> > > > handling this all, right?
>> > > It's really a special type of bonding of two netdevs. Like you'd bond
>> > > two ports to get twice the bandwidth. With the twist that the balancing
>> > > is done on NUMA proximity, rather than traffic hash.
>> > > Well, plus, the major twist that it's all done magically "for you"
>> > > in the vendor driver, and the two "lower" devices are not visible.
>> > > You only see the resulting bond.
>> > > I personally think that the magic hides as many problems as it
>> > > introduces and we'd be better off creating two separate netdevs.
>> > > And then a new type of "device bond" on top. Small win that
>> > > the "new device bond on top" can be shared code across vendors.
>> > 
>> > Yes. We have been exploring a small extension to bonding driver to enable
>> > a single numa-aware multi-threaded application to efficiently utilize
>> > multiple NICs across numa nodes.
>> 
>> 	Is this referring to something like the multi-pf under
>> discussion, or just generically with two arbitrary network devices
>> installed one each per NUMA node?
>
>Normal network devices one per NUMA node
>
>> 
>> > Here is an early version of a patch we have been trying and seems to be
>> > working well.
>> > 
>> > =========================================================================
>> > bonding: select tx device based on rx device of a flow
>> > 
>> > If napi_id is cached in the sk associated with skb, use the
>> > device associated with napi_id as the transmit device.
>> > 
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > 
>> > diff --git a/drivers/net/bonding/bond_main.c
>> > b/drivers/net/bonding/bond_main.c
>> > index 7a7d584f378a..77e3bf6c4502 100644
>> > --- a/drivers/net/bonding/bond_main.c
>> > +++ b/drivers/net/bonding/bond_main.c
>> > @@ -5146,6 +5146,30 @@ static struct slave
>> > *bond_xmit_3ad_xor_slave_get(struct bonding *bond,
>> >         unsigned int count;
>> >         u32 hash;
>> > 
>> > +       if (skb->sk) {
>> > +               int napi_id = skb->sk->sk_napi_id;
>> > +               struct net_device *dev;
>> > +               int idx;
>> > +
>> > +               rcu_read_lock();
>> > +               dev = dev_get_by_napi_id(napi_id);
>> > +               rcu_read_unlock();
>> > +
>> > +               if (!dev)
>> > +                       goto hash;
>> > +
>> > +               count = slaves ? READ_ONCE(slaves->count) : 0;
>> > +               if (unlikely(!count))
>> > +                       return NULL;
>> > +
>> > +               for (idx = 0; idx < count; idx++) {
>> > +                       slave = slaves->arr[idx];
>> > +                       if (slave->dev->ifindex == dev->ifindex)
>> > +                               return slave;
>> > +               }
>> > +       }
>> > +
>> > +hash:
>> >         hash = bond_xmit_hash(bond, skb);
>> >         count = slaves ? READ_ONCE(slaves->count) : 0;
>> >         if (unlikely(!count))
>> > =========================================================================
>> > 
>> > If we make this as a configurable bonding option, would this be an
>> > acceptable solution to accelerate numa-aware apps?
>> 
>> 	Assuming for the moment this is for "regular" network devices
>> installed one per NUMA node, why do this in bonding instead of at a
>> higher layer (multiple subnets or ECMP, for example)?
>> 
>> 	Is the intent here that the bond would aggregate its interfaces
>> via LACP with the peer being some kind of cross-chassis link aggregation
>> (MLAG, et al)?

No.

>
>Yes. basic LACP bonding setup. There could be multiple peers connecting to
>the server via switch providing LACP based link aggregation. No cross-chassis
>MLAG.

LACP does not make any sense, when you have only a single physical port.
That applies to ECMP mentioned above too I believe.
Samudrala, Sridhar Feb. 23, 2024, 11:56 p.m. UTC | #13
On 2/23/2024 3:40 AM, Jiri Pirko wrote:
> Fri, Feb 23, 2024 at 06:00:40AM CET, sridhar.samudrala@intel.com wrote:
>>
>>
>> On 2/22/2024 8:05 PM, Jay Vosburgh wrote:
>>> Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote:
>>>> On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
>>>>> On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>>>>>> On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>>>>>>> Greg, we have a feature here where a single device of class net has
>>>>>>> multiple "bus parents". We used to have one attr under class net
>>>>>>> (device) which is a link to the bus parent. Now we either need to add
>>>>>>> more or not bother with the linking of the whole device. Is there any
>>>>>>> precedent / preference for solving this from the device model
>>>>>>> perspective?
>>>>>>
>>>>>> How, logically, can a netdevice be controlled properly from 2 parent
>>>>>> devices on two different busses?  How is that even possible from a
>>>>>> physical point-of-view?  What exact bus types are involved here?
>>>>> Two PCIe buses, two endpoints, two networking ports. It's one piece
>>>>
>>>> Isn't it only 1 networking port with multiple PFs?
>>>>
>>>>> of silicon, tho, so the "slices" can talk to each other internally.
>>>>> The NVRAM configuration tells both endpoints that the user wants
>>>>> them "bonded", when the PCI drivers probe they "find each other"
>>>>> using some cookie or DSN or whatnot. And once they did, they spawn
>>>>> a single netdev.
>>>>>
>>>>>> This "shouldn't" be possible as in the end, it's usually a PCI device
>>>>>> handling this all, right?
>>>>> It's really a special type of bonding of two netdevs. Like you'd bond
>>>>> two ports to get twice the bandwidth. With the twist that the balancing
>>>>> is done on NUMA proximity, rather than traffic hash.
>>>>> Well, plus, the major twist that it's all done magically "for you"
>>>>> in the vendor driver, and the two "lower" devices are not visible.
>>>>> You only see the resulting bond.
>>>>> I personally think that the magic hides as many problems as it
>>>>> introduces and we'd be better off creating two separate netdevs.
>>>>> And then a new type of "device bond" on top. Small win that
>>>>> the "new device bond on top" can be shared code across vendors.
>>>>
>>>> Yes. We have been exploring a small extension to bonding driver to enable
>>>> a single numa-aware multi-threaded application to efficiently utilize
>>>> multiple NICs across numa nodes.
>>>
>>> 	Is this referring to something like the multi-pf under
>>> discussion, or just generically with two arbitrary network devices
>>> installed one each per NUMA node?
>>
>> Normal network devices one per NUMA node
>>
>>>
>>>> Here is an early version of a patch we have been trying and seems to be
>>>> working well.
>>>>
>>>> =========================================================================
>>>> bonding: select tx device based on rx device of a flow
>>>>
>>>> If napi_id is cached in the sk associated with skb, use the
>>>> device associated with napi_id as the transmit device.
>>>>
>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>
>>>> diff --git a/drivers/net/bonding/bond_main.c
>>>> b/drivers/net/bonding/bond_main.c
>>>> index 7a7d584f378a..77e3bf6c4502 100644
>>>> --- a/drivers/net/bonding/bond_main.c
>>>> +++ b/drivers/net/bonding/bond_main.c
>>>> @@ -5146,6 +5146,30 @@ static struct slave
>>>> *bond_xmit_3ad_xor_slave_get(struct bonding *bond,
>>>>          unsigned int count;
>>>>          u32 hash;
>>>>
>>>> +       if (skb->sk) {
>>>> +               int napi_id = skb->sk->sk_napi_id;
>>>> +               struct net_device *dev;
>>>> +               int idx;
>>>> +
>>>> +               rcu_read_lock();
>>>> +               dev = dev_get_by_napi_id(napi_id);
>>>> +               rcu_read_unlock();
>>>> +
>>>> +               if (!dev)
>>>> +                       goto hash;
>>>> +
>>>> +               count = slaves ? READ_ONCE(slaves->count) : 0;
>>>> +               if (unlikely(!count))
>>>> +                       return NULL;
>>>> +
>>>> +               for (idx = 0; idx < count; idx++) {
>>>> +                       slave = slaves->arr[idx];
>>>> +                       if (slave->dev->ifindex == dev->ifindex)
>>>> +                               return slave;
>>>> +               }
>>>> +       }
>>>> +
>>>> +hash:
>>>>          hash = bond_xmit_hash(bond, skb);
>>>>          count = slaves ? READ_ONCE(slaves->count) : 0;
>>>>          if (unlikely(!count))
>>>> =========================================================================
>>>>
>>>> If we make this as a configurable bonding option, would this be an
>>>> acceptable solution to accelerate numa-aware apps?
>>>
>>> 	Assuming for the moment this is for "regular" network devices
>>> installed one per NUMA node, why do this in bonding instead of at a
>>> higher layer (multiple subnets or ECMP, for example)?
>>>
>>> 	Is the intent here that the bond would aggregate its interfaces
>>> via LACP with the peer being some kind of cross-chassis link aggregation
>>> (MLAG, et al)?
> 
> No.
> 
>>
>> Yes. basic LACP bonding setup. There could be multiple peers connecting to
>> the server via switch providing LACP based link aggregation. No cross-chassis
>> MLAG.
> 
> LACP does not make any sense, when you have only a single physical port.
> That applies to ECMP mentioned above too I believe.

I meant for the 2 regular NICs on 2 numa node setup, not for multi-PF 1 
port setup.
Jiri Pirko Feb. 24, 2024, 12:48 p.m. UTC | #14
Sat, Feb 24, 2024 at 12:56:52AM CET, sridhar.samudrala@intel.com wrote:
>
>
>On 2/23/2024 3:40 AM, Jiri Pirko wrote:
>> Fri, Feb 23, 2024 at 06:00:40AM CET, sridhar.samudrala@intel.com wrote:
>> > 
>> > 
>> > On 2/22/2024 8:05 PM, Jay Vosburgh wrote:
>> > > Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote:
>> > > > On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
>> > > > > On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>> > > > > > On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>> > > > > > > Greg, we have a feature here where a single device of class net has
>> > > > > > > multiple "bus parents". We used to have one attr under class net
>> > > > > > > (device) which is a link to the bus parent. Now we either need to add
>> > > > > > > more or not bother with the linking of the whole device. Is there any
>> > > > > > > precedent / preference for solving this from the device model
>> > > > > > > perspective?
>> > > > > > 
>> > > > > > How, logically, can a netdevice be controlled properly from 2 parent
>> > > > > > devices on two different busses?  How is that even possible from a
>> > > > > > physical point-of-view?  What exact bus types are involved here?
>> > > > > Two PCIe buses, two endpoints, two networking ports. It's one piece
>> > > > 
>> > > > Isn't it only 1 networking port with multiple PFs?
>> > > > 
>> > > > > of silicon, tho, so the "slices" can talk to each other internally.
>> > > > > The NVRAM configuration tells both endpoints that the user wants
>> > > > > them "bonded", when the PCI drivers probe they "find each other"
>> > > > > using some cookie or DSN or whatnot. And once they did, they spawn
>> > > > > a single netdev.
>> > > > > 
>> > > > > > This "shouldn't" be possible as in the end, it's usually a PCI device
>> > > > > > handling this all, right?
>> > > > > It's really a special type of bonding of two netdevs. Like you'd bond
>> > > > > two ports to get twice the bandwidth. With the twist that the balancing
>> > > > > is done on NUMA proximity, rather than traffic hash.
>> > > > > Well, plus, the major twist that it's all done magically "for you"
>> > > > > in the vendor driver, and the two "lower" devices are not visible.
>> > > > > You only see the resulting bond.
>> > > > > I personally think that the magic hides as many problems as it
>> > > > > introduces and we'd be better off creating two separate netdevs.
>> > > > > And then a new type of "device bond" on top. Small win that
>> > > > > the "new device bond on top" can be shared code across vendors.
>> > > > 
>> > > > Yes. We have been exploring a small extension to bonding driver to enable
>> > > > a single numa-aware multi-threaded application to efficiently utilize
>> > > > multiple NICs across numa nodes.
>> > > 
>> > > 	Is this referring to something like the multi-pf under
>> > > discussion, or just generically with two arbitrary network devices
>> > > installed one each per NUMA node?
>> > 
>> > Normal network devices one per NUMA node
>> > 
>> > > 
>> > > > Here is an early version of a patch we have been trying and seems to be
>> > > > working well.
>> > > > 
>> > > > =========================================================================
>> > > > bonding: select tx device based on rx device of a flow
>> > > > 
>> > > > If napi_id is cached in the sk associated with skb, use the
>> > > > device associated with napi_id as the transmit device.
>> > > > 
>> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > > > 
>> > > > diff --git a/drivers/net/bonding/bond_main.c
>> > > > b/drivers/net/bonding/bond_main.c
>> > > > index 7a7d584f378a..77e3bf6c4502 100644
>> > > > --- a/drivers/net/bonding/bond_main.c
>> > > > +++ b/drivers/net/bonding/bond_main.c
>> > > > @@ -5146,6 +5146,30 @@ static struct slave
>> > > > *bond_xmit_3ad_xor_slave_get(struct bonding *bond,
>> > > >          unsigned int count;
>> > > >          u32 hash;
>> > > > 
>> > > > +       if (skb->sk) {
>> > > > +               int napi_id = skb->sk->sk_napi_id;
>> > > > +               struct net_device *dev;
>> > > > +               int idx;
>> > > > +
>> > > > +               rcu_read_lock();
>> > > > +               dev = dev_get_by_napi_id(napi_id);
>> > > > +               rcu_read_unlock();
>> > > > +
>> > > > +               if (!dev)
>> > > > +                       goto hash;
>> > > > +
>> > > > +               count = slaves ? READ_ONCE(slaves->count) : 0;
>> > > > +               if (unlikely(!count))
>> > > > +                       return NULL;
>> > > > +
>> > > > +               for (idx = 0; idx < count; idx++) {
>> > > > +                       slave = slaves->arr[idx];
>> > > > +                       if (slave->dev->ifindex == dev->ifindex)
>> > > > +                               return slave;
>> > > > +               }
>> > > > +       }
>> > > > +
>> > > > +hash:
>> > > >          hash = bond_xmit_hash(bond, skb);
>> > > >          count = slaves ? READ_ONCE(slaves->count) : 0;
>> > > >          if (unlikely(!count))
>> > > > =========================================================================
>> > > > 
>> > > > If we make this as a configurable bonding option, would this be an
>> > > > acceptable solution to accelerate numa-aware apps?
>> > > 
>> > > 	Assuming for the moment this is for "regular" network devices
>> > > installed one per NUMA node, why do this in bonding instead of at a
>> > > higher layer (multiple subnets or ECMP, for example)?
>> > > 
>> > > 	Is the intent here that the bond would aggregate its interfaces
>> > > via LACP with the peer being some kind of cross-chassis link aggregation
>> > > (MLAG, et al)?
>> 
>> No.
>> 
>> > 
>> > Yes. basic LACP bonding setup. There could be multiple peers connecting to
>> > the server via switch providing LACP based link aggregation. No cross-chassis
>> > MLAG.
>> 
>> LACP does not make any sense, when you have only a single physical port.
>> That applies to ECMP mentioned above too I believe.
>
>I meant for the 2 regular NICs on 2 numa node setup, not for multi-PF 1 port
>setup.

Okay, not sure how it is related to this thread then :)
Jakub Kicinski Feb. 28, 2024, 2:06 a.m. UTC | #15
On Fri, 23 Feb 2024 10:36:25 +0100 Jiri Pirko wrote:
> >> It's really a special type of bonding of two netdevs. Like you'd bond
> >> two ports to get twice the bandwidth. With the twist that the balancing
> >> is done on NUMA proximity, rather than traffic hash.
> >> 
> >> Well, plus, the major twist that it's all done magically "for you"
> >> in the vendor driver, and the two "lower" devices are not visible.
> >> You only see the resulting bond.
> >> 
> >> I personally think that the magic hides as many problems as it
> >> introduces and we'd be better off creating two separate netdevs.
> >> And then a new type of "device bond" on top. Small win that
> >> the "new device bond on top" can be shared code across vendors.  
> >
> >Yes. We have been exploring a small extension to bonding driver to enable a
> >single numa-aware multi-threaded application to efficiently utilize multiple
> >NICs across numa nodes.  
> 
> Bonding was my immediate response when we discussed this internally for
> the first time. But I had to eventually admit it is probably not that
> suitable in this case, here's why:
> 1) there are no 2 physical ports, only one.

Right, sorry, number of PFs matches number of ports for each bus.
But it's not necessarily a deal breaker - it's similar to a multi-host
device. We also have multiple netdevs and PCIe links, they just go to
different host rather than different NUMA nodes on one host.

> 2) it is basically a matter of device layout/provisioning that this
>    feature should be enabled, not user configuration.

We can still auto-instantiate it, not a deal breaker.

I'm not sure you're right in that assumption, tho. At Meta, we support
container sizes ranging from few CPUs to multiple NUMA nodes. Each NUMA
node may have it's own NIC, and the orchestration needs to stitch and
un-stitch NICs depending on whether the cores were allocated to small
containers or a huge one.

So it would be _easier_ to deal with multiple netdevs. Orchestration
layer already understands netdev <> NUMA mapping, it does not understand
multi-NUMA netdevs, and how to match up queues to nodes.

> 3) other subsystems like RDMA would benefit the same feature, so this
>    int not netdev specific in general.

Yes, looks RDMA-centric. RDMA being infamously bonding-challenged.

Anyway, back to the initial question - from Greg's reply I'm guessing
there's no precedent for doing such things in the device model either.
So we're on our own.
Jiri Pirko Feb. 28, 2024, 8:13 a.m. UTC | #16
Wed, Feb 28, 2024 at 03:06:19AM CET, kuba@kernel.org wrote:
>On Fri, 23 Feb 2024 10:36:25 +0100 Jiri Pirko wrote:
>> >> It's really a special type of bonding of two netdevs. Like you'd bond
>> >> two ports to get twice the bandwidth. With the twist that the balancing
>> >> is done on NUMA proximity, rather than traffic hash.
>> >> 
>> >> Well, plus, the major twist that it's all done magically "for you"
>> >> in the vendor driver, and the two "lower" devices are not visible.
>> >> You only see the resulting bond.
>> >> 
>> >> I personally think that the magic hides as many problems as it
>> >> introduces and we'd be better off creating two separate netdevs.
>> >> And then a new type of "device bond" on top. Small win that
>> >> the "new device bond on top" can be shared code across vendors.  
>> >
>> >Yes. We have been exploring a small extension to bonding driver to enable a
>> >single numa-aware multi-threaded application to efficiently utilize multiple
>> >NICs across numa nodes.  
>> 
>> Bonding was my immediate response when we discussed this internally for
>> the first time. But I had to eventually admit it is probably not that
>> suitable in this case, here's why:
>> 1) there are no 2 physical ports, only one.
>
>Right, sorry, number of PFs matches number of ports for each bus.
>But it's not necessarily a deal breaker - it's similar to a multi-host
>device. We also have multiple netdevs and PCIe links, they just go to
>different host rather than different NUMA nodes on one host.

That is a different scenario. You have multiple hosts and a switch
between them and the physical port. Yeah, it might be invisible switch,
but there still is one. On DPU/smartnic, it is visible and configurable.


>
>> 2) it is basically a matter of device layout/provisioning that this
>>    feature should be enabled, not user configuration.
>
>We can still auto-instantiate it, not a deal breaker.

"Auto-instantiate" in meating of userspace orchestration deamon,
not kernel, that's what you mean?


>
>I'm not sure you're right in that assumption, tho. At Meta, we support
>container sizes ranging from few CPUs to multiple NUMA nodes. Each NUMA
>node may have it's own NIC, and the orchestration needs to stitch and
>un-stitch NICs depending on whether the cores were allocated to small
>containers or a huge one.

Yeah, but still, there is one physical port for NIC-numanode pair.
Correct? Does the orchestration setup a bond on top of them or some other
master device or let the container use them independently?


>
>So it would be _easier_ to deal with multiple netdevs. Orchestration
>layer already understands netdev <> NUMA mapping, it does not understand
>multi-NUMA netdevs, and how to match up queues to nodes.
>
>> 3) other subsystems like RDMA would benefit the same feature, so this
>>    int not netdev specific in general.
>
>Yes, looks RDMA-centric. RDMA being infamously bonding-challenged.

Not really. It's just needed to consider all usecases, not only netdev.


>
>Anyway, back to the initial question - from Greg's reply I'm guessing
>there's no precedent for doing such things in the device model either.
>So we're on our own.
Jakub Kicinski Feb. 28, 2024, 5:06 p.m. UTC | #17
On Wed, 28 Feb 2024 09:13:57 +0100 Jiri Pirko wrote:
> >> 2) it is basically a matter of device layout/provisioning that this
> >>    feature should be enabled, not user configuration.  
> >
> >We can still auto-instantiate it, not a deal breaker.  
> 
> "Auto-instantiate" in meating of userspace orchestration deamon,
> not kernel, that's what you mean?

Either kernel, or pass some hints to a user space agent, like networkd
and have it handle the creation. We have precedent for "kernel side
bonding" with the VF<>virtio bonding thing.

> >I'm not sure you're right in that assumption, tho. At Meta, we support
> >container sizes ranging from few CPUs to multiple NUMA nodes. Each NUMA
> >node may have it's own NIC, and the orchestration needs to stitch and
> >un-stitch NICs depending on whether the cores were allocated to small
> >containers or a huge one.  
> 
> Yeah, but still, there is one physical port for NIC-numanode pair.

Well, today there is.

> Correct? Does the orchestration setup a bond on top of them or some other
> master device or let the container use them independently?

Just multi-nexthop routing and binding sockets to the netdev (with
some BPF magic, I think).

> >So it would be _easier_ to deal with multiple netdevs. Orchestration
> >layer already understands netdev <> NUMA mapping, it does not understand
> >multi-NUMA netdevs, and how to match up queues to nodes.
> >  
> >> 3) other subsystems like RDMA would benefit the same feature, so this
> >>    int not netdev specific in general.  
> >
> >Yes, looks RDMA-centric. RDMA being infamously bonding-challenged.  
> 
> Not really. It's just needed to consider all usecases, not only netdev.

All use cases or lowest common denominator, depends on priorities.
Jakub Kicinski Feb. 28, 2024, 5:43 p.m. UTC | #18
On Wed, 28 Feb 2024 09:06:04 -0800 Jakub Kicinski wrote:
> > >Yes, looks RDMA-centric. RDMA being infamously bonding-challenged.    
> > 
> > Not really. It's just needed to consider all usecases, not only netdev.  
> 
> All use cases or lowest common denominator, depends on priorities.

To be clear, I'm not trying to shut down this proposal, I think both
have disadvantages. This one is better for RDMA and iperf, the explicit
netdevs are better for more advanced TCP apps. All I want is clear docs
so users are not confused, and vendors don't diverge pointlessly.
Jiri Pirko Feb. 29, 2024, 8:21 a.m. UTC | #19
Wed, Feb 28, 2024 at 06:06:04PM CET, kuba@kernel.org wrote:
>On Wed, 28 Feb 2024 09:13:57 +0100 Jiri Pirko wrote:
>> >> 2) it is basically a matter of device layout/provisioning that this
>> >>    feature should be enabled, not user configuration.  
>> >
>> >We can still auto-instantiate it, not a deal breaker.  
>> 
>> "Auto-instantiate" in meating of userspace orchestration deamon,
>> not kernel, that's what you mean?
>
>Either kernel, or pass some hints to a user space agent, like networkd
>and have it handle the creation. We have precedent for "kernel side
>bonding" with the VF<>virtio bonding thing.
>
>> >I'm not sure you're right in that assumption, tho. At Meta, we support
>> >container sizes ranging from few CPUs to multiple NUMA nodes. Each NUMA
>> >node may have it's own NIC, and the orchestration needs to stitch and
>> >un-stitch NICs depending on whether the cores were allocated to small
>> >containers or a huge one.  
>> 
>> Yeah, but still, there is one physical port for NIC-numanode pair.
>
>Well, today there is.
>
>> Correct? Does the orchestration setup a bond on top of them or some other
>> master device or let the container use them independently?
>
>Just multi-nexthop routing and binding sockets to the netdev (with
>some BPF magic, I think).

Yeah, so basically 2 independent ports, 2 netdevices working
independently. Not sure I see the parallel to the subject we discuss
here :/


>
>> >So it would be _easier_ to deal with multiple netdevs. Orchestration
>> >layer already understands netdev <> NUMA mapping, it does not understand
>> >multi-NUMA netdevs, and how to match up queues to nodes.
>> >  
>> >> 3) other subsystems like RDMA would benefit the same feature, so this
>> >>    int not netdev specific in general.  
>> >
>> >Yes, looks RDMA-centric. RDMA being infamously bonding-challenged.  
>> 
>> Not really. It's just needed to consider all usecases, not only netdev.
>
>All use cases or lowest common denominator, depends on priorities.
Jakub Kicinski Feb. 29, 2024, 2:34 p.m. UTC | #20
On Thu, 29 Feb 2024 09:21:26 +0100 Jiri Pirko wrote:
> >> Correct? Does the orchestration setup a bond on top of them or some other
> >> master device or let the container use them independently?  
> >
> >Just multi-nexthop routing and binding sockets to the netdev (with
> >some BPF magic, I think).  
> 
> Yeah, so basically 2 independent ports, 2 netdevices working
> independently. Not sure I see the parallel to the subject we discuss
> here :/

From the user's perspective it's almost exactly the same.
User wants NUMA nodes to have a way to reach the network without
crossing the interconnect. Whether you do that with 2 200G NICs
or 1 400G NIC connected to two nodes is an implementation detail.
Saeed Mahameed March 2, 2024, 7:31 a.m. UTC | #21
On 28 Feb 09:43, Jakub Kicinski wrote:
>On Wed, 28 Feb 2024 09:06:04 -0800 Jakub Kicinski wrote:
>> > >Yes, looks RDMA-centric. RDMA being infamously bonding-challenged.
>> >
>> > Not really. It's just needed to consider all usecases, not only netdev.
>>
>> All use cases or lowest common denominator, depends on priorities.
>
>To be clear, I'm not trying to shut down this proposal, I think both
>have disadvantages. This one is better for RDMA and iperf, the explicit
>netdevs are better for more advanced TCP apps. All I want is clear docs
>so users are not confused, and vendors don't diverge pointlessly.

Just posted v4 with updated documentation that should cover the basic
feature which we believe is the most basic that all vendors should
implement, mlx5 implementation won't change much if we decide later to move
to some sort of a "generic netdev" interface, we don't agree it should be a
new kind of bond, as bond was meant for actual link aggregation of
multi-port devices, but again the mlx5 implementation will remain the same
regardless of any future extension of the feature, the defaults are well
documented and carefully selected for best user expectations.
diff mbox series

Patch

diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 69f3d6dcd9fd..473d72c36d61 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -74,6 +74,7 @@  Contents:
    mpls-sysctl
    mptcp-sysctl
    multiqueue
+   multi-pf-netdev
    napi
    net_cachelines/index
    netconsole
diff --git a/Documentation/networking/multi-pf-netdev.rst b/Documentation/networking/multi-pf-netdev.rst
new file mode 100644
index 000000000000..6ef2ac448d1e
--- /dev/null
+++ b/Documentation/networking/multi-pf-netdev.rst
@@ -0,0 +1,157 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===============
+Multi-PF Netdev
+===============
+
+Contents
+========
+
+- `Background`_
+- `Overview`_
+- `mlx5 implementation`_
+- `Channels distribution`_
+- `Topology`_
+- `Steering`_
+- `Mutually exclusive features`_
+
+Background
+==========
+
+The advanced Multi-PF NIC technology enables several CPUs within a multi-socket server to
+connect directly to the network, each through its own dedicated PCIe interface. Through either a
+connection harness that splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a
+single card. This results in eliminating the network traffic traversing over the internal bus
+between the sockets, significantly reducing overhead and latency, in addition to reducing CPU
+utilization and increasing network throughput.
+
+Overview
+========
+
+This feature adds support for combining multiple devices (PFs) of the same port in a Multi-PF
+environment under one netdev instance. Passing traffic through different devices belonging to
+different NUMA sockets saves cross-numa traffic and allows apps running on the same netdev from
+different numas to still feel a sense of proximity to the device and achieve improved performance.
+
+mlx5 implementation
+===================
+
+Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same
+NIC and has the socket-direct property enabled, once all PFS are probed, we create a single netdev
+to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed.
+
+The netdev network channels are distributed between all devices, a proper configuration would utilize
+the correct close numa node when working on a certain app/cpu.
+
+We pick one PF to be a primary (leader), and it fills a special role. The other devices
+(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent
+mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of
+the leader PF (east <-> west traffic) to function. All RX/TX traffic is steered through the primary
+to/from the secondaries.
+
+Currently, we limit the support to PFs only, and up to two PFs (sockets).
+
+Channels distribution
+=====================
+
+We distribute the channels between the different PFs to achieve local NUMA node performance
+on multiple NUMA nodes.
+
+Each combined channel works against one specific PF, creating all its datapath queues against it. We distribute
+channels to PFs in a round-robin policy.
+
+::
+
+        Example for 2 PFs and 6 channels:
+        +--------+--------+
+        | ch idx | PF idx |
+        +--------+--------+
+        |    0   |    0   |
+        |    1   |    1   |
+        |    2   |    0   |
+        |    3   |    1   |
+        |    4   |    0   |
+        |    5   |    1   |
+        +--------+--------+
+
+
+We prefer this round-robin distribution policy over another suggested intuitive distribution, in
+which we first distribute one half of the channels to PF0 and then the second half to PF1.
+
+The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
+mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
+As the channel stats are persistent across channel's closure, changing the mapping every single time
+would turn the accumulative stats less representing of the channel's history.
+
+This is achieved by using the correct core device instance (mdev) in each channel, instead of them
+all using the same instance under "priv->mdev".
+
+Topology
+========
+Currently the sysfs is kept untouched, letting the netdev sysfs point to its primary PF.
+Enhancing sysfs to reflect the actual topology is to be discussed and contributed separately.
+For now, debugfs is being used to reflect the topology:
+
+.. code-block:: bash
+
+        $ grep -H . /sys/kernel/debug/mlx5/0000\:08\:00.0/sd/*
+        /sys/kernel/debug/mlx5/0000:08:00.0/sd/group_id:0x00000101
+        /sys/kernel/debug/mlx5/0000:08:00.0/sd/primary:0000:08:00.0 vhca 0x0
+        /sys/kernel/debug/mlx5/0000:08:00.0/sd/secondary_0:0000:09:00.0 vhca 0x2
+
+Steering
+========
+Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
+
+In RX, the steering tables belong to the primary PF only, and it is its role to distribute incoming
+traffic to other PFs, via cross-vhca steering capabilities. Nothing special about the RSS table
+content, except that it needs a capable device to point to the receive queues of a different PF.
+
+In TX, the primary PF creates a new TX flow table, which is aliased by the secondaries, so they can
+go out to the network through it.
+
+In addition, we set default XPS configuration that, based on the cpu, selects an SQ belonging to the
+PF on the same node as the cpu.
+
+XPS default config example:
+
+NUMA node(s):          2
+NUMA node0 CPU(s):     0-11
+NUMA node1 CPU(s):     12-23
+
+PF0 on node0, PF1 on node1.
+
+- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
+- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
+- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
+- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
+- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
+- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
+- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
+- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
+- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
+- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
+- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
+- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
+- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
+- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
+- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
+- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
+- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
+- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
+- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
+- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
+- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
+- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
+- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
+- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
+
+Mutually exclusive features
+===========================
+
+The nature of Multi-PF, where different channels work with different PFs, conflicts with
+stateful features where the state is maintained in one of the PFs.
+For example, in the TLS device-offload feature, special context objects are created per connection
+and maintained in the PF.  Transitioning between different RQs/SQs would break the feature. Hence,
+we disable this combination for now.