diff mbox series

[net-next,V4,15/15] Documentation: networking: Add description for multi-pf netdev

Message ID 20240302072246.67920-16-saeed@kernel.org (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net-next,V4,01/15] net/mlx5: Add MPIR bit in mcam_access_reg | expand

Checks

Context Check Description
netdev/series_format success Pull request is its own cover letter
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 8 this patch: 8
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 2 maintainers not CCed: linux-doc@vger.kernel.org corbet@lwn.net
netdev/build_clang success Errors and warnings before: 957 this patch: 957
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 8 this patch: 8
netdev/checkpatch warning WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-03-03--03-00 (tests: 794)

Commit Message

Saeed Mahameed March 2, 2024, 7:22 a.m. UTC
From: Tariq Toukan <tariqt@nvidia.com>

Add documentation for the multi-pf netdev feature.
Describe the mlx5 implementation and design decisions.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 Documentation/networking/index.rst           |   1 +
 Documentation/networking/multi-pf-netdev.rst | 177 +++++++++++++++++++
 2 files changed, 178 insertions(+)
 create mode 100644 Documentation/networking/multi-pf-netdev.rst

Comments

Przemek Kitszel March 4, 2024, 12:03 p.m. UTC | #1
On 3/2/24 08:22, Saeed Mahameed wrote:
> From: Tariq Toukan <tariqt@nvidia.com>
> 
> Add documentation for the multi-pf netdev feature.
> Describe the mlx5 implementation and design decisions.
> 
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> ---
>   Documentation/networking/index.rst           |   1 +
>   Documentation/networking/multi-pf-netdev.rst | 177 +++++++++++++++++++
>   2 files changed, 178 insertions(+)
>   create mode 100644 Documentation/networking/multi-pf-netdev.rst
> 
> diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
> index 69f3d6dcd9fd..473d72c36d61 100644
> --- a/Documentation/networking/index.rst
> +++ b/Documentation/networking/index.rst
> @@ -74,6 +74,7 @@ Contents:
>      mpls-sysctl
>      mptcp-sysctl
>      multiqueue
> +   multi-pf-netdev
>      napi
>      net_cachelines/index
>      netconsole
> diff --git a/Documentation/networking/multi-pf-netdev.rst b/Documentation/networking/multi-pf-netdev.rst
> new file mode 100644
> index 000000000000..f6f782374b71
> --- /dev/null
> +++ b/Documentation/networking/multi-pf-netdev.rst
> @@ -0,0 +1,177 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. include:: <isonum.txt>
> +
> +===============
> +Multi-PF Netdev
> +===============
> +
> +Contents
> +========
> +
> +- `Background`_
> +- `Overview`_
> +- `mlx5 implementation`_
> +- `Channels distribution`_
> +- `Observability`_
> +- `Steering`_
> +- `Mutually exclusive features`_

this document describes mlx5 details mostly, and I would expect to find
them in a mlx5.rst file instead of vendor-agnostic doc

> +
> +Background
> +==========
> +
> +The advanced Multi-PF NIC technology enables several CPUs within a multi-socket server to

please remove the `advanced` word

> +connect directly to the network, each through its own dedicated PCIe interface. Through either a
> +connection harness that splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a
> +single card. This results in eliminating the network traffic traversing over the internal bus
> +between the sockets, significantly reducing overhead and latency, in addition to reducing CPU
> +utilization and increasing network throughput.
> +
> +Overview
> +========
> +
> +The feature adds support for combining multiple PFs of the same port in a Multi-PF environment under
> +one netdev instance. It is implemented in the netdev layer. Lower-layer instances like pci func,
> +sysfs entry, devlink) are kept separate.
> +Passing traffic through different devices belonging to different NUMA sockets saves cross-numa

please consider spelling out NUMA as always capitalized

> +traffic and allows apps running on the same netdev from different numas to still feel a sense of
> +proximity to the device and achieve improved performance.
> +
> +mlx5 implementation
> +===================
> +
> +Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same
> +NIC and has the socket-direct property enabled, once all PFS are probed, we create a single netdev

s/PFS/PFs/

> +to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed.
> +
> +The netdev network channels are distributed between all devices, a proper configuration would utilize
> +the correct close numa node when working on a certain app/cpu.

CPU

> +
> +We pick one PF to be a primary (leader), and it fills a special role. The other devices
> +(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent
> +mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of
> +the leader PF (east <-> west traffic) to function. All RX/TX traffic is steered through the primary

Rx, Tx (whole document)

> +to/from the secondaries.
> +
> +Currently, we limit the support to PFs only, and up to two PFs (sockets).
> +
> +Channels distribution
> +=====================
> +
> +We distribute the channels between the different PFs to achieve local NUMA node performance
> +on multiple NUMA nodes.
> +
> +Each combined channel works against one specific PF, creating all its datapath queues against it. We
> +distribute channels to PFs in a round-robin policy.
> +
> +::
> +
> +        Example for 2 PFs and 5 channels:
> +        +--------+--------+
> +        | ch idx | PF idx |
> +        +--------+--------+
> +        |    0   |    0   |
> +        |    1   |    1   |
> +        |    2   |    0   |
> +        |    3   |    1   |
> +        |    4   |    0   |
> +        +--------+--------+
> +
> +
> +We prefer this round-robin distribution policy over another suggested intuitive distribution, in
> +which we first distribute one half of the channels to PF0 and then the second half to PF1.

Please rephrase to describe current state (which makes sense over what
was suggested), instead of addressing feedback (that could be kept in
cover letter if you really want).

And again, the wording "we" clearly indicates that this section, as
future ones, is mlx specific.

> +
> +The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
> +mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
> +As the channel stats are persistent across channel's closure, changing the mapping every single time
> +would turn the accumulative stats less representing of the channel's history.
> +
> +This is achieved by using the correct core device instance (mdev) in each channel, instead of them
> +all using the same instance under "priv->mdev".
> +
> +Observability
> +=============
> +The relation between PF, irq, napi, and queue can be observed via netlink spec:
> +
> +$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}'
> +[{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'},
> + {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'},
> + {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'},
> + {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'},
> + {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'},
> + {'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'},
> + {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'},
> + {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'},
> + {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'},
> + {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}]
> +
> +$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}'
> +[{'id': 543, 'ifindex': 13, 'irq': 42},
> + {'id': 542, 'ifindex': 13, 'irq': 41},
> + {'id': 541, 'ifindex': 13, 'irq': 40},
> + {'id': 540, 'ifindex': 13, 'irq': 39},
> + {'id': 539, 'ifindex': 13, 'irq': 36}]
> +
> +Here you can clearly observe our channels distribution policy:
> +
> +$ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1
> +/proc/irq/36/mlx5_comp1@pci:0000:08:00.0
> +/proc/irq/39/mlx5_comp1@pci:0000:09:00.0
> +/proc/irq/40/mlx5_comp2@pci:0000:08:00.0
> +/proc/irq/41/mlx5_comp2@pci:0000:09:00.0
> +/proc/irq/42/mlx5_comp3@pci:0000:08:00.0
> +
> +Steering
> +========
> +Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
> +
> +In RX, the steering tables belong to the primary PF only, and it is its role to distribute incoming
> +traffic to other PFs, via cross-vhca steering capabilities. Nothing special about the RSS table
> +content, except that it needs a capable device to point to the receive queues of a different PF.

I guess you cannot enable the multi-pf for incapable device, so there is
anything noteworthy in last sentence?

> +
> +In TX, the primary PF creates a new TX flow table, which is aliased by the secondaries, so they can
> +go out to the network through it.
> +
> +In addition, we set default XPS configuration that, based on the cpu, selects an SQ belonging to the
> +PF on the same node as the cpu.
> +
> +XPS default config example:
> +
> +NUMA node(s):          2
> +NUMA node0 CPU(s):     0-11
> +NUMA node1 CPU(s):     12-23
> +
> +PF0 on node0, PF1 on node1.
> +
> +- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
> +- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
> +- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
> +- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
> +- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
> +- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
> +- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
> +- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
> +- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
> +- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
> +- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
> +- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
> +- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
> +- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
> +- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
> +- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
> +- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
> +- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
> +- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
> +- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
> +- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
> +- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
> +- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
> +- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
> +
> +Mutually exclusive features
> +===========================
> +
> +The nature of Multi-PF, where different channels work with different PFs, conflicts with
> +stateful features where the state is maintained in one of the PFs.
> +For example, in the TLS device-offload feature, special context objects are created per connection
> +and maintained in the PF.  Transitioning between different RQs/SQs would break the feature. Hence,
> +we disable this combination for now.

 From the reading I will know what the feature is at the user level.

After splitting most of the doc out into mlx5 file, and fixing the minor
typos, feel free to add my:

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tariq Toukan March 5, 2024, 8:12 p.m. UTC | #2
On 04/03/2024 14:03, Przemek Kitszel wrote:
> On 3/2/24 08:22, Saeed Mahameed wrote:
>> From: Tariq Toukan <tariqt@nvidia.com>
>>
>> Add documentation for the multi-pf netdev feature.
>> Describe the mlx5 implementation and design decisions.
>>
>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
>> ---
>>   Documentation/networking/index.rst           |   1 +
>>   Documentation/networking/multi-pf-netdev.rst | 177 +++++++++++++++++++
>>   2 files changed, 178 insertions(+)
>>   create mode 100644 Documentation/networking/multi-pf-netdev.rst
>>
>> diff --git a/Documentation/networking/index.rst 
>> b/Documentation/networking/index.rst
>> index 69f3d6dcd9fd..473d72c36d61 100644
>> --- a/Documentation/networking/index.rst
>> +++ b/Documentation/networking/index.rst
>> @@ -74,6 +74,7 @@ Contents:
>>      mpls-sysctl
>>      mptcp-sysctl
>>      multiqueue
>> +   multi-pf-netdev
>>      napi
>>      net_cachelines/index
>>      netconsole
>> diff --git a/Documentation/networking/multi-pf-netdev.rst 
>> b/Documentation/networking/multi-pf-netdev.rst
>> new file mode 100644
>> index 000000000000..f6f782374b71
>> --- /dev/null
>> +++ b/Documentation/networking/multi-pf-netdev.rst
>> @@ -0,0 +1,177 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +.. include:: <isonum.txt>
>> +
>> +===============
>> +Multi-PF Netdev
>> +===============
>> +
>> +Contents
>> +========
>> +
>> +- `Background`_
>> +- `Overview`_
>> +- `mlx5 implementation`_
>> +- `Channels distribution`_
>> +- `Observability`_
>> +- `Steering`_
>> +- `Mutually exclusive features`_
> 
> this document describes mlx5 details mostly, and I would expect to find
> them in a mlx5.rst file instead of vendor-agnostic doc
> 

It was originally under 
Documentation/networking/device_drivers/ethernet/mellanox/mlx5/
We moved it here with the needed changes per request.

See:
https://lore.kernel.org/all/20240209222738.4cf9f25b@kernel.org/

>> +
>> +Background
>> +==========
>> +
>> +The advanced Multi-PF NIC technology enables several CPUs within a 
>> multi-socket server to
> 
> please remove the `advanced` word
> 
>> +connect directly to the network, each through its own dedicated PCIe 
>> interface. Through either a
>> +connection harness that splits the PCIe lanes between two cards or by 
>> bifurcating a PCIe slot for a
>> +single card. This results in eliminating the network traffic 
>> traversing over the internal bus
>> +between the sockets, significantly reducing overhead and latency, in 
>> addition to reducing CPU
>> +utilization and increasing network throughput.
>> +
>> +Overview
>> +========
>> +
>> +The feature adds support for combining multiple PFs of the same port 
>> in a Multi-PF environment under
>> +one netdev instance. It is implemented in the netdev layer. 
>> Lower-layer instances like pci func,
>> +sysfs entry, devlink) are kept separate.
>> +Passing traffic through different devices belonging to different NUMA 
>> sockets saves cross-numa
> 
> please consider spelling out NUMA as always capitalized
> 
>> +traffic and allows apps running on the same netdev from different 
>> numas to still feel a sense of
>> +proximity to the device and achieve improved performance.
>> +
>> +mlx5 implementation
>> +===================
>> +
>> +Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs 
>> together which belong to the same
>> +NIC and has the socket-direct property enabled, once all PFS are 
>> probed, we create a single netdev
> 
> s/PFS/PFs/
> 
>> +to represent all of them, symmetrically, we destroy the netdev 
>> whenever any of the PFs is removed.
>> +
>> +The netdev network channels are distributed between all devices, a 
>> proper configuration would utilize
>> +the correct close numa node when working on a certain app/cpu.
> 
> CPU
> 
>> +
>> +We pick one PF to be a primary (leader), and it fills a special role. 
>> The other devices
>> +(secondaries) are disconnected from the network at the chip level 
>> (set to silent mode). In silent
>> +mode, no south <-> north traffic flowing directly through a secondary 
>> PF. It needs the assistance of
>> +the leader PF (east <-> west traffic) to function. All RX/TX traffic 
>> is steered through the primary
> 
> Rx, Tx (whole document)
> 
>> +to/from the secondaries.
>> +
>> +Currently, we limit the support to PFs only, and up to two PFs 
>> (sockets).
>> +
>> +Channels distribution
>> +=====================
>> +
>> +We distribute the channels between the different PFs to achieve local 
>> NUMA node performance
>> +on multiple NUMA nodes.
>> +
>> +Each combined channel works against one specific PF, creating all its 
>> datapath queues against it. We
>> +distribute channels to PFs in a round-robin policy.
>> +
>> +::
>> +
>> +        Example for 2 PFs and 5 channels:
>> +        +--------+--------+
>> +        | ch idx | PF idx |
>> +        +--------+--------+
>> +        |    0   |    0   |
>> +        |    1   |    1   |
>> +        |    2   |    0   |
>> +        |    3   |    1   |
>> +        |    4   |    0   |
>> +        +--------+--------+
>> +
>> +
>> +We prefer this round-robin distribution policy over another suggested 
>> intuitive distribution, in
>> +which we first distribute one half of the channels to PF0 and then 
>> the second half to PF1.
> 
> Please rephrase to describe current state (which makes sense over what
> was suggested), instead of addressing feedback (that could be kept in
> cover letter if you really want).
> 
> And again, the wording "we" clearly indicates that this section, as
> future ones, is mlx specific.
> 
>> +
>> +The reason we prefer round-robin is, it is less influenced by changes 
>> in the number of channels. The
>> +mapping between a channel index and a PF is fixed, no matter how many 
>> channels the user configures.
>> +As the channel stats are persistent across channel's closure, 
>> changing the mapping every single time
>> +would turn the accumulative stats less representing of the channel's 
>> history.
>> +
>> +This is achieved by using the correct core device instance (mdev) in 
>> each channel, instead of them
>> +all using the same instance under "priv->mdev".
>> +
>> +Observability
>> +=============
>> +The relation between PF, irq, napi, and queue can be observed via 
>> netlink spec:
>> +
>> +$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml 
>> --dump queue-get --json='{"ifindex": 13}'
>> +[{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'},
>> + {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'},
>> + {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'},
>> + {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'},
>> + {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'},
>> + {'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'},
>> + {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'},
>> + {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'},
>> + {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'},
>> + {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}]
>> +
>> +$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml 
>> --dump napi-get --json='{"ifindex": 13}'
>> +[{'id': 543, 'ifindex': 13, 'irq': 42},
>> + {'id': 542, 'ifindex': 13, 'irq': 41},
>> + {'id': 541, 'ifindex': 13, 'irq': 40},
>> + {'id': 540, 'ifindex': 13, 'irq': 39},
>> + {'id': 539, 'ifindex': 13, 'irq': 36}]
>> +
>> +Here you can clearly observe our channels distribution policy:
>> +
>> +$ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1
>> +/proc/irq/36/mlx5_comp1@pci:0000:08:00.0
>> +/proc/irq/39/mlx5_comp1@pci:0000:09:00.0
>> +/proc/irq/40/mlx5_comp2@pci:0000:08:00.0
>> +/proc/irq/41/mlx5_comp2@pci:0000:09:00.0
>> +/proc/irq/42/mlx5_comp3@pci:0000:08:00.0
>> +
>> +Steering
>> +========
>> +Secondary PFs are set to "silent" mode, meaning they are disconnected 
>> from the network.
>> +
>> +In RX, the steering tables belong to the primary PF only, and it is 
>> its role to distribute incoming
>> +traffic to other PFs, via cross-vhca steering capabilities. Nothing 
>> special about the RSS table
>> +content, except that it needs a capable device to point to the 
>> receive queues of a different PF.
> 
> I guess you cannot enable the multi-pf for incapable device, so there is
> anything noteworthy in last sentence?
> 

I was asked in earlier patchsets to elaborate on this.

It tells "how" an RSS table looks like on a capable device.
Maybe I should re-phrase to emphasize the point.

It is not straightforward that we still maintain a single RSS table like 
non-multi-PF netdevs. Preserving this (over other complex alternatives) 
is what noteworthy here.

>> +
>> +In TX, the primary PF creates a new TX flow table, which is aliased 
>> by the secondaries, so they can
>> +go out to the network through it.
>> +
>> +In addition, we set default XPS configuration that, based on the cpu, 
>> selects an SQ belonging to the
>> +PF on the same node as the cpu.
>> +
>> +XPS default config example:
>> +
>> +NUMA node(s):          2
>> +NUMA node0 CPU(s):     0-11
>> +NUMA node1 CPU(s):     12-23
>> +
>> +PF0 on node0, PF1 on node1.
>> +
>> +- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
>> +- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
>> +- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
>> +- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
>> +- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
>> +- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
>> +- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
>> +- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
>> +- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
>> +- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
>> +- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
>> +- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
>> +- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
>> +- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
>> +- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
>> +- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
>> +- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
>> +- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
>> +- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
>> +- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
>> +- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
>> +- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
>> +- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
>> +- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
>> +
>> +Mutually exclusive features
>> +===========================
>> +
>> +The nature of Multi-PF, where different channels work with different 
>> PFs, conflicts with
>> +stateful features where the state is maintained in one of the PFs.
>> +For example, in the TLS device-offload feature, special context 
>> objects are created per connection
>> +and maintained in the PF.  Transitioning between different RQs/SQs 
>> would break the feature. Hence,
>> +we disable this combination for now.
> 
>  From the reading I will know what the feature is at the user level.
> 
> After splitting most of the doc out into mlx5 file, and fixing the minor
> typos, feel free to add my:
> 
> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
> 

Thanks.
diff mbox series

Patch

diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 69f3d6dcd9fd..473d72c36d61 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -74,6 +74,7 @@  Contents:
    mpls-sysctl
    mptcp-sysctl
    multiqueue
+   multi-pf-netdev
    napi
    net_cachelines/index
    netconsole
diff --git a/Documentation/networking/multi-pf-netdev.rst b/Documentation/networking/multi-pf-netdev.rst
new file mode 100644
index 000000000000..f6f782374b71
--- /dev/null
+++ b/Documentation/networking/multi-pf-netdev.rst
@@ -0,0 +1,177 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===============
+Multi-PF Netdev
+===============
+
+Contents
+========
+
+- `Background`_
+- `Overview`_
+- `mlx5 implementation`_
+- `Channels distribution`_
+- `Observability`_
+- `Steering`_
+- `Mutually exclusive features`_
+
+Background
+==========
+
+The advanced Multi-PF NIC technology enables several CPUs within a multi-socket server to
+connect directly to the network, each through its own dedicated PCIe interface. Through either a
+connection harness that splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a
+single card. This results in eliminating the network traffic traversing over the internal bus
+between the sockets, significantly reducing overhead and latency, in addition to reducing CPU
+utilization and increasing network throughput.
+
+Overview
+========
+
+The feature adds support for combining multiple PFs of the same port in a Multi-PF environment under
+one netdev instance. It is implemented in the netdev layer. Lower-layer instances like pci func,
+sysfs entry, devlink) are kept separate.
+Passing traffic through different devices belonging to different NUMA sockets saves cross-numa
+traffic and allows apps running on the same netdev from different numas to still feel a sense of
+proximity to the device and achieve improved performance.
+
+mlx5 implementation
+===================
+
+Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same
+NIC and has the socket-direct property enabled, once all PFS are probed, we create a single netdev
+to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed.
+
+The netdev network channels are distributed between all devices, a proper configuration would utilize
+the correct close numa node when working on a certain app/cpu.
+
+We pick one PF to be a primary (leader), and it fills a special role. The other devices
+(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent
+mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of
+the leader PF (east <-> west traffic) to function. All RX/TX traffic is steered through the primary
+to/from the secondaries.
+
+Currently, we limit the support to PFs only, and up to two PFs (sockets).
+
+Channels distribution
+=====================
+
+We distribute the channels between the different PFs to achieve local NUMA node performance
+on multiple NUMA nodes.
+
+Each combined channel works against one specific PF, creating all its datapath queues against it. We
+distribute channels to PFs in a round-robin policy.
+
+::
+
+        Example for 2 PFs and 5 channels:
+        +--------+--------+
+        | ch idx | PF idx |
+        +--------+--------+
+        |    0   |    0   |
+        |    1   |    1   |
+        |    2   |    0   |
+        |    3   |    1   |
+        |    4   |    0   |
+        +--------+--------+
+
+
+We prefer this round-robin distribution policy over another suggested intuitive distribution, in
+which we first distribute one half of the channels to PF0 and then the second half to PF1.
+
+The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
+mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
+As the channel stats are persistent across channel's closure, changing the mapping every single time
+would turn the accumulative stats less representing of the channel's history.
+
+This is achieved by using the correct core device instance (mdev) in each channel, instead of them
+all using the same instance under "priv->mdev".
+
+Observability
+=============
+The relation between PF, irq, napi, and queue can be observed via netlink spec:
+
+$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}'
+[{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'},
+ {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'},
+ {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'},
+ {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'},
+ {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'},
+ {'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'},
+ {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'},
+ {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'},
+ {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'},
+ {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}]
+
+$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}'
+[{'id': 543, 'ifindex': 13, 'irq': 42},
+ {'id': 542, 'ifindex': 13, 'irq': 41},
+ {'id': 541, 'ifindex': 13, 'irq': 40},
+ {'id': 540, 'ifindex': 13, 'irq': 39},
+ {'id': 539, 'ifindex': 13, 'irq': 36}]
+
+Here you can clearly observe our channels distribution policy:
+
+$ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1
+/proc/irq/36/mlx5_comp1@pci:0000:08:00.0
+/proc/irq/39/mlx5_comp1@pci:0000:09:00.0
+/proc/irq/40/mlx5_comp2@pci:0000:08:00.0
+/proc/irq/41/mlx5_comp2@pci:0000:09:00.0
+/proc/irq/42/mlx5_comp3@pci:0000:08:00.0
+
+Steering
+========
+Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
+
+In RX, the steering tables belong to the primary PF only, and it is its role to distribute incoming
+traffic to other PFs, via cross-vhca steering capabilities. Nothing special about the RSS table
+content, except that it needs a capable device to point to the receive queues of a different PF.
+
+In TX, the primary PF creates a new TX flow table, which is aliased by the secondaries, so they can
+go out to the network through it.
+
+In addition, we set default XPS configuration that, based on the cpu, selects an SQ belonging to the
+PF on the same node as the cpu.
+
+XPS default config example:
+
+NUMA node(s):          2
+NUMA node0 CPU(s):     0-11
+NUMA node1 CPU(s):     12-23
+
+PF0 on node0, PF1 on node1.
+
+- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
+- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
+- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
+- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
+- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
+- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
+- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
+- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
+- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
+- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
+- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
+- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
+- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
+- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
+- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
+- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
+- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
+- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
+- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
+- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
+- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
+- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
+- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
+- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
+
+Mutually exclusive features
+===========================
+
+The nature of Multi-PF, where different channels work with different PFs, conflicts with
+stateful features where the state is maintained in one of the PFs.
+For example, in the TLS device-offload feature, special context objects are created per connection
+and maintained in the PF.  Transitioning between different RQs/SQs would break the feature. Hence,
+we disable this combination for now.