diff mbox series

[net-next,V2,15/15] Documentation: net/mlx5: Add description for Socket-Direct netdev combining

Message ID 20240208035352.387423-16-saeed@kernel.org (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net-next,V2,01/15] net/mlx5: Add MPIR bit in mcam_access_reg | expand

Checks

Context Check Description
netdev/series_format success Pull request is its own cover letter
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 8 this patch: 8
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 2 maintainers not CCed: linux-doc@vger.kernel.org corbet@lwn.net
netdev/build_clang success Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 8 this patch: 8
netdev/checkpatch warning WARNING: 'exmaple' may be misspelled - perhaps 'example'? WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Saeed Mahameed Feb. 8, 2024, 3:53 a.m. UTC
From: Tariq Toukan <tariqt@nvidia.com>

Add documentation for the feature and some details on some design decisions.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../ethernet/mellanox/mlx5/sd.rst             | 134 ++++++++++++++++++
 1 file changed, 134 insertions(+)
 create mode 100644 Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst

Comments

Jakub Kicinski Feb. 10, 2024, 6:27 a.m. UTC | #1
On Wed,  7 Feb 2024 19:53:52 -0800 Saeed Mahameed wrote:
> From: Tariq Toukan <tariqt@nvidia.com>
> 
> Add documentation for the feature and some details on some design decisions.

Thanks.

> diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst

SD which is not same SD which Jiri and William are talking about?
Please spell out the name.

Please make this a general networking/ documentation file.

If other vendors could take a look and make sure this behavior makes
sense for their plans / future devices that'd be great.

> new file mode 100644
> index 000000000000..c8b4d8025a81
> --- /dev/null
> +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst
> @@ -0,0 +1,134 @@
> +.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> +.. include:: <isonum.txt>
> +
> +==============================
> +Socket-Direct Netdev Combining
> +==============================
> +
> +:Copyright: |copy| 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
> +
> +Contents
> +========
> +
> +- `Background`_
> +- `Overview`_
> +- `Channels distribution`_
> +- `Steering`_
> +- `Mutually exclusive features`_
> +
> +Background
> +==========
> +
> +NVIDIA Mellanox Socket Direct technology enables several CPUs within a multi-socket server to

Please make it sound a little less like a marketing leaflet.
Isn't multi-PF netdev not a better name for the construct?
We don't call aRFS "queue direct", also socket has BSD socket meaning.

> +connect directly to the network, each through its own dedicated PCIe interface. Through either a
> +connection harness that splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a
> +single card. This results in eliminating the network traffic traversing over the internal bus
> +between the sockets, significantly reducing overhead and latency, in addition to reducing CPU
> +utilization and increasing network throughput.
> +
> +Overview
> +========
> +
> +This feature adds support for combining multiple devices (PFs) of the same port in a Socket Direct
> +environment under one netdev instance. Passing traffic through different devices belonging to
> +different NUMA sockets saves cross-numa traffic and allows apps running on the same netdev from
> +different numas to still feel a sense of proximity to the device and acheive improved performance.
> +
> +We acheive this by grouping PFs together, and creating the netdev only once all group members are
> +probed. Symmetrically, we destroy the netdev once any of the PFs is removed.

s/once/whenever/

> +The channels are distributed between all devices, a proper configuration would utilize the correct
> +close numa when working on a certain app/cpu.
> +
> +We pick one device to be a primary (leader), and it fills a special role. The other devices

"device" is probably best avoided, users may think device == card,
IIUC there's only one NIC ASIC here?

> +(secondaries) are disconnected from the network in the chip level (set to silent mode). All RX/TX

s/in/at/

> +traffic is steered through the primary to/from the secondaries.

I don't understand the "silent" part. I mean - you do pass traffic thru
them, what's the silence referring to?

> +Currently, we limit the support to PFs only, and up to two devices (sockets).
> +
> +Channels distribution
> +=====================
> +
> +Distribute the channels between the different SD-devices to acheive local numa node performance on

Something's missing in this sentence, subject "we"? 

> +multiple numas.

NUMA nodes

> +Each channel works against one specific mdev, creating all datapath queues against it. We distribute

The mix of channel and queue does not compute in this sentence for me.

Also mdev -> PF?

> +channels to mdevs in a round-robin policy.
> +
> +Example for 2 PFs and 6 channels:
> ++-------+-------+
> +| ch ix | PF ix |

ix? id or idx or index.

> ++-------+-------+
> +|   0   |   0   |
> +|   1   |   1   |
> +|   2   |   0   |
> +|   3   |   1   |
> +|   4   |   0   |
> +|   5   |   1   |
> ++-------+-------+
> +
> +This round-robin distribution policy is preferred over another suggested intuitive distribution, in
> +which we first distribute one half of the channels to PF0 and then the second half to PF1.

Preferred.. by whom? Just say that's the most broadly useful and therefore default config.

> +The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
> +mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
> +As the channel stats are persistent to channels closure, changing the mapping every single time

to -> across
channels -> channel or channel's or channel closures

> +would turn the accumulative stats less representing of the channel's history.
> +
> +This is acheived by using the correct core device instance (mdev) in each channel, instead of them
> +all using the same instance under "priv->mdev".
> +
> +Steering
> +========
> +Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
> +
> +In RX, the steering tables belong to the primary PF only, and it is its role to distribute incoming
> +traffic to other PFs, via advanced HW cross-vhca steering capabilities.

s/advanced HW//

You should cover how RSS looks - single table which functions exactly as
it would for a 1-PF device? Two-tier setup?

> +In TX, the primary PF creates a new TX flow table, which is aliased by the secondaries, so they can
> +go out to the network through it.
> +
> +In addition, we set default XPS configuration that, based on the cpu, selects an SQ belonging to the
> +PF on the same node as the cpu.
> +
> +XPS default config example:
> +
> +NUMA node(s):          2
> +NUMA node0 CPU(s):     0-11
> +NUMA node1 CPU(s):     12-23
> +
> +PF0 on node0, PF1 on node1.

You didn't cover how users are supposed to discover the topology. 
netdev is linked to a single device in sysfs, which is how we get
netdev <> NUMA node mapping today. What's the expected way to get
the NUMA nodes here?

And obviously this can't get merged until mlx5 exposes queue <> NAPI <>
IRQ mapping via the netdev genl.

<snip>

> +Mutually exclusive features
> +===========================
> +
> +The nature of socket direct, where different channels work with different PFs, conflicts with
> +stateful features where the state is maintained in one of the PFs.
> +For exmaple, in the TLS device-offload feature, special context objects are created per connection
> +and maintained in the PF.  Transitioning between different RQs/SQs would break the feature. Hence,
> +we disable this combination for now.
Samudrala, Sridhar Feb. 13, 2024, 1:11 a.m. UTC | #2
On 2/10/2024 12:27 AM, Jakub Kicinski wrote:
> On Wed,  7 Feb 2024 19:53:52 -0800 Saeed Mahameed wrote:
>> From: Tariq Toukan <tariqt@nvidia.com>
>>
>> Add documentation for the feature and some details on some design decisions.
> 
> Thanks.
> 
>> diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst
> 
> SD which is not same SD which Jiri and William are talking about?
> Please spell out the name.
> 
> Please make this a general networking/ documentation file.
> 
> If other vendors could take a look and make sure this behavior makes
> sense for their plans / future devices that'd be great.
> 
>> new file mode 100644
>> index 000000000000..c8b4d8025a81
>> --- /dev/null
>> +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst
>> @@ -0,0 +1,134 @@
>> +.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
>> +.. include:: <isonum.txt>
>> +
>> +==============================
>> +Socket-Direct Netdev Combining
>> +==============================
>> +
>> +:Copyright: |copy| 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
>> +
>> +Contents
>> +========
>> +
>> +- `Background`_
>> +- `Overview`_
>> +- `Channels distribution`_
>> +- `Steering`_
>> +- `Mutually exclusive features`_
>> +
>> +Background
>> +==========
>> +
>> +NVIDIA Mellanox Socket Direct technology enables several CPUs within a multi-socket server to
> 
> Please make it sound a little less like a marketing leaflet.
> Isn't multi-PF netdev not a better name for the construct?
> We don't call aRFS "queue direct", also socket has BSD socket meaning.

Yes Socket Direct is definitely misleading.
At Intel, we call this multi-homing technology where multiple PFs are 
associated with a single uplink port. multi-pf netdev sounds technically 
correct.


> 
>> +connect directly to the network, each through its own dedicated PCIe interface. Through either a
>> +connection harness that splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a
>> +single card. This results in eliminating the network traffic traversing over the internal bus
>> +between the sockets, significantly reducing overhead and latency, in addition to reducing CPU
>> +utilization and increasing network throughput.
>> +
>> +Overview
>> +========
>> +
>> +This feature adds support for combining multiple devices (PFs) of the same port in a Socket Direct
>> +environment under one netdev instance. Passing traffic through different devices belonging to
>> +different NUMA sockets saves cross-numa traffic and allows apps running on the same netdev from
>> +different numas to still feel a sense of proximity to the device and acheive improved performance.
>> +
>> +We acheive this by grouping PFs together, and creating the netdev only once all group members are
>> +probed. Symmetrically, we destroy the netdev once any of the PFs is removed.
> 
> s/once/whenever/
> 
>> +The channels are distributed between all devices, a proper configuration would utilize the correct
>> +close numa when working on a certain app/cpu.
>> +
>> +We pick one device to be a primary (leader), and it fills a special role. The other devices
> 
> "device" is probably best avoided, users may think device == card,
> IIUC there's only one NIC ASIC here?
> 
>> +(secondaries) are disconnected from the network in the chip level (set to silent mode). All RX/TX
> 
> s/in/at/
> 
>> +traffic is steered through the primary to/from the secondaries.
> 
> I don't understand the "silent" part. I mean - you do pass traffic thru
> them, what's the silence referring to?
> 
>> +Currently, we limit the support to PFs only, and up to two devices (sockets).
>> +
>> +Channels distribution
>> +=====================
>> +
>> +Distribute the channels between the different SD-devices to acheive local numa node performance on
> 
> Something's missing in this sentence, subject "we"?
> 
>> +multiple numas.
> 
> NUMA nodes
> 
>> +Each channel works against one specific mdev, creating all datapath queues against it. We distribute
> 
> The mix of channel and queue does not compute in this sentence for me.
> 
> Also mdev -> PF?
> 
>> +channels to mdevs in a round-robin policy.
>> +
>> +Example for 2 PFs and 6 channels:
>> ++-------+-------+
>> +| ch ix | PF ix |
> 
> ix? id or idx or index.
> 
>> ++-------+-------+
>> +|   0   |   0   |
>> +|   1   |   1   |
>> +|   2   |   0   |
>> +|   3   |   1   |
>> +|   4   |   0   |
>> +|   5   |   1   |
>> ++-------+-------+
>> +
>> +This round-robin distribution policy is preferred over another suggested intuitive distribution, in
>> +which we first distribute one half of the channels to PF0 and then the second half to PF1.
> 
> Preferred.. by whom? Just say that's the most broadly useful and therefore default config.
> 
>> +The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
>> +mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
>> +As the channel stats are persistent to channels closure, changing the mapping every single time
> 
> to -> across
> channels -> channel or channel's or channel closures
> 
>> +would turn the accumulative stats less representing of the channel's history.
>> +
>> +This is acheived by using the correct core device instance (mdev) in each channel, instead of them
>> +all using the same instance under "priv->mdev".
>> +
>> +Steering
>> +========
>> +Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
>> +
>> +In RX, the steering tables belong to the primary PF only, and it is its role to distribute incoming
>> +traffic to other PFs, via advanced HW cross-vhca steering capabilities.
> 
> s/advanced HW//
> 
> You should cover how RSS looks - single table which functions exactly as
> it would for a 1-PF device? Two-tier setup?
> 
>> +In TX, the primary PF creates a new TX flow table, which is aliased by the secondaries, so they can
>> +go out to the network through it.
>> +
>> +In addition, we set default XPS configuration that, based on the cpu, selects an SQ belonging to the
>> +PF on the same node as the cpu.
>> +
>> +XPS default config example:
>> +
>> +NUMA node(s):          2
>> +NUMA node0 CPU(s):     0-11
>> +NUMA node1 CPU(s):     12-23
>> +
>> +PF0 on node0, PF1 on node1.
> 
> You didn't cover how users are supposed to discover the topology.
> netdev is linked to a single device in sysfs, which is how we get
> netdev <> NUMA node mapping today. What's the expected way to get
> the NUMA nodes here?

In this configuration, there is 1:N relation between netdev and numa 
nodes and 1:1 relation between queue and numa node.

It would help if get-queue API exposes numa node as a parameter.

> 
> And obviously this can't get merged until mlx5 exposes queue <> NAPI <>
> IRQ mapping via the netdev genl.
> 
> <snip>
> 
>> +Mutually exclusive features
>> +===========================
>> +
>> +The nature of socket direct, where different channels work with different PFs, conflicts with
>> +stateful features where the state is maintained in one of the PFs.
>> +For exmaple, in the TLS device-offload feature, special context objects are created per connection
>> +and maintained in the PF.  Transitioning between different RQs/SQs would break the feature. Hence,
>> +we disable this combination for now.
> 
>
diff mbox series

Patch

diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst
new file mode 100644
index 000000000000..c8b4d8025a81
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/sd.rst
@@ -0,0 +1,134 @@ 
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+==============================
+Socket-Direct Netdev Combining
+==============================
+
+:Copyright: |copy| 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Contents
+========
+
+- `Background`_
+- `Overview`_
+- `Channels distribution`_
+- `Steering`_
+- `Mutually exclusive features`_
+
+Background
+==========
+
+NVIDIA Mellanox Socket Direct technology enables several CPUs within a multi-socket server to
+connect directly to the network, each through its own dedicated PCIe interface. Through either a
+connection harness that splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a
+single card. This results in eliminating the network traffic traversing over the internal bus
+between the sockets, significantly reducing overhead and latency, in addition to reducing CPU
+utilization and increasing network throughput.
+
+Overview
+========
+
+This feature adds support for combining multiple devices (PFs) of the same port in a Socket Direct
+environment under one netdev instance. Passing traffic through different devices belonging to
+different NUMA sockets saves cross-numa traffic and allows apps running on the same netdev from
+different numas to still feel a sense of proximity to the device and acheive improved performance.
+
+We acheive this by grouping PFs together, and creating the netdev only once all group members are
+probed. Symmetrically, we destroy the netdev once any of the PFs is removed.
+
+The channels are distributed between all devices, a proper configuration would utilize the correct
+close numa when working on a certain app/cpu.
+
+We pick one device to be a primary (leader), and it fills a special role. The other devices
+(secondaries) are disconnected from the network in the chip level (set to silent mode). All RX/TX
+traffic is steered through the primary to/from the secondaries.
+
+Currently, we limit the support to PFs only, and up to two devices (sockets).
+
+Channels distribution
+=====================
+
+Distribute the channels between the different SD-devices to acheive local numa node performance on
+multiple numas.
+
+Each channel works against one specific mdev, creating all datapath queues against it. We distribute
+channels to mdevs in a round-robin policy.
+
+Example for 2 PFs and 6 channels:
++-------+-------+
+| ch ix | PF ix |
++-------+-------+
+|   0   |   0   |
+|   1   |   1   |
+|   2   |   0   |
+|   3   |   1   |
+|   4   |   0   |
+|   5   |   1   |
++-------+-------+
+
+This round-robin distribution policy is preferred over another suggested intuitive distribution, in
+which we first distribute one half of the channels to PF0 and then the second half to PF1.
+
+The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
+mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
+As the channel stats are persistent to channels closure, changing the mapping every single time
+would turn the accumulative stats less representing of the channel's history.
+
+This is acheived by using the correct core device instance (mdev) in each channel, instead of them
+all using the same instance under "priv->mdev".
+
+Steering
+========
+Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
+
+In RX, the steering tables belong to the primary PF only, and it is its role to distribute incoming
+traffic to other PFs, via advanced HW cross-vhca steering capabilities.
+
+In TX, the primary PF creates a new TX flow table, which is aliased by the secondaries, so they can
+go out to the network through it.
+
+In addition, we set default XPS configuration that, based on the cpu, selects an SQ belonging to the
+PF on the same node as the cpu.
+
+XPS default config example:
+
+NUMA node(s):          2
+NUMA node0 CPU(s):     0-11
+NUMA node1 CPU(s):     12-23
+
+PF0 on node0, PF1 on node1.
+
+/sys/class/net/eth2/queues/tx-0/xps_cpus:000001
+/sys/class/net/eth2/queues/tx-1/xps_cpus:001000
+/sys/class/net/eth2/queues/tx-2/xps_cpus:000002
+/sys/class/net/eth2/queues/tx-3/xps_cpus:002000
+/sys/class/net/eth2/queues/tx-4/xps_cpus:000004
+/sys/class/net/eth2/queues/tx-5/xps_cpus:004000
+/sys/class/net/eth2/queues/tx-6/xps_cpus:000008
+/sys/class/net/eth2/queues/tx-7/xps_cpus:008000
+/sys/class/net/eth2/queues/tx-8/xps_cpus:000010
+/sys/class/net/eth2/queues/tx-9/xps_cpus:010000
+/sys/class/net/eth2/queues/tx-10/xps_cpus:000020
+/sys/class/net/eth2/queues/tx-11/xps_cpus:020000
+/sys/class/net/eth2/queues/tx-12/xps_cpus:000040
+/sys/class/net/eth2/queues/tx-13/xps_cpus:040000
+/sys/class/net/eth2/queues/tx-14/xps_cpus:000080
+/sys/class/net/eth2/queues/tx-15/xps_cpus:080000
+/sys/class/net/eth2/queues/tx-16/xps_cpus:000100
+/sys/class/net/eth2/queues/tx-17/xps_cpus:100000
+/sys/class/net/eth2/queues/tx-18/xps_cpus:000200
+/sys/class/net/eth2/queues/tx-19/xps_cpus:200000
+/sys/class/net/eth2/queues/tx-20/xps_cpus:000400
+/sys/class/net/eth2/queues/tx-21/xps_cpus:400000
+/sys/class/net/eth2/queues/tx-22/xps_cpus:000800
+/sys/class/net/eth2/queues/tx-23/xps_cpus:800000
+
+Mutually exclusive features
+===========================
+
+The nature of socket direct, where different channels work with different PFs, conflicts with
+stateful features where the state is maintained in one of the PFs.
+For exmaple, in the TLS device-offload feature, special context objects are created per connection
+and maintained in the PF.  Transitioning between different RQs/SQs would break the feature. Hence,
+we disable this combination for now.