diff mbox series

[RFC,v3,net-next] Documentation: devlink: Add devlink-sd

Message ID 20240125211219.5279-1-witu@nvidia.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [RFC,v3,net-next] Documentation: devlink: Add devlink-sd | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 8 this patch: 8
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 0 of 0 maintainers
netdev/build_clang success Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 8 this patch: 8
netdev/checkpatch fail ERROR: Remove Gerrit Change-Id's before submitting upstream WARNING: Possible repeated word: 'devlink' WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

William Tu Jan. 25, 2024, 9:12 p.m. UTC
Add devlink-sd, shared descriptor, documentation. The devlink-sd
mechanism is targeted for configuration of the shared rx descriptors
that server as a descriptor pool for ethernet reprsentors (reps)
to better utilize memory. Following operations are provided:
 * add/delete a shared descriptor pool
 * Configure the pool's properties
 * Bind/unbind a representor's rx channel to a descriptor pool

Propose new devlink objects because existing solutions below do
not fit our use cases:
1) devlink params: Need to add many new params to support
   the shared descriptor pool. It doesn't seem to be a good idea.
2) devlink-sb (shared buffer): very similar to the API proposed in
   this patch, but devlink-sb is used in ASIC hardware switch buffer
   and switch's port. Here the use case is switchdev mode with
   reprensentor ports and its rx queues.

Signed-off-by: William Tu <witu@nvidia.com>
Change-Id: I1de0d9544ff8371955c6976b2d301b1630023100
---
v3: read again myself and explain NAPI context and descriptor pool
v2: work on Jiri's feedback
- use more consistent device name, p0, pf0vf0, etc
- several grammar and spelling errors
- several changes to devlink sd api
  - remove hex, remove sd show, make output 1:1 mapping, use
  count instead of size, use "add" instead of "create"
  - remove the use of "we"
- remove the "default" and introduce "shared-descs" in switchdev mode
- make description more consistent with definitions in ethtool,
such as ring, channel, queue.
---
 .../networking/devlink/devlink-sd.rst         | 296 ++++++++++++++++++
 1 file changed, 296 insertions(+)
 create mode 100644 Documentation/networking/devlink/devlink-sd.rst
diff mbox series

Patch

diff --git a/Documentation/networking/devlink/devlink-sd.rst b/Documentation/networking/devlink/devlink-sd.rst
new file mode 100644
index 000000000000..e73587de9c50
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-sd.rst
@@ -0,0 +1,296 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+Devlink Shared Descriptors
+==========================
+
+Glossary
+========
+* REP port: Representor port
+* RQ: Receive Queue or RX Queue
+* WQE: Work Queue Entry
+* IRQ: Interrupt Request
+* Channel: A channel is an IRQ and the set of queues that can trigger
+  that IRQ.  ``devlink-sd`` assumes one rx queue per channel.
+* Descriptor: The data structure that describes a network packet.
+  An RQ consists of multiple descriptors.
+* NAPI context: New API context, associated with a device channel.
+* Device Naming:
+
+  - Uplink representors: p<port_number>, ex: p0.
+  - PF representors: pf<port_number>hpf, ex: pf0hpf.
+  - VF representors: pf<port_number>vf<function_number>, ex: pf0vf1.
+
+Background
+==========
+The ``devlink-sd`` mechanism is targeted for the configuration of the
+shared rx descriptors that host as a descriptor pool for ethernet
+representors (reps) to better utilize memory. Following operations are
+provided:
+
+* Add/delete a shared descriptor pool
+* Configure the pool's properties
+* Bind/unbind a representor's rx queue to a descriptor pool
+
+In switchdev mode, representors are slow-path ports that handle the
+miss traffic, i.e., traffic not being forwarded by the hardware.
+Representor ports are regular ethernet devices, with multiple channels
+consuming DMA memory. Memory consumption of the representor
+port's rx buffers can grow to several GB when scaling to 1k VFs reps.
+For example, in mlx5 driver, each RQ, with a typical 1K descriptors,
+consumes 3MB of DMA memory for packet buffer in descriptors, and with
+four channels, it consumes 4 * 3MB * 1024 = 12GB of memory. Since rep
+ports are for slow path traffic, most of these rep ports' rx DMA memory
+is idle when flows are forwarded directly in hardware to VFs.
+
+A network device driver consists of several channels and each channel
+represents an NAPI context and a set of queues that trigger that IRQ.
+devlink-sd considers only the *regular* RX queue in each channel,
+e.g., mlx5's non-regular RQs such as XSK RQ and drop RQ are not applicable
+here. Each device driver receives packets by setting up RQ, and
+each RQ receives packets by pre-allocating a dedicated set of rx
+ring descriptors, with each descriptor pointing to a memory buffer.
+The ``shared descriptor pool`` is a descriptor and buffer sharing
+mechanism. It allows multiple RQs to use the rx ring descriptors
+from the shared descriptor pool. In other words, the RQ no longer has
+its own dedicated rx ring descriptors, which might be idle when there
+is no traffic, but it consumes the descriptors from the descriptor
+pool only when packets arrive.
+
+The shared descriptor pool contains rx descriptors and its memory
+buffers. When multiple representors' RQs share the same pool of rx
+descriptors, they share the same set of memory buffers. As a result,
+the heavy-traffic representors can use all descriptors from the pool,
+while the idle, no-traffic representor consumes no memory. All the
+descriptors in the descriptor pool can be used by all the RQs. This
+makes the descriptor memory usage more efficient.
+
+The diagram below first shows two representors with their own regular
+RQ, and its rx descriptor ring and buffers, without using shared descriptor
+pool::
+
+      +--------+            +--------+
+      │ pf0vf1 │            │ pf0vf2 │
+      │   RQ   │            │   RQ   │
+      +--------+            +--------+
+           │                     │
+     +-----┴----------+    +-----┴----------+
+     │ rx descriptors │    │ rx descriptors │
+     │ and buffers    │    │ and buffers    │
+     +----------------+    +----------------+
+
+With shared descriptors, the diagram below shows that two representors
+can share the same descriptor pool::
+
+     +--------+            +--------+
+     │ pf0vf1 │            │ pf0vf2 │
+     │   RQ   │            │   RQ   │
+     +----┬---+            +----┬---+
+          │                     │
+          +---------+  +--------+
+                    │  │
+              +-----┴--┴-------+
+              │     shared     |
+              | rx descriptors │
+              │ and buffers    │
+              +----------------+
+
+Both packets arrived for pf0vf1 and pf0vf2 are consuming the descriptors
+and buffers in the pool. Once packets are passed to the upper Linux
+network stack, the driver will refill the rx descriptor with a new buffer,
+e.g., using the page_pool API.  Typically, a NAPI context is associated
+with each channel of a device, and packet reception and refilling operations
+happen in a NAPI context. Linux kernel guarantees that only one CPU at any
+time can call NAPI poll for each napi struct. In the shared rx descriptors
+case, a race condition happens when two NAPI contexts, scheduled to run
+on two CPUs, are fetching or refilling descriptors from/to the same
+shared descriptor pool. Thus, the shared descriptor pool should be either
+protected by a lock, or in a better design, have a 1:1 mapping between
+descriptor pool and NAPI context of a CPU (See examples below).
+
+API Overview
+============
+* Name:
+   - devlink-sd : devlink shared descriptor configuration
+* Synopsis:
+   - devlink sd pool show [DEV]
+   - devlink sd pool add DEV pool POOL_ID count DESCRIPTOR_NUM
+   - devlink sd pool delete id POOL_ID
+   - devlink sd port pool show [DEV]
+   - devlink sd port pool bind DEV queue QUEUE_ID pool POOL_ID
+   - devlink sd port pool unbind DEV queue QUEUE_ID
+
+Description
+===========
+ * devlink sd pool show - show shared descriptor pool and their
+   attributes
+
+    - DEV - the devlink device that supports shared descriptor pool.  If
+      this argument is omitted all available shared descriptor devices are
+      listed.
+
+ * devlink sd pool add - add a shared descriptor pool and the driver
+   allocates and returns descriptor pool id.
+
+    - DEV: the devlink device that supports shared descriptor pool.
+    - count DESCRIPTOR_NUM: the number of descriptors in the pool.
+
+ * devlink sd pool delete - delete shared descriptor pool
+
+    - pool POOL_ID: the id of the shared descriptor pool to be deleted.
+      Make sure no RX queue of any port is using the pool before deleting it.
+
+ * devlink sd port pool show - display port-pool mappings
+
+    - DEV: the devlink device that supports shared descriptor pool.
+
+ * devlink sd port pool bind - set the port-pool mapping
+
+    - DEV: the devlink device that supports shared descriptor pool.
+    - queue QUEUE_ID: the index of the channel. Note that a representor
+      might have multiple RX queues/channels, specify which queue id to
+      map to the pool.
+    - pool POOL_ID: the id of the shared descriptor pool to be mapped.
+
+ * devlink sd port pool unbind - unbind the port-pool mapping
+
+    - DEV: the devlink device that supports shared descriptor pool.
+    - queue QUEUE_ID: the index of the RX queue/channel.
+
+ * devlink dev eswitch set DEV mode switchdev - enable or disable default
+   port-pool mapping scheme
+
+    - DEV: the devlink device that supports shared descriptor pool.
+    - shared-descs { enable | disable }: enable/disable default port-pool
+      mapping scheme. See details below.
+
+
+Example usage
+=============
+
+.. code:: shell
+
+    # Enable switchdev mode for the device
+    * devlink dev eswitch set pci/0000:08:00.0 mode switchdev
+
+    # Show devlink device
+    $ devlink devlink show
+        pci/0000:08:00.0
+        pci/0000:08:00.1
+
+    # show existing descriptor pools
+    $ devlink sd pool show pci/0000:08:00.0
+        pci/0000:08:00.0: pool 11 count 2048
+
+    # Create a shared descriptor pool and 1024 descriptors, driver
+    # allocates and returns the pool id 12
+    $ devlink sd pool add pci/0000:08:00.0 count 1024
+        pci/0000:08:00.0: pool 12 count 1024
+
+    # Now check the pool again
+    $ devlink sd pool show pci/0000:08:00.0
+        pci/0000:08:00.0: pool 11 count 2048
+        pci/0000:08:00.0: pool 12 count 1024
+
+    # Bind a representor port, pf0vf1, queue 0 to the shared descriptor pool
+    $ devlink sd port pool bind pf0vf1 queue 0 pool 12
+
+    # Bind a representor port, pf0vf2, queue 0 to the shared descriptor pool
+    $ devlink sd port pool bind pf0vf2 queue 0 pool 12
+
+    # Show the rep port-pool mapping of pf0vf1
+    $ devlink sd port pool show pci/0000:08:00.0/11
+        pci/0000:08:00.0/11 queue 0 pool 12
+
+    # Show the rep port-pool mapping of pf0vf2
+    $ devlink sd port pool show pf0vf2
+    # or use the devlink port handle
+    $ devlink sd port pool show pci/0000:08:00.0/22
+        pci/0000:08:00.0/22 queue 0 pool 12
+
+    # To dump all ports mapping for a device
+    $ devlink sd port pool show pci/0000:08:00.0
+        pci/0000:08:00.0/11: queue 0 pool 12
+        pci/0000:08:00.0/22: queue 0 pool 12
+
+    # Unbind a representor port, pf0vf1, queue 0 from the shared descriptor pool
+    $ devlink sd port pool unbind pf0vf1 queue 0
+    $ devlink sd port pool show pci/0000:08:00.0
+        pci/0000:08:00.0/22: queue 0 pool 12
+
+Default Mapping Scheme
+======================
+The ``devlink-sd`` tries to be generic and fine-grained: allowing users
+to create shared descriptor pools and bind them to representor ports, in
+any mapping scheme they want. However, typically users don't want to
+do this by themselves. For convenience, ``devlink-sd`` adds a default mapping
+scheme as follows:
+
+.. code:: shell
+
+   # Create a shared descriptor pool for each rx queue of uplink
+     representor, assume having two queues:
+   $ devlink sd pool show p0
+       pci/0000:08:00.0: pool 8 count 1024 # reserved for queue 0
+       pci/0000:08:00.0: pool 9 count 1024 # reserved for queue 1
+
+   # Bind each representor port to its own shared descriptor pool, ex:
+   $ devlink sb port pool show pf0vf1
+        pci/0000:08:00.0/11: queue 0 pool 8
+        pci/0000:08:00.0/11: queue 1 pool 9
+
+   $ devlink sb port pool show pf0vf2
+        pci/0000:08:00.0/22: queue 0 pool 8
+        pci/0000:08:00.0/22: queue 1 pool 9
+
+The diagram shows the default mapping with two representors, each with
+two RX queues::
+
+     +--------+            +--------+     +--------+
+     │   p0   │            │ pf0vf1 │     | pf0vf2 │
+     │RQ0  RQ1│-------+    │RQ0  RQ1│     |RQ0  RQ1│
+     +-+------+       |    +-+----+-+     +-+----+-+
+       |              |      │    |         |    |
+       |  +------------------+    |     to  |    |
+       |  |           |           |      POOL-8  |
+   +---v--v-+         |     +-----v--+           |
+   │ POOL-8 |         |---> | POOL-9 |<----------+
+   +--------+               +--------+
+    NAPI-0                    NAPI-1
+
+The benefit of this default mapping is that it allows the p0, the uplink
+representor, to receive packets that are destined for pf0vf1 and pf0vf2,
+simply by polling the shared descriptor pools. In the above case, p0
+has two NAPI contexts, NAPI-0 polls for RQ0 and NAPI-1 polls for RQ1.
+Since the NAPI-0 receives packets by checking all the descriptors in
+the POOL-0, and the POOL-0 contains packets also for pf0vf1 and pf0vf2,
+polling POOL-1 can receive all the packets. As a result, uplink representors
+become the single device that receives packets for other representors.
+This makes managing pools and rx queues easier and since only one NAPI
+can poll on one pool, there is no lock required to avoid contention.
+
+Example usage (Default)
+=======================
+
+.. code:: shell
+
+    # Enable switchdev mode with additional *shared-descs* option
+    * devlink dev eswitch set pci/0000:08:00.0 mode switchdev \
+      shared-descs enable
+
+    # Assume two rx queues and one uplink device p0, and two reps pf0vf1 and pf0vf2
+    $ devlink sd port pool show pci/0000:08:00.0
+        pci/0000:08:00.0: queue 0 pool 8
+        pci/0000:08:00.0: queue 1 pool 9
+        pci/0000:08:00.0/11: queue 0 pool 8
+        pci/0000:08:00.0/11: queue 1 pool 9
+        pci/0000:08:00.0/22: queue 0 pool 8
+        pci/0000:08:00.0/22: queue 1 pool 9
+
+    # Disable *shared-descs* option falls back to non-sharing
+    * devlink dev eswitch set pci/0000:08:00.0 mode switchdev \
+      shared-descs disable
+
+    # pool and port-pool mappings are cleared
+    $ devlink sd port pool show pci/0000:08:00.0
+