new file mode 100644
@@ -0,0 +1,296 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+Devlink Shared Descriptors
+==========================
+
+Glossary
+========
+* REP port: Representor port
+* RQ: Receive Queue or RX Queue
+* WQE: Work Queue Entry
+* IRQ: Interrupt Request
+* Channel: A channel is an IRQ and the set of queues that can trigger
+ that IRQ. ``devlink-sd`` assumes one rx queue per channel.
+* Descriptor: The data structure that describes a network packet.
+ An RQ consists of multiple descriptors.
+* NAPI context: New API context, associated with a device channel.
+* Device Naming:
+
+ - Uplink representors: p<port_number>, ex: p0.
+ - PF representors: pf<port_number>hpf, ex: pf0hpf.
+ - VF representors: pf<port_number>vf<function_number>, ex: pf0vf1.
+
+Background
+==========
+The ``devlink-sd`` mechanism is targeted for the configuration of the
+shared rx descriptors that host as a descriptor pool for ethernet
+representors (reps) to better utilize memory. Following operations are
+provided:
+
+* Add/delete a shared descriptor pool
+* Configure the pool's properties
+* Bind/unbind a representor's rx queue to a descriptor pool
+
+In switchdev mode, representors are slow-path ports that handle the
+miss traffic, i.e., traffic not being forwarded by the hardware.
+Representor ports are regular ethernet devices, with multiple channels
+consuming DMA memory. Memory consumption of the representor
+port's rx buffers can grow to several GB when scaling to 1k VFs reps.
+For example, in mlx5 driver, each RQ, with a typical 1K descriptors,
+consumes 3MB of DMA memory for packet buffer in descriptors, and with
+four channels, it consumes 4 * 3MB * 1024 = 12GB of memory. Since rep
+ports are for slow path traffic, most of these rep ports' rx DMA memory
+is idle when flows are forwarded directly in hardware to VFs.
+
+A network device driver consists of several channels and each channel
+represents an NAPI context and a set of queues that trigger that IRQ.
+devlink-sd considers only the *regular* RX queue in each channel,
+e.g., mlx5's non-regular RQs such as XSK RQ and drop RQ are not applicable
+here. Each device driver receives packets by setting up RQ, and
+each RQ receives packets by pre-allocating a dedicated set of rx
+ring descriptors, with each descriptor pointing to a memory buffer.
+The ``shared descriptor pool`` is a descriptor and buffer sharing
+mechanism. It allows multiple RQs to use the rx ring descriptors
+from the shared descriptor pool. In other words, the RQ no longer has
+its own dedicated rx ring descriptors, which might be idle when there
+is no traffic, but it consumes the descriptors from the descriptor
+pool only when packets arrive.
+
+The shared descriptor pool contains rx descriptors and its memory
+buffers. When multiple representors' RQs share the same pool of rx
+descriptors, they share the same set of memory buffers. As a result,
+the heavy-traffic representors can use all descriptors from the pool,
+while the idle, no-traffic representor consumes no memory. All the
+descriptors in the descriptor pool can be used by all the RQs. This
+makes the descriptor memory usage more efficient.
+
+The diagram below first shows two representors with their own regular
+RQ, and its rx descriptor ring and buffers, without using shared descriptor
+pool::
+
+ +--------+ +--------+
+ │ pf0vf1 │ │ pf0vf2 │
+ │ RQ │ │ RQ │
+ +--------+ +--------+
+ │ │
+ +-----┴----------+ +-----┴----------+
+ │ rx descriptors │ │ rx descriptors │
+ │ and buffers │ │ and buffers │
+ +----------------+ +----------------+
+
+With shared descriptors, the diagram below shows that two representors
+can share the same descriptor pool::
+
+ +--------+ +--------+
+ │ pf0vf1 │ │ pf0vf2 │
+ │ RQ │ │ RQ │
+ +----┬---+ +----┬---+
+ │ │
+ +---------+ +--------+
+ │ │
+ +-----┴--┴-------+
+ │ shared |
+ | rx descriptors │
+ │ and buffers │
+ +----------------+
+
+Both packets arrived for pf0vf1 and pf0vf2 are consuming the descriptors
+and buffers in the pool. Once packets are passed to the upper Linux
+network stack, the driver will refill the rx descriptor with a new buffer,
+e.g., using the page_pool API. Typically, a NAPI context is associated
+with each channel of a device, and packet reception and refilling operations
+happen in a NAPI context. Linux kernel guarantees that only one CPU at any
+time can call NAPI poll for each napi struct. In the shared rx descriptors
+case, a race condition happens when two NAPI contexts, scheduled to run
+on two CPUs, are fetching or refilling descriptors from/to the same
+shared descriptor pool. Thus, the shared descriptor pool should be either
+protected by a lock, or in a better design, have a 1:1 mapping between
+descriptor pool and NAPI context of a CPU (See examples below).
+
+API Overview
+============
+* Name:
+ - devlink-sd : devlink shared descriptor configuration
+* Synopsis:
+ - devlink sd pool show [DEV]
+ - devlink sd pool add DEV pool POOL_ID count DESCRIPTOR_NUM
+ - devlink sd pool delete id POOL_ID
+ - devlink sd port pool show [DEV]
+ - devlink sd port pool bind DEV queue QUEUE_ID pool POOL_ID
+ - devlink sd port pool unbind DEV queue QUEUE_ID
+
+Description
+===========
+ * devlink sd pool show - show shared descriptor pool and their
+ attributes
+
+ - DEV - the devlink device that supports shared descriptor pool. If
+ this argument is omitted all available shared descriptor devices are
+ listed.
+
+ * devlink sd pool add - add a shared descriptor pool and the driver
+ allocates and returns descriptor pool id.
+
+ - DEV: the devlink device that supports shared descriptor pool.
+ - count DESCRIPTOR_NUM: the number of descriptors in the pool.
+
+ * devlink sd pool delete - delete shared descriptor pool
+
+ - pool POOL_ID: the id of the shared descriptor pool to be deleted.
+ Make sure no RX queue of any port is using the pool before deleting it.
+
+ * devlink sd port pool show - display port-pool mappings
+
+ - DEV: the devlink device that supports shared descriptor pool.
+
+ * devlink sd port pool bind - set the port-pool mapping
+
+ - DEV: the devlink device that supports shared descriptor pool.
+ - queue QUEUE_ID: the index of the channel. Note that a representor
+ might have multiple RX queues/channels, specify which queue id to
+ map to the pool.
+ - pool POOL_ID: the id of the shared descriptor pool to be mapped.
+
+ * devlink sd port pool unbind - unbind the port-pool mapping
+
+ - DEV: the devlink device that supports shared descriptor pool.
+ - queue QUEUE_ID: the index of the RX queue/channel.
+
+ * devlink dev eswitch set DEV mode switchdev - enable or disable default
+ port-pool mapping scheme
+
+ - DEV: the devlink device that supports shared descriptor pool.
+ - shared-descs { enable | disable }: enable/disable default port-pool
+ mapping scheme. See details below.
+
+
+Example usage
+=============
+
+.. code:: shell
+
+ # Enable switchdev mode for the device
+ * devlink dev eswitch set pci/0000:08:00.0 mode switchdev
+
+ # Show devlink device
+ $ devlink devlink show
+ pci/0000:08:00.0
+ pci/0000:08:00.1
+
+ # show existing descriptor pools
+ $ devlink sd pool show pci/0000:08:00.0
+ pci/0000:08:00.0: pool 11 count 2048
+
+ # Create a shared descriptor pool and 1024 descriptors, driver
+ # allocates and returns the pool id 12
+ $ devlink sd pool add pci/0000:08:00.0 count 1024
+ pci/0000:08:00.0: pool 12 count 1024
+
+ # Now check the pool again
+ $ devlink sd pool show pci/0000:08:00.0
+ pci/0000:08:00.0: pool 11 count 2048
+ pci/0000:08:00.0: pool 12 count 1024
+
+ # Bind a representor port, pf0vf1, queue 0 to the shared descriptor pool
+ $ devlink sd port pool bind pf0vf1 queue 0 pool 12
+
+ # Bind a representor port, pf0vf2, queue 0 to the shared descriptor pool
+ $ devlink sd port pool bind pf0vf2 queue 0 pool 12
+
+ # Show the rep port-pool mapping of pf0vf1
+ $ devlink sd port pool show pci/0000:08:00.0/11
+ pci/0000:08:00.0/11 queue 0 pool 12
+
+ # Show the rep port-pool mapping of pf0vf2
+ $ devlink sd port pool show pf0vf2
+ # or use the devlink port handle
+ $ devlink sd port pool show pci/0000:08:00.0/22
+ pci/0000:08:00.0/22 queue 0 pool 12
+
+ # To dump all ports mapping for a device
+ $ devlink sd port pool show pci/0000:08:00.0
+ pci/0000:08:00.0/11: queue 0 pool 12
+ pci/0000:08:00.0/22: queue 0 pool 12
+
+ # Unbind a representor port, pf0vf1, queue 0 from the shared descriptor pool
+ $ devlink sd port pool unbind pf0vf1 queue 0
+ $ devlink sd port pool show pci/0000:08:00.0
+ pci/0000:08:00.0/22: queue 0 pool 12
+
+Default Mapping Scheme
+======================
+The ``devlink-sd`` tries to be generic and fine-grained: allowing users
+to create shared descriptor pools and bind them to representor ports, in
+any mapping scheme they want. However, typically users don't want to
+do this by themselves. For convenience, ``devlink-sd`` adds a default mapping
+scheme as follows:
+
+.. code:: shell
+
+ # Create a shared descriptor pool for each rx queue of uplink
+ representor, assume having two queues:
+ $ devlink sd pool show p0
+ pci/0000:08:00.0: pool 8 count 1024 # reserved for queue 0
+ pci/0000:08:00.0: pool 9 count 1024 # reserved for queue 1
+
+ # Bind each representor port to its own shared descriptor pool, ex:
+ $ devlink sb port pool show pf0vf1
+ pci/0000:08:00.0/11: queue 0 pool 8
+ pci/0000:08:00.0/11: queue 1 pool 9
+
+ $ devlink sb port pool show pf0vf2
+ pci/0000:08:00.0/22: queue 0 pool 8
+ pci/0000:08:00.0/22: queue 1 pool 9
+
+The diagram shows the default mapping with two representors, each with
+two RX queues::
+
+ +--------+ +--------+ +--------+
+ │ p0 │ │ pf0vf1 │ | pf0vf2 │
+ │RQ0 RQ1│-------+ │RQ0 RQ1│ |RQ0 RQ1│
+ +-+------+ | +-+----+-+ +-+----+-+
+ | | │ | | |
+ | +------------------+ | to | |
+ | | | | POOL-8 |
+ +---v--v-+ | +-----v--+ |
+ │ POOL-8 | |---> | POOL-9 |<----------+
+ +--------+ +--------+
+ NAPI-0 NAPI-1
+
+The benefit of this default mapping is that it allows the p0, the uplink
+representor, to receive packets that are destined for pf0vf1 and pf0vf2,
+simply by polling the shared descriptor pools. In the above case, p0
+has two NAPI contexts, NAPI-0 polls for RQ0 and NAPI-1 polls for RQ1.
+Since the NAPI-0 receives packets by checking all the descriptors in
+the POOL-0, and the POOL-0 contains packets also for pf0vf1 and pf0vf2,
+polling POOL-1 can receive all the packets. As a result, uplink representors
+become the single device that receives packets for other representors.
+This makes managing pools and rx queues easier and since only one NAPI
+can poll on one pool, there is no lock required to avoid contention.
+
+Example usage (Default)
+=======================
+
+.. code:: shell
+
+ # Enable switchdev mode with additional *shared-descs* option
+ * devlink dev eswitch set pci/0000:08:00.0 mode switchdev \
+ shared-descs enable
+
+ # Assume two rx queues and one uplink device p0, and two reps pf0vf1 and pf0vf2
+ $ devlink sd port pool show pci/0000:08:00.0
+ pci/0000:08:00.0: queue 0 pool 8
+ pci/0000:08:00.0: queue 1 pool 9
+ pci/0000:08:00.0/11: queue 0 pool 8
+ pci/0000:08:00.0/11: queue 1 pool 9
+ pci/0000:08:00.0/22: queue 0 pool 8
+ pci/0000:08:00.0/22: queue 1 pool 9
+
+ # Disable *shared-descs* option falls back to non-sharing
+ * devlink dev eswitch set pci/0000:08:00.0 mode switchdev \
+ shared-descs disable
+
+ # pool and port-pool mappings are cleared
+ $ devlink sd port pool show pci/0000:08:00.0
+
Add devlink-sd, shared descriptor, documentation. The devlink-sd mechanism is targeted for configuration of the shared rx descriptors that server as a descriptor pool for ethernet reprsentors (reps) to better utilize memory. Following operations are provided: * add/delete a shared descriptor pool * Configure the pool's properties * Bind/unbind a representor's rx channel to a descriptor pool Propose new devlink objects because existing solutions below do not fit our use cases: 1) devlink params: Need to add many new params to support the shared descriptor pool. It doesn't seem to be a good idea. 2) devlink-sb (shared buffer): very similar to the API proposed in this patch, but devlink-sb is used in ASIC hardware switch buffer and switch's port. Here the use case is switchdev mode with reprensentor ports and its rx queues. Signed-off-by: William Tu <witu@nvidia.com> Change-Id: I1de0d9544ff8371955c6976b2d301b1630023100 --- v3: read again myself and explain NAPI context and descriptor pool v2: work on Jiri's feedback - use more consistent device name, p0, pf0vf0, etc - several grammar and spelling errors - several changes to devlink sd api - remove hex, remove sd show, make output 1:1 mapping, use count instead of size, use "add" instead of "create" - remove the use of "we" - remove the "default" and introduce "shared-descs" in switchdev mode - make description more consistent with definitions in ethtool, such as ring, channel, queue. --- .../networking/devlink/devlink-sd.rst | 296 ++++++++++++++++++ 1 file changed, 296 insertions(+) create mode 100644 Documentation/networking/devlink/devlink-sd.rst