new file mode 100644
@@ -0,0 +1,284 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+Devlink Shared Descriptors
+==========================
+
+Glossary
+========
+* REP port: Representor port
+* RQ: Receive Queue or RX Queue
+* WQE: Work Queue Entry
+* IRQ: Interrupt Request
+* Channel: A channel is an IRQ and the set of queues that can trigger
+ that IRQ. ``devlink-sd`` assumes one rx queue per channel.
+* Descriptor: The data structure that describes a network packet.
+ An RQ consists of multiple descriptors.
+* Device Naming:
+
+ - Uplink representors: p<port_number>, ex: p0.
+ - PF representors: pf<port_number>hpf, ex: pf0hpf.
+ - VF representors: pf<port_number>vf<function_number>, ex: pf0vf1.
+
+Background
+==========
+The ``devlink-sd`` mechanism is targeted for the configuration of the
+shared rx descriptors that host as a descriptor pool for ethernet
+representors (reps) to better utilize memory. Following operations are
+provided:
+
+* Add/delete a shared descriptor pool
+* Configure the pool's properties
+* Bind/unbind a representor's rx channel to a descriptor pool
+
+In switchdev mode, representors are slow-path ports that handle the
+miss traffic, i.e., traffic not being forwarded by the hardware.
+Representor ports are regular ethernet devices, with multiple channels
+consuming DMA memory. Memory consumption of the representor
+port's rx buffers can grow to several GB when scaling to 1k VFs reps.
+For example, in mlx5 driver, each RQ, with a typical 1K descriptors,
+consumes 3MB of DMA memory for packet buffer in descriptors, and with four
+channels, it consumes 4 * 3MB * 1024 = 12GB of memory. And since rep
+ports are for slow path traffic when flows are mostly offloaded,
+most of these rep ports' rx DMA memory is idle.
+
+A network device driver consists of several channels and each channel
+represents an IRQ and a set of queues that trigger that IRQ. devlink-sd
+considers only the *regular* RX queue in each channel, e.g., mlx5's
+non-regular RQs such as XSK RQ and drop RQ are not applicable here.
+Each device driver receives packets by setting up RQ, and each RQ
+receives packets by pre-allocating a dedicated set of rx
+ring descriptors, with each descriptor pointing to a memory buffer.
+The ``shared descriptor pool`` is a descriptor and buffer sharing
+mechanism. It allows multiple RQs to use the rx ring descriptors
+from the shared descriptor pool. In other words, the RQ no longer has
+its own dedicated rx ring descriptors, which might be idle when there
+is no traffic, but it gets the descriptors from the descriptor pool.
+
+The shared descriptor pool contains rx descriptors and its memory
+buffers. When multiple representors' RQs share the same pool of rx
+descriptors, they share the same set of memory buffers. As a result,
+the heavy-traffic representors can use all descriptors from the pool,
+while the idle, no-traffic representor consumes no memory. All the
+descriptors in the descriptor pool can be used by all the RQs. This
+makes the descriptor memory usage more efficient.
+
+The diagram below first shows two representors with their own the
+regular rep, RQ, and its rx descriptor ring and buffers::
+
+ +--------+ +--------+
+ │ pf0vf1 │ │ pf0vf2 │
+ │ RQ │ │ RQ │
+ +--------+ +--------+
+ │ │
+ +-----┴----------+ +-----┴----------+
+ │ rx descriptors │ │ rx descriptors │
+ │ and buffers │ │ and buffers │
+ +----------------+ +----------------+
+
+With shared descriptors, the diagram below shows that two representors
+can share the same descriptor pool::
+
+ +--------+ +--------+
+ │ pf0vf1 │ │ pf0vf2 │
+ │ RQ │ │ RQ │
+ +----┬---+ +----┬---+
+ │ │
+ +---------+ +--------+
+ │ │
+ +-----┴--┴-------+
+ │ shared |
+ | rx descriptors │
+ │ and buffers │
+ +----------------+
+
+API Overview
+============
+* Name:
+ - devlink-sd : devlink shared descriptor configuration
+* Synopsis:
+ - devlink sd pool show [DEV]
+ - devlink sd pool add DEV pool POOL_ID count DESCRIPTOR_NUM
+ - devlink sd pool delete id POOL_ID
+ - devlink sd port pool show [DEV]
+ - devlink sd port pool bind DEV queue QUEUE_ID pool POOL_ID
+ - devlink sd port pool unbind DEV queue QUEUE_ID
+
+Description
+===========
+ * devlink sd pool show - show shared descriptor pool and their
+ attributes
+
+ - DEV - the devlink device that supports shared descriptor pool. If
+ this argument is omitted all available shared descriptor devices are
+ listed.
+
+ * devlink sd pool add - add a shared descriptor pool and the driver
+ allocates and returns descriptor pool id.
+
+ - DEV: the devlink device that supports shared descriptor pool.
+ - count DESCRIPTOR_NUM: the number of descriptors in the pool.
+
+ * devlink sd pool delete - delete shared descriptor pool
+
+ - pool POOL_ID: the id of the shared descriptor pool to be deleted.
+ Make sure no RX queue of any port is using the pool before deleting it.
+
+ * devlink sd port pool show - display port-pool mappings
+
+ - DEV: the devlink device that supports shared descriptor pool.
+
+ * devlink sd port pool bind - set the port-pool mapping
+
+ - DEV: the devlink device that supports shared descriptor pool.
+ - queue QUEUE_ID: the index of the channel. Note that a representor
+ might have multiple RX queues/channels, specify which queue id to
+ map to the pool.
+ - pool POOL_ID: the id of the shared descriptor pool to be mapped.
+
+ * devlink sd port pool unbind - unbind the port-pool mapping
+
+ - DEV: the devlink device that supports shared descriptor pool.
+ - queue QUEUE_ID: the index of the RX queue/channel.
+
+ * devlink dev eswitch set DEV mode switchdev - enable or disable default
+ port-pool mapping scheme
+
+ - DEV: the devlink device that supports shared descriptor pool.
+ - shared-descs { enable | disable }: enable/disable default port-pool
+ mapping scheme. See details below.
+
+
+Example usage
+=============
+
+.. code:: shell
+
+ # Enable switchdev mode for the device
+ * devlink dev eswitch set pci/0000:08:00.0 mode switchdev
+
+ # Show devlink device
+ $ devlink devlink show
+ pci/0000:08:00.0
+ pci/0000:08:00.1
+
+ # show existing descriptor pools
+ $ devlink sd pool show pci/0000:08:00.0
+ pci/0000:08:00.0: pool 11 count 2048
+
+ # Create a shared descriptor pool and 1024 descriptors, driver
+ # allocates and returns the pool id 12
+ $ devlink sd pool add pci/0000:08:00.0 count 1024
+ pci/0000:08:00.0: pool 12 count 1024
+
+ # Now check the pool again
+ $ devlink sd pool show pci/0000:08:00.0
+ pci/0000:08:00.0: pool 11 count 2048
+ pci/0000:08:00.0: pool 12 count 1024
+
+ # Bind a representor port, pf0vf1, queue 0 to the shared descriptor pool
+ $ devlink sd port pool bind pf0vf1 queue 0 pool 12
+
+ # Bind a representor port, pf0vf2, queue 0 to the shared descriptor pool
+ $ devlink sd port pool bind pf0vf2 queue 0 pool 12
+
+ # Show the rep port-pool mapping of pf0vf1
+ $ devlink sd port pool show pci/0000:08:00.0/11
+ pci/0000:08:00.0/11 queue 0 pool 12
+
+ # Show the rep port-pool mapping of pf0vf2
+ $ devlink sd port pool show pf0vf2
+ # or use the devlink port handle
+ $ devlink sd port pool show pci/0000:08:00.0/22
+ pci/0000:08:00.0/22 queue 0 pool 12
+
+ # To dump all ports mapping for a device
+ $ devlink sd port pool show pci/0000:08:00.0
+ pci/0000:08:00.0/11: queue 0 pool 12
+ pci/0000:08:00.0/22: queue 0 pool 12
+
+ # Unbind a representor port, pf0vf1, queue 0 from the shared descriptor pool
+ $ devlink sd port pool unbind pf0vf1 queue 0
+ $ devlink sd port pool show pci/0000:08:00.0
+ pci/0000:08:00.0/22: queue 0 pool 12
+
+Default Mapping Scheme
+======================
+The ``devlink-sd`` tries to be generic and fine-grained: allowing users
+to create shared descriptor pools and bind them to representor ports, in
+any mapping scheme they want. However, typically users don't want to
+do this by themselves. For convenience, ``devlink-sd`` adds a default mapping
+scheme as follows:
+
+.. code:: shell
+
+ # Create a shared descriptor pool for each rx queue of uplink
+ representor, assume having two queues:
+ $ devlink sd pool show p0
+ pci/0000:08:00.0: pool 8 count 1024 # reserved for queue 0
+ pci/0000:08:00.0: pool 9 count 1024 # reserved for queue 1
+
+ # Bind each representor port to its own shared descriptor pool, ex:
+ $ devlink sb port pool show pf0vf1
+ pci/0000:08:00.0/11: queue 0 pool 8
+ pci/0000:08:00.0/11: queue 1 pool 9
+
+ $ devlink sb port pool show pf0vf2
+ pci/0000:08:00.0/22: queue 0 pool 8
+ pci/0000:08:00.0/22: queue 1 pool 9
+
+The diagram shows the default mapping with two representors, each with
+two RX queues::
+
+ +--------+ +--------+ +--------+
+ │ p0 │ │ pf0vf1 │ | pf0vf2 │
+ │RQ0 RQ1│-------+ │RQ0 RQ1│ |RQ0 RQ1│
+ +-+------+ | +-+----+-+ +-+----+-+
+ | | │ | | |
+ | +------------------+ | to | |
+ | | | | POOL-8 |
+ +---v--v-+ | +-----v--+ |
+ │ POOL-8 | |---> | POOL-9 |<----------+
+ +--------+ +--------+
+ NAPI-0 NAPI-1
+
+The benefit of this default mapping is that it allows the p0, the uplink
+representor, to receive packets that are destined for pf0vf1 and pf0vf2,
+simply by polling the shared descriptor pools. In the above case, p0
+has two NAPI contexts, NAPI-0 polls for RQ0 and NAPI-1 polls for RQ1.
+Since the NAPI-0 receives packets by checking all the descriptors in
+the POOL-0, and the POOL-0 contains packets also for pf0vf1 and pf0vf2,
+polling POOL-1 can receive all the packets. As a result, uplink representors
+become the single device that receives packets for other representors.
+This makes managing pools and rx queues easier and since only one NAPI
+can poll on one pool, there is no lock required to avoid contention.
+
+Example usage (Default)
+=======================
+
+.. code:: shell
+
+ # Enable switchdev mode with additional *shared-descs* option
+ * devlink dev eswitch set pci/0000:08:00.0 mode switchdev \
+ shared-descs enable
+
+ # Assume two rx queues and one uplink device p0, and two reps pf0vf1 and pf0vf2
+ $ devlink sd port pool show pci/0000:08:00.0
+ pci/0000:08:00.0: queue 0 pool 8
+ pci/0000:08:00.0: queue 1 pool 9
+ pci/0000:08:00.0/11: queue 0 pool 8
+ pci/0000:08:00.0/11: queue 1 pool 9
+ pci/0000:08:00.0/22: queue 0 pool 8
+ pci/0000:08:00.0/22: queue 1 pool 9
+
+ # Disable *shared-descs* option falls back to non-sharing
+ * devlink dev eswitch set pci/0000:08:00.0 mode switchdev \
+ shared-descs disable
+
+ # pool and port-pool mappings are cleared
+ $ devlink sd port pool show pci/0000:08:00.0
+
--
2.37.1 (Apple Git-137.1)
Add devlink-sd, shared descriptor, documentation. The devlink-sd mechanism is targeted for configuration of the shared rx descriptors that server as a descriptor pool for ethernet reprsentors (reps) to better utilize memory. Following operations are provided: * add/delete a shared descriptor pool * Configure the pool's properties * Bind/unbind a representor's rx channel to a descriptor pool Propose new devlink objects because existing solutions below do not fit our use cases: 1) devlink params: Need to add many new params to support the shared descriptor pool. It doesn't seem to be a good idea. 2) devlink-sb (shared buffer): very similar to the API proposed in this patch, but devlink-sb is used in ASIC hardware switch buffer and switch's port. Here the use case is switchdev mode with reprensentor ports and its rx queues. AFAIK, Intel's ICE driver and Broadcom's driver also have the switchdev mode and representor ports, thus the proposed new API should be useful for other vendors. Any comments are welcome, thanks! Signed-off-by: William Tu <witu@nvidia.com> --- v2: work on Jiri's internal feedback - use more consistent device name, p0, pf0vf0, etc - several grammar and spelling errors - several changes to devlink sd api - remove hex, remove sd show, make output 1:1 mapping, use count instead of size, use "add" instead of "create" - remove the use of "we" - remove the "default" and introduce "shared-descs" in switchdev mode - make description more consistent with definitions in ethtool, such as ring, channel, queue. --- .../networking/devlink/devlink-sd.rst | 284 ++++++++++++++++++ 1 file changed, 284 insertions(+) create mode 100644 Documentation/networking/devlink/devlink-sd.rst