[v11,00/25] nvme-tcp receive offloads

Message ID	20230203132705.627232-1-aaptel@nvidia.com (mailing list archive)
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Aurelien Aptel <aaptel@nvidia.com> To: linux-nvme@lists.infradead.org, netdev@vger.kernel.org, sagi@grimberg.me, hch@lst.de, kbusch@kernel.org, axboe@fb.com, chaitanyak@nvidia.com, davem@davemloft.net, kuba@kernel.org Cc: aaptel@nvidia.com, aurelien.aptel@gmail.com, smalin@nvidia.com, malin1024@gmail.com, ogerlitz@nvidia.com, yorayz@nvidia.com, borisp@nvidia.com Subject: [PATCH v11 00/25] nvme-tcp receive offloads Date: Fri, 3 Feb 2023 15:26:40 +0200 Message-Id: <20230203132705.627232-1-aaptel@nvidia.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 Precedence: bulk
Series	nvme-tcp receive offloads \| expand [v11,00/25] nvme-tcp receive offloads [v11,01/25] net: Introduce direct data placement tcp offload [v11,02/25] net/ethtool: add new stringset ETH_SS_ULP_DDP_{CAPS,STATS} [v11,03/25] net/ethtool: add ULP_DDP_{GET,SET} operations for caps and stats [v11,04/25] Documentation: document netlink ULP_DDP_GET/SET messages [v11,05/25] iov_iter: skip copy if src == dst for direct data placement [v11,06/25] net/tls,core: export get_netdev_for_sock [v11,07/25] nvme-tcp: Add DDP offload control path [v11,08/25] nvme-tcp: Add DDP data-path [v11,09/25] nvme-tcp: RX DDGST offload [v11,10/25] nvme-tcp: Deal with netdevice DOWN events [v11,11/25] nvme-tcp: Add modparam to control the ULP offload enablement [v11,12/25] Documentation: add ULP DDP offload documentation [v11,13/25] net/mlx5e: Rename from tls to transport static params [v11,14/25] net/mlx5e: Refactor ico sq polling to get budget [v11,15/25] net/mlx5e: Have mdev pointer directly on the icosq structure [v11,16/25] net/mlx5e: Refactor doorbell function to allow avoiding a completion [v11,17/25] net/mlx5: Add NVMEoTCP caps, HW bits, 128B CQE and enumerations [v11,18/25] net/mlx5e: NVMEoTCP, offload initialization [v11,19/25] net/mlx5e: TCP flow steering for nvme-tcp acceleration [v11,20/25] net/mlx5e: NVMEoTCP, use KLM UMRs for buffer registration [v11,21/25] net/mlx5e: NVMEoTCP, queue init/teardown [v11,22/25] net/mlx5e: NVMEoTCP, ddp setup and resync [v11,23/25] net/mlx5e: NVMEoTCP, async ddp invalidation [v11,24/25] net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload [v11,25/25] net/mlx5e: NVMEoTCP, statistics

Message ID

20230203132705.627232-1-aaptel@nvidia.com (mailing list archive)

Headers

From: Aurelien Aptel <aaptel@nvidia.com>
To: linux-nvme@lists.infradead.org, netdev@vger.kernel.org,
        sagi@grimberg.me, hch@lst.de, kbusch@kernel.org, axboe@fb.com,
        chaitanyak@nvidia.com, davem@davemloft.net, kuba@kernel.org
Cc: aaptel@nvidia.com, aurelien.aptel@gmail.com, smalin@nvidia.com,
        malin1024@gmail.com, ogerlitz@nvidia.com, yorayz@nvidia.com,
        borisp@nvidia.com
Subject: [PATCH v11 00/25] nvme-tcp receive offloads
Date: Fri,  3 Feb 2023 15:26:40 +0200
Message-Id: <20230203132705.627232-1-aaptel@nvidia.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?q?XwgRd1FoD96SKsDlC57ingMVNLGD?=
	=?utf-8?q?+3NMTOdEesgF78xuceMzwAoPD8RVl8iR5ciiSG3hlcuy0EtgaElQLrCPvwzvlfHxT?=
	=?utf-8?q?7ETISvIScmw1CZBaNoitXBZXBRR/sMB1/bRf7y999YxEFAfYMi+MwgH8N3rVjnDhR?=
	=?utf-8?q?OVr1765asXyyAw2WGYb34PQR7EgdRDIJVpxiR6oLSKyjU4R3wNs/0fyEnziIghnwZ?=
	=?utf-8?q?+gMcBj25qxajz6XTNEnpWxdC8iHnycX/t8o9Vd6uNjZ68xK0ITE3t9UZK1IB1cXf1?=
	=?utf-8?q?2cHYt8flUZKNYnix/J/4NS2SQAgZ+FdERrEzRqWv0/8dYqgThvGtObclGqPU8BetU?=
	=?utf-8?q?PE9dIT7ShRm9SpVohPiy+mldyAHUB4GVSS4+IWmRYZ1JrR4Z9pIik+m2eGIhjJd82?=
	=?utf-8?q?2eB88vy/Ow+E7wftc5zokTwEF+JvA9GK3ZmZzAnyPbcuJO3PNz8gWShJDd+uyn/D0?=
	=?utf-8?q?Em/17bu+ZKgSqiXKvj18x5eUsZutqL7gdHrzGYDcy+pDEJACYwU25ZINa9NcQbMsZ?=
	=?utf-8?q?HAZyRgAZ7L+Ne+HApDxL4wN7LLDr95PgOLZ0YRpSzfC8oKl2ku6Bq2X4krRh+g0Cu?=
	=?utf-8?q?5gmhO2g1NGktSWSfkLkrEtzBU/w8KnldCyipdGANvBtxNrfCIynArL8vW6v/zAS2F?=
	=?utf-8?q?5xLK4GHilb1QUC4AwSNXmedm33lXEzhWpZ0XkgFnmAhkJGLFgZidCMqrTQTASLJqq?=
	=?utf-8?q?RJnWEZlIgMYOQxUVZasm05+AQ+qOfcc/rSRSZNDPyi8wE95DVGzK8aPTmILY8tbj6?=
	=?utf-8?q?08t/3PTjNppqWBZTv2zHbiH1XCbXlWzkBwJkf1mHXsMPaUr90iS2jTulyLeFqWGqr?=
	=?utf-8?q?Z53ZcXPKHLZGHAEPuWmRL2wLPAGnVHIlUv3YW/9+yKr3c7jNfqLO0OgrXYPIaW9mZ?=
	=?utf-8?q?8btuTHJX6D7mqkk05ZT2qYdyA0Nt6okwn1M5eIt0W+oa2loFjlp4Laye2sf3m3UPA?=
	=?utf-8?q?kDV8SUl9Dr3AeUBuLNc9pFg1/oFEaznXyysCNLqZesuvk+Hn/UbiIRDFIS0ApZEhu?=
	=?utf-8?q?03lCGQnLaqLQbP0iy6z4QQpUX9Lh3X9SI3TLsYVe5U2pM2oQ1LWTh89eeFozhgfKS?=
	=?utf-8?q?T/5itABloR5WguKFu8dQLSqOLO3MV3uktMGHDL7H0qe9MLpCeV4x2tFlOKWlhyD2u?=
	=?utf-8?q?6NULHlwPYwV20o4mnOoVcIqmNBaFsR9mhcgHo3PvbFeU4WbxhXXyHYbHekxZC3mbw?=
	=?utf-8?q?uZvhOV/Ty9BNYoKRLiDnV+pZbjV8ZFtZE0Bs+qM0cQQS24WO+7VloPqvx3wvTW6LS?=
	=?utf-8?q?s1IgBTx3Bb827/mtWxiwI33P498RYyqzkXdkeJbKzeOJ0SEHIyjLH4e2bKWeclFdi?=
	=?utf-8?q?bEKOYYe8UFNp0xcYC8L9UtGu8lAeltH6ZKd1KVE5lGBVvi3AuCxARv6QD2EVHBhws?=
	=?utf-8?q?ECwVyoqGBPO/6PVgPyjUk1kfETADYvBlmaxo+O8PEovRe9aXdBzO6tQptM4bv64gn?=
	=?utf-8?q?txe7zEK53NCP8XCTElXVro4EEkL+9VIbuR0JORErTIOMclfYYEfjvoPjP2Wuraxko?=
	=?utf-8?q?MA5ae0zRYSRa?=
X-OriginatorOrg: Nvidia.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 71de6a9a-e824-41e1-18e5-08db05ea5c32
X-MS-Exchange-CrossTenant-AuthSource: SJ1PR12MB6075.namprd12.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 03 Feb 2023 13:27:13.2986
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: 
 gRPtI3odfLd606axsP79clVMdJM140MzmJeQFh/kxj0oq/g4J6RkPy1LE4RoPKfDH2+kdWt/Azo1YPzaf3bSRQ==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BL1PR12MB5803
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

Series

nvme-tcp receive offloads | expand

Message

Aurelien Aptel Feb. 3, 2023, 1:26 p.m. UTC

Hi,

Here is the next iteration of our nvme-tcp receive offload series.

The main changes are in patch 3 (netlink).

Rebased on top of today net-next
8065c0e13f98 ("Merge branch 'yt8531-support'")

The changes are also available through git:

Repo: https://github.com/aaptel/linux.git branch nvme-rx-offload-v11
Web: https://github.com/aaptel/linux/tree/nvme-rx-offload-v11

The NVMeTCP offload was presented in netdev 0x16 (video now available):
- https://netdevconf.info/0x16/session.html?NVMeTCP-Offload-%E2%80%93-Implementation-and-Performance-Gains
- https://youtu.be/W74TR-SNgi4

From: Aurelien Aptel <aaptel@nvidia.com>
From: Shai Malin <smalin@nvidia.com>
From: Ben Ben-Ishay <benishay@nvidia.com>
From: Boris Pismenny <borisp@nvidia.com>
From: Or Gerlitz <ogerlitz@nvidia.com>
From: Yoray Zack <yorayz@nvidia.com>

=========================================

This series adds support for NVMe-TCP receive offloads. The method here
does not mandate the offload of the network stack to the device.
Instead, these work together with TCP to offload:
1. copy from SKB to the block layer buffers.
2. CRC calculation and verification for received PDU.

The series implements these as a generic offload infrastructure for storage
protocols, which calls TCP Direct Data Placement and TCP Offload CRC
respectively. We use this infrastructure to implement NVMe-TCP offload for
copy and CRC.
Future implementations can reuse the same infrastructure for other protocols
such as iSCSI.

Note:
These offloads are similar in nature to the packet-based NIC TLS offloads,
which are already upstream (see net/tls/tls_device.c).
You can read more about TLS offload here:
https://www.kernel.org/doc/html/latest/networking/tls-offload.html

Queue Level
===========
The offload for IO queues is initialized after the handshake of the
NVMe-TCP protocol is finished by calling `nvme_tcp_offload_socket`
with the tcp socket of the nvme_tcp_queue:
This operation sets all relevant hardware contexts in
hardware. If it fails, then the IO queue proceeds as usual with no offload.
If it succeeds then `nvme_tcp_setup_ddp` and `nvme_tcp_teardown_ddp` may be
called to perform copy offload, and crc offload will be used.
This initialization does not change the normal operation of NVMe-TCP in any
way besides adding the option to call the above mentioned NDO operations.

For the admin queue, NVMe-TCP does not initialize the offload.
Instead, NVMe-TCP calls the driver to configure limits for the controller,
such as max_hw_sectors and max_segments, these must be limited to accommodate
potential HW resource limits, and to improve performance.

If some error occurs, and the IO queue must be closed or reconnected, then
offload is teardown and initialized again. Additionally, we handle netdev
down events via the existing error recovery flow.

IO Level
========
The NVMe-TCP layer calls the NIC driver to map block layer buffers to CID
using `nvme_tcp_setup_ddp` before sending the read request. When the response
is received, then the NIC HW will write the PDU payload directly into the
designated buffer, and build an SKB such that it points into the destination
buffer. This SKB represents the entire packet received on the wire, but it
points to the block layer buffers. Once NVMe-TCP attempts to copy data from
this SKB to the block layer buffer it can skip the copy by checking in the
copying function: if (src == dst) -> skip copy

Finally, when the PDU has been processed to completion, the NVMe-TCP layer
releases the NIC HW context by calling `nvme_tcp_teardown_ddp` which
asynchronously unmaps the buffers from NIC HW.

The NIC must release its mapping between command IDs and the target buffers.
This mapping is released when NVMe-TCP calls the NIC
driver (`nvme_tcp_offload_socket`).
As completing IOs is performance critical, we introduce asynchronous
completions for NVMe-TCP, i.e. NVMe-TCP calls the NIC, which will later
call NVMe-TCP to complete the IO (`nvme_tcp_ddp_teardown_done`).

On the IO level, and in order to use the offload only when a clear
performance improvement is expected, the offload is used only for IOs
which are bigger than io_threshold.

SKB
===
The DDP (zero-copy) and CRC offloads require two additional bits in the SKB.
The ddp bit is useful to prevent condensing of SKBs which are targeted
for zero-copy. The crc bit is useful to prevent GRO coalescing SKBs with
different offload values. This bit is similar in concept to the
"decrypted" bit.

After offload is initialized, we use the SKB's crc bit to indicate that:
"there was no problem with the verification of all CRC fields in this packet's
payload". The bit is set to zero if there was an error, or if HW skipped
offload for some reason. If *any* SKB in a PDU has (crc != 1), then the
calling driver must compute the CRC, and check it. We perform this check, and
accompanying software fallback at the end of the processing of a received PDU.

Resynchronization flow
======================
The resynchronization flow is performed to reset the hardware tracking of
NVMe-TCP PDUs within the TCP stream. The flow consists of a request from
the hardware proxied by the driver, regarding a possible location of a
PDU header. Followed by a response from the NVMe-TCP driver.

This flow is rare, and it should happen only after packet loss or
reordering events that involve NVMe-TCP PDU headers.

CID Mapping
===========
ConnectX-7 assumes linear CID (0...N-1 for queue of size N) where the Linux NVMe
driver uses part of the 16 bit CCID for generation counter.
To address that, we use the existing quirk in the NVMe layer when the HW
driver advertises that they don't support the full 16 bit CCID range.

Enablement on ConnectX-7
========================
By default, NVMeTCP offload is disabled in the mlx driver and in the nvme-tcp host.
In order to enable it:

	# Disable CQE compression (specific for ConnectX)
	ethtool --set-priv-flags <device> rx_cqe_compress off

	# Enable the ULP-DDP
	ethtool --ulp-ddp <device> nvme-tcp-ddp on nvme-tcp-ddgst-rx-offload on

	# Enable ULP offload in nvme-tcp
	modprobe nvme-tcp ulp_offload=1

Following the device ULP-DDP enablement, all the IO queues/sockets which are
running on the device are offloaded.

This feature requires a patched ethtool that can be obtained from
Web: https://github.com/aaptel/ethtool/tree/ulp-ddp
Git: https://github.com/aaptel/ethtool.git
Branch: ulp-ddp

Performance
===========
With this implementation, using the ConnectX-7 NIC, we were able to
demonstrate the following CPU utilization improvement:

Without data digest:
For  64K queued read IOs – up to 32% improvement in the BW/IOPS (111 Gbps vs. 84 Gbps).
For 512K queued read IOs – up to 55% improvement in the BW/IOPS (148 Gbps vs. 98 Gbps).

With data digest:
For  64K queued read IOs – up to 107% improvement in the BW/IOPS (111 Gbps vs. 53 Gbps).
For 512K queued read IOs – up to 138% improvement in the BW/IOPS (146 Gbps vs. 61 Gbps).

With small IOs we are not expecting that the offload will show a performance gain.

The test configuration:
- fio command: qd=128, jobs=8.
- Server: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz, 160 cores.

Patches
=======
Patch 1:  Introduce the infrastructure for all ULP DDP and ULP DDP CRC offloads.
Patch 2:  Add ethtool string sets for ULP DDP capability names and stats names.
Patch 3:  Add ethtool operations to get/set ULP DDP capabilities and statistics.
Patch 4:  Documentation of ULP DDP ethtool netlink messages.
Patch 5:  The iov_iter change to skip copy if (src == dst).
Patch 6:  Export the get_netdev_for_sock function from TLS to generic location.
Patch 7:  NVMe-TCP changes to call NIC driver on queue init/teardown and resync.
Patch 8:  NVMe-TCP changes to call NIC driver on IO operation
          setup/teardown, and support async completions.
Patch 9:  NVMe-TCP changes to support CRC offload on receive
          Also, this patch moves CRC calculation to the end of PDU
          in case offload requires software fallback.
Patch 10: NVMe-TCP handling of netdev events: stop the offload if netdev is
          going down.
Patch 11: Add module parameter to the NVMe-TCP control the enable ULP offload.
Patch 12: Documentation of ULP DDP offloads.

The rest of the series is the mlx5 implementation of the offload.

Testing
=======
This series was tested on ConnectX-7 HW using various configurations
of IO sizes, queue depths, MTUs, and with both the SPDK and kernel NVMe-TCP
targets.

Compatibility
=============
* The offload works with bare-metal or SRIOV.
* The HW can support up to 64K connections per device (assuming no
  other HW accelerations are used). In this series, we will introduce
  the support for up to 4k connections, and we have plans to increase it.
* SW TLS could not work together with the NVMeTCP offload as the HW
  will need to track the NVMeTCP headers in the TCP stream.
* The ConnectX HW support HW TLS, but in ConnectX-7 those features
  could not co-exists (and it is not part of this series).
* The NVMeTCP offload ConnectX 7 HW can support tunneling, but we
  don’t see the need for this feature yet.
* NVMe poll queues are not in the scope of this series.

Future Work
===========
* NVMeTCP transmit offload.
* NVMeTCP host offloads incremental features.
* NVMeTCP target offload.

Changes since v10:
=================
- Pass extack to drivers for better error reporting in the .set_caps
  callback (Jakub).
- netlink: use new callbacks, existing macros, padding, fix size
  add notifications, update specs (Jakub).

Changes since v9:
=================
- Add missing crc checks in tcp_try_coalesce() (Paolo).
- Add missing ifdef guard for socket ops (Paolo).
- Remove verbose netlink format for statistics (Jakub).
- Use regular attributes for statistics (Jakub).
- Expose and document individual stats to uAPI (Jakub).
- Move ethtool ops for caps&stats to netdev_ops->ulp_ddp_ops (Jakub).

Changes since v8:
=================
- Make stats stringset global instead of per-device (Jakub).
- Remove per-queue stats (Jakub).
- Rename ETH_SS_ULP_DDP stringset to ETH_SS_ULP_DDP_CAPS.
- Update & fix kdoc comments.
- Use 80 columns limit for nvme code.

Changes since v7:
=================
- Remove ULP DDP netdev->feature bit (Jakub).
- Expose ULP DDP capabilities to userspace via ethtool netlink messages (Jakub).
- Move ULP DDP stats to dedicated stats group (Jakub).
- Add ethtool_ops operations for setting capabilities and getting stats (Jakub).
- Move ulp_ddp_netdev_ops into net_device_ops (Jakub).
- Use union for protocol-specific struct instances (Jakub).
- Fold netdev_sk_get_lowest_dev() into get_netdev_for_sock (Christoph).
- Rename memcpy skip patch to something more obvious (Christoph).
- Move capabilities from ulp_ddp.h to ulp_ddp_caps.h.
- Add "Compatibility" section on the cover letter (Sagi).

Changes since v6:
=================
- Moved IS_ULP_{DDP,CRC} macros to skb_is_ulp_{ddp,crc} inline functions (Jakub).
- Fix copyright notice (Leon).
- Added missing ifdef to allow build with MLX5_EN_TLS disabled.
- Fix space alignment, indent and long lines (max 99 columns).
- Add missing field documentation in ulp_ddp.h.

Changes since v5:
=================
- Limit the series to RX offloads.
- Added two separated skb indications to avoid wrong flushing of GRO
  when aggerating offloaded packets.
- Use accessor functions for skb->ddp and skb->crc (Eric D) bits.
- Add kernel-doc for get_netdev_for_sock (Christoph).
- Remove ddp_iter* routines and only modify _copy_to_iter (Al Viro, Christoph).
- Remove consume skb (Sagi).
- Add a knob in the ddp limits struct for the HW driver to advertise
  if they need the nvme-tcp driver to apply the generation counter
  quirk. Use this knob for the mlx5 CX7 offload.
- bugfix: use u8 flags instead of bool in mlx5e_nvmeotcp_queue->dgst.
- bugfix: use sg_dma_len(sgl) instead of sgl->length.
- bugfix: remove sgl leak in nvme_tcp_setup_ddp().
- bugfix: remove sgl leak when only using DDGST_RX offload.
- Add error check for dma_map_sg().
- Reduce #ifdef by using dummy macros/functions.
- Remove redundant netdev null check in nvme_tcp_pdu_last_send().
- Rename ULP_DDP_RESYNC_{REQ -> PENDING}.
- Add per-ulp limits struct (Sagi).
- Add ULP DDP capabilities querying (Sagi).
- Simplify RX DDGST logic (Sagi).
- Document resync flow better.
- Add ulp_offload param to nvme-tcp module to enable ULP offload (Sagi).
- Add a revert commit to reintroduce nvme_tcp_queue->queue_size.

Changes since v4:
=================
- Add transmit offload patches.
- Use one feature bit for both receive and transmit offload.

Changes since v3:
=================
- Use DDP_TCP ifdefs in iov_iter and skb iterators to minimize impact
  when compiled out (Christoph).
- Simplify netdev references and reduce the use of
  get_netdev_for_sock (Sagi).
- Avoid "static" in it's own line, move it one line down (Christoph)
- Pass (queue, skb, *offset) and retrieve the pdu_seq in
  nvme_tcp_resync_response (Sagi).
- Add missing assignment of offloading_netdev to null in offload_limits
  error case (Sagi).
- Set req->offloaded = false once -- the lifetime rules are:
  set to false on cmd_setup / set to true when ddp setup succeeds (Sagi).
- Replace pr_info_ratelimited with dev_info_ratelimited (Sagi).
- Add nvme_tcp_complete_request and invoke it from two similar call
  sites (Sagi).
- Introduce nvme_tcp_req_map_sg earlier in the series (Sagi).
- Add nvme_tcp_consume_skb and put into it a hunk from
  nvme_tcp_recv_data to handle copy with and without offload.

Changes since v2:
=================
- Use skb->ddp_crc for copy offload to avoid skb_condense.
- Default mellanox driver support to no (experimental feature).
- In iov_iter use non-ddp functions for kvec and iovec.
- Remove typecasting in NVMe-TCP.

Changes since v1:
=================
- Rework iov_iter copy skip if src==dst to be less intrusive (David Ahern).
- Add tcp-ddp documentation (David Ahern).
- Refactor mellanox driver patches into more patches (Saeed Mahameed).
- Avoid pointer casting (David Ahern).
- Rename NVMe-TCP offload flags (Shai Malin).
- Update cover-letter according to the above.

Changes since RFC v1:
=====================
- Split mlx5 driver patches to several commits.
- Fix NVMe-TCP handling of recovery flows. In particular, move queue offload.
  init/teardown to the start/stop functions.

Signed-off-by: Or Gerlitz <ogerlitz@nvidia.com>

Aurelien Aptel (6):
  net/ethtool: add new stringset ETH_SS_ULP_DDP_{CAPS,STATS}
  net/ethtool: add ULP_DDP_{GET,SET} operations for caps and stats
  Documentation: document netlink ULP_DDP_GET/SET messages
  net/tls,core: export get_netdev_for_sock
  nvme-tcp: Add modparam to control the ULP offload enablement
  net/mlx5e: NVMEoTCP, statistics

Ben Ben-Ishay (8):
  iov_iter: skip copy if src == dst for direct data placement
  net/mlx5: Add NVMEoTCP caps, HW bits, 128B CQE and enumerations
  net/mlx5e: NVMEoTCP, offload initialization
  net/mlx5e: NVMEoTCP, use KLM UMRs for buffer registration
  net/mlx5e: NVMEoTCP, queue init/teardown
  net/mlx5e: NVMEoTCP, ddp setup and resync
  net/mlx5e: NVMEoTCP, async ddp invalidation
  net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload

Boris Pismenny (4):
  net: Introduce direct data placement tcp offload
  nvme-tcp: Add DDP offload control path
  nvme-tcp: Add DDP data-path
  net/mlx5e: TCP flow steering for nvme-tcp acceleration

Or Gerlitz (5):
  nvme-tcp: Deal with netdevice DOWN events
  net/mlx5e: Rename from tls to transport static params
  net/mlx5e: Refactor ico sq polling to get budget
  net/mlx5e: Have mdev pointer directly on the icosq structure
  net/mlx5e: Refactor doorbell function to allow avoiding a completion

Yoray Zack (2):
  nvme-tcp: RX DDGST offload
  Documentation: add ULP DDP offload documentation

 Documentation/netlink/specs/ethtool.yaml      |  102 ++
 Documentation/networking/ethtool-netlink.rst  |   92 ++
 Documentation/networking/index.rst            |    1 +
 Documentation/networking/statistics.rst       |    1 +
 Documentation/networking/ulp-ddp-offload.rst  |  376 ++++++
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |   11 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |    3 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |    9 +
 .../net/ethernet/mellanox/mlx5/core/en/fs.h   |    4 +-
 .../ethernet/mellanox/mlx5/core/en/params.c   |   12 +-
 .../ethernet/mellanox/mlx5/core/en/params.h   |    3 +
 .../mellanox/mlx5/core/en/reporter_rx.c       |    4 +-
 .../ethernet/mellanox/mlx5/core/en/rx_res.c   |   28 +
 .../ethernet/mellanox/mlx5/core/en/rx_res.h   |    4 +
 .../net/ethernet/mellanox/mlx5/core/en/tir.c  |   15 +
 .../net/ethernet/mellanox/mlx5/core/en/tir.h  |    2 +
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |   28 +-
 .../mlx5/core/en_accel/common_utils.h         |   32 +
 .../mellanox/mlx5/core/en_accel/en_accel.h    |    3 +
 .../mellanox/mlx5/core/en_accel/fs_tcp.c      |   12 +-
 .../mellanox/mlx5/core/en_accel/fs_tcp.h      |    2 +-
 .../mellanox/mlx5/core/en_accel/ktls.c        |    2 +-
 .../mellanox/mlx5/core/en_accel/ktls_rx.c     |    8 +-
 .../mellanox/mlx5/core/en_accel/ktls_tx.c     |    8 +-
 .../mellanox/mlx5/core/en_accel/ktls_txrx.c   |   36 +-
 .../mellanox/mlx5/core/en_accel/ktls_utils.h  |   17 +-
 .../mellanox/mlx5/core/en_accel/nvmeotcp.c    | 1116 +++++++++++++++++
 .../mellanox/mlx5/core/en_accel/nvmeotcp.h    |  142 +++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.c        |  325 +++++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.h        |   37 +
 .../mlx5/core/en_accel/nvmeotcp_stats.c       |   66 +
 .../mlx5/core/en_accel/nvmeotcp_utils.h       |   66 +
 .../ethernet/mellanox/mlx5/core/en_ethtool.c  |    6 +
 .../net/ethernet/mellanox/mlx5/core/en_fs.c   |    4 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   31 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |   67 +-
 .../ethernet/mellanox/mlx5/core/en_stats.h    |    7 +
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c |    4 +-
 drivers/net/ethernet/mellanox/mlx5/core/fw.c  |    6 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |    1 +
 drivers/nvme/host/tcp.c                       |  567 ++++++++-
 include/linux/ethtool.h                       |   32 +
 include/linux/mlx5/device.h                   |   59 +-
 include/linux/mlx5/mlx5_ifc.h                 |   82 +-
 include/linux/mlx5/qp.h                       |    1 +
 include/linux/netdevice.h                     |   18 +-
 include/linux/skbuff.h                        |   24 +
 include/net/inet_connection_sock.h            |    6 +
 include/net/ulp_ddp.h                         |  189 +++
 include/net/ulp_ddp_caps.h                    |   35 +
 include/uapi/linux/ethtool.h                  |    4 +
 include/uapi/linux/ethtool_netlink.h          |   40 +
 lib/iov_iter.c                                |    8 +-
 net/Kconfig                                   |   20 +
 net/core/dev.c                                |   26 +-
 net/core/skbuff.c                             |    3 +-
 net/ethtool/Makefile                          |    2 +-
 net/ethtool/bitset.c                          |   20 +-
 net/ethtool/common.c                          |   23 +
 net/ethtool/common.h                          |    2 +
 net/ethtool/netlink.c                         |   20 +
 net/ethtool/netlink.h                         |    4 +
 net/ethtool/strset.c                          |   11 +
 net/ethtool/ulp_ddp.c                         |  316 +++++
 net/ipv4/tcp_input.c                          |   13 +-
 net/ipv4/tcp_ipv4.c                           |    3 +
 net/ipv4/tcp_offload.c                        |    3 +
 net/tls/tls_device.c                          |   16 -
 68 files changed, 4082 insertions(+), 158 deletions(-)
 create mode 100644 Documentation/networking/ulp-ddp-offload.rst
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/common_utils.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_stats.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h
 create mode 100644 include/net/ulp_ddp.h
 create mode 100644 include/net/ulp_ddp_caps.h
 create mode 100644 net/ethtool/ulp_ddp.c

Comments

Sagi Grimberg Feb. 23, 2023, 3:33 p.m. UTC | #1

> Hi,
> 
> Here is the next iteration of our nvme-tcp receive offload series.
> 
> The main changes are in patch 3 (netlink).
> 
> Rebased on top of today net-next
> 8065c0e13f98 ("Merge branch 'yt8531-support'")
> 
> The changes are also available through git:
> 
> Repo: https://github.com/aaptel/linux.git branch nvme-rx-offload-v11
> Web: https://github.com/aaptel/linux/tree/nvme-rx-offload-v11
> 
> The NVMeTCP offload was presented in netdev 0x16 (video now available):
> - https://netdevconf.info/0x16/session.html?NVMeTCP-Offload-%E2%80%93-Implementation-and-Performance-Gains
> - https://youtu.be/W74TR-SNgi4
> 
> From: Aurelien Aptel <aaptel@nvidia.com>
> From: Shai Malin <smalin@nvidia.com>
> From: Ben Ben-Ishay <benishay@nvidia.com>
> From: Boris Pismenny <borisp@nvidia.com>
> From: Or Gerlitz <ogerlitz@nvidia.com>
> From: Yoray Zack <yorayz@nvidia.com>

Hey Aurelien and Co,

I've spent some time today looking at the last iteration of this,
What I cannot understand, is how will this ever be used outside
of the kernel nvme-tcp host driver?

It seems that the interface is diesigned to fit only a kernel
consumer, and a very specific one.

Have you considered using a more standard interfaces to use this
such that spdk or an io_uring based initiator can use it?

To me it appears that:
- ddp limits can be obtained via getsockopt
- sk_add/sk_del can be done via setsockopt
- offloaded DDGST crc can be obtained via something like
   msghdr.msg_control
- Perhaps for setting up the offload per IO, recvmsg would be the
   vehicle with a new msg flag MSG_RCV_DDP or something, that would hide
   all the details of what the HW needs (the command_id would be set
   somewhere in the msghdr).
- And all of the resync flow would be something that a separate
   ulp socket provider would take care of. Similar to how TLS presents
   itself to a tcp application. So the application does not need to be
   aware of it.

I'm not sure that such interface could cover everything that is needed, 
but what I'm trying to convey, is that the current interface limits the
usability for almost anything else. Please correct me if I'm wrong.
Is this designed to also cater anything else outside of the kernel
nvme-tcp host driver?

> Compatibility
> =============
> * The offload works with bare-metal or SRIOV.
> * The HW can support up to 64K connections per device (assuming no
>    other HW accelerations are used). In this series, we will introduce
>    the support for up to 4k connections, and we have plans to increase it.
> * SW TLS could not work together with the NVMeTCP offload as the HW
>    will need to track the NVMeTCP headers in the TCP stream.

Can't say I like that.

> * The ConnectX HW support HW TLS, but in ConnectX-7 those features
>    could not co-exists (and it is not part of this series).
> * The NVMeTCP offload ConnectX 7 HW can support tunneling, but we
>    don’t see the need for this feature yet.
> * NVMe poll queues are not in the scope of this series.

bonding/teaming?

> 
> Future Work
> ===========
> * NVMeTCP transmit offload.
> * NVMeTCP host offloads incremental features.
> * NVMeTCP target offload.

Which target? which host?

Shai Malin March 1, 2023, 7:13 p.m. UTC | #2

Hi Sagi,

On Thu, 23 Feb 2023, Sagi Grimberg <sagi@grimberg.me> wrote:
> Hey Aurelien and Co,
> 
> I've spent some time today looking at the last iteration of this,
> What I cannot understand, is how will this ever be used outside
> of the kernel nvme-tcp host driver?
> 
> It seems that the interface is diesigned to fit only a kernel
> consumer, and a very specific one.

As part of this series, we are only covering the kernel nvme-tcp host driver.
The ULP layer we are introducing was designed also for other kernel drivers 
such as the kernel nvme-tcp target and iSCSI.

> 
> Have you considered using a more standard interfaces to use this
> such that spdk or an io_uring based initiator can use it?

The main problem which I will explain in more detail (under the digest, 
teardown and resync flows) is that in order to use a more standard interface 
that will hide what the HW needs it will require duplicating the NVMeTCP
logic of tracking the PDUs in the TCP stream – this seems wrong to us.

I can add that we are also working on an end-to-end user-space design for SPDK
and we don’t see how the two designs could use the same socket APIs without 
impacting the performance gain of both cases.

> 
> To me it appears that:
> - ddp limits can be obtained via getsockopt
> - sk_add/sk_del can be done via setsockopt
> - offloaded DDGST crc can be obtained via something like
>    msghdr.msg_control

If we wish to hide it from the NVMeTCP driver it will require us to duplicate 
the NVMeTCP logic of tracking the PDUs in the TCP stream.
In the positive case, when there are no DDGST errors, nothing is needed to be 
done in the NVMeTCP driver, but in the case of errors (or just in the case of 
out of order packet in the middle of the PDU), the DDGST will need to be 
calculated in SW and it seems wrong to us to duplicate it outside of the 
NVMeTCP driver.

> - Perhaps for setting up the offload per IO, recvmsg would be the
>    vehicle with a new msg flag MSG_RCV_DDP or something, that would hide
>    all the details of what the HW needs (the command_id would be set
>    somewhere in the msghdr).

Our design includes the following steps per IO:
- ddp_setup (register the command id and buffer to the HW)
- The existing flow which sends the command
- The existing flow of read_sock()
- ddp_teardown (for the per IO HW teardown, before posting the NVMe completion)

Using the recvmsg will only replace the read_sock() but this part in the NVMeTCP 
driver is not impacted by the offload design.
The ddp_setup is needed in order to set the command id and the buffer, and this 
needs to be done before or as part of the sending of command and prior to the 
receive flow.
In addition, without duplicating the tracking of the PDUs in the TCP stream, 
it is not possible to hide the teardown flow from the NVMeTCP driver.

> - And all of the resync flow would be something that a separate
>    ulp socket provider would take care of. Similar to how TLS presents
>    itself to a tcp application. So the application does not need to be
>    aware of it.

The resync flow requires awareness of TCP sequence numbers and NVMe 
PDUs boundaries. If we hide it from the NVMeTCP driver we would have 
to again duplicate NVMe PDU tracking code.
TLS is a stream protocol and maps cleanly to TCP socket operations.
NVMeTCP on the other hand is a request-response protocol. 
On the data path, it's not comparable.
While it could be done by contorting recvmsg() with new flags, adding input 
fields to the msghdr struct and changing the behaviour of uAPI, it will add 
a lot more friction than a separate ULP DDP API specifically made for this 
purpose.

> 
> I'm not sure that such interface could cover everything that is needed,
> but what I'm trying to convey, is that the current interface limits the
> usability for almost anything else. Please correct me if I'm wrong.
> Is this designed to also cater anything else outside of the kernel
> nvme-tcp host driver?

As part of this series, we are targeting the kernel nvme-tcp host driver, 
and later we are planning to add support for the kernel nvme-tcp target driver.

The ULP layer was designed to be generic for other request-response based 
protocols.

> 
> > Compatibility
> > =============
> > * The offload works with bare-metal or SRIOV.
> > * The HW can support up to 64K connections per device (assuming no
> >    other HW accelerations are used). In this series, we will introduce
> >    the support for up to 4k connections, and we have plans to increase it.
> > * SW TLS could not work together with the NVMeTCP offload as the HW
> >    will need to track the NVMeTCP headers in the TCP stream.
> 
> Can't say I like that.

The answer should be to support both TLS offload and NVMeTCP offload, 
which is a HW limit in our current generation.

> 
> > * The ConnectX HW support HW TLS, but in ConnectX-7 those features
> >    could not co-exists (and it is not part of this series).
> > * The NVMeTCP offload ConnectX 7 HW can support tunneling, but we
> >    don’t see the need for this feature yet.
> > * NVMe poll queues are not in the scope of this series.
> 
> bonding/teaming?

We are planning to add it as part of the "incremental feature". 
It will follow the existing design of the mlx HW TLS.

> 
> >
> > Future Work
> > ===========
> > * NVMeTCP transmit offload.
> > * NVMeTCP host offloads incremental features.
> > * NVMeTCP target offload.
> 
> Which target? which host?

Kernel nvme-tcp host driver and kernel nvme-tcp target driver.

Sagi Grimberg March 7, 2023, 10:11 a.m. UTC | #3

> Hi Sagi,

Hey Shai,

> On Thu, 23 Feb 2023, Sagi Grimberg <sagi@grimberg.me> wrote:
>> Hey Aurelien and Co,
>>
>> I've spent some time today looking at the last iteration of this,
>> What I cannot understand, is how will this ever be used outside
>> of the kernel nvme-tcp host driver?
>>
>> It seems that the interface is diesigned to fit only a kernel
>> consumer, and a very specific one.
> 
> As part of this series, we are only covering the kernel nvme-tcp host driver.
> The ULP layer we are introducing was designed also for other kernel drivers
> such as the kernel nvme-tcp target and iSCSI.

Yes, I can see how this can fit iscsi, but my question was more for
userspace.

>> Have you considered using a more standard interfaces to use this
>> such that spdk or an io_uring based initiator can use it?
> 
> The main problem which I will explain in more detail (under the digest,
> teardown and resync flows) is that in order to use a more standard interface
> that will hide what the HW needs it will require duplicating the NVMeTCP
> logic of tracking the PDUs in the TCP stream – this seems wrong to us.

Hmm. Its not really the entire logic, its only the pdu/data
boundaries. Every data coming to the host is structured, with a standard
C2HData PDU.

If we assume for the sake of the discussion that such a parser exist,
how would the interface to the ulp look like?

> I can add that we are also working on an end-to-end user-space design for SPDK
> and we don’t see how the two designs could use the same socket APIs without
> impacting the performance gain of both cases.

Can you share a bit more how this will be done for spdk? spdk runs on
normal sockets today, hence I'd love to know how you plan to introduce
it there.

> 
>>
>> To me it appears that:
>> - ddp limits can be obtained via getsockopt
>> - sk_add/sk_del can be done via setsockopt
>> - offloaded DDGST crc can be obtained via something like
>>     msghdr.msg_control
> 
> If we wish to hide it from the NVMeTCP driver it will require us to duplicate
> the NVMeTCP logic of tracking the PDUs in the TCP stream.
> In the positive case, when there are no DDGST errors, nothing is needed to be
> done in the NVMeTCP driver, but in the case of errors (or just in the case of
> out of order packet in the middle of the PDU), the DDGST will need to be
> calculated in SW and it seems wrong to us to duplicate it outside of the
> NVMeTCP driver.

I didn't suggest that the recalc would be hidden, if the calculation
failed the ULP can do it.

Something like:
	ddp_ddgst = CMSG_DATA(CMSG_FIRSTHDR(msg))
	if (ddp_ddgst->valid) {
		great...
	} else {
		recalc_ddgst
	}

> 
>> - Perhaps for setting up the offload per IO, recvmsg would be the
>>     vehicle with a new msg flag MSG_RCV_DDP or something, that would hide
>>     all the details of what the HW needs (the command_id would be set
>>     somewhere in the msghdr).
> 
> Our design includes the following steps per IO:
> - ddp_setup (register the command id and buffer to the HW)
> - The existing flow which sends the command
> - The existing flow of read_sock()
> - ddp_teardown (for the per IO HW teardown, before posting the NVMe completion)
> 
> Using the recvmsg will only replace the read_sock() but this part in the NVMeTCP
> driver is not impacted by the offload design.
> The ddp_setup is needed in order to set the command id and the buffer, and this
> needs to be done before or as part of the sending of command and prior to the
> receive flow.

I agree, I didn't suggest that replacing the entire read_sock flow
(although that can change to recvmsg as well).

I was thinking more along the lines of MSG_PEEK that doesn't actually
reads anything, where the driver prior to sending the command, would do:

	// do a DDP recvmsg to setup direct data placement
	msg = { .msg_flags = MSG_DDP_SETUP };
	iov_iter_bvec(&msg.msg_iter, ITER_DEST, vec, nr_bvec, size);
	sock_recvmsg(queue->sock, &msg, msg.msg_flags);

	...

	// now send the cmd to the controller
	nvme_tcp_setup_cmd_pdu
	nvme_tcp_try_send_cmd_pdu
	...

That is what I was thinking about.
	
> In addition, without duplicating the tracking of the PDUs in the TCP stream,
> it is not possible to hide the teardown flow from the NVMeTCP driver.

The teardown flow is a bummer because it delays the nvme completion.
Is it ok to dma_unmap and then schedule a ddp teardown async without
delaying the nvme completion?

How does TLS offload handles the async teardown?

> 
>> - And all of the resync flow would be something that a separate
>>     ulp socket provider would take care of. Similar to how TLS presents
>>     itself to a tcp application. So the application does not need to be
>>     aware of it.
> 
> The resync flow requires awareness of TCP sequence numbers and NVMe
> PDUs boundaries. If we hide it from the NVMeTCP driver we would have
> to again duplicate NVMe PDU tracking code.

Its really a pretty trivial part of the pdu tracking. Just when
a C2HData PDU starts...

> TLS is a stream protocol and maps cleanly to TCP socket operations.
> NVMeTCP on the other hand is a request-response protocol.
> On the data path, it's not comparable.

Is it? looks similar to me in the sense that the input stream needs to
know when a msg starts and ends. The only difference is that the ULP
needs to timely provide the context to the offload.

> While it could be done by contorting recvmsg() with new flags, adding input
> fields to the msghdr struct and changing the behaviour of uAPI, it will add
> a lot more friction than a separate ULP DDP API specifically made for this
> purpose.

This may be true. I was merely asking if this was a path that was
genuinely explored and ruled out. To me, it sounds like it can
work, and also allow userspace to automatically gain from it.

>> I'm not sure that such interface could cover everything that is needed,
>> but what I'm trying to convey, is that the current interface limits the
>> usability for almost anything else. Please correct me if I'm wrong.
>> Is this designed to also cater anything else outside of the kernel
>> nvme-tcp host driver?
> 
> As part of this series, we are targeting the kernel nvme-tcp host driver,
> and later we are planning to add support for the kernel nvme-tcp target driver.
> 
> The ULP layer was designed to be generic for other request-response based
> protocols.

I understand.

> 
>>
>>> Compatibility
>>> =============
>>> * The offload works with bare-metal or SRIOV.
>>> * The HW can support up to 64K connections per device (assuming no
>>>     other HW accelerations are used). In this series, we will introduce
>>>     the support for up to 4k connections, and we have plans to increase it.
>>> * SW TLS could not work together with the NVMeTCP offload as the HW
>>>     will need to track the NVMeTCP headers in the TCP stream.
>>
>> Can't say I like that.
> 
> The answer should be to support both TLS offload and NVMeTCP offload,
> which is a HW limit in our current generation.

I can't complain about this because we don't have TLS supported in 
nvme-tcp. But _can_ they work together on future HW?

> 
>>
>>> * The ConnectX HW support HW TLS, but in ConnectX-7 those features
>>>     could not co-exists (and it is not part of this series).
>>> * The NVMeTCP offload ConnectX 7 HW can support tunneling, but we
>>>     don’t see the need for this feature yet.
>>> * NVMe poll queues are not in the scope of this series.
>>
>> bonding/teaming?
> 
> We are planning to add it as part of the "incremental feature".
> It will follow the existing design of the mlx HW TLS.

OK.

>>> Future Work
>>> ===========
>>> * NVMeTCP transmit offload.
>>> * NVMeTCP host offloads incremental features.
>>> * NVMeTCP target offload.
>>
>> Which target? which host?
> 
> Kernel nvme-tcp host driver and kernel nvme-tcp target driver.

Look, I did spend time looking into this patchset several times
in the past, and also provided feedback. The reason why I'm questioning
the interface now, is because I'm wandering if it can be less intrusive
and use the common socket interface instead of inventing a new one.

Shai Malin March 22, 2023, 10:19 a.m. UTC | #4

Hi Sagi
 
On Tue, 7 Mar 2023 at 12:11, Sagi Grimberg sagi@grimberg.me wrote:
> >> Have you considered using a more standard interfaces to use this
> >> such that spdk or an io_uring based initiator can use it?
> >
> > The main problem which I will explain in more detail (under the digest,
> > teardown and resync flows) is that in order to use a more standard interface
> > that will hide what the HW needs it will require duplicating the NVMeTCP
> > logic of tracking the PDUs in the TCP stream – this seems wrong to us.
> 
> Hmm. Its not really the entire logic, its only the pdu/data
> boundaries. Every data coming to the host is structured, with a standard
> C2HData PDU.
 
In order to track the PDUs we will need to duplicate the parsing and tracking 
for all the common headers of all the PDU types, that is doable, but it will 
also need to be trustworthy, which means duplicating the header digest 
verification if enabled.
 
> 
> If we assume for the sake of the discussion that such a parser exist,
> how would the interface to the ulp look like?
 
Enablement could be done via setsockopt(). We would add a new 
SOL_NVME_TCP flag and store its state in new socket struct fields.
We would also set the resync callbacks from ipvX/tcp.c (explained further 
down).
 
Per IO setup:
-------------
The ddp_setup() call would be moved to ipv4/ipv6 implementations of TCP, 
in sock_recvmsg().
As you suggested, we would add a new MSG_NVME_DDP flag to register an 
expected NVMeTCP data response PDU.             
Adding a command_id field to the msghdr struct would break
userspace ABI. Instead we would use msghdr ancillary fields and add a new
cmsg_level (SOL_NVME_TCP) and a new cmsg_type (NVME_TCP_COMMAND_ID).
	
      struct msghdr msg = {.flags = MSG_NVME_DDP};
      int cmsg_len = sizeof(u32); // command_id size
      struct cmsghdr *cmsg;
      char buf[CMSG_SPACE(cmsg_len)];
 
      msg.msg_control = buf;
      msg.msg_controllen = sizeof(buf);
      cmsg = CMSG_FIRSTHDR(&msg);
      cmsg->cmsg_level = SOL_NVME_TCP;
      cmsg->cmsg_type = NVME_TCP_COMMAND_ID;
      cmsg->cmsg_len = CMSG_LEN(cmsg_len);
      *CMSG_DATA(cmsg) = 0x1; // command_id
      msg.msg_controllen = cmsg->cmsg_len;
 
      iov_iter_bvec(&msg.msg_iter, ITER_DEST, vec, nr_bvec, size);
      sock_recvmsg(queue->sock, &msg, msg.msg_flags);
 
Per IO teardown:
----------------
The ddp_teardown will need to remain as a separate API, as explained before, 
it is needed in order to verify that the HW teardown flow is complete before 
posting the NVMe completion. 
 
It’s not possible to move the teardown into sock_recvmsg() as it adds a
significant delay (~1 usec) and will not allow us to process other PDUs in the
meantime.
 
The challenge here is that we haven’t found a good and inexpensive reference in
the socket API for such IO notifications (I assume we don’t want to use 
signals).
This might block the socket API approach.
 
Data Digest offload:
--------------------
Since recvmsg() would only read a single PDU, your suggested 
approach could work:

          ddp_ddgst = CMSG_DATA(CMSG_FIRSTHDR(msg))
          if (ddp_ddgst->valid) {
                 great...
          } else {
                 recalc_ddgst
          }
                  
Receive flow:
-------------
Using recvmsg() doesn't have the callback mechanism that sock->read_sock() 
provides so the logic in the NVMeTCP driver code would have to be changed.
Instead of parsing and copying the PDU in chunks as it comes via the callback, 
it would have to process the entire PDU upon returning from recvmsg() in one go. 
 
To handle re-ordered NVMe-TCP responses, the ULP would duplicate the
tracking of PDUs and maintain a list of completed responses yet to be 
received by the NVMeTCP driver.
The NVMeTCP driver will call the recvmsg(s, &ddp_buf, cccid) only for C2H PDUs, 
and based on the PDU header type and command_id.
 
If placement was successful, the data is already in ddp_buf. If not, the kernel
would perform the regular software copy. It would be transparent for the 
NVMeTCP driver.
 
As mentioned before, in order to safely track the PDUs, we will
need to duplicate the header digest verification if enabled.
 
In summary the flow for NVMeTCP driver could be something like:
 
// Enable nvme-tcp offload
                setsockopt(s, SOL_NVME_TCP, ...);
 
...
 
// Command sending:
 
                // register memory
                msg.flags = MSG_NVME_DDP;
                msg.iov_iter = resp_buf;
 
                // set 1 msg ancillary field with the command id
                cmsg->cmsg_level = SOL_NVME_TCP;
                cmsg->cmsg_type = NVME_TCP_COMMAND_ID;
                cmsg->cmsg_len = CMSG_LEN(sizeof(u32));
                *CMSG_DATA(cmsg) = 0x1; // command id
                recvmsg(s, msg, msg.flags);
 
                // send the request
                send(s, req_buf, ...);
 
...
 
// C2H receive:
 
We will need to modify nvme_tcp_recv_pdu() to read only the next 
PDU header and according to the pdu type and command_id perform 
the following:
      
      ddp_buf = get_cid_buffer(cid);
      msg.iov_iter = ddp_buf;
      // set 1 msg ancillary field with the command id
      cmsg->cmsg_level = SOL_NVME_TCP;
      cmsg->cmsg_type = NVME_TCP_COMMAND_ID;
      cmsg->cmsg_len = CMSG_LEN(sizeof(u32));
      *CMSG_DATA(cmsg) = cid;
      recvmsg(s, msg, msg.flags);
      
      // Teardown (for this we don't have a good solution)
 
      // Data digest check
      ddp_ddgst = CMSG_DATA(CMSG_FIRSTHDR(msg))
      if (ddp_ddgst->valid) {
            great...
      } else {
            recalc_ddgst
      }    

Resync:
-------
This API could be removed from the NVMeTCP driver, and we could move 
the processing of resync request to ipv4/ipv6 tcp code. 
The resync pending state would be moved to the socket struct.
But also in this case, in order to safely track the PDUs, we will need to 
parse all PDUs and to also calculate the header digest if enabled.
 
User space:
-----------
Regarding the user space design, since issuing 3 syscalls per IO (recvmsg + 
sendmsg + recvmsg) would be too expensive, we could add support for 
io_uring so it will wrap the above APIs.
For example, similarly to recvmsg(), we would submit a IORING_OP_RECVMSG 
sqe before each IO (with the MSG_NVME_DDP flag) to register memory for DDP.
 
> 
> > I can add that we are also working on an end-to-end user-space design for SPDK
> > and we don’t see how the two designs could use the same socket APIs without
> > impacting the performance gain of both cases.
> 
> Can you share a bit more how this will be done for spdk? spdk runs on
> normal sockets today, hence I'd love to know how you plan to introduce
> it there.
 
The SPDK developed at Nvidia can use XLIO
https://docs.nvidia.com/networking/display/XLIOv214 as the TCP stack.
It will be out of the context of this discussion as it is not following the 
normal socket operations.
 
> 
> >
> >>
> >> To me it appears that:
> >> - ddp limits can be obtained via getsockopt
> >> - sk_add/sk_del can be done via setsockopt
> >> - offloaded DDGST crc can be obtained via something like
> >>     msghdr.msg_control
> >
> > If we wish to hide it from the NVMeTCP driver it will require us to duplicate
> > the NVMeTCP logic of tracking the PDUs in the TCP stream.
> > In the positive case, when there are no DDGST errors, nothing is needed to be
> > done in the NVMeTCP driver, but in the case of errors (or just in the case of
> > out of order packet in the middle of the PDU), the DDGST will need to be
> > calculated in SW and it seems wrong to us to duplicate it outside of the
> > NVMeTCP driver.
> 
> I didn't suggest that the recalc would be hidden, if the calculation
> failed the ULP can do it.
> 
> Something like:
>         ddp_ddgst = CMSG_DATA(CMSG_FIRSTHDR(msg))
>         if (ddp_ddgst->valid) {
>                 great...
>         } else {
>                 recalc_ddgst
>         }
 
Yes, we agree, as explained before.
 
> 
> >
> >> - Perhaps for setting up the offload per IO, recvmsg would be the
> >>     vehicle with a new msg flag MSG_RCV_DDP or something, that would hide
> >>     all the details of what the HW needs (the command_id would be set
> >>     somewhere in the msghdr).
> >
> > Our design includes the following steps per IO:
> > - ddp_setup (register the command id and buffer to the HW)
> > - The existing flow which sends the command
> > - The existing flow of read_sock()
> > - ddp_teardown (for the per IO HW teardown, before posting the NVMe completion)
> >
> > Using the recvmsg will only replace the read_sock() but this part in the NVMeTCP
> > driver is not impacted by the offload design.
> > The ddp_setup is needed in order to set the command id and the buffer, and this
> > needs to be done before or as part of the sending of command and prior to the
> > receive flow.
> 
> I agree, I didn't suggest that replacing the entire read_sock flow
> (although that can change to recvmsg as well).
> 
> I was thinking more along the lines of MSG_PEEK that doesn't actually
> reads anything, where the driver prior to sending the command, would do:
> 
>         // do a DDP recvmsg to setup direct data placement
>         msg = { .msg_flags = MSG_DDP_SETUP };
>         iov_iter_bvec(&msg.msg_iter, ITER_DEST, vec, nr_bvec, size);
>         sock_recvmsg(queue->sock, &msg, msg.msg_flags);
> 
>         ...
> 
>         // now send the cmd to the controller
>         nvme_tcp_setup_cmd_pdu
>         nvme_tcp_try_send_cmd_pdu
>         ...
> 
> That is what I was thinking about.
 
Was answered as part of the “receive flow” section above.
 
> 
> > In addition, without duplicating the tracking of the PDUs in the TCP stream,
> > it is not possible to hide the teardown flow from the NVMeTCP driver.
> 
> The teardown flow is a bummer because it delays the nvme completion.
> Is it ok to dma_unmap and then schedule a ddp teardown async without
> delaying the nvme completion?
 
With the existing HW it will not be possible. The teardown is required in 
order to guarantee that the HW will not try accessing the HW resources 
or the IO buffer for which nvme completion was already called.
 
In a future HW, the teardown flow could improve.
 
> 
> How does TLS offload handles the async teardown?
 
In TLS we don’t have such operation per IO. Only the resync is similar.
 
> 
> >
> >> - And all of the resync flow would be something that a separate
> >>     ulp socket provider would take care of. Similar to how TLS presents
> >>     itself to a tcp application. So the application does not need to be
> >>     aware of it.
> >
> > The resync flow requires awareness of TCP sequence numbers and NVMe
> > PDUs boundaries. If we hide it from the NVMeTCP driver we would have
> > to again duplicate NVMe PDU tracking code.
> 
> Its really a pretty trivial part of the pdu tracking. Just when
> a C2HData PDU starts...
 
The resync is not only in the context of the C2HData PDUs, 
As explained before, it will require to parse the common header and validate 
the header digest in order to verify the resync of the HW. 
We really don’t like this duplication.
 
> 
> > TLS is a stream protocol and maps cleanly to TCP socket operations.
> > NVMeTCP on the other hand is a request-response protocol.
> > On the data path, it's not comparable.
> 
> Is it? looks similar to me in the sense that the input stream needs to
> know when a msg starts and ends. The only difference is that the ULP
> needs to timely provide the context to the offload.
 
It's a big difference, the app that is using the TLS doesn’t need to handle 
the IO path operations that NVMeTCP offload requires (ddp_setup, ddp_treadown).
 
> 
> > While it could be done by contorting recvmsg() with new flags, adding input
> > fields to the msghdr struct and changing the behaviour of uAPI, it will add
> > a lot more friction than a separate ULP DDP API specifically made for this
> > purpose.
> 
> This may be true. I was merely asking if this was a path that was
> genuinely explored and ruled out. To me, it sounds like it can
> work, and also allow userspace to automatically gain from it.
 
We have explored different approaches but ultimately picked this one.
While it might be more intrusive, it offers the better compromise with 
performance and cleanliness (avoiding code duplication and major 
refactoring).
 
The socket API:
1. Won’t simplify the nvme-tcp code: it will arguably make it more complex.
That is, assuming we resolve all the issues mentioned earlier.
 
2. Its only real improvement would be to have the feature available from 
userspace which we doubt can be as performant.
(side note – based on our experience, managing the entire TCP stack from 
userspace will achieve much better performance but this is again out of the 
scope of this discussion)
 
Based on that, our request would be to review and consider merging this 
series with our original design. 
 
> 
> >> I'm not sure that such interface could cover everything that is needed,
> >> but what I'm trying to convey, is that the current interface limits the
> >> usability for almost anything else. Please correct me if I'm wrong.
> >> Is this designed to also cater anything else outside of the kernel
> >> nvme-tcp host driver?
> >
> > As part of this series, we are targeting the kernel nvme-tcp host driver,
> > and later we are planning to add support for the kernel nvme-tcp target driver.
> >
> > The ULP layer was designed to be generic for other request-response
> based
> > protocols.
> 
> I understand.
> 
> >
> >>
> >>> Compatibility
> >>> =============
> >>> * The offload works with bare-metal or SRIOV.
> >>> * The HW can support up to 64K connections per device (assuming no
> >>>     other HW accelerations are used). In this series, we will introduce
> >>>     the support for up to 4k connections, and we have plans to increase it.
> >>> * SW TLS could not work together with the NVMeTCP offload as the HW
> >>>     will need to track the NVMeTCP headers in the TCP stream.
> >>
> >> Can't say I like that.
> >
> > The answer should be to support both TLS offload and NVMeTCP offload,
> > which is a HW limit in our current generation.
> 
> I can't complain about this because we don't have TLS supported in
> nvme-tcp. But _can_ they work together on future HW?
 
Yes. With our original design the effort to support both offloads is incremental.
 
But, with the new socket API design discussed here, it will be more challenging 
to stack ULPs, and hence doing TLS + NVMe-TCP will be harder with this design.
 
> 
> >
> >>
> >>> * The ConnectX HW support HW TLS, but in ConnectX-7 those features
> >>>     could not co-exists (and it is not part of this series).
> >>> * The NVMeTCP offload ConnectX 7 HW can support tunneling, but we
> >>>     don’t see the need for this feature yet.
> >>> * NVMe poll queues are not in the scope of this series.
> >>
> >> bonding/teaming?
> >
> > We are planning to add it as part of the "incremental feature".
> > It will follow the existing design of the mlx HW TLS.
> 
> OK.
> 
> >>> Future Work
> >>> ===========
> >>> * NVMeTCP transmit offload.
> >>> * NVMeTCP host offloads incremental features.
> >>> * NVMeTCP target offload.
> >>
> >> Which target? which host?
> >
> > Kernel nvme-tcp host driver and kernel nvme-tcp target driver.
> 
> Look, I did spend time looking into this patchset several times
> in the past, and also provided feedback. The reason why I'm questioning
> the interface now, is because I'm wandering if it can be less intrusive
> and use the common socket interface instead of inventing a new one.
 
Thanks again for your time looking into this and providing your feedback.
It’s a valid question but we think the current design is the better compromise.