[v12,07/26] nvme-tcp: Add DDP offload control path

Message ID	20230712161513.134860-8-aaptel@nvidia.com (mailing list archive)
State	RFC
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CE0502151A for <netdev@vger.kernel.org>; Wed, 12 Jul 2023 16:17:16 +0000 (UTC) From: Aurelien Aptel <aaptel@nvidia.com> To: linux-nvme@lists.infradead.org, netdev@vger.kernel.org, sagi@grimberg.me, hch@lst.de, kbusch@kernel.org, axboe@fb.com, chaitanyak@nvidia.com, davem@davemloft.net, kuba@kernel.org Cc: Boris Pismenny <borisp@nvidia.com>, aaptel@nvidia.com, aurelien.aptel@gmail.com, smalin@nvidia.com, malin1024@gmail.com, ogerlitz@nvidia.com, yorayz@nvidia.com, galshalom@nvidia.com, mgurtovoy@nvidia.com Subject: [PATCH v12 07/26] nvme-tcp: Add DDP offload control path Date: Wed, 12 Jul 2023 16:14:54 +0000 Message-Id: <20230712161513.134860-8-aaptel@nvidia.com> In-Reply-To: <20230712161513.134860-1-aaptel@nvidia.com> References: <20230712161513.134860-1-aaptel@nvidia.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk MIME-Version: 1.0
Series	nvme-tcp receive offloads \| expand [v12,00/26] nvme-tcp receive offloads [v12,01/26] net: Introduce direct data placement tcp offload [v12,02/26] net/ethtool: add new stringset ETH_SS_ULP_DDP_{CAPS,STATS} [v12,03/26] net/ethtool: add ULP_DDP_{GET,SET} operations for caps and stats [v12,04/26] Documentation: document netlink ULP_DDP_GET/SET messages [v12,05/26] iov_iter: skip copy if src == dst for direct data placement [v12,06/26] net/tls,core: export get_netdev_for_sock [v12,07/26] nvme-tcp: Add DDP offload control path [v12,08/26] nvme-tcp: Add DDP data-path [v12,09/26] nvme-tcp: RX DDGST offload [v12,10/26] nvme-tcp: Deal with netdevice DOWN events [v12,11/26] nvme-tcp: Add modparam to control the ULP offload enablement [v12,12/26] nvme-tcp: Only enable offload with TLS if the driver supports it [v12,13/26] Documentation: add ULP DDP offload documentation [v12,14/26] net/mlx5e: Rename from tls to transport static params [v12,15/26] net/mlx5e: Refactor ico sq polling to get budget [v12,16/26] net/mlx5e: Have mdev pointer directly on the icosq structure [v12,17/26] net/mlx5e: Refactor doorbell function to allow avoiding a completion [v12,18/26] net/mlx5: Add NVMEoTCP caps, HW bits, 128B CQE and enumerations [v12,19/26] net/mlx5e: NVMEoTCP, offload initialization [v12,20/26] net/mlx5e: TCP flow steering for nvme-tcp acceleration [v12,21/26] net/mlx5e: NVMEoTCP, use KLM UMRs for buffer registration [v12,22/26] net/mlx5e: NVMEoTCP, queue init/teardown [v12,23/26] net/mlx5e: NVMEoTCP, ddp setup and resync [v12,24/26] net/mlx5e: NVMEoTCP, async ddp invalidation [v12,25/26] net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload [v12,26/26] net/mlx5e: NVMEoTCP, statistics

Context	Check	Description
netdev/tree_selection	success	Guessing tree name failed - patch did not apply, async

Aurelien Aptel July 12, 2023, 4:14 p.m. UTC

From: Boris Pismenny <borisp@nvidia.com>

This commit introduces direct data placement offload to NVME
TCP. There is a context per queue, which is established after the
handshake using the sk_add/del NDOs.

Additionally, a resynchronization routine is used to assist
hardware recovery from TCP OOO, and continue the offload.
Resynchronization operates as follows:

1. TCP OOO causes the NIC HW to stop the offload

2. NIC HW identifies a PDU header at some TCP sequence number,
and asks NVMe-TCP to confirm it.
This request is delivered from the NIC driver to NVMe-TCP by first
finding the socket for the packet that triggered the request, and
then finding the nvme_tcp_queue that is used by this routine.
Finally, the request is recorded in the nvme_tcp_queue.

3. When NVMe-TCP observes the requested TCP sequence, it will compare
it with the PDU header TCP sequence, and report the result to the
NIC driver (resync), which will update the HW, and resume offload
when all is successful.

Some HW implementation such as ConnectX-7 assume linear CCID (0...N-1
for queue of size N) where the linux nvme driver uses part of the 16
bit CCID for generation counter. To address that, we use the existing
quirk in the nvme layer when the HW driver advertises if the device is
not supports the full 16 bit CCID range.

Furthermore, we let the offloading driver advertise what is the max hw
sectors/segments via ulp_ddp_limits.

A follow-up patch introduces the data-path changes required for this
offload.

Socket operations need a netdev reference. This reference is
dropped on NETDEV_GOING_DOWN events to allow the device to go down in
a follow-up patch.

Signed-off-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com>
Signed-off-by: Or Gerlitz <ogerlitz@nvidia.com>
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Shai Malin <smalin@nvidia.com>
Signed-off-by: Aurelien Aptel <aaptel@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/nvme/host/tcp.c | 272 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 263 insertions(+), 9 deletions(-)

Chaitanya Kulkarni Aug. 1, 2023, 2:25 a.m. UTC | #1

On 7/12/23 09:14, Aurelien Aptel wrote:
> From: Boris Pismenny <borisp@nvidia.com>
>
> This commit introduces direct data placement offload to NVME
> TCP. There is a context per queue, which is established after the
> handshake using the sk_add/del NDOs.
>
> Additionally, a resynchronization routine is used to assist
> hardware recovery from TCP OOO, and continue the offload.
> Resynchronization operates as follows:
>
> 1. TCP OOO causes the NIC HW to stop the offload
>
> 2. NIC HW identifies a PDU header at some TCP sequence number,
> and asks NVMe-TCP to confirm it.
> This request is delivered from the NIC driver to NVMe-TCP by first
> finding the socket for the packet that triggered the request, and
> then finding the nvme_tcp_queue that is used by this routine.
> Finally, the request is recorded in the nvme_tcp_queue.
>
> 3. When NVMe-TCP observes the requested TCP sequence, it will compare
> it with the PDU header TCP sequence, and report the result to the
> NIC driver (resync), which will update the HW, and resume offload
> when all is successful.
>
> Some HW implementation such as ConnectX-7 assume linear CCID (0...N-1
> for queue of size N) where the linux nvme driver uses part of the 16
> bit CCID for generation counter. To address that, we use the existing
> quirk in the nvme layer when the HW driver advertises if the device is
> not supports the full 16 bit CCID range.
>
> Furthermore, we let the offloading driver advertise what is the max hw
> sectors/segments via ulp_ddp_limits.
>
> A follow-up patch introduces the data-path changes required for this
> offload.
>
> Socket operations need a netdev reference. This reference is
> dropped on NETDEV_GOING_DOWN events to allow the device to go down in
> a follow-up patch.
>
> Signed-off-by: Boris Pismenny <borisp@nvidia.com>
> Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com>
> Signed-off-by: Or Gerlitz <ogerlitz@nvidia.com>
> Signed-off-by: Yoray Zack <yorayz@nvidia.com>
> Signed-off-by: Shai Malin <smalin@nvidia.com>
> Signed-off-by: Aurelien Aptel <aaptel@nvidia.com>
> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
> ---

For NVMe related code :-

Offload feature is configurable and maybe not be turned on in the absence
of the H/W. In order to keep the nvme/host/tcp.c file small to only handle
core related functionality, I wonder if we should to move tcp-offload code
into it's own file say nvme/host/tcp-offload.c ?

-ck

Sagi Grimberg Aug. 9, 2023, 7:13 a.m. UTC | #2

Hey,

Sorry for the late review on this, I've been struggling with
prioritizing this due to lack of time.

I'll most likely review in batches.

On 7/12/23 19:14, Aurelien Aptel wrote:
> From: Boris Pismenny <borisp@nvidia.com>
> 
> This commit introduces direct data placement offload to NVME
> TCP. There is a context per queue, which is established after the
> handshake using the sk_add/del NDOs.
> 
> Additionally, a resynchronization routine is used to assist
> hardware recovery from TCP OOO, and continue the offload.
> Resynchronization operates as follows:
> 
> 1. TCP OOO causes the NIC HW to stop the offload
> 
> 2. NIC HW identifies a PDU header at some TCP sequence number,
> and asks NVMe-TCP to confirm it.
> This request is delivered from the NIC driver to NVMe-TCP by first
> finding the socket for the packet that triggered the request, and
> then finding the nvme_tcp_queue that is used by this routine.
> Finally, the request is recorded in the nvme_tcp_queue.
> 
> 3. When NVMe-TCP observes the requested TCP sequence, it will compare
> it with the PDU header TCP sequence, and report the result to the
> NIC driver (resync), which will update the HW, and resume offload
> when all is successful.
> 
> Some HW implementation such as ConnectX-7 assume linear CCID (0...N-1
> for queue of size N) where the linux nvme driver uses part of the 16
> bit CCID for generation counter. To address that, we use the existing
> quirk in the nvme layer when the HW driver advertises if the device is
> not supports the full 16 bit CCID range.
> 
> Furthermore, we let the offloading driver advertise what is the max hw
> sectors/segments via ulp_ddp_limits.
> 
> A follow-up patch introduces the data-path changes required for this
> offload.
> 
> Socket operations need a netdev reference. This reference is
> dropped on NETDEV_GOING_DOWN events to allow the device to go down in
> a follow-up patch.
> 
> Signed-off-by: Boris Pismenny <borisp@nvidia.com>
> Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com>
> Signed-off-by: Or Gerlitz <ogerlitz@nvidia.com>
> Signed-off-by: Yoray Zack <yorayz@nvidia.com>
> Signed-off-by: Shai Malin <smalin@nvidia.com>
> Signed-off-by: Aurelien Aptel <aaptel@nvidia.com>
> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
> ---
>   drivers/nvme/host/tcp.c | 272 ++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 263 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 4a8b32bd5257..7d3b0ac65c03 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -20,6 +20,10 @@
>   #include <net/busy_poll.h>
>   #include <trace/events/sock.h>
>   
> +#ifdef CONFIG_ULP_DDP
> +#include <net/ulp_ddp.h>
> +#endif
> +
>   #include "nvme.h"
>   #include "fabrics.h"
>   
> @@ -118,6 +122,7 @@ enum nvme_tcp_queue_flags {
>   	NVME_TCP_Q_ALLOCATED	= 0,
>   	NVME_TCP_Q_LIVE		= 1,
>   	NVME_TCP_Q_POLLING	= 2,
> +	NVME_TCP_Q_OFF_DDP	= 3,
>   };
>   
>   enum nvme_tcp_recv_state {
> @@ -145,6 +150,19 @@ struct nvme_tcp_queue {
>   	size_t			ddgst_remaining;
>   	unsigned int		nr_cqe;
>   
> +#ifdef CONFIG_ULP_DDP
> +	/*
> +	 * resync_req is a speculative PDU header tcp seq number (with
> +	 * an additional flag at 32 lower bits) that the HW send to
> +	 * the SW, for the SW to verify.
> +	 * - The 32 high bits store the seq number
> +	 * - The 32 low bits are used as a flag to know if a request
> +	 *   is pending (ULP_DDP_RESYNC_PENDING).
> +	 */
> +	atomic64_t		resync_req;
> +	struct ulp_ddp_limits	ddp_limits;
> +#endif
> +
>   	/* send state */
>   	struct nvme_tcp_request *request;
>   
> @@ -187,6 +205,9 @@ struct nvme_tcp_ctrl {
>   	struct delayed_work	connect_work;
>   	struct nvme_tcp_request async_req;
>   	u32			io_queues[HCTX_MAX_TYPES];
> +
> +	struct net_device	*offloading_netdev;
> +	u32			offload_io_threshold;

ddp_netdev
ddp_io_threashold

>   };
>   
>   static LIST_HEAD(nvme_tcp_ctrl_list);
> @@ -290,6 +311,204 @@ static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req,
>   	return nvme_tcp_pdu_data_left(req) <= len;
>   }
>   
> +#ifdef CONFIG_ULP_DDP
> +
> +static bool nvme_tcp_ddp_query_limits(struct net_device *netdev,
> +				      struct nvme_tcp_queue *queue)
> +{
> +	int ret;
> +
> +	if (!netdev->netdev_ops->ulp_ddp_ops->limits)
> +		return false;

Can we expose all of these ops in wrappers ala:

netdev_ulp_ddp_limits(netdev, &limits)
netdev_ulp_ddp_sk_add(netdev, sk, &nvme_tcp_ddp_ulp_ops)
netdev_ulp_ddp_sk_del(netdev, sk)
netdev_ulp_ddp_resync(netdev, skb, seq)

etc...

> +
> +	queue->ddp_limits.type = ULP_DDP_NVME;
> +	ret = netdev->netdev_ops->ulp_ddp_ops->limits(netdev, &queue->ddp_limits);
> +	if (ret == -EOPNOTSUPP) {
> +		return false;
> +	} else if (ret) {
> +		WARN_ONCE(ret, "ddp limits failed (ret=%d)", ret);
> +		return false;
> +	}
> +
> +	dev_dbg_ratelimited(queue->ctrl->ctrl.device,
> +			    "netdev %s offload limits: max_ddp_sgl_len %d\n",
> +			    netdev->name, queue->ddp_limits.max_ddp_sgl_len);
> +
> +	return true;
> +}
> +
> +static inline bool is_netdev_ulp_offload_active(struct net_device *netdev,
> +						struct nvme_tcp_queue *queue)
> +{
> +	if (!netdev || !queue)
> +		return false;

Is it reasonable to be called here with !netdev or !queue ?

> +
> +	/* If we cannot query the netdev limitations, do not offload */
> +	if (!nvme_tcp_ddp_query_limits(netdev, queue))
> +		return false;
> +
> +	/* If netdev supports nvme-tcp ddp offload, we can offload */
> +	if (test_bit(ULP_DDP_C_NVME_TCP_BIT, netdev->ulp_ddp_caps.active))
> +		return true;

This should be coming from the API itself, have the limits query
api fail if this is off.

btw, what is the active thing? is this driven from ethtool enable?
what happens if the user disables it while there is a ulp using it?

> +
> +	return false;

This can be folded to the above function.

> +}
> +
> +static bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags);
> +static const struct ulp_ddp_ulp_ops nvme_tcp_ddp_ulp_ops = {
> +	.resync_request		= nvme_tcp_resync_request,
> +};
> +
> +static int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
> +{
> +	struct net_device *netdev = queue->ctrl->offloading_netdev;
> +	struct ulp_ddp_config config = {.type = ULP_DDP_NVME};
> +	int ret;
> +
> +	config.nvmeotcp.pfv = NVME_TCP_PFV_1_0;

Question, what happens if the pfv changes, is the ddp guaranteed to
work?

> +	config.nvmeotcp.cpda = 0;
> +	config.nvmeotcp.dgst =
> +		queue->hdr_digest ? NVME_TCP_HDR_DIGEST_ENABLE : 0;
> +	config.nvmeotcp.dgst |=
> +		queue->data_digest ? NVME_TCP_DATA_DIGEST_ENABLE : 0;
> +	config.nvmeotcp.queue_size = queue->ctrl->ctrl.sqsize + 1;
> +	config.nvmeotcp.queue_id = nvme_tcp_queue_id(queue);
> +	config.nvmeotcp.io_cpu = queue->io_cpu;
> +
> +	/* Socket ops keep a netdev reference. It is put in
> +	 * nvme_tcp_unoffload_socket().  This ref is dropped on
> +	 * NETDEV_GOING_DOWN events to allow the device to go down
> +	 */
> +	dev_hold(netdev);
> +	ret = netdev->netdev_ops->ulp_ddp_ops->sk_add(netdev,
> +						      queue->sock->sk,
> +						      &config);

It would be preferred if dev_hold would be taken in sk_add
and released in sk_del so that the ulp does not need to worry
acount it.

> +	if (ret) {
> +		dev_put(netdev);
> +		return ret;
> +	}
> +
> +	inet_csk(queue->sock->sk)->icsk_ulp_ddp_ops = &nvme_tcp_ddp_ulp_ops;

can also be folded inside an api.

> +	set_bit(NVME_TCP_Q_OFF_DDP, &queue->flags);
> +	return 0;
> +}
> +
> +static void nvme_tcp_unoffload_socket(struct nvme_tcp_queue *queue)
> +{
> +	struct net_device *netdev = queue->ctrl->offloading_netdev;
> +
> +	if (!netdev) {
> +		dev_info_ratelimited(queue->ctrl->ctrl.device, "netdev not found\n");
> +		return;
> +	}

Again, is it reasonable to be called here with !netdev?

> +
> +	clear_bit(NVME_TCP_Q_OFF_DDP, &queue->flags);
> +
> +	netdev->netdev_ops->ulp_ddp_ops->sk_del(netdev, queue->sock->sk);
> +
> +	inet_csk(queue->sock->sk)->icsk_ulp_ddp_ops = NULL;
> +	dev_put(netdev); /* held by offload_socket */

Both can be done by the api instead of the ulp itself.

> +}
> +
> +static void nvme_tcp_offload_apply_limits(struct nvme_tcp_queue *queue,
> +					  struct net_device *netdev)
> +{
> +	queue->ctrl->offloading_netdev = netdev;
> +	queue->ctrl->ctrl.max_segments = queue->ddp_limits.max_ddp_sgl_len;
> +	queue->ctrl->ctrl.max_hw_sectors =
> +		queue->ddp_limits.max_ddp_sgl_len << (ilog2(SZ_4K) - 9);

this is SECTOR_SHIFT?

> +	queue->ctrl->offload_io_threshold = queue->ddp_limits.io_threshold;
> +
> +	/* offloading HW doesn't support full ccid range, apply the quirk */
> +	queue->ctrl->ctrl.quirks |=
> +		queue->ddp_limits.nvmeotcp.full_ccid_range ? 0 : NVME_QUIRK_SKIP_CID_GEN;
> +}
> +
> +/* In presence of packet drops or network packet reordering, the device may lose
> + * synchronization between the TCP stream and the L5P framing, and require a
> + * resync with the kernel's TCP stack.
> + *
> + * - NIC HW identifies a PDU header at some TCP sequence number,
> + *   and asks NVMe-TCP to confirm it.
> + * - When NVMe-TCP observes the requested TCP sequence, it will compare
> + *   it with the PDU header TCP sequence, and report the result to the
> + *   NIC driver
> + */
> +static void nvme_tcp_resync_response(struct nvme_tcp_queue *queue,
> +				     struct sk_buff *skb, unsigned int offset)
> +{
> +	u64 pdu_seq = TCP_SKB_CB(skb)->seq + offset - queue->pdu_offset;
> +	struct net_device *netdev = queue->ctrl->offloading_netdev;
> +	u64 pdu_val = (pdu_seq << 32) | ULP_DDP_RESYNC_PENDING;
> +	u64 resync_val;
> +	u32 resync_seq;
> +
> +	resync_val = atomic64_read(&queue->resync_req);
> +	/* Lower 32 bit flags. Check validity of the request */
> +	if ((resync_val & ULP_DDP_RESYNC_PENDING) == 0)
> +		return;
> +
> +	/*
> +	 * Obtain and check requested sequence number: is this PDU header
> +	 * before the request?
> +	 */
> +	resync_seq = resync_val >> 32;
> +	if (before(pdu_seq, resync_seq))
> +		return;
> +
> +	/*
> +	 * The atomic operation guarantees that we don't miss any NIC driver
> +	 * resync requests submitted after the above checks.
> +	 */
> +	if (atomic64_cmpxchg(&queue->resync_req, pdu_val,
> +			     pdu_val & ~ULP_DDP_RESYNC_PENDING) !=
> +			     atomic64_read(&queue->resync_req))
> +		netdev->netdev_ops->ulp_ddp_ops->resync(netdev,
> +							queue->sock->sk,
> +							pdu_seq);

Who else is doing an atomic on this value? and what happens
if the cmpxchg fails?

> +}
> +
> +static bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
> +{
> +	struct nvme_tcp_queue *queue = sk->sk_user_data;
> +
> +	/*
> +	 * "seq" (TCP seq number) is what the HW assumes is the
> +	 * beginning of a PDU.  The nvme-tcp layer needs to store the
> +	 * number along with the "flags" (ULP_DDP_RESYNC_PENDING) to
> +	 * indicate that a request is pending.
> +	 */
> +	atomic64_set(&queue->resync_req, (((uint64_t)seq << 32) | flags));

Question, is this coming from multiple contexts? what contexts are
competing here that make it an atomic operation? It is unclear what is
going on here tbh.

> +
> +	return true;
> +}
> +
> +#else
> +
> +static inline bool is_netdev_ulp_offload_active(struct net_device *netdev,
> +						struct nvme_tcp_queue *queue)
> +{
> +	return false;
> +}
> +
> +static int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static void nvme_tcp_unoffload_socket(struct nvme_tcp_queue *queue)
> +{}
> +
> +static void nvme_tcp_offload_apply_limits(struct nvme_tcp_queue *queue,
> +					  struct net_device *netdev)
> +{}
> +
> +static void nvme_tcp_resync_response(struct nvme_tcp_queue *queue,
> +				     struct sk_buff *skb, unsigned int offset)
> +{}
> +
> +#endif
> +
>   static void nvme_tcp_init_iter(struct nvme_tcp_request *req,
>   		unsigned int dir)
>   {
> @@ -732,6 +951,9 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>   	size_t rcv_len = min_t(size_t, *len, queue->pdu_remaining);
>   	int ret;
>   
> +	if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags))
> +		nvme_tcp_resync_response(queue, skb, *offset);
> +
>   	ret = skb_copy_bits(skb, *offset,
>   		&pdu[queue->pdu_offset], rcv_len);
>   	if (unlikely(ret))
> @@ -1795,6 +2017,8 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
>   	kernel_sock_shutdown(queue->sock, SHUT_RDWR);
>   	nvme_tcp_restore_sock_ops(queue);
>   	cancel_work_sync(&queue->io_work);
> +	if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags))
> +		nvme_tcp_unoffload_socket(queue);
>   }
>   
>   static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
> @@ -1831,25 +2055,55 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
>   {
>   	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>   	struct nvme_tcp_queue *queue = &ctrl->queues[idx];
> +	struct net_device *netdev;
>   	int ret;
>   
>   	queue->rd_enabled = true;
>   	nvme_tcp_init_recv_ctx(queue);
>   	nvme_tcp_setup_sock_ops(queue);
>   
> -	if (idx)
> +	if (idx) {
>   		ret = nvmf_connect_io_queue(nctrl, idx);
> -	else
> +		if (ret)
> +			goto err;
> +
> +		netdev = queue->ctrl->offloading_netdev;
> +		if (is_netdev_ulp_offload_active(netdev, queue)) {

Seems redundant to pass netdev as an argument here.

> +			ret = nvme_tcp_offload_socket(queue);
> +			if (ret) {
> +				dev_info(nctrl->device,
> +					 "failed to setup offload on queue %d ret=%d\n",
> +					 idx, ret);
> +			}
> +		}
> +	} else {
>   		ret = nvmf_connect_admin_queue(nctrl);
> +		if (ret)
> +			goto err;
>   
> -	if (!ret) {
> -		set_bit(NVME_TCP_Q_LIVE, &queue->flags);
> -	} else {
> -		if (test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
> -			__nvme_tcp_stop_queue(queue);
> -		dev_err(nctrl->device,
> -			"failed to connect queue: %d ret=%d\n", idx, ret);
> +		netdev = get_netdev_for_sock(queue->sock->sk);

Is there any chance that this is a different netdev than what is
already recorded? doesn't make sense to me.

> +		if (!netdev) {
> +			dev_info_ratelimited(ctrl->ctrl.device, "netdev not found\n");
> +			ctrl->offloading_netdev = NULL;
> +			goto done;
> +		}
> +		if (is_netdev_ulp_offload_active(netdev, queue))
> +			nvme_tcp_offload_apply_limits(queue, netdev);
> +		/*
> +		 * release the device as no offload context is
> +		 * established yet.
> +		 */
> +		dev_put(netdev);

the put is unclear, what does it pair with? the get_netdev_for_sock?

>   	}
> +
> +done:
> +	set_bit(NVME_TCP_Q_LIVE, &queue->flags);
> +	return 0;
> +err:
> +	if (test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
> +		__nvme_tcp_stop_queue(queue);
> +	dev_err(nctrl->device,
> +		"failed to connect queue: %d ret=%d\n", idx, ret);
>   	return ret;
>   }
>

Sagi Grimberg Aug. 9, 2023, 7:39 a.m. UTC | #3

On 8/1/23 05:25, Chaitanya Kulkarni wrote:
> On 7/12/23 09:14, Aurelien Aptel wrote:
>> From: Boris Pismenny <borisp@nvidia.com>
>>
>> This commit introduces direct data placement offload to NVME
>> TCP. There is a context per queue, which is established after the
>> handshake using the sk_add/del NDOs.
>>
>> Additionally, a resynchronization routine is used to assist
>> hardware recovery from TCP OOO, and continue the offload.
>> Resynchronization operates as follows:
>>
>> 1. TCP OOO causes the NIC HW to stop the offload
>>
>> 2. NIC HW identifies a PDU header at some TCP sequence number,
>> and asks NVMe-TCP to confirm it.
>> This request is delivered from the NIC driver to NVMe-TCP by first
>> finding the socket for the packet that triggered the request, and
>> then finding the nvme_tcp_queue that is used by this routine.
>> Finally, the request is recorded in the nvme_tcp_queue.
>>
>> 3. When NVMe-TCP observes the requested TCP sequence, it will compare
>> it with the PDU header TCP sequence, and report the result to the
>> NIC driver (resync), which will update the HW, and resume offload
>> when all is successful.
>>
>> Some HW implementation such as ConnectX-7 assume linear CCID (0...N-1
>> for queue of size N) where the linux nvme driver uses part of the 16
>> bit CCID for generation counter. To address that, we use the existing
>> quirk in the nvme layer when the HW driver advertises if the device is
>> not supports the full 16 bit CCID range.
>>
>> Furthermore, we let the offloading driver advertise what is the max hw
>> sectors/segments via ulp_ddp_limits.
>>
>> A follow-up patch introduces the data-path changes required for this
>> offload.
>>
>> Socket operations need a netdev reference. This reference is
>> dropped on NETDEV_GOING_DOWN events to allow the device to go down in
>> a follow-up patch.
>>
>> Signed-off-by: Boris Pismenny <borisp@nvidia.com>
>> Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com>
>> Signed-off-by: Or Gerlitz <ogerlitz@nvidia.com>
>> Signed-off-by: Yoray Zack <yorayz@nvidia.com>
>> Signed-off-by: Shai Malin <smalin@nvidia.com>
>> Signed-off-by: Aurelien Aptel <aaptel@nvidia.com>
>> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
>> ---
> 
> For NVMe related code :-
> 
> Offload feature is configurable and maybe not be turned on in the absence
> of the H/W. In order to keep the nvme/host/tcp.c file small to only handle
> core related functionality, I wonder if we should to move tcp-offload code
> into it's own file say nvme/host/tcp-offload.c ?

Maybe. it wouldn't be tcp_offload.c but rather tcp_ddp.c because its not
offloading the tcp stack but rather doing direct data placement.

If we are going to do that it will pollute nvme.h or add a common
header file, which is something I'd like to avoid if possible.

Chaitanya Kulkarni Aug. 11, 2023, 5:28 a.m. UTC | #4

On 8/9/2023 12:39 AM, Sagi Grimberg wrote:
> 
> 
> On 8/1/23 05:25, Chaitanya Kulkarni wrote:
>> On 7/12/23 09:14, Aurelien Aptel wrote:
>>> From: Boris Pismenny <borisp@nvidia.com>
>>>
>>> This commit introduces direct data placement offload to NVME
>>> TCP. There is a context per queue, which is established after the
>>> handshake using the sk_add/del NDOs.
>>>
>>> Additionally, a resynchronization routine is used to assist
>>> hardware recovery from TCP OOO, and continue the offload.
>>> Resynchronization operates as follows:
>>>
>>> 1. TCP OOO causes the NIC HW to stop the offload
>>>
>>> 2. NIC HW identifies a PDU header at some TCP sequence number,
>>> and asks NVMe-TCP to confirm it.
>>> This request is delivered from the NIC driver to NVMe-TCP by first
>>> finding the socket for the packet that triggered the request, and
>>> then finding the nvme_tcp_queue that is used by this routine.
>>> Finally, the request is recorded in the nvme_tcp_queue.
>>>
>>> 3. When NVMe-TCP observes the requested TCP sequence, it will compare
>>> it with the PDU header TCP sequence, and report the result to the
>>> NIC driver (resync), which will update the HW, and resume offload
>>> when all is successful.
>>>
>>> Some HW implementation such as ConnectX-7 assume linear CCID (0...N-1
>>> for queue of size N) where the linux nvme driver uses part of the 16
>>> bit CCID for generation counter. To address that, we use the existing
>>> quirk in the nvme layer when the HW driver advertises if the device is
>>> not supports the full 16 bit CCID range.
>>>
>>> Furthermore, we let the offloading driver advertise what is the max hw
>>> sectors/segments via ulp_ddp_limits.
>>>
>>> A follow-up patch introduces the data-path changes required for this
>>> offload.
>>>
>>> Socket operations need a netdev reference. This reference is
>>> dropped on NETDEV_GOING_DOWN events to allow the device to go down in
>>> a follow-up patch.
>>>
>>> Signed-off-by: Boris Pismenny <borisp@nvidia.com>
>>> Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com>
>>> Signed-off-by: Or Gerlitz <ogerlitz@nvidia.com>
>>> Signed-off-by: Yoray Zack <yorayz@nvidia.com>
>>> Signed-off-by: Shai Malin <smalin@nvidia.com>
>>> Signed-off-by: Aurelien Aptel <aaptel@nvidia.com>
>>> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
>>> ---
>>
>> For NVMe related code :-
>>
>> Offload feature is configurable and maybe not be turned on in the absence
>> of the H/W. In order to keep the nvme/host/tcp.c file small to only 
>> handle
>> core related functionality, I wonder if we should to move tcp-offload 
>> code
>> into it's own file say nvme/host/tcp-offload.c ?
> 
> Maybe. it wouldn't be tcp_offload.c but rather tcp_ddp.c because its not
> offloading the tcp stack but rather doing direct data placement.
>

fine ...

> If we are going to do that it will pollute nvme.h or add a common
> header file, which is something I'd like to avoid if possible.

my comment was mainly based on how we separated the core code from
configurable features, and wondering any decision we make for host will
also apply for the target ddp code in future, but you prefer to keep it 
as it is let's not bloat header ...

-ck

Aurelien Aptel Aug. 14, 2023, 4:11 p.m. UTC | #5

Sagi Grimberg <sagi@grimberg.me> writes:
>> @@ -187,6 +205,9 @@ struct nvme_tcp_ctrl {
>>       struct delayed_work     connect_work;
>>       struct nvme_tcp_request async_req;
>>       u32                     io_queues[HCTX_MAX_TYPES];
>> +
>> +     struct net_device       *offloading_netdev;
>> +     u32                     offload_io_threshold;
>
> ddp_netdev
> ddp_io_threashold

Sure, will change it.

>> +static bool nvme_tcp_ddp_query_limits(struct net_device *netdev,
>> +                                   struct nvme_tcp_queue *queue)
>> +{
>> +     int ret;
>> +
>> +     if (!netdev->netdev_ops->ulp_ddp_ops->limits)
>> +             return false;
>
> Can we expose all of these ops in wrappers ala:
>
> netdev_ulp_ddp_limits(netdev, &limits)
> netdev_ulp_ddp_sk_add(netdev, sk, &nvme_tcp_ddp_ulp_ops)
> netdev_ulp_ddp_sk_del(netdev, sk)
> netdev_ulp_ddp_resync(netdev, skb, seq)
>
> etc...

Sure, we will add simple wrappers in ulp_ddp.h to check for the function
pointers.

>> +static inline bool is_netdev_ulp_offload_active(struct net_device *netdev,
>> +                                             struct nvme_tcp_queue *queue)
>> +{
>> +     if (!netdev || !queue)
>> +             return false;
>
> Is it reasonable to be called here with !netdev or !queue ?

The check is needed only for the IO queue case but we can move it
earlier in nvme_tcp_start_queue().

>> +
>> +     /* If we cannot query the netdev limitations, do not offload */
>> +     if (!nvme_tcp_ddp_query_limits(netdev, queue))
>> +             return false;
>> +
>> +     /* If netdev supports nvme-tcp ddp offload, we can offload */
>> +     if (test_bit(ULP_DDP_C_NVME_TCP_BIT, netdev->ulp_ddp_caps.active))
>> +             return true;
>
> This should be coming from the API itself, have the limits query
> api fail if this is off.

We can move the function to the ULP DDP layer.

> btw, what is the active thing? is this driven from ethtool enable?
> what happens if the user disables it while there is a ulp using it?

The active bits are indeed driven by ethtool according to the design
Jakub suggested.
The nvme-tcp connection will have to be reconnected to see the effect of
changing the bit.

>> +
>> +     return false;
>
> This can be folded to the above function.

We won't be able to check for TLS in a common wrapper. We think this
should be kept.

>> +static int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
>> +{
>> +     struct net_device *netdev = queue->ctrl->offloading_netdev;
>> +     struct ulp_ddp_config config = {.type = ULP_DDP_NVME};
>> +     int ret;
>> +
>> +     config.nvmeotcp.pfv = NVME_TCP_PFV_1_0;
>
> Question, what happens if the pfv changes, is the ddp guaranteed to
> work?

The existing HW supports only NVME_TCP_PFV_1_0.
Once a new version will be used, the device driver should fail the
sk_add().

>> +     config.nvmeotcp.cpda = 0;
>> +     config.nvmeotcp.dgst =
>> +             queue->hdr_digest ? NVME_TCP_HDR_DIGEST_ENABLE : 0;
>> +     config.nvmeotcp.dgst |=
>> +             queue->data_digest ? NVME_TCP_DATA_DIGEST_ENABLE : 0;
>> +     config.nvmeotcp.queue_size = queue->ctrl->ctrl.sqsize + 1;
>> +     config.nvmeotcp.queue_id = nvme_tcp_queue_id(queue);
>> +     config.nvmeotcp.io_cpu = queue->io_cpu;
>> +
>> +     /* Socket ops keep a netdev reference. It is put in
>> +      * nvme_tcp_unoffload_socket().  This ref is dropped on
>> +      * NETDEV_GOING_DOWN events to allow the device to go down
>> +      */
>> +     dev_hold(netdev);
>> +     ret = netdev->netdev_ops->ulp_ddp_ops->sk_add(netdev,
>> +                                                   queue->sock->sk,
>> +                                                   &config);
>
> It would be preferred if dev_hold would be taken in sk_add
> and released in sk_del so that the ulp does not need to worry
> acount it.

Sure, we will move the refcount accounting to the ulp ddp wrapper.

>> +     inet_csk(queue->sock->sk)->icsk_ulp_ddp_ops = &nvme_tcp_ddp_ulp_ops;
> can also be folded inside an api.

Sure, we will move this to sk_add() and add a parameter for the ops.

>> +static void nvme_tcp_unoffload_socket(struct nvme_tcp_queue *queue)
>> +{
>> +     struct net_device *netdev = queue->ctrl->offloading_netdev;
>> +
>> +     if (!netdev) {
>> +             dev_info_ratelimited(queue->ctrl->ctrl.device, "netdev not found\n");
>> +             return;
>> +     }
>
> Again, is it reasonable to be called here with !netdev?

No it's redundant, we will remove it.

>> +
>> +     clear_bit(NVME_TCP_Q_OFF_DDP, &queue->flags);
>> +
>> +     netdev->netdev_ops->ulp_ddp_ops->sk_del(netdev, queue->sock->sk);
>> +
>> +     inet_csk(queue->sock->sk)->icsk_ulp_ddp_ops = NULL;
>> +     dev_put(netdev); /* held by offload_socket */
>
> Both can be done by the api instead of the ulp itself.

Sure, we will move it.

>> +}
>> +
>> +static void nvme_tcp_offload_apply_limits(struct nvme_tcp_queue *queue,
>> +                                       struct net_device *netdev)
>> +{
>> +     queue->ctrl->offloading_netdev = netdev;
>> +     queue->ctrl->ctrl.max_segments = queue->ddp_limits.max_ddp_sgl_len;
>> +     queue->ctrl->ctrl.max_hw_sectors =
>> +             queue->ddp_limits.max_ddp_sgl_len << (ilog2(SZ_4K) - 9);
>
> this is SECTOR_SHIFT?

Yes it is, we will use it.

>> +/* In presence of packet drops or network packet reordering, the device may lose
>> + * synchronization between the TCP stream and the L5P framing, and require a
>> + * resync with the kernel's TCP stack.
>> + *
>> + * - NIC HW identifies a PDU header at some TCP sequence number,
>> + *   and asks NVMe-TCP to confirm it.
>> + * - When NVMe-TCP observes the requested TCP sequence, it will compare
>> + *   it with the PDU header TCP sequence, and report the result to the
>> + *   NIC driver
>> + */
>> +static void nvme_tcp_resync_response(struct nvme_tcp_queue *queue,
>> +                                  struct sk_buff *skb, unsigned int offset)
>> +{
>> +     u64 pdu_seq = TCP_SKB_CB(skb)->seq + offset - queue->pdu_offset;
>> +     struct net_device *netdev = queue->ctrl->offloading_netdev;
>> +     u64 pdu_val = (pdu_seq << 32) | ULP_DDP_RESYNC_PENDING;
>> +     u64 resync_val;
>> +     u32 resync_seq;
>> +
>> +     resync_val = atomic64_read(&queue->resync_req);
>> +     /* Lower 32 bit flags. Check validity of the request */
>> +     if ((resync_val & ULP_DDP_RESYNC_PENDING) == 0)
>> +             return;
>> +
>> +     /*
>> +      * Obtain and check requested sequence number: is this PDU header
>> +      * before the request?
>> +      */
>> +     resync_seq = resync_val >> 32;
>> +     if (before(pdu_seq, resync_seq))
>> +             return;
>> +
>> +     /*
>> +      * The atomic operation guarantees that we don't miss any NIC driver
>> +      * resync requests submitted after the above checks.
>> +      */
>> +     if (atomic64_cmpxchg(&queue->resync_req, pdu_val,
>> +                          pdu_val & ~ULP_DDP_RESYNC_PENDING) !=
>> +                          atomic64_read(&queue->resync_req))
>> +             netdev->netdev_ops->ulp_ddp_ops->resync(netdev,
>> +                                                     queue->sock->sk,
>> +                                                     pdu_seq);
>
> Who else is doing an atomic on this value? and what happens
> if the cmpxchg fails?

The driver thread can set queue->resync_req concurrently in patch
"net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload" in function
nvmeotcp_update_resync().

If the cmpxchg fails it means a new resync request was triggered by the
HW, the old request will be dropped and the new one will be processed by
a later PDU.

>> +}
>> +
>> +static bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
>> +{
>> +     struct nvme_tcp_queue *queue = sk->sk_user_data;
>> +
>> +     /*
>> +      * "seq" (TCP seq number) is what the HW assumes is the
>> +      * beginning of a PDU.  The nvme-tcp layer needs to store the
>> +      * number along with the "flags" (ULP_DDP_RESYNC_PENDING) to
>> +      * indicate that a request is pending.
>> +      */
>> +     atomic64_set(&queue->resync_req, (((uint64_t)seq << 32) | flags));
>
> Question, is this coming from multiple contexts? what contexts are
> competing here that make it an atomic operation? It is unclear what is
> going on here tbh.

The driver could get a resync request and set queue->resync_req
concurrently while processing HW CQEs as you can see in patch
"net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload" in function
nvmeotcp_update_resync().

The resync flow is:

     nvme-tcp                           mlx5                     hw
        |                                |                        |
        |                                |                      sends CQE with
        |                                |                      resync request
        |                                | <----------------------'                                  
        |                         nvmeotcp_update_resync()
  nvme_tcp_resync_request() <-----------'|
  we store the request

Later, while receiving PDUs we check for pending requests.
If there is one, we send call nvme_tcp_resync_response() which calls
into mlx5 to send the response to the HW.

>> @@ -1831,25 +2055,55 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
>>   {
>>       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>>       struct nvme_tcp_queue *queue = &ctrl->queues[idx];
>> +     struct net_device *netdev;
>>       int ret;
>>
>>       queue->rd_enabled = true;
>>       nvme_tcp_init_recv_ctx(queue);
>>       nvme_tcp_setup_sock_ops(queue);
>>
>> -     if (idx)
>> +     if (idx) {
>>               ret = nvmf_connect_io_queue(nctrl, idx);
>> -     else
>> +             if (ret)
>> +                     goto err;
>> +
>> +             netdev = queue->ctrl->offloading_netdev;
>> +             if (is_netdev_ulp_offload_active(netdev, queue)) {
>
> Seems redundant to pass netdev as an argument here.

Thanks, we will remove it.

>> +                     ret = nvme_tcp_offload_socket(queue);
>> +                     if (ret) {
>> +                             dev_info(nctrl->device,
>> +                                      "failed to setup offload on queue %d ret=%d\n",
>> +                                      idx, ret);
>> +                     }
>> +             }
>> +     } else {
>>               ret = nvmf_connect_admin_queue(nctrl);
>> +             if (ret)
>> +                     goto err;
>>
>> -     if (!ret) {
>> -             set_bit(NVME_TCP_Q_LIVE, &queue->flags);
>> -     } else {
>> -             if (test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>> -                     __nvme_tcp_stop_queue(queue);
>> -             dev_err(nctrl->device,
>> -                     "failed to connect queue: %d ret=%d\n", idx, ret);
>> +             netdev = get_netdev_for_sock(queue->sock->sk);
>
> Is there any chance that this is a different netdev than what is
> already recorded? doesn't make sense to me.

The idea is that we are first starting the admin queue, which looks up
the netdev associated with the socket and stored in the queue. Later,
when the IO queues are started, we use the recorded netdev.

In cases of bonding or vlan, a netdev can have lower device links, which
get_netdev_for_sock() will look up.

Do you see any gap?

>> +             if (!netdev) {
>> +                     dev_info_ratelimited(ctrl->ctrl.device, "netdev not found\n");
>> +                     ctrl->offloading_netdev = NULL;
>> +                     goto done;
>> +             }
>> +             if (is_netdev_ulp_offload_active(netdev, queue))
>> +                     nvme_tcp_offload_apply_limits(queue, netdev);
>> +             /*
>> +              * release the device as no offload context is
>> +              * established yet.
>> +              */
>> +             dev_put(netdev);
>
> the put is unclear, what does it pair with? the get_netdev_for_sock?

Yes, get_netdev_for_sock() takes a reference, which we don't need at
that point so we put it.

Sagi Grimberg Aug. 14, 2023, 6:54 p.m. UTC | #6

>>> +static inline bool is_netdev_ulp_offload_active(struct net_device *netdev,
>>> +                                             struct nvme_tcp_queue *queue)
>>> +{
>>> +     if (!netdev || !queue)
>>> +             return false;
>>
>> Is it reasonable to be called here with !netdev or !queue ?
> 
> The check is needed only for the IO queue case but we can move it
> earlier in nvme_tcp_start_queue().

I still don't understand even on io queues how this can happen.

>>> +
>>> +     /* If we cannot query the netdev limitations, do not offload */
>>> +     if (!nvme_tcp_ddp_query_limits(netdev, queue))
>>> +             return false;
>>> +
>>> +     /* If netdev supports nvme-tcp ddp offload, we can offload */
>>> +     if (test_bit(ULP_DDP_C_NVME_TCP_BIT, netdev->ulp_ddp_caps.active))
>>> +             return true;
>>
>> This should be coming from the API itself, have the limits query
>> api fail if this is off.
> 
> We can move the function to the ULP DDP layer.
> 
>> btw, what is the active thing? is this driven from ethtool enable?
>> what happens if the user disables it while there is a ulp using it?
> 
> The active bits are indeed driven by ethtool according to the design
> Jakub suggested.
> The nvme-tcp connection will have to be reconnected to see the effect of
> changing the bit.

It should move inside the api as well. Don't want to care about it in
nvme.
>>> +
>>> +     return false;
>>
>> This can be folded to the above function.
> 
> We won't be able to check for TLS in a common wrapper. We think this
> should be kept.

Why? any tcp ddp need to be able to support tls. Nothing specific to
nvme here.

>>> +static int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
>>> +{
>>> +     struct net_device *netdev = queue->ctrl->offloading_netdev;
>>> +     struct ulp_ddp_config config = {.type = ULP_DDP_NVME};
>>> +     int ret;
>>> +
>>> +     config.nvmeotcp.pfv = NVME_TCP_PFV_1_0;
>>
>> Question, what happens if the pfv changes, is the ddp guaranteed to
>> work?
> 
> The existing HW supports only NVME_TCP_PFV_1_0.
> Once a new version will be used, the device driver should fail the
> sk_add().

OK.

>>> +/* In presence of packet drops or network packet reordering, the device may lose
>>> + * synchronization between the TCP stream and the L5P framing, and require a
>>> + * resync with the kernel's TCP stack.
>>> + *
>>> + * - NIC HW identifies a PDU header at some TCP sequence number,
>>> + *   and asks NVMe-TCP to confirm it.
>>> + * - When NVMe-TCP observes the requested TCP sequence, it will compare
>>> + *   it with the PDU header TCP sequence, and report the result to the
>>> + *   NIC driver
>>> + */
>>> +static void nvme_tcp_resync_response(struct nvme_tcp_queue *queue,
>>> +                                  struct sk_buff *skb, unsigned int offset)
>>> +{
>>> +     u64 pdu_seq = TCP_SKB_CB(skb)->seq + offset - queue->pdu_offset;
>>> +     struct net_device *netdev = queue->ctrl->offloading_netdev;
>>> +     u64 pdu_val = (pdu_seq << 32) | ULP_DDP_RESYNC_PENDING;
>>> +     u64 resync_val;
>>> +     u32 resync_seq;
>>> +
>>> +     resync_val = atomic64_read(&queue->resync_req);
>>> +     /* Lower 32 bit flags. Check validity of the request */
>>> +     if ((resync_val & ULP_DDP_RESYNC_PENDING) == 0)
>>> +             return;
>>> +
>>> +     /*
>>> +      * Obtain and check requested sequence number: is this PDU header
>>> +      * before the request?
>>> +      */
>>> +     resync_seq = resync_val >> 32;
>>> +     if (before(pdu_seq, resync_seq))
>>> +             return;
>>> +
>>> +     /*
>>> +      * The atomic operation guarantees that we don't miss any NIC driver
>>> +      * resync requests submitted after the above checks.
>>> +      */
>>> +     if (atomic64_cmpxchg(&queue->resync_req, pdu_val,
>>> +                          pdu_val & ~ULP_DDP_RESYNC_PENDING) !=
>>> +                          atomic64_read(&queue->resync_req))
>>> +             netdev->netdev_ops->ulp_ddp_ops->resync(netdev,
>>> +                                                     queue->sock->sk,
>>> +                                                     pdu_seq);
>>
>> Who else is doing an atomic on this value? and what happens
>> if the cmpxchg fails?
> 
> The driver thread can set queue->resync_req concurrently in patch
> "net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload" in function
> nvmeotcp_update_resync().
> 
> If the cmpxchg fails it means a new resync request was triggered by the
> HW, the old request will be dropped and the new one will be processed by
> a later PDU.

So resync_req is actually the current tcp sequence number or something?
The name resync_req is very confusing.

> 
>>> +}
>>> +
>>> +static bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
>>> +{
>>> +     struct nvme_tcp_queue *queue = sk->sk_user_data;
>>> +
>>> +     /*
>>> +      * "seq" (TCP seq number) is what the HW assumes is the
>>> +      * beginning of a PDU.  The nvme-tcp layer needs to store the
>>> +      * number along with the "flags" (ULP_DDP_RESYNC_PENDING) to
>>> +      * indicate that a request is pending.
>>> +      */
>>> +     atomic64_set(&queue->resync_req, (((uint64_t)seq << 32) | flags));
>>
>> Question, is this coming from multiple contexts? what contexts are
>> competing here that make it an atomic operation? It is unclear what is
>> going on here tbh.
> 
> The driver could get a resync request and set queue->resync_req
> concurrently while processing HW CQEs as you can see in patch
> "net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload" in function
> nvmeotcp_update_resync().
> 
> The resync flow is:
> 
>       nvme-tcp                           mlx5                     hw
>          |                                |                        |
>          |                                |                      sends CQE with
>          |                                |                      resync request
>          |                                | <----------------------'
>          |                         nvmeotcp_update_resync()
>    nvme_tcp_resync_request() <-----------'|
>    we store the request
> 
> Later, while receiving PDUs we check for pending requests.
> If there is one, we send call nvme_tcp_resync_response() which calls
> into mlx5 to send the response to the HW.
> 

...

>>> +                     ret = nvme_tcp_offload_socket(queue);
>>> +                     if (ret) {
>>> +                             dev_info(nctrl->device,
>>> +                                      "failed to setup offload on queue %d ret=%d\n",
>>> +                                      idx, ret);
>>> +                     }
>>> +             }
>>> +     } else {
>>>                ret = nvmf_connect_admin_queue(nctrl);
>>> +             if (ret)
>>> +                     goto err;
>>>
>>> -     if (!ret) {
>>> -             set_bit(NVME_TCP_Q_LIVE, &queue->flags);
>>> -     } else {
>>> -             if (test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>>> -                     __nvme_tcp_stop_queue(queue);
>>> -             dev_err(nctrl->device,
>>> -                     "failed to connect queue: %d ret=%d\n", idx, ret);
>>> +             netdev = get_netdev_for_sock(queue->sock->sk);
>>
>> Is there any chance that this is a different netdev than what is
>> already recorded? doesn't make sense to me.
> 
> The idea is that we are first starting the admin queue, which looks up
> the netdev associated with the socket and stored in the queue. Later,
> when the IO queues are started, we use the recorded netdev.
> 
> In cases of bonding or vlan, a netdev can have lower device links, which
> get_netdev_for_sock() will look up.

I think the code should in high level do:
	if (idx) {
		ret = nvmf_connect_io_queue(nctrl, idx);
		if (ret)
			goto err;
		if (nvme_tcp_ddp_query_limits(queue))
			nvme_tcp_offload_socket(queue);

	} else {
		ret = nvmf_connect_admin_queue(nctrl);
		if (ret)
			goto err;
		ctrl->ddp_netdev = get_netdev_for_sock(queue->sock->sk);
		if (nvme_tcp_ddp_query_limits(queue))
			nvme_tcp_offload_apply_limits(queue);
	}

ctrl->ddp_netdev should be cleared and put when the admin queue
is stopped/freed, similar to how async_req is handled.

>>> +                     goto done;
>>> +             }
>>> +             if (is_netdev_ulp_offload_active(netdev, queue))
>>> +                     nvme_tcp_offload_apply_limits(queue, netdev);
>>> +             /*
>>> +              * release the device as no offload context is
>>> +              * established yet.
>>> +              */
>>> +             dev_put(netdev);
>>
>> the put is unclear, what does it pair with? the get_netdev_for_sock?
> 
> Yes, get_netdev_for_sock() takes a reference, which we don't need at
> that point so we put it.

Well, you store a pointer to it, what happens if it goes away while
the controller is being set up?

Max Gurtovoy Aug. 16, 2023, 12:50 a.m. UTC | #7

Hi Sagi,

On 11/08/2023 8:28, Chaitanya Kulkarni wrote:
> On 8/9/2023 12:39 AM, Sagi Grimberg wrote:
>>
>>
>> On 8/1/23 05:25, Chaitanya Kulkarni wrote:
>>> On 7/12/23 09:14, Aurelien Aptel wrote:
>>>> From: Boris Pismenny <borisp@nvidia.com>
>>>>
>>>> This commit introduces direct data placement offload to NVME
>>>> TCP. There is a context per queue, which is established after the
>>>> handshake using the sk_add/del NDOs.
>>>>
>>>> Additionally, a resynchronization routine is used to assist
>>>> hardware recovery from TCP OOO, and continue the offload.
>>>> Resynchronization operates as follows:
>>>>
>>>> 1. TCP OOO causes the NIC HW to stop the offload
>>>>
>>>> 2. NIC HW identifies a PDU header at some TCP sequence number,
>>>> and asks NVMe-TCP to confirm it.
>>>> This request is delivered from the NIC driver to NVMe-TCP by first
>>>> finding the socket for the packet that triggered the request, and
>>>> then finding the nvme_tcp_queue that is used by this routine.
>>>> Finally, the request is recorded in the nvme_tcp_queue.
>>>>
>>>> 3. When NVMe-TCP observes the requested TCP sequence, it will compare
>>>> it with the PDU header TCP sequence, and report the result to the
>>>> NIC driver (resync), which will update the HW, and resume offload
>>>> when all is successful.
>>>>
>>>> Some HW implementation such as ConnectX-7 assume linear CCID (0...N-1
>>>> for queue of size N) where the linux nvme driver uses part of the 16
>>>> bit CCID for generation counter. To address that, we use the existing
>>>> quirk in the nvme layer when the HW driver advertises if the device is
>>>> not supports the full 16 bit CCID range.
>>>>
>>>> Furthermore, we let the offloading driver advertise what is the max hw
>>>> sectors/segments via ulp_ddp_limits.
>>>>
>>>> A follow-up patch introduces the data-path changes required for this
>>>> offload.
>>>>
>>>> Socket operations need a netdev reference. This reference is
>>>> dropped on NETDEV_GOING_DOWN events to allow the device to go down in
>>>> a follow-up patch.
>>>>
>>>> Signed-off-by: Boris Pismenny <borisp@nvidia.com>
>>>> Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com>
>>>> Signed-off-by: Or Gerlitz <ogerlitz@nvidia.com>
>>>> Signed-off-by: Yoray Zack <yorayz@nvidia.com>
>>>> Signed-off-by: Shai Malin <smalin@nvidia.com>
>>>> Signed-off-by: Aurelien Aptel <aaptel@nvidia.com>
>>>> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
>>>> ---
>>>
>>> For NVMe related code :-
>>>
>>> Offload feature is configurable and maybe not be turned on in the absence
>>> of the H/W. In order to keep the nvme/host/tcp.c file small to only
>>> handle
>>> core related functionality, I wonder if we should to move tcp-offload
>>> code
>>> into it's own file say nvme/host/tcp-offload.c ?
>>
>> Maybe. it wouldn't be tcp_offload.c but rather tcp_ddp.c because its not
>> offloading the tcp stack but rather doing direct data placement.
>>
> 
> fine ...
> 
>> If we are going to do that it will pollute nvme.h or add a common
>> header file, which is something I'd like to avoid if possible.

Would you like us to try doing so and see how will it look like ?
I actually think that it will be easier to maintain if we split it to 
tcp_ddp.c (+ optional adding of tcp.h under driver/nvme/host...)

> 
> my comment was mainly based on how we separated the core code from
> configurable features, and wondering any decision we make for host will
> also apply for the target ddp code in future, but you prefer to keep it
> as it is let's not bloat header ...
> 
> -ck
> 
> 
>

Aurelien Aptel Aug. 16, 2023, 12:30 p.m. UTC | #8

Sagi Grimberg <sagi@grimberg.me> writes:
>>>> +     if (!netdev || !queue)
>>>> +             return false;
>>>
>>> Is it reasonable to be called here with !netdev or !queue ?
>>
>> The check is needed only for the IO queue case but we can move it
>> earlier in nvme_tcp_start_queue().
>
> I still don't understand even on io queues how this can happen.

In case where the netdev does not support DDP, when the admin queue is
started, netdev->offloading_netdev will not be set and therefore will be
NULL.

Later, when the IO queue is started:

    netdev = queue->ctrl->offloading_netdev; <== NULL
    if (is_netdev_ulp_offload_active(netdev, queue)) { <== we pass NULL

We can move the NULL check higher-up if you prefer, like so:

    if (queue->ctrl->offloading_netdev &&
        is_netdev_ulp_offload_active(queue->ctrl->offloading_netdev, queue)) {

>>>> +
>>>> +     /* If we cannot query the netdev limitations, do not offload */
>>>> +     if (!nvme_tcp_ddp_query_limits(netdev, queue))
>>>> +             return false;
>>>> +
>>>> +     /* If netdev supports nvme-tcp ddp offload, we can offload */
>>>> +     if (test_bit(ULP_DDP_C_NVME_TCP_BIT, netdev->ulp_ddp_caps.active))
>>>> +             return true;
>>>
>>> This should be coming from the API itself, have the limits query
>>> api fail if this is off.
>>
>> We can move the function to the ULP DDP layer.
>>
>>> btw, what is the active thing? is this driven from ethtool enable?
>>> what happens if the user disables it while there is a ulp using it?
>>
>> The active bits are indeed driven by ethtool according to the design
>> Jakub suggested.
>> The nvme-tcp connection will have to be reconnected to see the effect of
>> changing the bit.
>
> It should move inside the api as well. Don't want to care about it in
> nvme.

Ok, we will move it there.

>>>> +
>>>> +     return false;
>>>
>>> This can be folded to the above function.
>>
>> We won't be able to check for TLS in a common wrapper. We think this
>> should be kept.
>
> Why? any tcp ddp need to be able to support tls. Nothing specific to
> nvme here.

True, we will move it to the ULP wrapper.

>>>> +     /*
>>>> +      * The atomic operation guarantees that we don't miss any NIC driver
>>>> +      * resync requests submitted after the above checks.
>>>> +      */
>>>> +     if (atomic64_cmpxchg(&queue->resync_req, pdu_val,
>>>> +                          pdu_val & ~ULP_DDP_RESYNC_PENDING) !=
>>>> +                          atomic64_read(&queue->resync_req))
>>>> +             netdev->netdev_ops->ulp_ddp_ops->resync(netdev,
>>>> +                                                     queue->sock->sk,
>>>> +                                                     pdu_seq);
>>>
>>> Who else is doing an atomic on this value? and what happens
>>> if the cmpxchg fails?
>>
>> The driver thread can set queue->resync_req concurrently in patch
>> "net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload" in function
>> nvmeotcp_update_resync().
>>
>> If the cmpxchg fails it means a new resync request was triggered by the
>> HW, the old request will be dropped and the new one will be processed by
>> a later PDU.
>
> So resync_req is actually the current tcp sequence number or something?
> The name resync_req is very confusing.

queue->resync_req is the TCP sequence for which the HW requested a
resync operation. We can rename it with queue->resync_tcp_seq.

>>>> +                     ret = nvme_tcp_offload_socket(queue);
>>>> +                     if (ret) {
>>>> +                             dev_info(nctrl->device,
>>>> +                                      "failed to setup offload on queue %d ret=%d\n",
>>>> +                                      idx, ret);
>>>> +                     }
>>>> +             }
>>>> +     } else {
>>>>                ret = nvmf_connect_admin_queue(nctrl);
>>>> +             if (ret)
>>>> +                     goto err;
>>>>
>>>> -     if (!ret) {
>>>> -             set_bit(NVME_TCP_Q_LIVE, &queue->flags);
>>>> -     } else {
>>>> -             if (test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>>>> -                     __nvme_tcp_stop_queue(queue);
>>>> -             dev_err(nctrl->device,
>>>> -                     "failed to connect queue: %d ret=%d\n", idx, ret);
>>>> +             netdev = get_netdev_for_sock(queue->sock->sk);
>>>
>>> Is there any chance that this is a different netdev than what is
>>> already recorded? doesn't make sense to me.
>>
>> The idea is that we are first starting the admin queue, which looks up
>> the netdev associated with the socket and stored in the queue. Later,
>> when the IO queues are started, we use the recorded netdev.
>>
>> In cases of bonding or vlan, a netdev can have lower device links, which
>> get_netdev_for_sock() will look up.
>
> I think the code should in high level do:
>         if (idx) {
>                 ret = nvmf_connect_io_queue(nctrl, idx);
>                 if (ret)
>                         goto err;
>                 if (nvme_tcp_ddp_query_limits(queue))
>                         nvme_tcp_offload_socket(queue);
>
>         } else {
>                 ret = nvmf_connect_admin_queue(nctrl);
>                 if (ret)
>                         goto err;
>                 ctrl->ddp_netdev = get_netdev_for_sock(queue->sock->sk);
>                 if (nvme_tcp_ddp_query_limits(queue))
>                         nvme_tcp_offload_apply_limits(queue);
>         }

Ok, we will follow this design.

> ctrl->ddp_netdev should be cleared and put when the admin queue
> is stopped/freed, similar to how async_req is handled.

Thanks, we will clear ddp_netdev on queue stop/free.
This will also prevent reusing a potentially wrong netdev after a reconnection.

>>>> +             /*
>>>> +              * release the device as no offload context is
>>>> +              * established yet.
>>>> +              */
>>>> +             dev_put(netdev);
>>>
>>> the put is unclear, what does it pair with? the get_netdev_for_sock?
>>
>> Yes, get_netdev_for_sock() takes a reference, which we don't need at
>> that point so we put it.
>
> Well, you store a pointer to it, what happens if it goes away while
> the controller is being set up?

It's a problem. We will remove the dev_put() to keep the first
reference, and only release it when it is cleared from the admin queue.

[v12,07/26] nvme-tcp: Add DDP offload control path

Checks

Commit Message

Comments

Patch