[v2,net-next,07/21] nvme-tcp: Add DDP data-path

Message ID	20210114151033.13020-8-borisp@mellanox.com (mailing list archive)
State	Changes Requested
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@kernel.org> From: Boris Pismenny <borisp@mellanox.com> To: kuba@kernel.org, davem@davemloft.net, saeedm@nvidia.com, hch@lst.de, sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org, viro@zeniv.linux.org.uk, edumazet@google.com, dsahern@gmail.com, smalin@marvell.com Cc: boris.pismenny@gmail.com, linux-nvme@lists.infradead.org, netdev@vger.kernel.org, benishay@nvidia.com, ogerlitz@nvidia.com, yorayz@nvidia.com, Ben Ben-Ishay <benishay@mellanox.com>, Or Gerlitz <ogerlitz@mellanox.com>, Yoray Zack <yorayz@mellanox.com> Subject: [PATCH v2 net-next 07/21] nvme-tcp: Add DDP data-path Date: Thu, 14 Jan 2021 17:10:19 +0200 Message-Id: <20210114151033.13020-8-borisp@mellanox.com> In-Reply-To: <20210114151033.13020-1-borisp@mellanox.com> References: <20210114151033.13020-1-borisp@mellanox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	nvme-tcp receive offloads \| expand [v2,net-next,00/21] nvme-tcp receive offloads [v2,net-next,01/21] iov_iter: Introduce new procedures for copy to iter/pages [v2,net-next,02/21] net: Introduce direct data placement tcp offload [v2,net-next,03/21] net: Introduce crc offload for tcp ddp ulp [v2,net-next,04/21] net: SKB copy(+hash) iterators for DDP offloads [v2,net-next,05/21] net/tls: expose get_netdev_for_sock [v2,net-next,06/21] nvme-tcp: Add DDP offload control path [v2,net-next,07/21] nvme-tcp: Add DDP data-path [v2,net-next,08/21] nvme-tcp : Recalculate crc in the end of the capsule [v2,net-next,09/21] nvme-tcp: Deal with netdevice DOWN events [v2,net-next,10/21] net/mlx5: Header file changes for nvme-tcp offload [v2,net-next,11/21] net/mlx5: Add 128B CQE for NVMEoTCP offload [v2,net-next,12/21] net/mlx5e: TCP flow steering for nvme-tcp [v2,net-next,13/21] net/mlx5e: NVMEoTCP offload initialization [v2,net-next,14/21] net/mlx5e: KLM UMR helper macros [v2,net-next,15/21] net/mlx5e: NVMEoTCP use KLM UMRs [v2,net-next,16/21] net/mlx5e: NVMEoTCP queue init/teardown [v2,net-next,17/21] net/mlx5e: NVMEoTCP async ddp invalidation [v2,net-next,18/21] net/mlx5e: NVMEoTCP ddp setup and resync [v2,net-next,19/21] net/mlx5e: NVMEoTCP, data-path for DDP offload [v2,net-next,20/21] net/mlx5e: NVMEoTCP statistics [v2,net-next,21/21] Documentation: add TCP DDP offload documentation

Message ID

20210114151033.13020-8-borisp@mellanox.com (mailing list archive)

State

Changes Requested

Delegated to:

Netdev Maintainers

Headers

From: Boris Pismenny <borisp@mellanox.com>
To: kuba@kernel.org, davem@davemloft.net, saeedm@nvidia.com,
        hch@lst.de, sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org,
        viro@zeniv.linux.org.uk, edumazet@google.com, dsahern@gmail.com,
        smalin@marvell.com
Cc: boris.pismenny@gmail.com, linux-nvme@lists.infradead.org,
        netdev@vger.kernel.org, benishay@nvidia.com, ogerlitz@nvidia.com,
        yorayz@nvidia.com, Ben Ben-Ishay <benishay@mellanox.com>,
        Or Gerlitz <ogerlitz@mellanox.com>,
        Yoray Zack <yorayz@mellanox.com>
Subject: [PATCH v2 net-next 07/21] nvme-tcp: Add DDP data-path
Date: Thu, 14 Jan 2021 17:10:19 +0200
Message-Id: <20210114151033.13020-8-borisp@mellanox.com>
In-Reply-To: <20210114151033.13020-1-borisp@mellanox.com>
References: <20210114151033.13020-1-borisp@mellanox.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

nvme-tcp receive offloads | expand

Context	Check	Description
netdev/apply	fail	Patch does not apply to net-next
netdev/tree_selection	success	Clearly marked for net-next

Context

Check

Description

netdev/apply

fail

Patch does not apply to net-next

netdev/tree_selection

success

Clearly marked for net-next

Commit Message

Boris Pismenny Jan. 14, 2021, 3:10 p.m. UTC

Introduce the NVMe-TCP DDP data-path offload.
Using this interface, the NIC hardware will scatter TCP payload directly
to the BIO pages according to the command_id in the PDU.
To maintain the correctness of the network stack, the driver is expected
to construct SKBs that point to the BIO pages.

The data-path interface contains two routines: tcp_ddp_setup/teardown.
The setup provides the mapping from command_id to the request buffers,
while the teardown removes this mapping.

For efficiency, we introduce an asynchronous nvme completion, which is
split between NVMe-TCP and the NIC driver as follows:
NVMe-TCP performs the specific completion, while NIC driver performs the
generic mq_blk completion.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 drivers/nvme/host/tcp.c | 141 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 131 insertions(+), 10 deletions(-)

Comments

David Ahern Jan. 19, 2021, 4:18 a.m. UTC | #1

On 1/14/21 8:10 AM, Boris Pismenny wrote:
> @@ -664,8 +753,15 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
>  		return -EINVAL;
>  	}
>  
> -	if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
> -		nvme_complete_rq(rq);
> +	req = blk_mq_rq_to_pdu(rq);
> +	if (req->offloaded) {
> +		req->status = cqe->status;
> +		req->result = cqe->result;
> +		nvme_tcp_teardown_ddp(queue, cqe->command_id, rq);
> +	} else {
> +		if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
> +			nvme_complete_rq(rq);
> +	}
>  	queue->nr_cqe++;
>  
>  	return 0;
> @@ -859,9 +955,18 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>  static inline void nvme_tcp_end_request(struct request *rq, u16 status)
>  {
>  	union nvme_result res = {};
> +	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
> +	struct nvme_tcp_queue *queue = req->queue;
> +	struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;
>  
> -	if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))
> -		nvme_complete_rq(rq);
> +	if (req->offloaded) {
> +		req->status = cpu_to_le16(status << 1);
> +		req->result = res;
> +		nvme_tcp_teardown_ddp(queue, pdu->command_id, rq);
> +	} else {
> +		if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))
> +			nvme_complete_rq(rq);
> +	}
>  }
>  
>  static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,

The req->offload checks assume the offload is to the expected
offload_netdev, but you do not verify the data arrived as expected. You
might get lucky if both netdev's belong to the same PCI device (assuming
the h/w handles it a certain way), but it will not if the netdev's
belong to different devices.

Consider a system with 2 network cards -- even if it is 2 mlx5 based
devices. One setup can have the system using a bond with 1 port from
each PCI device. The tx path picks a leg based on the hash of the ntuple
and that (with Tariq's bond patches) becomes the expected offload
device. A similar example holds for a pure routing setup with ECMP. For
both there is full redundancy in the network - separate NIC cards
connected to separate TORs to have independent network paths.

A packet arrives on the *other* netdevice - you have *no* control over
the Rx path. Your current checks will think the packet arrived with DDP
but it did not.

Boris Pismenny Jan. 31, 2021, 8:44 a.m. UTC | #2

On 19/01/2021 6:18, David Ahern wrote:
> On 1/14/21 8:10 AM, Boris Pismenny wrote:
>> @@ -664,8 +753,15 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
>>  		return -EINVAL;
>>  	}
>>  
>> -	if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
>> -		nvme_complete_rq(rq);
>> +	req = blk_mq_rq_to_pdu(rq);
>> +	if (req->offloaded) {
>> +		req->status = cqe->status;
>> +		req->result = cqe->result;
>> +		nvme_tcp_teardown_ddp(queue, cqe->command_id, rq);
>> +	} else {
>> +		if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
>> +			nvme_complete_rq(rq);
>> +	}
>>  	queue->nr_cqe++;
>>  
>>  	return 0;
>> @@ -859,9 +955,18 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>>  static inline void nvme_tcp_end_request(struct request *rq, u16 status)
>>  {
>>  	union nvme_result res = {};
>> +	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
>> +	struct nvme_tcp_queue *queue = req->queue;
>> +	struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;
>>  
>> -	if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))
>> -		nvme_complete_rq(rq);
>> +	if (req->offloaded) {
>> +		req->status = cpu_to_le16(status << 1);
>> +		req->result = res;
>> +		nvme_tcp_teardown_ddp(queue, pdu->command_id, rq);
>> +	} else {
>> +		if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))
>> +			nvme_complete_rq(rq);
>> +	}
>>  }
>>  
>>  static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
> 
> 
> The req->offload checks assume the offload is to the expected
> offload_netdev, but you do not verify the data arrived as expected. You
> might get lucky if both netdev's belong to the same PCI device (assuming
> the h/w handles it a certain way), but it will not if the netdev's
> belong to different devices.
> 
> Consider a system with 2 network cards -- even if it is 2 mlx5 based
> devices. One setup can have the system using a bond with 1 port from
> each PCI device. The tx path picks a leg based on the hash of the ntuple
> and that (with Tariq's bond patches) becomes the expected offload
> device. A similar example holds for a pure routing setup with ECMP. For
> both there is full redundancy in the network - separate NIC cards
> connected to separate TORs to have independent network paths.
> 
> A packet arrives on the *other* netdevice - you have *no* control over
> the Rx path. Your current checks will think the packet arrived with DDP
> but it did not.
> 

There's no problem if another (non-offload) netdevice receives traffic
that arrives here. Because that other device will never set the SKB
frag pages to point to the final destination buffers, and so copy
offload will not take place.

The req->offload indication is mainly for matching ddp_setup with
ddp_teardown calls, it does not control copy/crc offload as these are
controlled per-skb using frags for copy and skb bits for crc.

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 31bf9e3ea236..9a43bd31ec15 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -57,6 +57,11 @@  struct nvme_tcp_request {
 	size_t			offset;
 	size_t			data_sent;
 	enum nvme_tcp_send_state state;
+
+	bool			offloaded;
+	struct tcp_ddp_io	ddp;
+	__le16			status;
+	union nvme_result	result;
 };
 
 enum nvme_tcp_queue_flags {
@@ -231,10 +236,74 @@  static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req,
 #ifdef CONFIG_TCP_DDP
 
 static bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags);
+static void nvme_tcp_ddp_teardown_done(void *ddp_ctx);
 static const struct tcp_ddp_ulp_ops nvme_tcp_ddp_ulp_ops = {
 	.resync_request		= nvme_tcp_resync_request,
+	.ddp_teardown_done	= nvme_tcp_ddp_teardown_done,
 };
 
+static
+int nvme_tcp_teardown_ddp(struct nvme_tcp_queue *queue,
+			  u16 command_id,
+			  struct request *rq)
+{
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct net_device *netdev = queue->ctrl->offloading_netdev;
+	int ret;
+
+	if (unlikely(!netdev)) {
+		pr_info_ratelimited("%s: netdev not found\n", __func__);
+		return -EINVAL;
+	}
+
+	ret = netdev->tcp_ddp_ops->tcp_ddp_teardown(netdev, queue->sock->sk,
+						    &req->ddp, rq);
+	sg_free_table_chained(&req->ddp.sg_table, SG_CHUNK_SIZE);
+	req->offloaded = false;
+	return ret;
+}
+
+static void nvme_tcp_ddp_teardown_done(void *ddp_ctx)
+{
+	struct request *rq = ddp_ctx;
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+
+	if (!nvme_try_complete_req(rq, cpu_to_le16(req->status << 1), req->result))
+		nvme_complete_rq(rq);
+}
+
+static
+int nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue,
+		       u16 command_id,
+		       struct request *rq)
+{
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct net_device *netdev = queue->ctrl->offloading_netdev;
+	int ret;
+
+	req->offloaded = false;
+
+	if (unlikely(!netdev)) {
+		pr_info_ratelimited("%s: netdev not found\n", __func__);
+		return -EINVAL;
+	}
+
+	req->ddp.command_id = command_id;
+	req->ddp.sg_table.sgl = req->ddp.first_sgl;
+	ret = sg_alloc_table_chained(&req->ddp.sg_table, blk_rq_nr_phys_segments(rq),
+				     req->ddp.sg_table.sgl, SG_CHUNK_SIZE);
+	if (ret)
+		return -ENOMEM;
+	req->ddp.nents = blk_rq_map_sg(rq->q, rq, req->ddp.sg_table.sgl);
+
+	ret = netdev->tcp_ddp_ops->tcp_ddp_setup(netdev,
+						 queue->sock->sk,
+						 &req->ddp);
+	if (!ret)
+		req->offloaded = true;
+	return ret;
+}
+
 static
 int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
 {
@@ -375,6 +444,25 @@  bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
 
 #else
 
+static
+int nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue,
+		       u16 command_id,
+		       struct request *rq)
+{
+	return -EINVAL;
+}
+
+static
+int nvme_tcp_teardown_ddp(struct nvme_tcp_queue *queue,
+			  u16 command_id,
+			  struct request *rq)
+{
+	return -EINVAL;
+}
+
+static void nvme_tcp_ddp_teardown_done(void *ddp_ctx)
+{}
+
 static
 int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
 {
@@ -653,6 +741,7 @@  static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
 static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
 		struct nvme_completion *cqe)
 {
+	struct nvme_tcp_request *req;
 	struct request *rq;
 
 	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
@@ -664,8 +753,15 @@  static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
 		return -EINVAL;
 	}
 
-	if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
-		nvme_complete_rq(rq);
+	req = blk_mq_rq_to_pdu(rq);
+	if (req->offloaded) {
+		req->status = cqe->status;
+		req->result = cqe->result;
+		nvme_tcp_teardown_ddp(queue, cqe->command_id, rq);
+	} else {
+		if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
+			nvme_complete_rq(rq);
+	}
 	queue->nr_cqe++;
 
 	return 0;
@@ -859,9 +955,18 @@  static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 static inline void nvme_tcp_end_request(struct request *rq, u16 status)
 {
 	union nvme_result res = {};
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_tcp_queue *queue = req->queue;
+	struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;
 
-	if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))
-		nvme_complete_rq(rq);
+	if (req->offloaded) {
+		req->status = cpu_to_le16(status << 1);
+		req->result = res;
+		nvme_tcp_teardown_ddp(queue, pdu->command_id, rq);
+	} else {
+		if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))
+			nvme_complete_rq(rq);
+	}
 }
 
 static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
@@ -908,12 +1013,22 @@  static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 		recv_len = min_t(size_t, recv_len,
 				iov_iter_count(&req->iter));
 
-		if (queue->data_digest)
-			ret = skb_copy_and_hash_datagram_iter(skb, *offset,
-				&req->iter, recv_len, queue->rcv_hash);
-		else
-			ret = skb_copy_datagram_iter(skb, *offset,
-					&req->iter, recv_len);
+		if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags)) {
+			if (queue->data_digest)
+				ret = skb_ddp_copy_and_hash_datagram_iter(skb, *offset,
+						&req->iter, recv_len, queue->rcv_hash);
+			else
+				ret = skb_ddp_copy_datagram_iter(skb, *offset,
+						&req->iter, recv_len);
+		} else {
+			if (queue->data_digest)
+				ret = skb_copy_and_hash_datagram_iter(skb, *offset,
+						&req->iter, recv_len, queue->rcv_hash);
+			else
+				ret = skb_copy_datagram_iter(skb, *offset,
+						&req->iter, recv_len);
+		}
+
 		if (ret) {
 			dev_err(queue->ctrl->ctrl.device,
 				"queue %d failed to copy request %#x data",
@@ -1137,6 +1252,7 @@  static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
 	bool inline_data = nvme_tcp_has_inline_data(req);
 	u8 hdgst = nvme_tcp_hdgst_len(queue);
 	int len = sizeof(*pdu) + hdgst - req->offset;
+	struct request *rq = blk_mq_rq_from_pdu(req);
 	int flags = MSG_DONTWAIT;
 	int ret;
 
@@ -1145,6 +1261,10 @@  static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
 	else
 		flags |= MSG_EOR;
 
+	if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags) &&
+	    blk_rq_nr_phys_segments(rq) && rq_data_dir(rq) == READ)
+		nvme_tcp_setup_ddp(queue, pdu->cmd.common.command_id, rq);
+
 	if (queue->hdr_digest && !req->offset)
 		nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
 
@@ -2447,6 +2567,7 @@  static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
 	req->data_len = blk_rq_nr_phys_segments(rq) ?
 				blk_rq_payload_bytes(rq) : 0;
 	req->curr_bio = rq->bio;
+	req->offloaded = false;
 
 	if (rq_data_dir(rq) == WRITE &&
 	    req->data_len <= nvme_tcp_inline_data_size(queue))

[v2,net-next,07/21] nvme-tcp: Add DDP data-path

Checks

Commit Message

Comments

Patch