[v10,22/25] net/mlx5e: NVMEoTCP, ddp setup and resync

Message ID	20230126162136.13003-23-aaptel@nvidia.com (mailing list archive)
State	Superseded
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Aurelien Aptel <aaptel@nvidia.com> To: linux-nvme@lists.infradead.org, netdev@vger.kernel.org, sagi@grimberg.me, hch@lst.de, kbusch@kernel.org, axboe@fb.com, chaitanyak@nvidia.com, davem@davemloft.net, kuba@kernel.org Cc: Aurelien Aptel <aaptel@nvidia.com>, aurelien.aptel@gmail.com, smalin@nvidia.com, malin1024@gmail.com, ogerlitz@nvidia.com, yorayz@nvidia.com, borisp@nvidia.com Subject: [PATCH v10 22/25] net/mlx5e: NVMEoTCP, ddp setup and resync Date: Thu, 26 Jan 2023 18:21:33 +0200 Message-Id: <20230126162136.13003-23-aaptel@nvidia.com> In-Reply-To: <20230126162136.13003-1-aaptel@nvidia.com> References: <20230126162136.13003-1-aaptel@nvidia.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Precedence: bulk
Series	nvme-tcp receive offloads \| expand [v10,00/25] nvme-tcp receive offloads [v10,01/25] net: Introduce direct data placement tcp offload [v10,02/25] net/ethtool: add new stringset ETH_SS_ULP_DDP_{CAPS,STATS} [v10,03/25] net/ethtool: add ULP_DDP_{GET,SET} operations for caps and stats [v10,04/25] Documentation: document netlink ULP_DDP_GET/SET messages [v10,05/25] iov_iter: skip copy if src == dst for direct data placement [v10,06/25] net/tls,core: export get_netdev_for_sock [v10,07/25] nvme-tcp: Add DDP offload control path [v10,08/25] nvme-tcp: Add DDP data-path [v10,09/25] nvme-tcp: RX DDGST offload [v10,10/25] nvme-tcp: Deal with netdevice DOWN events [v10,11/25] nvme-tcp: Add modparam to control the ULP offload enablement [v10,12/25] Documentation: add ULP DDP offload documentation [v10,13/25] net/mlx5e: Rename from tls to transport static params [v10,14/25] net/mlx5e: Refactor ico sq polling to get budget [v10,15/25] net/mlx5e: Have mdev pointer directly on the icosq structure [v10,16/25] net/mlx5e: Refactor doorbell function to allow avoiding a completion [v10,17/25] net/mlx5: Add NVMEoTCP caps, HW bits, 128B CQE and enumerations [v10,18/25] net/mlx5e: NVMEoTCP, offload initialization [v10,19/25] net/mlx5e: TCP flow steering for nvme-tcp acceleration [v10,20/25] net/mlx5e: NVMEoTCP, use KLM UMRs for buffer registration [v10,21/25] net/mlx5e: NVMEoTCP, queue init/teardown [v10,22/25] net/mlx5e: NVMEoTCP, ddp setup and resync [v10,23/25] net/mlx5e: NVMEoTCP, async ddp invalidation [v10,24/25] net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload [v10,25/25] net/mlx5e: NVMEoTCP, statistics

Message ID

20230126162136.13003-23-aaptel@nvidia.com (mailing list archive)

State

Superseded

Delegated to:

Netdev Maintainers

Headers

From: Aurelien Aptel <aaptel@nvidia.com>
To: linux-nvme@lists.infradead.org, netdev@vger.kernel.org,
        sagi@grimberg.me, hch@lst.de, kbusch@kernel.org, axboe@fb.com,
        chaitanyak@nvidia.com, davem@davemloft.net, kuba@kernel.org
Cc: Aurelien Aptel <aaptel@nvidia.com>, aurelien.aptel@gmail.com,
        smalin@nvidia.com, malin1024@gmail.com, ogerlitz@nvidia.com,
        yorayz@nvidia.com, borisp@nvidia.com
Subject: [PATCH v10 22/25] net/mlx5e: NVMEoTCP, ddp setup and resync
Date: Thu, 26 Jan 2023 18:21:33 +0200
Message-Id: <20230126162136.13003-23-aaptel@nvidia.com>
In-Reply-To: <20230126162136.13003-1-aaptel@nvidia.com>
References: <20230126162136.13003-1-aaptel@nvidia.com>
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
MIME-Version: 1.0
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: 
 eFxK/Q6XrJRfFH5D7qYx/TgACIStLmA3X7d6ET1XCSTchJ/5Ezk8StfLLPe2bOHNvN0fulq15+8GZCXFmpB8xONmpdBq8oBb4QrR0Bvneq6WVOa0Nv/X6v5aheg4wN1BvO+7FxFfAtmv/pJM5JNlBvXC1UFAC1jhibPJlAhGda7OkiMKcvmijmJLlsWS4LFWwmXoSLUIdFtFFMAaz7pZ3HoUgvmFcBIGp1KIqykgGvYDVS1A3HxIcNgwAhIsDHYHOR3LWxJ73+4ONkYYXbqmLmmoxdaLG+ybA0EXeVHGgjd532Tuvv8yAX7j8yD1ENbi+u0hHtf5DSRGN9yqqD9S6FK3yfSoZLUZfIN5ulYi+HG5OGmGR+Mw12TUAdsZ2O3wnl2v5gi22JcnT15z5JpRp3HVU15Wf89h+0RGSLFMpZ3t0JXd4xotzHLQlZktDEQjZ/13fssGSzcizvqRuMV7ewRtPixOowT0HUrmVuZbfypVnw58G2SLxsZqe/sGZQ3B0aXSaA8aI7PLzdsNnj9ea5kG8WwirSN97UanuhHfwJFbTbtjzvh6kDx5wSS8gAmdk2KqdL4KTCeaJL8aRiwVXHyO0z9NRRbPm8FhDfyRn2tgATcNtZ7r6cIInHgXczKRU5nJV298oEGzESAjC1RvC8Jz6bLu2M4jNuPG9T40qWrPxH8240xqFhYtkR9zt3ZySIMJwUcEhTzhDFfbCFxocFbDJngdiQutVzbVcBX0/WnYXe0Gce74/nmQhe/8kRN0D5Owy/3U9zzIkoQ3wma2mW6sat7ruwEHrvpvIG4rdh7tYtfJuxBfrg24Xqor3AKjFr3Uv4eI240qI58pF11tuh9B6aJ+9uzz9/Tt+1iVUooWtUCqnpDq74YeVSEZnDTUh43pu2PulMZDYFX9wa9MtOKZJuc6ERuHFukRctTdtHF9a7gEMgpFcTcApoVe5VCnUNFMp6IPYhiTTQpm9qeVImjUQEX0m21H4hbpMt/2aOr2rNFZi949fbMMrJnL1deWG/m3QA8PCE5bzqgMmBRMQB/pZotduwhI004hJVJzJVOUpBlAmDrPJMYVeIk7ooDaPLpsLMhIXRQJxeGzBG0NI6B6segxLDJU/Lb0U+WlOU6KWvbd2cvULBUbB0ve6YVIbwKRdfsbKkigzQFFVD+6yzdA39ZULi2hn+/eHGkbYHcrg+i2dBn/mh11oxJoGN1cUUj0HrI68h39SGo+FJQzcnyR0XbFFxFkQ4DJ0fs20wyizAvmQYNx0tk7yW+ikV/Bt+bU9UeYGX01jVG8XFhTsc9IDbKkkP0qpPpzU51Z5JPGor40JVss1w/npB2TnVEmNbGoraRloiLyDVq9jijBm/FaM1BhBG2SSE+yarzgmTvxlyAeN+UcJju2tEhifC6KRhOmWZ0kv9DRvEpvdbLLhZdvrOWL/UietJuNcuyZ1EEIPQLwoMhiUmHfkiSd88r7MvDWI28WfUtd6BXkDnOGQoAqdV/d9473oLvw7rlnmGU0GoPPxkg4w25CMr9/VxEhj2LhJFhR68WCTha2USzD+EvDwNpllZ3Tf45bSt0IvUYCvEijNgFe1QuoC+kBkQoh
X-OriginatorOrg: Nvidia.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 749ca57d-f2cb-4341-f376-08daffb9bcb3
X-MS-Exchange-CrossTenant-AuthSource: SJ1PR12MB6075.namprd12.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 26 Jan 2023 16:24:02.8007
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: 
 lxUboHZlNOuDcwIROgMs9ubcKQaIljKgehT21PYfoDZhrspw6VhoyHHjpTzgNb6ZlMnn1uwVJH2pKWPp5v7LSA==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH0PR12MB7792
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: kuba@kernel.org

Series

nvme-tcp receive offloads | expand

Context	Check	Description
netdev/tree_selection	success	Guessed tree name to be net-next, async
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/subject_prefix	success	Link
netdev/cover_letter	success	Series has a cover letter
netdev/patch_count	fail	Series longer than 15 patches (and no cover letter)
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers	warning	7 maintainers not CCed: saeedm@nvidia.com benishay@nvidia.com leon@kernel.org pabeni@redhat.com linux-rdma@vger.kernel.org tariqt@nvidia.com edumazet@google.com
netdev/build_clang	success	Errors and warnings before: 0 this patch: 0
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 0 this patch: 0
netdev/checkpatch	warning	WARNING: line length of 81 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Context

Check

Description

netdev/tree_selection

success

Guessed tree name to be net-next, async

netdev/fixes_present

success

Fixes tag not required for -next series

netdev/subject_prefix

success

Link

netdev/cover_letter

success

Series has a cover letter

netdev/patch_count

fail

Series longer than 15 patches (and no cover letter)

netdev/header_inline

success

No static functions without inline keyword in header files

netdev/build_32bit

success

Errors and warnings before: 0 this patch: 0

netdev/cc_maintainers

warning

7 maintainers not CCed: saeedm@nvidia.com benishay@nvidia.com leon@kernel.org pabeni@redhat.com linux-rdma@vger.kernel.org tariqt@nvidia.com edumazet@google.com

netdev/build_clang

success

Errors and warnings before: 0 this patch: 0

netdev/module_param

success

Was 0 now: 0

netdev/verify_signedoff

success

Signed-off-by tag matches author and committer

netdev/check_selftest

success

No net selftest shell script

netdev/verify_fixes

success

No Fixes tag

netdev/build_allmodconfig_warn

success

Errors and warnings before: 0 this patch: 0

netdev/checkpatch

warning

WARNING: line length of 81 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns

netdev/kdoc

success

Errors and warnings before: 0 this patch: 0

netdev/source_inline

success

Was 0 now: 0

Commit Message

Aurelien Aptel Jan. 26, 2023, 4:21 p.m. UTC

From: Ben Ben-Ishay <benishay@nvidia.com>

NVMEoTCP offload uses buffer registration for every NVME request to perform
direct data placement. This is achieved by creating a NIC HW mapping
between the CCID (command capsule ID) to the set of buffers that compose
the request. The registration is implemented via MKEY for which we do
fast/async mapping using KLM UMR WQE.

The buffer registration takes place when the ULP calls the ddp_setup op
which is done before they send their corresponding request to the other
side (e.g nvmf target). We don't wait for the completion of the
registration before returning back to the ulp. The reason being that
the HW mapping should be in place fast enough vs the RTT it would take
for the request to be responded. If this doesn't happen, some IO may not
be ddp-offloaded, but that doesn't stop the overall offloading session.

When the offloading HW gets out of sync with the protocol session, a
hardware/software handshake takes place to resync. The ddp_resync op is the
part of the handshake where the SW confirms to the HW that a indeed they
identified correctly a PDU header at a certain TCP sequence number. This
allows the HW to resume the offload.

The 1st part of the handshake is when the HW identifies such sequence
number in an arriving packet. A special mark is made on the completion
(cqe) and then the mlx5 driver invokes the ddp resync_request callback
advertised by the ULP in the ddp context - this is in downstream patch.

Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com>
Signed-off-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Or Gerlitz <ogerlitz@nvidia.com>
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Aurelien Aptel <aaptel@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../mellanox/mlx5/core/en_accel/nvmeotcp.c    | 146 +++++++++++++++++-
 1 file changed, 144 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
index a90be18963d6..e7c2cf83fd20 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
@@ -682,19 +682,156 @@  mlx5e_nvmeotcp_queue_teardown(struct net_device *netdev,
 	mlx5e_nvmeotcp_put_queue(queue);
 }
 
+static bool
+mlx5e_nvmeotcp_validate_small_sgl_suffix(struct scatterlist *sg, int sg_len, int mtu)
+{
+	int i, hole_size, hole_len, chunk_size = 0;
+
+	for (i = 1; i < sg_len; i++)
+		chunk_size += sg_dma_len(&sg[i]);
+
+	if (chunk_size >= mtu)
+		return true;
+
+	hole_size = mtu - chunk_size - 1;
+	hole_len = DIV_ROUND_UP(hole_size, PAGE_SIZE);
+
+	if (sg_len + hole_len > MAX_SKB_FRAGS)
+		return false;
+
+	return true;
+}
+
+static bool
+mlx5e_nvmeotcp_validate_big_sgl_suffix(struct scatterlist *sg, int sg_len, int mtu)
+{
+	int i, j, last_elem, window_idx, window_size = MAX_SKB_FRAGS - 1;
+	int chunk_size = 0;
+
+	last_elem = sg_len - window_size;
+	window_idx = window_size;
+
+	for (j = 1; j < window_size; j++)
+		chunk_size += sg_dma_len(&sg[j]);
+
+	for (i = 1; i <= last_elem; i++, window_idx++) {
+		chunk_size += sg_dma_len(&sg[window_idx]);
+		if (chunk_size < mtu - 1)
+			return false;
+
+		chunk_size -= sg_dma_len(&sg[i]);
+	}
+
+	return true;
+}
+
+/* This function makes sure that the middle/suffix of a PDU SGL meets the
+ * restriction of MAX_SKB_FRAGS. There are two cases here:
+ * 1. sg_len < MAX_SKB_FRAGS - the extreme case here is a packet that consists
+ * of one byte from the first SG element + the rest of the SGL and the remaining
+ * space of the packet will be scattered to the WQE and will be pointed by
+ * SKB frags.
+ * 2. sg_len => MAX_SKB_FRAGS - the extreme case here is a packet that consists
+ * of one byte from middle SG element + 15 continuous SG elements + one byte
+ * from a sequential SG element or the rest of the packet.
+ */
+static bool
+mlx5e_nvmeotcp_validate_sgl_suffix(struct scatterlist *sg, int sg_len, int mtu)
+{
+	int ret;
+
+	if (sg_len < MAX_SKB_FRAGS)
+		ret = mlx5e_nvmeotcp_validate_small_sgl_suffix(sg, sg_len, mtu);
+	else
+		ret = mlx5e_nvmeotcp_validate_big_sgl_suffix(sg, sg_len, mtu);
+
+	return ret;
+}
+
+static bool
+mlx5e_nvmeotcp_validate_sgl_prefix(struct scatterlist *sg, int sg_len, int mtu)
+{
+	int i, hole_size, hole_len, tmp_len, chunk_size = 0;
+
+	tmp_len = min_t(int, sg_len, MAX_SKB_FRAGS);
+
+	for (i = 0; i < tmp_len; i++)
+		chunk_size += sg_dma_len(&sg[i]);
+
+	if (chunk_size >= mtu)
+		return true;
+
+	hole_size = mtu - chunk_size;
+	hole_len = DIV_ROUND_UP(hole_size, PAGE_SIZE);
+
+	if (tmp_len + hole_len > MAX_SKB_FRAGS)
+		return false;
+
+	return true;
+}
+
+/* This function is responsible to ensure that a PDU could be offloaded.
+ * PDU is offloaded by building a non-linear SKB such that each SGL element is
+ * placed in frag, thus this function should ensure that all packets that
+ * represent part of the PDU won't exaggerate from MAX_SKB_FRAGS SGL.
+ * In addition NVMEoTCP offload has one PDU offload for packet restriction.
+ * Packet could start with a new PDU and then we should check that the prefix
+ * of the PDU meets the requirement or a packet can start in the middle of SG
+ * element and then we should check that the suffix of PDU meets the requirement.
+ */
+static bool
+mlx5e_nvmeotcp_validate_sgl(struct scatterlist *sg, int sg_len, int mtu)
+{
+	int max_hole_frags;
+
+	max_hole_frags = DIV_ROUND_UP(mtu, PAGE_SIZE);
+	if (sg_len + max_hole_frags <= MAX_SKB_FRAGS)
+		return true;
+
+	if (!mlx5e_nvmeotcp_validate_sgl_prefix(sg, sg_len, mtu) ||
+	    !mlx5e_nvmeotcp_validate_sgl_suffix(sg, sg_len, mtu))
+		return false;
+
+	return true;
+}
+
 static int
 mlx5e_nvmeotcp_ddp_setup(struct net_device *netdev,
 			 struct sock *sk,
 			 struct ulp_ddp_io *ddp)
 {
+	struct scatterlist *sg = ddp->sg_table.sgl;
+	struct mlx5e_nvmeotcp_queue_entry *nvqt;
 	struct mlx5e_nvmeotcp_queue *queue;
+	struct mlx5_core_dev *mdev;
+	int i, size = 0, count = 0;
 
 	queue = container_of(ulp_ddp_get_ctx(sk),
 			     struct mlx5e_nvmeotcp_queue, ulp_ddp_ctx);
+	mdev = queue->priv->mdev;
+	count = dma_map_sg(mdev->device, ddp->sg_table.sgl, ddp->nents,
+			   DMA_FROM_DEVICE);
+
+	if (count <= 0)
+		return -EINVAL;
 
-	/* Placeholder - map_sg and initializing the count */
+	if (WARN_ON(count > mlx5e_get_max_sgl(mdev)))
+		return -ENOSPC;
+
+	if (!mlx5e_nvmeotcp_validate_sgl(sg, count, READ_ONCE(netdev->mtu)))
+		return -EOPNOTSUPP;
+
+	for (i = 0; i < count; i++)
+		size += sg_dma_len(&sg[i]);
+
+	nvqt = &queue->ccid_table[ddp->command_id];
+	nvqt->size = size;
+	nvqt->ddp = ddp;
+	nvqt->sgl = sg;
+	nvqt->ccid_gen++;
+	nvqt->sgl_length = count;
+	mlx5e_nvmeotcp_post_klm_wqe(queue, KLM_UMR, ddp->command_id, count);
 
-	mlx5e_nvmeotcp_post_klm_wqe(queue, KLM_UMR, ddp->command_id, 0);
 	return 0;
 }
 
@@ -717,6 +854,11 @@  static void
 mlx5e_nvmeotcp_ddp_resync(struct net_device *netdev,
 			  struct sock *sk, u32 seq)
 {
+	struct mlx5e_nvmeotcp_queue *queue =
+		container_of(ulp_ddp_get_ctx(sk), struct mlx5e_nvmeotcp_queue, ulp_ddp_ctx);
+
+	queue->after_resync_cqe = 1;
+	mlx5e_nvmeotcp_rx_post_static_params_wqe(queue, seq);
 }
 
 struct mlx5e_nvmeotcp_queue *

[v10,22/25] net/mlx5e: NVMEoTCP, ddp setup and resync

Checks

Commit Message

Patch