diff mbox series

[v5,net-next,13/13] mlx5: support BIG TCP packets

Message ID 20220509222149.1763877-14-eric.dumazet@gmail.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series tcp: BIG TCP implementation | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 0 this patch: 3
netdev/cc_maintainers warning 1 maintainers not CCed: linux-rdma@vger.kernel.org
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 0 this patch: 3
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Eric Dumazet May 9, 2022, 10:21 p.m. UTC
From: Coco Li <lixiaoyan@google.com>

mlx5 supports LSOv2.

IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header
with JUMBO TLV for big packets.

We need to ignore/skip this HBH header when populating TX descriptor.

Note that ipv6_has_hopopt_jumbo() only recognizes very specific packet
layout, thus mlx5e_sq_xmit_wqe() is taking care of this layout only.

v2: clear hopbyhop in mlx5e_tx_get_gso_ihs()
v4: fix compile error for CONFIG_MLX5_CORE_IPOIB=y

Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Leon Romanovsky <leon@kernel.org>
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  1 +
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 84 +++++++++++++++----
 2 files changed, 69 insertions(+), 16 deletions(-)

Comments

Eric Dumazet May 9, 2022, 10:30 p.m. UTC | #1
On Mon, May 9, 2022 at 3:22 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> From: Coco Li <lixiaoyan@google.com>
>
> mlx5 supports LSOv2.
>
> IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header
> with JUMBO TLV for big packets.
>
> We need to ignore/skip this HBH header when populating TX descriptor.
>
> Note that ipv6_has_hopopt_jumbo() only recognizes very specific packet
> layout, thus mlx5e_sq_xmit_wqe() is taking care of this layout only.
>
> v2: clear hopbyhop in mlx5e_tx_get_gso_ihs()
> v4: fix compile error for CONFIG_MLX5_CORE_IPOIB=y
>
> Signed-off-by: Coco Li <lixiaoyan@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> Cc: Saeed Mahameed <saeedm@nvidia.com>
> Cc: Leon Romanovsky <leon@kernel.org>
> ---
>  .../net/ethernet/mellanox/mlx5/core/en_main.c |  1 +
>  .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 84 +++++++++++++++----
>  2 files changed, 69 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index d27986869b8ba070d1a4f8bcdc7e14ab54ae984e..226825410a1aa55b5b7941a7389a78abdb800521 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -4920,6 +4920,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
>
>         netdev->priv_flags       |= IFF_UNICAST_FLT;
>
> +       netif_set_tso_max_size(netdev, 512 * 1024);

Apparently I forgot to amend this part on the final patch of the series.

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 226825410a1aa55b5b7941a7389a78abdb800521..bf3bca79e160124abd128ac1e9910cb2f39a39ff
100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4920,7 +4920,7 @@ static void mlx5e_build_nic_netdev(struct
net_device *netdev)

        netdev->priv_flags       |= IFF_UNICAST_FLT;

-       netif_set_tso_max_size(netdev, 512 * 1024);
+       netif_set_tso_max_size(netdev, GSO_MAX_SIZE);
        mlx5e_set_netdev_dev_addr(netdev);
        mlx5e_ipsec_build_netdev(priv);
        mlx5e_ktls_build_netdev(priv);
Jakub Kicinski May 10, 2022, 1:38 a.m. UTC | #2
On Mon,  9 May 2022 15:21:49 -0700 Eric Dumazet wrote:
> From: Coco Li <lixiaoyan@google.com>
> 
> mlx5 supports LSOv2.
> 
> IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header
> with JUMBO TLV for big packets.
> 
> We need to ignore/skip this HBH header when populating TX descriptor.
> 
> Note that ipv6_has_hopopt_jumbo() only recognizes very specific packet
> layout, thus mlx5e_sq_xmit_wqe() is taking care of this layout only.
> 
> v2: clear hopbyhop in mlx5e_tx_get_gso_ihs()
> v4: fix compile error for CONFIG_MLX5_CORE_IPOIB=y
> 
> Signed-off-by: Coco Li <lixiaoyan@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> Cc: Saeed Mahameed <saeedm@nvidia.com>
> Cc: Leon Romanovsky <leon@kernel.org>

So we're leaving the warning for Kees to deal with?

Kees is there some form of "I know what I'm doing" cast 
that you could sneak us under the table?
Eric Dumazet May 10, 2022, 2 a.m. UTC | #3
On Mon, May 9, 2022 at 6:38 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon,  9 May 2022 15:21:49 -0700 Eric Dumazet wrote:
> > From: Coco Li <lixiaoyan@google.com>
> >
> > mlx5 supports LSOv2.
> >
> > IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header
> > with JUMBO TLV for big packets.
> >
> > We need to ignore/skip this HBH header when populating TX descriptor.
> >
> > Note that ipv6_has_hopopt_jumbo() only recognizes very specific packet
> > layout, thus mlx5e_sq_xmit_wqe() is taking care of this layout only.
> >
> > v2: clear hopbyhop in mlx5e_tx_get_gso_ihs()
> > v4: fix compile error for CONFIG_MLX5_CORE_IPOIB=y
> >
> > Signed-off-by: Coco Li <lixiaoyan@google.com>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> > Cc: Saeed Mahameed <saeedm@nvidia.com>
> > Cc: Leon Romanovsky <leon@kernel.org>
>
> So we're leaving the warning for Kees to deal with?

I think so. I do not see an easy way to escape this, unless perhaps add some
extra obfuscation, so that gcc can not determine the memcpy() third
argument at compile time.

Alternative is to remove mlx5 patch from the upstream series.

>
> Kees is there some form of "I know what I'm doing" cast
> that you could sneak us under the table?
Kees Cook May 10, 2022, 3:49 p.m. UTC | #4
On Mon, May 09, 2022 at 06:38:53PM -0700, Jakub Kicinski wrote:
> On Mon,  9 May 2022 15:21:49 -0700 Eric Dumazet wrote:
> > From: Coco Li <lixiaoyan@google.com>
> > 
> > mlx5 supports LSOv2.
> > 
> > IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header
> > with JUMBO TLV for big packets.
> > 
> > We need to ignore/skip this HBH header when populating TX descriptor.
> > 
> > Note that ipv6_has_hopopt_jumbo() only recognizes very specific packet
> > layout, thus mlx5e_sq_xmit_wqe() is taking care of this layout only.
> > 
> > v2: clear hopbyhop in mlx5e_tx_get_gso_ihs()
> > v4: fix compile error for CONFIG_MLX5_CORE_IPOIB=y
> > 
> > Signed-off-by: Coco Li <lixiaoyan@google.com>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> > Cc: Saeed Mahameed <saeedm@nvidia.com>
> > Cc: Leon Romanovsky <leon@kernel.org>
> 
> So we're leaving the warning for Kees to deal with?
> 
> Kees is there some form of "I know what I'm doing" cast 
> that you could sneak us under the table?

Right now, it's switching that memcpy to __builtin_memcpy(), but I'll
send a patch that'll create an unsafe_memcpy() macro that does the right
things vs kasan, fortify, etc.
Kees Cook May 11, 2022, 2:55 a.m. UTC | #5
On Mon, May 09, 2022 at 06:38:53PM -0700, Jakub Kicinski wrote:
> So we're leaving the warning for Kees to deal with?
> 
> Kees is there some form of "I know what I'm doing" cast 
> that you could sneak us under the table?

Okay, I've sent this[1] now. If that looks okay to you, I figure you'll
land it via netdev for the coming merge window?

-Kees

[1] https://lore.kernel.org/netdev/20220511025301.3636666-1-keescook@chromium.org/
Jakub Kicinski May 11, 2022, 4:26 p.m. UTC | #6
On Tue, 10 May 2022 19:55:16 -0700 Kees Cook wrote:
> On Mon, May 09, 2022 at 06:38:53PM -0700, Jakub Kicinski wrote:
> > So we're leaving the warning for Kees to deal with?
> > 
> > Kees is there some form of "I know what I'm doing" cast 
> > that you could sneak us under the table?  
> 
> Okay, I've sent this[1] now. If that looks okay to you, I figure you'll
> land it via netdev for the coming merge window?

I was about to say "great!" but perhaps given we're adding an unsafe_
flavor of something a "it is what it is" would be a more appropriate
reaction.

Thank you!
Kees Cook May 11, 2022, 5:27 p.m. UTC | #7
On Wed, May 11, 2022 at 09:26:48AM -0700, Jakub Kicinski wrote:
> On Tue, 10 May 2022 19:55:16 -0700 Kees Cook wrote:
> > On Mon, May 09, 2022 at 06:38:53PM -0700, Jakub Kicinski wrote:
> > > So we're leaving the warning for Kees to deal with?
> > > 
> > > Kees is there some form of "I know what I'm doing" cast 
> > > that you could sneak us under the table?  
> > 
> > Okay, I've sent this[1] now. If that looks okay to you, I figure you'll
> > land it via netdev for the coming merge window?
> 
> I was about to say "great!" but perhaps given we're adding an unsafe_
> flavor of something a "it is what it is" would be a more appropriate
> reaction.

Heh, well, I think it's just calling a spade a spade: plain memcpy is
already unsafe. The goal is for the kernel's (fortified) memcpy to be
"provably" safe. :) But yeah, I get what you mean. I'm sad that I don't
yet have a workable way to deal with this code pattern, but I'm getting
close, I think. My random notes currently are:

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 2dc48406cd08..595d0db4e97a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -386,6 +386,14 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 			stats->added_vlan_packets++;
 		} else {
 			eseg->inline_hdr.sz |= cpu_to_be16(attr->ihs);
+/* interface could take:
+	fas: wqe
+	dst: eth.inline_hdr.start
+	src: skb->data
+	bytes: attr->ihs
+	elements member: data
+	elements_count value: wqe_attr->ds_cnt_inl
+*/
 			memcpy(eseg->inline_hdr.start, skb->data, attr->ihs);
 		}
 		dseg += wqe_attr->ds_cnt_inl;

There's a similar case with how netlink constructs things (i.e.
performing a memcpy across some of the trailing header members and then
into the flex array) that may share this code pattern, and at least one
patch to mlx5 I'd sent before could be refactored back into this to
unsplit the memcpy there.

Anyway, I'll continue to chip away at it.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index d27986869b8ba070d1a4f8bcdc7e14ab54ae984e..226825410a1aa55b5b7941a7389a78abdb800521 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4920,6 +4920,7 @@  static void mlx5e_build_nic_netdev(struct net_device *netdev)
 
 	netdev->priv_flags       |= IFF_UNICAST_FLT;
 
+	netif_set_tso_max_size(netdev, 512 * 1024);
 	mlx5e_set_netdev_dev_addr(netdev);
 	mlx5e_ipsec_build_netdev(priv);
 	mlx5e_ktls_build_netdev(priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 2dc48406cd08d21ff94f665cd61ab9227f351215..b4fc45ba1b347fb9ad0f46b9c091cc45e4d3d84f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -40,6 +40,7 @@ 
 #include "en_accel/en_accel.h"
 #include "en_accel/ipsec_rxtx.h"
 #include "en/ptp.h"
+#include <net/ipv6.h>
 
 static void mlx5e_dma_unmap_wqe_err(struct mlx5e_txqsq *sq, u8 num_dma)
 {
@@ -130,23 +131,32 @@  mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 		sq->stats->csum_none++;
 }
 
+/* Returns the number of header bytes that we plan
+ * to inline later in the transmit descriptor
+ */
 static inline u16
-mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb)
+mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb, int *hopbyhop)
 {
 	struct mlx5e_sq_stats *stats = sq->stats;
 	u16 ihs;
 
+	*hopbyhop = 0;
 	if (skb->encapsulation) {
 		ihs = skb_inner_transport_offset(skb) + inner_tcp_hdrlen(skb);
 		stats->tso_inner_packets++;
 		stats->tso_inner_bytes += skb->len - ihs;
 	} else {
-		if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)
+		if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) {
 			ihs = skb_transport_offset(skb) + sizeof(struct udphdr);
-		else
+		} else {
 			ihs = skb_transport_offset(skb) + tcp_hdrlen(skb);
+			if (ipv6_has_hopopt_jumbo(skb)) {
+				*hopbyhop = sizeof(struct hop_jumbo_hdr);
+				ihs -= sizeof(struct hop_jumbo_hdr);
+			}
+		}
 		stats->tso_packets++;
-		stats->tso_bytes += skb->len - ihs;
+		stats->tso_bytes += skb->len - ihs - *hopbyhop;
 	}
 
 	return ihs;
@@ -208,6 +218,7 @@  struct mlx5e_tx_attr {
 	__be16 mss;
 	u16 insz;
 	u8 opcode;
+	u8 hopbyhop;
 };
 
 struct mlx5e_tx_wqe_attr {
@@ -244,14 +255,16 @@  static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	struct mlx5e_sq_stats *stats = sq->stats;
 
 	if (skb_is_gso(skb)) {
-		u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb);
+		int hopbyhop;
+		u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
 
 		*attr = (struct mlx5e_tx_attr) {
 			.opcode    = MLX5_OPCODE_LSO,
 			.mss       = cpu_to_be16(skb_shinfo(skb)->gso_size),
 			.ihs       = ihs,
 			.num_bytes = skb->len + (skb_shinfo(skb)->gso_segs - 1) * ihs,
-			.headlen   = skb_headlen(skb) - ihs,
+			.headlen   = skb_headlen(skb) - ihs - hopbyhop,
+			.hopbyhop  = hopbyhop,
 		};
 
 		stats->packets += skb_shinfo(skb)->gso_segs;
@@ -365,7 +378,8 @@  mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	struct mlx5_wqe_eth_seg  *eseg;
 	struct mlx5_wqe_data_seg *dseg;
 	struct mlx5e_tx_wqe_info *wi;
-
+	u16 ihs = attr->ihs;
+	struct ipv6hdr *h6;
 	struct mlx5e_sq_stats *stats = sq->stats;
 	int num_dma;
 
@@ -379,15 +393,36 @@  mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 
 	eseg->mss = attr->mss;
 
-	if (attr->ihs) {
-		if (skb_vlan_tag_present(skb)) {
-			eseg->inline_hdr.sz |= cpu_to_be16(attr->ihs + VLAN_HLEN);
-			mlx5e_insert_vlan(eseg->inline_hdr.start, skb, attr->ihs);
+	if (ihs) {
+		u8 *start = eseg->inline_hdr.start;
+
+		if (unlikely(attr->hopbyhop)) {
+			/* remove the HBH header.
+			 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
+			 */
+			if (skb_vlan_tag_present(skb)) {
+				mlx5e_insert_vlan(start, skb, ETH_HLEN + sizeof(*h6));
+				ihs += VLAN_HLEN;
+				h6 = (struct ipv6hdr *)(start + sizeof(struct vlan_ethhdr));
+			} else {
+				memcpy(start, skb->data, ETH_HLEN + sizeof(*h6));
+				h6 = (struct ipv6hdr *)(start + ETH_HLEN);
+			}
+			h6->nexthdr = IPPROTO_TCP;
+			/* Copy the TCP header after the IPv6 one */
+			memcpy(h6 + 1,
+			       skb->data + ETH_HLEN + sizeof(*h6) +
+					sizeof(struct hop_jumbo_hdr),
+			       tcp_hdrlen(skb));
+			/* Leave ipv6 payload_len set to 0, as LSO v2 specs request. */
+		} else if (skb_vlan_tag_present(skb)) {
+			mlx5e_insert_vlan(start, skb, ihs);
+			ihs += VLAN_HLEN;
 			stats->added_vlan_packets++;
 		} else {
-			eseg->inline_hdr.sz |= cpu_to_be16(attr->ihs);
-			memcpy(eseg->inline_hdr.start, skb->data, attr->ihs);
+			memcpy(start, skb->data, ihs);
 		}
+		eseg->inline_hdr.sz |= cpu_to_be16(ihs);
 		dseg += wqe_attr->ds_cnt_inl;
 	} else if (skb_vlan_tag_present(skb)) {
 		eseg->insert.type = cpu_to_be16(MLX5_ETH_WQE_INSERT_VLAN);
@@ -398,7 +433,7 @@  mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	}
 
 	dseg += wqe_attr->ds_cnt_ids;
-	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr->ihs,
+	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr->ihs + attr->hopbyhop,
 					  attr->headlen, dseg);
 	if (unlikely(num_dma < 0))
 		goto err_drop;
@@ -918,12 +953,29 @@  void mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	eseg->mss = attr.mss;
 
 	if (attr.ihs) {
-		memcpy(eseg->inline_hdr.start, skb->data, attr.ihs);
+		if (unlikely(attr.hopbyhop)) {
+			struct ipv6hdr *h6;
+
+			/* remove the HBH header.
+			 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
+			 */
+			memcpy(eseg->inline_hdr.start, skb->data, ETH_HLEN + sizeof(*h6));
+			h6 = (struct ipv6hdr *)((char *)eseg->inline_hdr.start + ETH_HLEN);
+			h6->nexthdr = IPPROTO_TCP;
+			/* Copy the TCP header after the IPv6 one */
+			memcpy(h6 + 1,
+			       skb->data + ETH_HLEN + sizeof(*h6) +
+					sizeof(struct hop_jumbo_hdr),
+			       tcp_hdrlen(skb));
+			/* Leave ipv6 payload_len set to 0, as LSO v2 specs request. */
+		} else {
+			memcpy(eseg->inline_hdr.start, skb->data, attr.ihs);
+		}
 		eseg->inline_hdr.sz = cpu_to_be16(attr.ihs);
 		dseg += wqe_attr.ds_cnt_inl;
 	}
 
-	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr.ihs,
+	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr.ihs + attr.hopbyhop,
 					  attr.headlen, dseg);
 	if (unlikely(num_dma < 0))
 		goto err_drop;