tcp: use linear buffer for small frames

Message ID	20220830123345.1909199-1-chenzhen126@huawei.com (mailing list archive)
State	Changes Requested
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@kernel.org> From: Zhen Chen <chenzhen126@huawei.com> To: <edumazet@google.com>, <davem@davemloft.net>, <kuznet@ms2.inr.ac.ru>, <yoshfuji@linux-ipv6.org>, <netdev@vger.kernel.org> CC: <yanan@huawei.com>, <caowangbao@huawei.com> Subject: [PATCH] tcp: use linear buffer for small frames Date: Tue, 30 Aug 2022 20:33:45 +0800 Message-ID: <20220830123345.1909199-1-chenzhen126@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII Precedence: bulk
Series	tcp: use linear buffer for small frames \| expand tcp: use linear buffer for small frames

Message ID

20220830123345.1909199-1-chenzhen126@huawei.com (mailing list archive)

State

Changes Requested

Delegated to:

Netdev Maintainers

Headers

From: Zhen Chen <chenzhen126@huawei.com>
To: <edumazet@google.com>, <davem@davemloft.net>,
        <kuznet@ms2.inr.ac.ru>, <yoshfuji@linux-ipv6.org>,
        <netdev@vger.kernel.org>
CC: <yanan@huawei.com>, <caowangbao@huawei.com>
Subject: [PATCH] tcp: use linear buffer for small frames
Date: Tue, 30 Aug 2022 20:33:45 +0800
Message-ID: <20220830123345.1909199-1-chenzhen126@huawei.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 7BIT
Content-Type: text/plain; charset=US-ASCII
Precedence: bulk

Series

tcp: use linear buffer for small frames | expand

Context	Check	Description
netdev/tree_selection	success	Guessed tree name to be net-next
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/subject_prefix	warning	Target tree name not specified in the subject
netdev/cover_letter	success	Single patches do not need cover letters
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 2 this patch: 2
netdev/cc_maintainers	warning	3 maintainers not CCed: dsahern@kernel.org kuba@kernel.org pabeni@redhat.com
netdev/build_clang	success	Errors and warnings before: 5 this patch: 5
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 2 this patch: 2
netdev/checkpatch	warning	WARNING: 'wont' may be misspelled - perhaps 'won't'? WARNING: line length of 81 exceeds 80 columns
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Context

Check

Description

netdev/tree_selection

success

Guessed tree name to be net-next

netdev/fixes_present

success

Fixes tag not required for -next series

netdev/subject_prefix

warning

Target tree name not specified in the subject

netdev/cover_letter

success

Single patches do not need cover letters

netdev/patch_count

success

Link

netdev/header_inline

success

No static functions without inline keyword in header files

netdev/build_32bit

success

Errors and warnings before: 2 this patch: 2

netdev/cc_maintainers

warning

3 maintainers not CCed: dsahern@kernel.org kuba@kernel.org pabeni@redhat.com

netdev/build_clang

success

Errors and warnings before: 5 this patch: 5

netdev/module_param

success

Was 0 now: 0

netdev/verify_signedoff

success

Signed-off-by tag matches author and committer

netdev/check_selftest

success

No net selftest shell script

netdev/verify_fixes

success

No Fixes tag

netdev/build_allmodconfig_warn

success

Errors and warnings before: 2 this patch: 2

netdev/checkpatch

warning

WARNING: 'wont' may be misspelled - perhaps 'won't'? WARNING: line length of 81 exceeds 80 columns

netdev/kdoc

success

Errors and warnings before: 0 this patch: 0

netdev/source_inline

success

Was 0 now: 0

Commit Message

Zhen Chen Aug. 30, 2022, 12:33 p.m. UTC

472c2e07eef0 ("tcp: add one skb cache for tx") and related patches added a
machanism to relax slab layer in tcp stack, by caching one skb per socket.
The feature is disabled by default and the patch also dropped linear payload
for small frames, which caused about 5% of performance regression for small
packets because nic drivers would bother to deal with fraglist than before.

As d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache") reverted the whole
machanism but skipped the linear part, just make the revert complete.

Signed-off-by: Zhen Chen <chenzhen126@huawei.com>
---
 net/ipv4/tcp.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

Comments

Eric Dumazet Aug. 30, 2022, 2:03 p.m. UTC | #1

On Tue, Aug 30, 2022 at 5:37 AM Zhen Chen <chenzhen126@huawei.com> wrote:
>
> 472c2e07eef0 ("tcp: add one skb cache for tx") and related patches added a
> machanism to relax slab layer in tcp stack, by caching one skb per socket.
> The feature is disabled by default and the patch also dropped linear payload
> for small frames, which caused about 5% of performance regression for small
> packets because nic drivers would bother to deal with fraglist than before.

I do not think it is true. Which driver exhibits a 5% penalty exactly ?

I decided to not bring back this feature, and instead make TCP stack
less complex.

We want instead to have all TCP payload in page frags, there is still
a part to rewrite (MTU probing),
and maybe retransmit aggregation.

>
> As d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache") reverted the whole
> machanism but skipped the linear part, just make the revert complete.
>
> Signed-off-by: Zhen Chen <chenzhen126@huawei.com>
> ---
>  net/ipv4/tcp.c | 28 +++++++++++++++++++++++++++-
>  1 file changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e5011c136fdb..0b6010051598 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1154,6 +1154,30 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
>  }
>  EXPORT_SYMBOL(tcp_sendpage);
>
> +/* Do not bother using a page frag for very small frames.
> + * But use this heuristic only for the first skb in write queue.
> + *
> + * Having no payload in skb->head allows better SACK shifting
> + * in tcp_shift_skb_data(), reducing sack/rack overhead, because
> + * write queue has less skbs.
> + * Each skb can hold up to MAX_SKB_FRAGS * 32Kbytes, or ~0.5 MB.
> + * This also speeds up tso_fragment(), since it wont fallback
> + * to tcp_fragment().
> + */
> +static int linear_payload_sz(bool first_skb)
> +{
> +       if (first_skb)
> +               return SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
> +       return 0;
> +}
> +
> +static int select_size(bool first_skb, bool zc)
> +{
> +       if (zc)
> +               return 0;
> +       return linear_payload_sz(first_skb);
> +}
> +
>  void tcp_free_fastopen_req(struct tcp_sock *tp)
>  {
>         if (tp->fastopen_req) {
> @@ -1311,6 +1335,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>
>                 if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) {
>                         bool first_skb;
> +                       int linear;
>
>  new_segment:
>                         if (!sk_stream_memory_free(sk))
> @@ -1322,7 +1347,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>                                         goto restart;
>                         }
>                         first_skb = tcp_rtx_and_write_queues_empty(sk);
> -                       skb = tcp_stream_alloc_skb(sk, 0, sk->sk_allocation,
> +                       linear = select_size(first_skb, zc);
> +                       skb = tcp_stream_alloc_skb(sk, linear, sk->sk_allocation,
>                                                    first_skb);
>                         if (!skb)
>                                 goto wait_for_space;
> --
> 2.23.0
>

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e5011c136fdb..0b6010051598 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1154,6 +1154,30 @@  int tcp_sendpage(struct sock *sk, struct page *page, int offset,
 }
 EXPORT_SYMBOL(tcp_sendpage);
 
+/* Do not bother using a page frag for very small frames.
+ * But use this heuristic only for the first skb in write queue.
+ *
+ * Having no payload in skb->head allows better SACK shifting
+ * in tcp_shift_skb_data(), reducing sack/rack overhead, because
+ * write queue has less skbs.
+ * Each skb can hold up to MAX_SKB_FRAGS * 32Kbytes, or ~0.5 MB.
+ * This also speeds up tso_fragment(), since it wont fallback
+ * to tcp_fragment().
+ */
+static int linear_payload_sz(bool first_skb)
+{
+	if (first_skb)
+		return SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
+	return 0;
+}
+
+static int select_size(bool first_skb, bool zc)
+{
+	if (zc)
+		return 0;
+	return linear_payload_sz(first_skb);
+}
+
 void tcp_free_fastopen_req(struct tcp_sock *tp)
 {
 	if (tp->fastopen_req) {
@@ -1311,6 +1335,7 @@  int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 
 		if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) {
 			bool first_skb;
+			int linear;
 
 new_segment:
 			if (!sk_stream_memory_free(sk))
@@ -1322,7 +1347,8 @@  int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 					goto restart;
 			}
 			first_skb = tcp_rtx_and_write_queues_empty(sk);
-			skb = tcp_stream_alloc_skb(sk, 0, sk->sk_allocation,
+			linear = select_size(first_skb, zc);
+			skb = tcp_stream_alloc_skb(sk, linear, sk->sk_allocation,
 						   first_skb);
 			if (!skb)
 				goto wait_for_space;

tcp: use linear buffer for small frames

Checks

Commit Message

Comments

Patch