diff mbox series

[net-next,17/20] tcp: defer skb freeing after socket lock is released

Message ID 20211115190249.3936899-18-eric.dumazet@gmail.com (mailing list archive)
State Accepted
Delegated to: Netdev Maintainers
Headers show
Series tcp: optimizations for linux-5.17 | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count fail Series longer than 15 patches (and no cover letter)
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5997 this patch: 5997
netdev/cc_maintainers warning 3 maintainers not CCed: yoshfuji@linux-ipv6.org jonathan.lemon@gmail.com dsahern@kernel.org
netdev/build_clang success Errors and warnings before: 1044 this patch: 1044
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 6152 this patch: 6152
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 123 lines checked
netdev/kdoc fail Errors and warnings before: 0 this patch: 2
netdev/source_inline success Was 0 now: 0

Commit Message

Eric Dumazet Nov. 15, 2021, 7:02 p.m. UTC
From: Eric Dumazet <edumazet@google.com>

tcp recvmsg() (or rx zerocopy) spends a fair amount of time
freeing skbs after their payload has been consumed.

A typical ~64KB GRO packet has to release ~45 page
references, eventually going to page allocator
for each of them.

Currently, this freeing is performed while socket lock
is held, meaning that there is a high chance that
BH handler has to queue incoming packets to tcp socket backlog.

This can cause additional latencies, because the user
thread has to process the backlog at release_sock() time,
and while doing so, additional frames can be added
by BH handler.

This patch adds logic to defer these frees after socket
lock is released, or directly from BH handler if possible.

Being able to free these skbs from BH handler helps a lot,
because this avoids the usual alloc/free assymetry,
when BH handler and user thread do not run on same cpu or
NUMA node.

One cpu can now be fully utilized for the kernel->user copy,
and another cpu is handling BH processing and skb/page
allocs/frees (assuming RFS is not forcing use of a single CPU)

Tested:
 100Gbit NIC
 Max throughput for one TCP_STREAM flow, over 10 runs

MTU : 1500
Before: 55 Gbit
After:  66 Gbit

MTU : 4096+(headers)
Before: 82 Gbit
After:  95 Gbit

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/skbuff.h |  2 ++
 include/net/sock.h     |  3 +++
 include/net/tcp.h      | 10 ++++++++++
 net/ipv4/tcp.c         | 27 +++++++++++++++++++++++++--
 net/ipv4/tcp_ipv4.c    |  1 +
 net/ipv6/tcp_ipv6.c    |  1 +
 6 files changed, 42 insertions(+), 2 deletions(-)

Comments

Jakub Kicinski Nov. 16, 2021, 2:27 p.m. UTC | #1
On Mon, 15 Nov 2021 11:02:46 -0800 Eric Dumazet wrote:
> One cpu can now be fully utilized for the kernel->user copy,
> and another cpu is handling BH processing and skb/page
> allocs/frees (assuming RFS is not forcing use of a single CPU)

Are you saying the kernel->user copy is not under the socket lock
today? I'm working on getting the crypto & copy from under the socket
lock for ktls, and it looked like tcp does the copy under the lock.
Eric Dumazet Nov. 16, 2021, 3:05 p.m. UTC | #2
On Tue, Nov 16, 2021 at 6:27 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 15 Nov 2021 11:02:46 -0800 Eric Dumazet wrote:
> > One cpu can now be fully utilized for the kernel->user copy,
> > and another cpu is handling BH processing and skb/page
> > allocs/frees (assuming RFS is not forcing use of a single CPU)
>
> Are you saying the kernel->user copy is not under the socket lock
> today? I'm working on getting the crypto & copy from under the socket
> lock for ktls, and it looked like tcp does the copy under the lock.

Copy is done currently with socket lock owned.

But each skb is freed one at a time, after its payload has been consumed.

Note that I am also working on performing the copy while still allowing BH
to process incoming packets.

This is a bit more complex, but I think it is doable.
Jakub Kicinski Nov. 16, 2021, 3:20 p.m. UTC | #3
On Tue, 16 Nov 2021 07:05:54 -0800 Eric Dumazet wrote:
> On Tue, Nov 16, 2021 at 6:27 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Mon, 15 Nov 2021 11:02:46 -0800 Eric Dumazet wrote:  
> > > One cpu can now be fully utilized for the kernel->user copy,
> > > and another cpu is handling BH processing and skb/page
> > > allocs/frees (assuming RFS is not forcing use of a single CPU)  
> >
> > Are you saying the kernel->user copy is not under the socket lock
> > today? I'm working on getting the crypto & copy from under the socket
> > lock for ktls, and it looked like tcp does the copy under the lock.  
> 
> Copy is done currently with socket lock owned.
> 
> But each skb is freed one at a time, after its payload has been consumed.
> 
> Note that I am also working on performing the copy while still allowing BH
> to process incoming packets.
> 
> This is a bit more complex, but I think it is doable.

Can't wait ! :)
Eric Dumazet Nov. 16, 2021, 3:22 p.m. UTC | #4
On Tue, Nov 16, 2021 at 7:05 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Nov 16, 2021 at 6:27 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Mon, 15 Nov 2021 11:02:46 -0800 Eric Dumazet wrote:
> > > One cpu can now be fully utilized for the kernel->user copy,
> > > and another cpu is handling BH processing and skb/page
> > > allocs/frees (assuming RFS is not forcing use of a single CPU)
> >
> > Are you saying the kernel->user copy is not under the socket lock
> > today? I'm working on getting the crypto & copy from under the socket
> > lock for ktls, and it looked like tcp does the copy under the lock.
>
> Copy is done currently with socket lock owned.
>
> But each skb is freed one at a time, after its payload has been consumed.
>
> Note that I am also working on performing the copy while still allowing BH
> to process incoming packets.
>
> This is a bit more complex, but I think it is doable.

Here is the perf top profile on cpu used by user thread doing the
recvmsg(), at 96 Gbit/s

We no longer see skb freeing related costs, but we still see costs of
having to process the backlog.

   81.06%  [kernel]       [k] copy_user_enhanced_fast_string
     2.50%  [kernel]       [k] __skb_datagram_iter
     2.25%  [kernel]       [k] _copy_to_iter
     1.45%  [kernel]       [k] tcp_recvmsg_locked
     1.39%  [kernel]       [k] tcp_rcv_established
     0.93%  [kernel]       [k] skb_try_coalesce
     0.79%  [kernel]       [k] sock_rfree
     0.72%  [kernel]       [k] tcp_v6_do_rcv
     0.57%  [kernel]       [k] skb_release_data
     0.50%  [kernel]       [k] tcp_queue_rcv
     0.43%  [kernel]       [k] __direct_call_clocksource_read1
     0.43%  [kernel]       [k] __release_sock
     0.39%  [kernel]       [k] _raw_spin_lock
     0.25%  [kernel]       [k] __direct_call_hrtimer_clock_base_get_time1
     0.20%  [kernel]       [k] __tcp_transmit_skb
     0.19%  [kernel]       [k] __dev_queue_xmit
     0.18%  [kernel]       [k] __tcp_select_window
     0.18%  [kernel]       [k] _raw_spin_lock_bh
Jakub Kicinski Nov. 16, 2021, 3:27 p.m. UTC | #5
On Tue, 16 Nov 2021 07:22:02 -0800 Eric Dumazet wrote:
> Here is the perf top profile on cpu used by user thread doing the
> recvmsg(), at 96 Gbit/s
> 
> We no longer see skb freeing related costs, but we still see costs of
> having to process the backlog.
> 
>    81.06%  [kernel]       [k] copy_user_enhanced_fast_string
>      2.50%  [kernel]       [k] __skb_datagram_iter
>      2.25%  [kernel]       [k] _copy_to_iter
>      1.45%  [kernel]       [k] tcp_recvmsg_locked
>      1.39%  [kernel]       [k] tcp_rcv_established

Huh, somehow I assumed your 4k MTU numbers were with zero-copy :o

Out of curiosity - what's the softirq load with 4k? Do you have an 
idea what the load is on the CPU consuming the data vs the softirq
processing with 1500B ?
Eric Dumazet Nov. 16, 2021, 4:46 p.m. UTC | #6
On Tue, Nov 16, 2021 at 7:27 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 16 Nov 2021 07:22:02 -0800 Eric Dumazet wrote:
> > Here is the perf top profile on cpu used by user thread doing the
> > recvmsg(), at 96 Gbit/s
> >
> > We no longer see skb freeing related costs, but we still see costs of
> > having to process the backlog.
> >
> >    81.06%  [kernel]       [k] copy_user_enhanced_fast_string
> >      2.50%  [kernel]       [k] __skb_datagram_iter
> >      2.25%  [kernel]       [k] _copy_to_iter
> >      1.45%  [kernel]       [k] tcp_recvmsg_locked
> >      1.39%  [kernel]       [k] tcp_rcv_established
>
> Huh, somehow I assumed your 4k MTU numbers were with zero-copy :o
>
> Out of curiosity - what's the softirq load with 4k? Do you have an
> idea what the load is on the CPU consuming the data vs the softirq
> processing with 1500B ?

On my testing host,

4K MTU : processing ~2,600.000 packets per second in GRO and other parts
use about 60% of the core in BH.
(Some of this cost comes from a clang issue, and the csum_partial() one
I was working on last week)
NIC RX interrupts are firing about 25,000 times per second in this setup.

1500 MTU : processing ~ 5,800,000 packets per second uses one core in
BH (and also one core in recvmsg()),
We stay in NAPI mode (no IRQ rearming)
(That was with a TCP_STREAM run sustaining 70Gbit)

BH numbers also depend on IRQ coalescing parameters.
Jakub Kicinski Nov. 16, 2021, 6:18 p.m. UTC | #7
On Tue, 16 Nov 2021 08:46:37 -0800 Eric Dumazet wrote:
> On my testing host,
> 
> 4K MTU : processing ~2,600.000 packets per second in GRO and other parts
> use about 60% of the core in BH.
> (Some of this cost comes from a clang issue, and the csum_partial() one
> I was working on last week)
> NIC RX interrupts are firing about 25,000 times per second in this setup.
> 
> 1500 MTU : processing ~ 5,800,000 packets per second uses one core in
> BH (and also one core in recvmsg()),
> We stay in NAPI mode (no IRQ rearming)
> (That was with a TCP_STREAM run sustaining 70Gbit)
> 
> BH numbers also depend on IRQ coalescing parameters.

Very interesting, curious to see what not doing the copy under socket
lock will do to the 1.5k case. 

Thanks a lot for sharing the detailed info!
David Ahern Nov. 16, 2021, 8:45 p.m. UTC | #8
On 11/16/21 9:46 AM, Eric Dumazet wrote:
> On Tue, Nov 16, 2021 at 7:27 AM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Tue, 16 Nov 2021 07:22:02 -0800 Eric Dumazet wrote:
>>> Here is the perf top profile on cpu used by user thread doing the
>>> recvmsg(), at 96 Gbit/s
>>>
>>> We no longer see skb freeing related costs, but we still see costs of
>>> having to process the backlog.
>>>
>>>    81.06%  [kernel]       [k] copy_user_enhanced_fast_string
>>>      2.50%  [kernel]       [k] __skb_datagram_iter
>>>      2.25%  [kernel]       [k] _copy_to_iter
>>>      1.45%  [kernel]       [k] tcp_recvmsg_locked
>>>      1.39%  [kernel]       [k] tcp_rcv_established
>>
>> Huh, somehow I assumed your 4k MTU numbers were with zero-copy :o

I thought the same. :-)

>>
>> Out of curiosity - what's the softirq load with 4k? Do you have an
>> idea what the load is on the CPU consuming the data vs the softirq
>> processing with 1500B ?
> 
> On my testing host,
> 
> 4K MTU : processing ~2,600.000 packets per second in GRO and other parts
> use about 60% of the core in BH.

4kB or 4kB+hdr MTU? I ask because there is a subtle difference in the
size of the GRO packet which affects overall efficiency.

e.g., at 1500 MTU, 1448 MSS, a GRO packet has at most 45 segments for a
GRO size of 65212. At 4000 MTU, 3948 MSS, a GRO packet has at most 16
segments for a GRO packet size of 63220. I have noticed that 3300 MTU is
a bit of sweet spot with MLX5/ConnectX-5 at least - 20 segments and
65012 GRO packet without triggering nonlinear mode.


> (Some of this cost comes from a clang issue, and the csum_partial() one
> I was working on last week)
> NIC RX interrupts are firing about 25,000 times per second in this setup.
> 
> 1500 MTU : processing ~ 5,800,000 packets per second uses one core in
> BH (and also one core in recvmsg()),
> We stay in NAPI mode (no IRQ rearming)
> (That was with a TCP_STREAM run sustaining 70Gbit)
> 
> BH numbers also depend on IRQ coalescing parameters.
> 

What NIC do you use for testing?
Eric Dumazet Nov. 16, 2021, 9:35 p.m. UTC | #9
On Tue, Nov 16, 2021 at 12:45 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 11/16/21 9:46 AM, Eric Dumazet wrote:
> > On Tue, Nov 16, 2021 at 7:27 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >>
> >> On Tue, 16 Nov 2021 07:22:02 -0800 Eric Dumazet wrote:
> >>> Here is the perf top profile on cpu used by user thread doing the
> >>> recvmsg(), at 96 Gbit/s
> >>>
> >>> We no longer see skb freeing related costs, but we still see costs of
> >>> having to process the backlog.
> >>>
> >>>    81.06%  [kernel]       [k] copy_user_enhanced_fast_string
> >>>      2.50%  [kernel]       [k] __skb_datagram_iter
> >>>      2.25%  [kernel]       [k] _copy_to_iter
> >>>      1.45%  [kernel]       [k] tcp_recvmsg_locked
> >>>      1.39%  [kernel]       [k] tcp_rcv_established
> >>
> >> Huh, somehow I assumed your 4k MTU numbers were with zero-copy :o
>
> I thought the same. :-)
>
> >>
> >> Out of curiosity - what's the softirq load with 4k? Do you have an
> >> idea what the load is on the CPU consuming the data vs the softirq
> >> processing with 1500B ?
> >
> > On my testing host,
> >
> > 4K MTU : processing ~2,600.000 packets per second in GRO and other parts
> > use about 60% of the core in BH.
>
> 4kB or 4kB+hdr MTU? I ask because there is a subtle difference in the
> size of the GRO packet which affects overall efficiency.
>
> e.g., at 1500 MTU, 1448 MSS, a GRO packet has at most 45 segments for a
> GRO size of 65212. At 4000 MTU, 3948 MSS, a GRO packet has at most 16
> segments for a GRO packet size of 63220. I have noticed that 3300 MTU is
> a bit of sweet spot with MLX5/ConnectX-5 at least - 20 segments and
> 65012 GRO packet without triggering nonlinear mode.

We are using 4096 bytes of payload, to enable TCP RX zero copy if
receiver wants it.
(even if in this case I was using TCP_STREAM which does a standard recvmsg())

Yes, the TSO/GRO standard limit in this case is 15*4K = 61440, but also remember
we are working on BIG TCP packets, so we do not have to find a 'sweet spot' :)

With BIG TCP enabled, I am sending/receiving TSO/GRO packets with 45
4K segments, (184320 bytes of payload).
(But the results I gave in this thread were with standard TSO/GRO limits)

>
>
> > (Some of this cost comes from a clang issue, and the csum_partial() one
> > I was working on last week)
> > NIC RX interrupts are firing about 25,000 times per second in this setup.
> >
> > 1500 MTU : processing ~ 5,800,000 packets per second uses one core in
> > BH (and also one core in recvmsg()),
> > We stay in NAPI mode (no IRQ rearming)
> > (That was with a TCP_STREAM run sustaining 70Gbit)
> >
> > BH numbers also depend on IRQ coalescing parameters.
> >
>
> What NIC do you use for testing?

Google proprietary NIC.
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 686a666d073d5106526f3c5c20d64f26131be72d..b8b806512e1615fad2bc9935baba3fff14996012 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -36,6 +36,7 @@ 
 #include <linux/splice.h>
 #include <linux/in6.h>
 #include <linux/if_packet.h>
+#include <linux/llist.h>
 #include <net/flow.h>
 #include <net/page_pool.h>
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
@@ -743,6 +744,7 @@  struct sk_buff {
 		};
 		struct rb_node		rbnode; /* used in netem, ip4 defrag, and tcp stack */
 		struct list_head	list;
+		struct llist_node	ll_node;
 	};
 
 	union {
diff --git a/include/net/sock.h b/include/net/sock.h
index 2d40fe4c7718ee702bf7e5a847ceff6f8f2f5b7d..2578d1f455a7af0d7f4ce5d3b4ac25ee41fdaeb4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -63,6 +63,7 @@ 
 #include <linux/indirect_call_wrapper.h>
 #include <linux/atomic.h>
 #include <linux/refcount.h>
+#include <linux/llist.h>
 #include <net/dst.h>
 #include <net/checksum.h>
 #include <net/tcp_states.h>
@@ -408,6 +409,8 @@  struct sock {
 		struct sk_buff	*head;
 		struct sk_buff	*tail;
 	} sk_backlog;
+	struct llist_head defer_list;
+
 #define sk_rmem_alloc sk_backlog.rmem_alloc
 
 	int			sk_forward_alloc;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 05c81677aaf782f23b8c63d6ed133df802b43064..44e442bf23f9ccc0a1a914345c3faf1fc9f99d5f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1368,6 +1368,16 @@  static inline bool tcp_checksum_complete(struct sk_buff *skb)
 }
 
 bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb);
+
+void __sk_defer_free_flush(struct sock *sk);
+
+static inline void sk_defer_free_flush(struct sock *sk)
+{
+	if (llist_empty(&sk->defer_list))
+		return;
+	__sk_defer_free_flush(sk);
+}
+
 int tcp_filter(struct sock *sk, struct sk_buff *skb);
 void tcp_set_state(struct sock *sk, int state);
 void tcp_done(struct sock *sk);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4e7011672aa9a04370b7a03b972fe19cd48ea232..33cd9a1c199cef9822ec0ddb3aec91c1111754c7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1580,14 +1580,34 @@  void tcp_cleanup_rbuf(struct sock *sk, int copied)
 		tcp_send_ack(sk);
 }
 
+void __sk_defer_free_flush(struct sock *sk)
+{
+	struct llist_node *head;
+	struct sk_buff *skb, *n;
+
+	head = llist_del_all(&sk->defer_list);
+	llist_for_each_entry_safe(skb, n, head, ll_node) {
+		prefetch(n);
+		skb_mark_not_on_list(skb);
+		__kfree_skb(skb);
+	}
+}
+EXPORT_SYMBOL(__sk_defer_free_flush);
+
 static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb)
 {
+	__skb_unlink(skb, &sk->sk_receive_queue);
 	if (likely(skb->destructor == sock_rfree)) {
 		sock_rfree(skb);
 		skb->destructor = NULL;
 		skb->sk = NULL;
+		if (!skb_queue_empty(&sk->sk_receive_queue) ||
+		    !llist_empty(&sk->defer_list)) {
+			llist_add(&skb->ll_node, &sk->defer_list);
+			return;
+		}
 	}
-	sk_eat_skb(sk, skb);
+	__kfree_skb(skb);
 }
 
 static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
@@ -2422,6 +2442,7 @@  static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 			/* Do not sleep, just process backlog. */
 			__sk_flush_backlog(sk);
 		} else {
+			sk_defer_free_flush(sk);
 			sk_wait_data(sk, &timeo, last);
 		}
 
@@ -2540,6 +2561,7 @@  int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 	ret = tcp_recvmsg_locked(sk, msg, len, nonblock, flags, &tss,
 				 &cmsg_flags);
 	release_sock(sk);
+	sk_defer_free_flush(sk);
 
 	if (cmsg_flags && ret >= 0) {
 		if (cmsg_flags & TCP_CMSG_TS)
@@ -3065,7 +3087,7 @@  int tcp_disconnect(struct sock *sk, int flags)
 		sk->sk_frag.page = NULL;
 		sk->sk_frag.offset = 0;
 	}
-
+	sk_defer_free_flush(sk);
 	sk_error_report(sk);
 	return 0;
 }
@@ -4194,6 +4216,7 @@  static int do_tcp_getsockopt(struct sock *sk, int level,
 		err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
 							  &zc, &len, err);
 		release_sock(sk);
+		sk_defer_free_flush(sk);
 		if (len >= offsetofend(struct tcp_zerocopy_receive, msg_flags))
 			goto zerocopy_rcv_cmsg;
 		switch (len) {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 5ad81bfb27b2f8d9a3cfe11141160b48092cfa3a..3dd19a2bf06c483b43d7e60080c624f10bb2f63d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2102,6 +2102,7 @@  int tcp_v4_rcv(struct sk_buff *skb)
 
 	sk_incoming_cpu_update(sk);
 
+	sk_defer_free_flush(sk);
 	bh_lock_sock_nested(sk);
 	tcp_segs_in(tcp_sk(sk), skb);
 	ret = 0;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index f41f14b701233dd2d0f5ad464a623a5ba9774763..3b7d6ede13649d2589f5a456c5a132409486880f 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1758,6 +1758,7 @@  INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 
 	sk_incoming_cpu_update(sk);
 
+	sk_defer_free_flush(sk);
 	bh_lock_sock_nested(sk);
 	tcp_segs_in(tcp_sk(sk), skb);
 	ret = 0;