Message ID | 20211115190249.3936899-18-eric.dumazet@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | tcp: optimizations for linux-5.17 | expand |
On Mon, 15 Nov 2021 11:02:46 -0800 Eric Dumazet wrote: > One cpu can now be fully utilized for the kernel->user copy, > and another cpu is handling BH processing and skb/page > allocs/frees (assuming RFS is not forcing use of a single CPU) Are you saying the kernel->user copy is not under the socket lock today? I'm working on getting the crypto & copy from under the socket lock for ktls, and it looked like tcp does the copy under the lock.
On Tue, Nov 16, 2021 at 6:27 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 15 Nov 2021 11:02:46 -0800 Eric Dumazet wrote: > > One cpu can now be fully utilized for the kernel->user copy, > > and another cpu is handling BH processing and skb/page > > allocs/frees (assuming RFS is not forcing use of a single CPU) > > Are you saying the kernel->user copy is not under the socket lock > today? I'm working on getting the crypto & copy from under the socket > lock for ktls, and it looked like tcp does the copy under the lock. Copy is done currently with socket lock owned. But each skb is freed one at a time, after its payload has been consumed. Note that I am also working on performing the copy while still allowing BH to process incoming packets. This is a bit more complex, but I think it is doable.
On Tue, 16 Nov 2021 07:05:54 -0800 Eric Dumazet wrote: > On Tue, Nov 16, 2021 at 6:27 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 15 Nov 2021 11:02:46 -0800 Eric Dumazet wrote: > > > One cpu can now be fully utilized for the kernel->user copy, > > > and another cpu is handling BH processing and skb/page > > > allocs/frees (assuming RFS is not forcing use of a single CPU) > > > > Are you saying the kernel->user copy is not under the socket lock > > today? I'm working on getting the crypto & copy from under the socket > > lock for ktls, and it looked like tcp does the copy under the lock. > > Copy is done currently with socket lock owned. > > But each skb is freed one at a time, after its payload has been consumed. > > Note that I am also working on performing the copy while still allowing BH > to process incoming packets. > > This is a bit more complex, but I think it is doable. Can't wait ! :)
On Tue, Nov 16, 2021 at 7:05 AM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Nov 16, 2021 at 6:27 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Mon, 15 Nov 2021 11:02:46 -0800 Eric Dumazet wrote: > > > One cpu can now be fully utilized for the kernel->user copy, > > > and another cpu is handling BH processing and skb/page > > > allocs/frees (assuming RFS is not forcing use of a single CPU) > > > > Are you saying the kernel->user copy is not under the socket lock > > today? I'm working on getting the crypto & copy from under the socket > > lock for ktls, and it looked like tcp does the copy under the lock. > > Copy is done currently with socket lock owned. > > But each skb is freed one at a time, after its payload has been consumed. > > Note that I am also working on performing the copy while still allowing BH > to process incoming packets. > > This is a bit more complex, but I think it is doable. Here is the perf top profile on cpu used by user thread doing the recvmsg(), at 96 Gbit/s We no longer see skb freeing related costs, but we still see costs of having to process the backlog. 81.06% [kernel] [k] copy_user_enhanced_fast_string 2.50% [kernel] [k] __skb_datagram_iter 2.25% [kernel] [k] _copy_to_iter 1.45% [kernel] [k] tcp_recvmsg_locked 1.39% [kernel] [k] tcp_rcv_established 0.93% [kernel] [k] skb_try_coalesce 0.79% [kernel] [k] sock_rfree 0.72% [kernel] [k] tcp_v6_do_rcv 0.57% [kernel] [k] skb_release_data 0.50% [kernel] [k] tcp_queue_rcv 0.43% [kernel] [k] __direct_call_clocksource_read1 0.43% [kernel] [k] __release_sock 0.39% [kernel] [k] _raw_spin_lock 0.25% [kernel] [k] __direct_call_hrtimer_clock_base_get_time1 0.20% [kernel] [k] __tcp_transmit_skb 0.19% [kernel] [k] __dev_queue_xmit 0.18% [kernel] [k] __tcp_select_window 0.18% [kernel] [k] _raw_spin_lock_bh
On Tue, 16 Nov 2021 07:22:02 -0800 Eric Dumazet wrote: > Here is the perf top profile on cpu used by user thread doing the > recvmsg(), at 96 Gbit/s > > We no longer see skb freeing related costs, but we still see costs of > having to process the backlog. > > 81.06% [kernel] [k] copy_user_enhanced_fast_string > 2.50% [kernel] [k] __skb_datagram_iter > 2.25% [kernel] [k] _copy_to_iter > 1.45% [kernel] [k] tcp_recvmsg_locked > 1.39% [kernel] [k] tcp_rcv_established Huh, somehow I assumed your 4k MTU numbers were with zero-copy :o Out of curiosity - what's the softirq load with 4k? Do you have an idea what the load is on the CPU consuming the data vs the softirq processing with 1500B ?
On Tue, Nov 16, 2021 at 7:27 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 16 Nov 2021 07:22:02 -0800 Eric Dumazet wrote: > > Here is the perf top profile on cpu used by user thread doing the > > recvmsg(), at 96 Gbit/s > > > > We no longer see skb freeing related costs, but we still see costs of > > having to process the backlog. > > > > 81.06% [kernel] [k] copy_user_enhanced_fast_string > > 2.50% [kernel] [k] __skb_datagram_iter > > 2.25% [kernel] [k] _copy_to_iter > > 1.45% [kernel] [k] tcp_recvmsg_locked > > 1.39% [kernel] [k] tcp_rcv_established > > Huh, somehow I assumed your 4k MTU numbers were with zero-copy :o > > Out of curiosity - what's the softirq load with 4k? Do you have an > idea what the load is on the CPU consuming the data vs the softirq > processing with 1500B ? On my testing host, 4K MTU : processing ~2,600.000 packets per second in GRO and other parts use about 60% of the core in BH. (Some of this cost comes from a clang issue, and the csum_partial() one I was working on last week) NIC RX interrupts are firing about 25,000 times per second in this setup. 1500 MTU : processing ~ 5,800,000 packets per second uses one core in BH (and also one core in recvmsg()), We stay in NAPI mode (no IRQ rearming) (That was with a TCP_STREAM run sustaining 70Gbit) BH numbers also depend on IRQ coalescing parameters.
On Tue, 16 Nov 2021 08:46:37 -0800 Eric Dumazet wrote: > On my testing host, > > 4K MTU : processing ~2,600.000 packets per second in GRO and other parts > use about 60% of the core in BH. > (Some of this cost comes from a clang issue, and the csum_partial() one > I was working on last week) > NIC RX interrupts are firing about 25,000 times per second in this setup. > > 1500 MTU : processing ~ 5,800,000 packets per second uses one core in > BH (and also one core in recvmsg()), > We stay in NAPI mode (no IRQ rearming) > (That was with a TCP_STREAM run sustaining 70Gbit) > > BH numbers also depend on IRQ coalescing parameters. Very interesting, curious to see what not doing the copy under socket lock will do to the 1.5k case. Thanks a lot for sharing the detailed info!
On 11/16/21 9:46 AM, Eric Dumazet wrote: > On Tue, Nov 16, 2021 at 7:27 AM Jakub Kicinski <kuba@kernel.org> wrote: >> >> On Tue, 16 Nov 2021 07:22:02 -0800 Eric Dumazet wrote: >>> Here is the perf top profile on cpu used by user thread doing the >>> recvmsg(), at 96 Gbit/s >>> >>> We no longer see skb freeing related costs, but we still see costs of >>> having to process the backlog. >>> >>> 81.06% [kernel] [k] copy_user_enhanced_fast_string >>> 2.50% [kernel] [k] __skb_datagram_iter >>> 2.25% [kernel] [k] _copy_to_iter >>> 1.45% [kernel] [k] tcp_recvmsg_locked >>> 1.39% [kernel] [k] tcp_rcv_established >> >> Huh, somehow I assumed your 4k MTU numbers were with zero-copy :o I thought the same. :-) >> >> Out of curiosity - what's the softirq load with 4k? Do you have an >> idea what the load is on the CPU consuming the data vs the softirq >> processing with 1500B ? > > On my testing host, > > 4K MTU : processing ~2,600.000 packets per second in GRO and other parts > use about 60% of the core in BH. 4kB or 4kB+hdr MTU? I ask because there is a subtle difference in the size of the GRO packet which affects overall efficiency. e.g., at 1500 MTU, 1448 MSS, a GRO packet has at most 45 segments for a GRO size of 65212. At 4000 MTU, 3948 MSS, a GRO packet has at most 16 segments for a GRO packet size of 63220. I have noticed that 3300 MTU is a bit of sweet spot with MLX5/ConnectX-5 at least - 20 segments and 65012 GRO packet without triggering nonlinear mode. > (Some of this cost comes from a clang issue, and the csum_partial() one > I was working on last week) > NIC RX interrupts are firing about 25,000 times per second in this setup. > > 1500 MTU : processing ~ 5,800,000 packets per second uses one core in > BH (and also one core in recvmsg()), > We stay in NAPI mode (no IRQ rearming) > (That was with a TCP_STREAM run sustaining 70Gbit) > > BH numbers also depend on IRQ coalescing parameters. > What NIC do you use for testing?
On Tue, Nov 16, 2021 at 12:45 PM David Ahern <dsahern@gmail.com> wrote: > > On 11/16/21 9:46 AM, Eric Dumazet wrote: > > On Tue, Nov 16, 2021 at 7:27 AM Jakub Kicinski <kuba@kernel.org> wrote: > >> > >> On Tue, 16 Nov 2021 07:22:02 -0800 Eric Dumazet wrote: > >>> Here is the perf top profile on cpu used by user thread doing the > >>> recvmsg(), at 96 Gbit/s > >>> > >>> We no longer see skb freeing related costs, but we still see costs of > >>> having to process the backlog. > >>> > >>> 81.06% [kernel] [k] copy_user_enhanced_fast_string > >>> 2.50% [kernel] [k] __skb_datagram_iter > >>> 2.25% [kernel] [k] _copy_to_iter > >>> 1.45% [kernel] [k] tcp_recvmsg_locked > >>> 1.39% [kernel] [k] tcp_rcv_established > >> > >> Huh, somehow I assumed your 4k MTU numbers were with zero-copy :o > > I thought the same. :-) > > >> > >> Out of curiosity - what's the softirq load with 4k? Do you have an > >> idea what the load is on the CPU consuming the data vs the softirq > >> processing with 1500B ? > > > > On my testing host, > > > > 4K MTU : processing ~2,600.000 packets per second in GRO and other parts > > use about 60% of the core in BH. > > 4kB or 4kB+hdr MTU? I ask because there is a subtle difference in the > size of the GRO packet which affects overall efficiency. > > e.g., at 1500 MTU, 1448 MSS, a GRO packet has at most 45 segments for a > GRO size of 65212. At 4000 MTU, 3948 MSS, a GRO packet has at most 16 > segments for a GRO packet size of 63220. I have noticed that 3300 MTU is > a bit of sweet spot with MLX5/ConnectX-5 at least - 20 segments and > 65012 GRO packet without triggering nonlinear mode. We are using 4096 bytes of payload, to enable TCP RX zero copy if receiver wants it. (even if in this case I was using TCP_STREAM which does a standard recvmsg()) Yes, the TSO/GRO standard limit in this case is 15*4K = 61440, but also remember we are working on BIG TCP packets, so we do not have to find a 'sweet spot' :) With BIG TCP enabled, I am sending/receiving TSO/GRO packets with 45 4K segments, (184320 bytes of payload). (But the results I gave in this thread were with standard TSO/GRO limits) > > > > (Some of this cost comes from a clang issue, and the csum_partial() one > > I was working on last week) > > NIC RX interrupts are firing about 25,000 times per second in this setup. > > > > 1500 MTU : processing ~ 5,800,000 packets per second uses one core in > > BH (and also one core in recvmsg()), > > We stay in NAPI mode (no IRQ rearming) > > (That was with a TCP_STREAM run sustaining 70Gbit) > > > > BH numbers also depend on IRQ coalescing parameters. > > > > What NIC do you use for testing? Google proprietary NIC.
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 686a666d073d5106526f3c5c20d64f26131be72d..b8b806512e1615fad2bc9935baba3fff14996012 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -36,6 +36,7 @@ #include <linux/splice.h> #include <linux/in6.h> #include <linux/if_packet.h> +#include <linux/llist.h> #include <net/flow.h> #include <net/page_pool.h> #if IS_ENABLED(CONFIG_NF_CONNTRACK) @@ -743,6 +744,7 @@ struct sk_buff { }; struct rb_node rbnode; /* used in netem, ip4 defrag, and tcp stack */ struct list_head list; + struct llist_node ll_node; }; union { diff --git a/include/net/sock.h b/include/net/sock.h index 2d40fe4c7718ee702bf7e5a847ceff6f8f2f5b7d..2578d1f455a7af0d7f4ce5d3b4ac25ee41fdaeb4 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -63,6 +63,7 @@ #include <linux/indirect_call_wrapper.h> #include <linux/atomic.h> #include <linux/refcount.h> +#include <linux/llist.h> #include <net/dst.h> #include <net/checksum.h> #include <net/tcp_states.h> @@ -408,6 +409,8 @@ struct sock { struct sk_buff *head; struct sk_buff *tail; } sk_backlog; + struct llist_head defer_list; + #define sk_rmem_alloc sk_backlog.rmem_alloc int sk_forward_alloc; diff --git a/include/net/tcp.h b/include/net/tcp.h index 05c81677aaf782f23b8c63d6ed133df802b43064..44e442bf23f9ccc0a1a914345c3faf1fc9f99d5f 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1368,6 +1368,16 @@ static inline bool tcp_checksum_complete(struct sk_buff *skb) } bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb); + +void __sk_defer_free_flush(struct sock *sk); + +static inline void sk_defer_free_flush(struct sock *sk) +{ + if (llist_empty(&sk->defer_list)) + return; + __sk_defer_free_flush(sk); +} + int tcp_filter(struct sock *sk, struct sk_buff *skb); void tcp_set_state(struct sock *sk, int state); void tcp_done(struct sock *sk); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 4e7011672aa9a04370b7a03b972fe19cd48ea232..33cd9a1c199cef9822ec0ddb3aec91c1111754c7 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1580,14 +1580,34 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied) tcp_send_ack(sk); } +void __sk_defer_free_flush(struct sock *sk) +{ + struct llist_node *head; + struct sk_buff *skb, *n; + + head = llist_del_all(&sk->defer_list); + llist_for_each_entry_safe(skb, n, head, ll_node) { + prefetch(n); + skb_mark_not_on_list(skb); + __kfree_skb(skb); + } +} +EXPORT_SYMBOL(__sk_defer_free_flush); + static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb) { + __skb_unlink(skb, &sk->sk_receive_queue); if (likely(skb->destructor == sock_rfree)) { sock_rfree(skb); skb->destructor = NULL; skb->sk = NULL; + if (!skb_queue_empty(&sk->sk_receive_queue) || + !llist_empty(&sk->defer_list)) { + llist_add(&skb->ll_node, &sk->defer_list); + return; + } } - sk_eat_skb(sk, skb); + __kfree_skb(skb); } static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) @@ -2422,6 +2442,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, /* Do not sleep, just process backlog. */ __sk_flush_backlog(sk); } else { + sk_defer_free_flush(sk); sk_wait_data(sk, &timeo, last); } @@ -2540,6 +2561,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, ret = tcp_recvmsg_locked(sk, msg, len, nonblock, flags, &tss, &cmsg_flags); release_sock(sk); + sk_defer_free_flush(sk); if (cmsg_flags && ret >= 0) { if (cmsg_flags & TCP_CMSG_TS) @@ -3065,7 +3087,7 @@ int tcp_disconnect(struct sock *sk, int flags) sk->sk_frag.page = NULL; sk->sk_frag.offset = 0; } - + sk_defer_free_flush(sk); sk_error_report(sk); return 0; } @@ -4194,6 +4216,7 @@ static int do_tcp_getsockopt(struct sock *sk, int level, err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname, &zc, &len, err); release_sock(sk); + sk_defer_free_flush(sk); if (len >= offsetofend(struct tcp_zerocopy_receive, msg_flags)) goto zerocopy_rcv_cmsg; switch (len) { diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 5ad81bfb27b2f8d9a3cfe11141160b48092cfa3a..3dd19a2bf06c483b43d7e60080c624f10bb2f63d 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2102,6 +2102,7 @@ int tcp_v4_rcv(struct sk_buff *skb) sk_incoming_cpu_update(sk); + sk_defer_free_flush(sk); bh_lock_sock_nested(sk); tcp_segs_in(tcp_sk(sk), skb); ret = 0; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index f41f14b701233dd2d0f5ad464a623a5ba9774763..3b7d6ede13649d2589f5a456c5a132409486880f 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1758,6 +1758,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) sk_incoming_cpu_update(sk); + sk_defer_free_flush(sk); bh_lock_sock_nested(sk); tcp_segs_in(tcp_sk(sk), skb); ret = 0;