Message ID | 20240723135742.35102-1-kerneljasonxing@gmail.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC,net-next] net: add an entry for CONFIG_NET_RX_BUSY_POLL | expand |
On Tue, Jul 23, 2024 at 3:57 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > From: Jason Xing <kernelxing@tencent.com> > > When I was doing performance test on unix_poll(), I found out that > accessing sk->sk_ll_usec when calling sock_poll()->sk_can_busy_loop() > occupies too much time, which causes around 16% degradation. So I > decided to turn off this config, which cannot be done apparently > before this patch. Too many CONFIG_ options, distros will enable it anyway. In my builds, offset of sk_ll_usec is 0xe8. Are you using some debug options or an old tree ? I can not understand how a 16% degradation can occur, reading a field in a cache line which contains read mostly fields for af_unix socket. I think you need to provide more details / analysis, and perhaps come to a different conclusion. > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > --- > More data not much related if you're interested: > 5.82 │ mov 0x18(%r13),%rdx > 0.03 │ mov %rsi,%r12 > 1.76 │ mov %rdi,%rbx > │ sk_can_busy_loop(): > 0.50 │ mov 0x104(%rdx),%r14d > 41.30 │ test %r14d,%r14d > Note: I run 'perf record -e L1-dcache-load-misses' to diagnose > --- > net/Kconfig | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/net/Kconfig b/net/Kconfig > index d27d0deac0bf..1f1b793984fe 100644 > --- a/net/Kconfig > +++ b/net/Kconfig > @@ -335,8 +335,10 @@ config CGROUP_NET_CLASSID > being used in cls_cgroup and for netfilter matching. > > config NET_RX_BUSY_POLL > - bool > + bool "Low latency busy poll timeout" > default y if !PREEMPT_RT || (PREEMPT_RT && !NETCONSOLE) > + help > + Approximate time in us to spin waiting for packets on the device queue. Wrong comment. It is a y/n choice, no 'usec' at this stage. > > config BQL > bool > -- > 2.37.3 >
On Tue, Jul 23, 2024 at 10:57 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Jul 23, 2024 at 3:57 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > From: Jason Xing <kernelxing@tencent.com> > > > > When I was doing performance test on unix_poll(), I found out that > > accessing sk->sk_ll_usec when calling sock_poll()->sk_can_busy_loop() > > occupies too much time, which causes around 16% degradation. So I > > decided to turn off this config, which cannot be done apparently > > before this patch. > > Too many CONFIG_ options, distros will enable it anyway. > > In my builds, offset of sk_ll_usec is 0xe8. > > Are you using some debug options or an old tree ? > > I can not understand how a 16% degradation can occur, reading a field > in a cache line which contains read mostly fields for af_unix socket. > > I think you need to provide more details / analysis, and perhaps come > to a different conclusion. Thanks for your comments. I'm also confused about the result. The design of the cache line is correct from my perspective because they are all read mostly fields as you said. I was doing some tests by using libmicro[1] and found this line '41.30 │ test %r14d,%r14d' by using perf. So I realised that there is something strange here. Then I disable that config, the result turns out to be better than before. One of my colleagues can prove it. In this patch, I described a story about why I would like to let people disable/enable it, but investigating this part may be another different thing, I think. I will keep trying. [1]: https://github.com/redhat-performance/libMicro.git running 'https://github.com/redhat-performance/libMicro.git' to see the results > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > --- > > More data not much related if you're interested: > > 5.82 │ mov 0x18(%r13),%rdx > > 0.03 │ mov %rsi,%r12 > > 1.76 │ mov %rdi,%rbx > > │ sk_can_busy_loop(): > > 0.50 │ mov 0x104(%rdx),%r14d > > 41.30 │ test %r14d,%r14d > > Note: I run 'perf record -e L1-dcache-load-misses' to diagnose > > --- > > net/Kconfig | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/net/Kconfig b/net/Kconfig > > index d27d0deac0bf..1f1b793984fe 100644 > > --- a/net/Kconfig > > +++ b/net/Kconfig > > @@ -335,8 +335,10 @@ config CGROUP_NET_CLASSID > > being used in cls_cgroup and for netfilter matching. > > > > config NET_RX_BUSY_POLL > > - bool > > + bool "Low latency busy poll timeout" > > default y if !PREEMPT_RT || (PREEMPT_RT && !NETCONSOLE) > > + help > > + Approximate time in us to spin waiting for packets on the device queue. > > Wrong comment. It is a y/n choice, no 'usec' at this stage. Oh, I see. Thanks, Jason > > > > > config BQL > > bool > > -- > > 2.37.3 > >
On Tue, Jul 23, 2024 at 11:09 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Tue, Jul 23, 2024 at 10:57 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Tue, Jul 23, 2024 at 3:57 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > When I was doing performance test on unix_poll(), I found out that > > > accessing sk->sk_ll_usec when calling sock_poll()->sk_can_busy_loop() > > > occupies too much time, which causes around 16% degradation. So I > > > decided to turn off this config, which cannot be done apparently > > > before this patch. > > > > Too many CONFIG_ options, distros will enable it anyway. > > > > In my builds, offset of sk_ll_usec is 0xe8. > > > > Are you using some debug options or an old tree ? I forgot to say: I'm running the latest kernel which I pulled around two hours ago. Whatever kind of configs with/without debug options I use, I can still reproduce it. > > > > I can not understand how a 16% degradation can occur, reading a field > > in a cache line which contains read mostly fields for af_unix socket. > > > > I think you need to provide more details / analysis, and perhaps come > > to a different conclusion. > > Thanks for your comments. > > I'm also confused about the result. The design of the cache line is > correct from my perspective because they are all read mostly fields as > you said. > > I was doing some tests by using libmicro[1] and found this line '41.30 > │ test %r14d,%r14d' by using perf. So I realised that there is > something strange here. Then I disable that config, the result turns > out to be better than before. One of my colleagues can prove it. > > In this patch, I described a story about why I would like to let > people disable/enable it, but investigating this part may be another > different thing, I think. I will keep trying. > > [1]: https://github.com/redhat-performance/libMicro.git > running 'https://github.com/redhat-performance/libMicro.git' to see the results > > > > > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > > --- > > > More data not much related if you're interested: > > > 5.82 │ mov 0x18(%r13),%rdx > > > 0.03 │ mov %rsi,%r12 > > > 1.76 │ mov %rdi,%rbx > > > │ sk_can_busy_loop(): > > > 0.50 │ mov 0x104(%rdx),%r14d > > > 41.30 │ test %r14d,%r14d > > > Note: I run 'perf record -e L1-dcache-load-misses' to diagnose > > > --- > > > net/Kconfig | 4 +++- > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > diff --git a/net/Kconfig b/net/Kconfig > > > index d27d0deac0bf..1f1b793984fe 100644 > > > --- a/net/Kconfig > > > +++ b/net/Kconfig > > > @@ -335,8 +335,10 @@ config CGROUP_NET_CLASSID > > > being used in cls_cgroup and for netfilter matching. > > > > > > config NET_RX_BUSY_POLL > > > - bool > > > + bool "Low latency busy poll timeout" > > > default y if !PREEMPT_RT || (PREEMPT_RT && !NETCONSOLE) > > > + help > > > + Approximate time in us to spin waiting for packets on the device queue. > > > > Wrong comment. It is a y/n choice, no 'usec' at this stage. > > Oh, I see. > > Thanks, > Jason > > > > > > > > > config BQL > > > bool > > > -- > > > 2.37.3 > > >
On Tue, Jul 23, 2024 at 5:13 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Tue, Jul 23, 2024 at 11:09 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > On Tue, Jul 23, 2024 at 10:57 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Tue, Jul 23, 2024 at 3:57 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > When I was doing performance test on unix_poll(), I found out that > > > > accessing sk->sk_ll_usec when calling sock_poll()->sk_can_busy_loop() > > > > occupies too much time, which causes around 16% degradation. So I > > > > decided to turn off this config, which cannot be done apparently > > > > before this patch. > > > > > > Too many CONFIG_ options, distros will enable it anyway. > > > > > > In my builds, offset of sk_ll_usec is 0xe8. > > > > > > Are you using some debug options or an old tree ? > > I forgot to say: I'm running the latest kernel which I pulled around > two hours ago. Whatever kind of configs with/without debug options I > use, I can still reproduce it. Ok, please post : pahole --hex -C sock vmlinux
On Tue, Jul 23, 2024 at 11:26 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Jul 23, 2024 at 5:13 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > On Tue, Jul 23, 2024 at 11:09 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > On Tue, Jul 23, 2024 at 10:57 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Tue, Jul 23, 2024 at 3:57 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > When I was doing performance test on unix_poll(), I found out that > > > > > accessing sk->sk_ll_usec when calling sock_poll()->sk_can_busy_loop() > > > > > occupies too much time, which causes around 16% degradation. So I > > > > > decided to turn off this config, which cannot be done apparently > > > > > before this patch. > > > > > > > > Too many CONFIG_ options, distros will enable it anyway. > > > > > > > > In my builds, offset of sk_ll_usec is 0xe8. > > > > > > > > Are you using some debug options or an old tree ? > > > > I forgot to say: I'm running the latest kernel which I pulled around > > two hours ago. Whatever kind of configs with/without debug options I > > use, I can still reproduce it. > > Ok, please post : > > pahole --hex -C sock vmlinux 1) Enable the config: $ pahole --hex -C sock vmlinux struct sock { struct sock_common __sk_common; /* 0 0x88 */ /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ __u8 __cacheline_group_begin__sock_write_rx[0]; /* 0x88 0 */ atomic_t sk_drops; /* 0x88 0x4 */ __s32 sk_peek_off; /* 0x8c 0x4 */ struct sk_buff_head sk_error_queue; /* 0x90 0x18 */ struct sk_buff_head sk_receive_queue; /* 0xa8 0x18 */ /* --- cacheline 3 boundary (192 bytes) --- */ struct { atomic_t rmem_alloc; /* 0xc0 0x4 */ int len; /* 0xc4 0x4 */ struct sk_buff * head; /* 0xc8 0x8 */ struct sk_buff * tail; /* 0xd0 0x8 */ } sk_backlog; /* 0xc0 0x18 */ __u8 __cacheline_group_end__sock_write_rx[0]; /* 0xd8 0 */ __u8 __cacheline_group_begin__sock_read_rx[0]; /* 0xd8 0 */ struct dst_entry * sk_rx_dst; /* 0xd8 0x8 */ int sk_rx_dst_ifindex; /* 0xe0 0x4 */ u32 sk_rx_dst_cookie; /* 0xe4 0x4 */ unsigned int sk_ll_usec; /* 0xe8 0x4 */ unsigned int sk_napi_id; /* 0xec 0x4 */ u16 sk_busy_poll_budget; /* 0xf0 0x2 */ u8 sk_prefer_busy_poll; /* 0xf2 0x1 */ u8 sk_userlocks; /* 0xf3 0x1 */ int sk_rcvbuf; /* 0xf4 0x4 */ struct sk_filter * sk_filter; /* 0xf8 0x8 */ /* --- cacheline 4 boundary (256 bytes) --- */ union { struct socket_wq * sk_wq; /* 0x100 0x8 */ struct socket_wq * sk_wq_raw; /* 0x100 0x8 */ }; /* 0x100 0x8 */ void (*sk_data_ready)(struct sock *); /* 0x108 0x8 */ long int sk_rcvtimeo; /* 0x110 0x8 */ int sk_rcvlowat; /* 0x118 0x4 */ __u8 __cacheline_group_end__sock_read_rx[0]; /* 0x11c 0 */ __u8 __cacheline_group_begin__sock_read_rxtx[0]; /* 0x11c 0 */ int sk_err; /* 0x11c 0x4 */ struct socket * sk_socket; /* 0x120 0x8 */ struct mem_cgroup * sk_memcg; /* 0x128 0x8 */ struct xfrm_policy * sk_policy[2]; /* 0x130 0x10 */ /* --- cacheline 5 boundary (320 bytes) --- */ __u8 __cacheline_group_end__sock_read_rxtx[0]; /* 0x140 0 */ __u8 __cacheline_group_begin__sock_write_rxtx[0]; /* 0x140 0 */ socket_lock_t sk_lock; /* 0x140 0x20 */ u32 sk_reserved_mem; /* 0x160 0x4 */ int sk_forward_alloc; /* 0x164 0x4 */ u32 sk_tsflags; /* 0x168 0x4 */ __u8 __cacheline_group_end__sock_write_rxtx[0]; /* 0x16c 0 */ __u8 __cacheline_group_begin__sock_write_tx[0]; /* 0x16c 0 */ int sk_write_pending; /* 0x16c 0x4 */ atomic_t sk_omem_alloc; /* 0x170 0x4 */ int sk_sndbuf; /* 0x174 0x4 */ int sk_wmem_queued; /* 0x178 0x4 */ refcount_t sk_wmem_alloc; /* 0x17c 0x4 */ /* --- cacheline 6 boundary (384 bytes) --- */ long unsigned int sk_tsq_flags; /* 0x180 0x8 */ union { struct sk_buff * sk_send_head; /* 0x188 0x8 */ struct rb_root tcp_rtx_queue; /* 0x188 0x8 */ }; /* 0x188 0x8 */ struct sk_buff_head sk_write_queue; /* 0x190 0x18 */ u32 sk_dst_pending_confirm; /* 0x1a8 0x4 */ u32 sk_pacing_status; /* 0x1ac 0x4 */ struct page_frag sk_frag; /* 0x1b0 0x10 */ /* --- cacheline 7 boundary (448 bytes) --- */ struct timer_list sk_timer; /* 0x1c0 0x28 */ /* XXX last struct has 4 bytes of padding */ long unsigned int sk_pacing_rate; /* 0x1e8 0x8 */ atomic_t sk_zckey; /* 0x1f0 0x4 */ atomic_t sk_tskey; /* 0x1f4 0x4 */ __u8 __cacheline_group_end__sock_write_tx[0]; /* 0x1f8 0 */ __u8 __cacheline_group_begin__sock_read_tx[0]; /* 0x1f8 0 */ long unsigned int sk_max_pacing_rate; /* 0x1f8 0x8 */ /* --- cacheline 8 boundary (512 bytes) --- */ long int sk_sndtimeo; /* 0x200 0x8 */ u32 sk_priority; /* 0x208 0x4 */ u32 sk_mark; /* 0x20c 0x4 */ struct dst_entry * sk_dst_cache; /* 0x210 0x8 */ netdev_features_t sk_route_caps; /* 0x218 0x8 */ u16 sk_gso_type; /* 0x220 0x2 */ u16 sk_gso_max_segs; /* 0x222 0x2 */ unsigned int sk_gso_max_size; /* 0x224 0x4 */ gfp_t sk_allocation; /* 0x228 0x4 */ u32 sk_txhash; /* 0x22c 0x4 */ u8 sk_pacing_shift; /* 0x230 0x1 */ bool sk_use_task_frag; /* 0x231 0x1 */ __u8 __cacheline_group_end__sock_read_tx[0]; /* 0x232 0 */ u8 sk_gso_disabled:1; /* 0x232: 0 0x1 */ u8 sk_kern_sock:1; /* 0x232:0x1 0x1 */ u8 sk_no_check_tx:1; /* 0x232:0x2 0x1 */ u8 sk_no_check_rx:1; /* 0x232:0x3 0x1 */ /* XXX 4 bits hole, try to pack */ u8 sk_shutdown; /* 0x233 0x1 */ u16 sk_type; /* 0x234 0x2 */ u16 sk_protocol; /* 0x236 0x2 */ long unsigned int sk_lingertime; /* 0x238 0x8 */ /* --- cacheline 9 boundary (576 bytes) --- */ struct proto * sk_prot_creator; /* 0x240 0x8 */ rwlock_t sk_callback_lock; /* 0x248 0x8 */ int sk_err_soft; /* 0x250 0x4 */ u32 sk_ack_backlog; /* 0x254 0x4 */ u32 sk_max_ack_backlog; /* 0x258 0x4 */ kuid_t sk_uid; /* 0x25c 0x4 */ spinlock_t sk_peer_lock; /* 0x260 0x4 */ int sk_bind_phc; /* 0x264 0x4 */ struct pid * sk_peer_pid; /* 0x268 0x8 */ const struct cred * sk_peer_cred; /* 0x270 0x8 */ ktime_t sk_stamp; /* 0x278 0x8 */ /* --- cacheline 10 boundary (640 bytes) --- */ int sk_disconnects; /* 0x280 0x4 */ u8 sk_txrehash; /* 0x284 0x1 */ u8 sk_clockid; /* 0x285 0x1 */ u8 sk_txtime_deadline_mode:1; /* 0x286: 0 0x1 */ u8 sk_txtime_report_errors:1; /* 0x286:0x1 0x1 */ u8 sk_txtime_unused:6; /* 0x286:0x2 0x1 */ /* XXX 1 byte hole, try to pack */ void * sk_user_data; /* 0x288 0x8 */ void * sk_security; /* 0x290 0x8 */ struct sock_cgroup_data sk_cgrp_data; /* 0x298 0x10 */ /* XXX last struct has 2 bytes of padding */ void (*sk_state_change)(struct sock *); /* 0x2a8 0x8 */ void (*sk_write_space)(struct sock *); /* 0x2b0 0x8 */ void (*sk_error_report)(struct sock *); /* 0x2b8 0x8 */ /* --- cacheline 11 boundary (704 bytes) --- */ int (*sk_backlog_rcv)(struct sock *, struct sk_buff *); /* 0x2c0 0x8 */ void (*sk_destruct)(struct sock *); /* 0x2c8 0x8 */ struct sock_reuseport * sk_reuseport_cb; /* 0x2d0 0x8 */ struct bpf_local_storage * sk_bpf_storage; /* 0x2d8 0x8 */ struct callback_head sk_rcu __attribute__((__aligned__(8))); /* 0x2e0 0x10 */ netns_tracker ns_tracker; /* 0x2f0 0 */ /* size: 752, cachelines: 12, members: 105 */ /* sum members: 749, holes: 1, sum holes: 1 */ /* sum bitfield members: 12 bits, bit holes: 1, sum bit holes: 4 bits */ /* paddings: 2, sum paddings: 6 */ /* forced alignments: 1 */ /* last cacheline: 48 bytes */ } __attribute__((__aligned__(8))); 2) Disable the config: $ pahole --hex -C sock vmlinux struct sock { struct sock_common __sk_common; /* 0 0x88 */ /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ __u8 __cacheline_group_begin__sock_write_rx[0]; /* 0x88 0 */ atomic_t sk_drops; /* 0x88 0x4 */ __s32 sk_peek_off; /* 0x8c 0x4 */ struct sk_buff_head sk_error_queue; /* 0x90 0x18 */ struct sk_buff_head sk_receive_queue; /* 0xa8 0x18 */ /* --- cacheline 3 boundary (192 bytes) --- */ struct { atomic_t rmem_alloc; /* 0xc0 0x4 */ int len; /* 0xc4 0x4 */ struct sk_buff * head; /* 0xc8 0x8 */ struct sk_buff * tail; /* 0xd0 0x8 */ } sk_backlog; /* 0xc0 0x18 */ __u8 __cacheline_group_end__sock_write_rx[0]; /* 0xd8 0 */ __u8 __cacheline_group_begin__sock_read_rx[0]; /* 0xd8 0 */ struct dst_entry * sk_rx_dst; /* 0xd8 0x8 */ int sk_rx_dst_ifindex; /* 0xe0 0x4 */ u32 sk_rx_dst_cookie; /* 0xe4 0x4 */ u8 sk_userlocks; /* 0xe8 0x1 */ /* XXX 3 bytes hole, try to pack */ int sk_rcvbuf; /* 0xec 0x4 */ struct sk_filter * sk_filter; /* 0xf0 0x8 */ union { struct socket_wq * sk_wq; /* 0xf8 0x8 */ struct socket_wq * sk_wq_raw; /* 0xf8 0x8 */ }; /* 0xf8 0x8 */ /* --- cacheline 4 boundary (256 bytes) --- */ void (*sk_data_ready)(struct sock *); /* 0x100 0x8 */ long int sk_rcvtimeo; /* 0x108 0x8 */ int sk_rcvlowat; /* 0x110 0x4 */ __u8 __cacheline_group_end__sock_read_rx[0]; /* 0x114 0 */ __u8 __cacheline_group_begin__sock_read_rxtx[0]; /* 0x114 0 */ int sk_err; /* 0x114 0x4 */ struct socket * sk_socket; /* 0x118 0x8 */ struct mem_cgroup * sk_memcg; /* 0x120 0x8 */ struct xfrm_policy * sk_policy[2]; /* 0x128 0x10 */ __u8 __cacheline_group_end__sock_read_rxtx[0]; /* 0x138 0 */ __u8 __cacheline_group_begin__sock_write_rxtx[0]; /* 0x138 0 */ socket_lock_t sk_lock; /* 0x138 0x20 */ /* --- cacheline 5 boundary (320 bytes) was 24 bytes ago --- */ u32 sk_reserved_mem; /* 0x158 0x4 */ int sk_forward_alloc; /* 0x15c 0x4 */ u32 sk_tsflags; /* 0x160 0x4 */ __u8 __cacheline_group_end__sock_write_rxtx[0]; /* 0x164 0 */ __u8 __cacheline_group_begin__sock_write_tx[0]; /* 0x164 0 */ int sk_write_pending; /* 0x164 0x4 */ atomic_t sk_omem_alloc; /* 0x168 0x4 */ int sk_sndbuf; /* 0x16c 0x4 */ int sk_wmem_queued; /* 0x170 0x4 */ refcount_t sk_wmem_alloc; /* 0x174 0x4 */ long unsigned int sk_tsq_flags; /* 0x178 0x8 */ /* --- cacheline 6 boundary (384 bytes) --- */ union { struct sk_buff * sk_send_head; /* 0x180 0x8 */ struct rb_root tcp_rtx_queue; /* 0x180 0x8 */ }; /* 0x180 0x8 */ struct sk_buff_head sk_write_queue; /* 0x188 0x18 */ u32 sk_dst_pending_confirm; /* 0x1a0 0x4 */ u32 sk_pacing_status; /* 0x1a4 0x4 */ struct page_frag sk_frag; /* 0x1a8 0x10 */ struct timer_list sk_timer; /* 0x1b8 0x28 */ /* XXX last struct has 4 bytes of padding */ /* --- cacheline 7 boundary (448 bytes) was 32 bytes ago --- */ long unsigned int sk_pacing_rate; /* 0x1e0 0x8 */ atomic_t sk_zckey; /* 0x1e8 0x4 */ atomic_t sk_tskey; /* 0x1ec 0x4 */ __u8 __cacheline_group_end__sock_write_tx[0]; /* 0x1f0 0 */ __u8 __cacheline_group_begin__sock_read_tx[0]; /* 0x1f0 0 */ long unsigned int sk_max_pacing_rate; /* 0x1f0 0x8 */ long int sk_sndtimeo; /* 0x1f8 0x8 */ /* --- cacheline 8 boundary (512 bytes) --- */ u32 sk_priority; /* 0x200 0x4 */ u32 sk_mark; /* 0x204 0x4 */ struct dst_entry * sk_dst_cache; /* 0x208 0x8 */ netdev_features_t sk_route_caps; /* 0x210 0x8 */ u16 sk_gso_type; /* 0x218 0x2 */ u16 sk_gso_max_segs; /* 0x21a 0x2 */ unsigned int sk_gso_max_size; /* 0x21c 0x4 */ gfp_t sk_allocation; /* 0x220 0x4 */ u32 sk_txhash; /* 0x224 0x4 */ u8 sk_pacing_shift; /* 0x228 0x1 */ bool sk_use_task_frag; /* 0x229 0x1 */ __u8 __cacheline_group_end__sock_read_tx[0]; /* 0x22a 0 */ u8 sk_gso_disabled:1; /* 0x22a: 0 0x1 */ u8 sk_kern_sock:1; /* 0x22a:0x1 0x1 */ u8 sk_no_check_tx:1; /* 0x22a:0x2 0x1 */ u8 sk_no_check_rx:1; /* 0x22a:0x3 0x1 */ /* XXX 4 bits hole, try to pack */ u8 sk_shutdown; /* 0x22b 0x1 */ u16 sk_type; /* 0x22c 0x2 */ u16 sk_protocol; /* 0x22e 0x2 */ long unsigned int sk_lingertime; /* 0x230 0x8 */ struct proto * sk_prot_creator; /* 0x238 0x8 */ /* --- cacheline 9 boundary (576 bytes) --- */ rwlock_t sk_callback_lock; /* 0x240 0x8 */ int sk_err_soft; /* 0x248 0x4 */ u32 sk_ack_backlog; /* 0x24c 0x4 */ u32 sk_max_ack_backlog; /* 0x250 0x4 */ kuid_t sk_uid; /* 0x254 0x4 */ spinlock_t sk_peer_lock; /* 0x258 0x4 */ int sk_bind_phc; /* 0x25c 0x4 */ struct pid * sk_peer_pid; /* 0x260 0x8 */ const struct cred * sk_peer_cred; /* 0x268 0x8 */ ktime_t sk_stamp; /* 0x270 0x8 */ int sk_disconnects; /* 0x278 0x4 */ u8 sk_txrehash; /* 0x27c 0x1 */ u8 sk_clockid; /* 0x27d 0x1 */ u8 sk_txtime_deadline_mode:1; /* 0x27e: 0 0x1 */ u8 sk_txtime_report_errors:1; /* 0x27e:0x1 0x1 */ u8 sk_txtime_unused:6; /* 0x27e:0x2 0x1 */ /* XXX 1 byte hole, try to pack */ /* --- cacheline 10 boundary (640 bytes) --- */ void * sk_user_data; /* 0x280 0x8 */ void * sk_security; /* 0x288 0x8 */ struct sock_cgroup_data sk_cgrp_data; /* 0x290 0x10 */ /* XXX last struct has 2 bytes of padding */ void (*sk_state_change)(struct sock *); /* 0x2a0 0x8 */ void (*sk_write_space)(struct sock *); /* 0x2a8 0x8 */ void (*sk_error_report)(struct sock *); /* 0x2b0 0x8 */ int (*sk_backlog_rcv)(struct sock *, struct sk_buff *); /* 0x2b8 0x8 */ /* --- cacheline 11 boundary (704 bytes) --- */ void (*sk_destruct)(struct sock *); /* 0x2c0 0x8 */ struct sock_reuseport * sk_reuseport_cb; /* 0x2c8 0x8 */ struct bpf_local_storage * sk_bpf_storage; /* 0x2d0 0x8 */ struct callback_head sk_rcu __attribute__((__aligned__(8))); /* 0x2d8 0x10 */ netns_tracker ns_tracker; /* 0x2e8 0 */ /* size: 744, cachelines: 12, members: 101 */ /* sum members: 738, holes: 2, sum holes: 4 */ /* sum bitfield members: 12 bits, bit holes: 1, sum bit holes: 4 bits */ /* paddings: 2, sum paddings: 6 */ /* forced alignments: 1 */ /* last cacheline: 40 bytes */ } __attribute__((__aligned__(8)));
On Tue, Jul 23, 2024 at 6:01 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Tue, Jul 23, 2024 at 11:26 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Tue, Jul 23, 2024 at 5:13 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > On Tue, Jul 23, 2024 at 11:09 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > On Tue, Jul 23, 2024 at 10:57 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > On Tue, Jul 23, 2024 at 3:57 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > > > When I was doing performance test on unix_poll(), I found out that > > > > > > accessing sk->sk_ll_usec when calling sock_poll()->sk_can_busy_loop() > > > > > > occupies too much time, which causes around 16% degradation. So I > > > > > > decided to turn off this config, which cannot be done apparently > > > > > > before this patch. > > > > > > > > > > Too many CONFIG_ options, distros will enable it anyway. > > > > > > > > > > In my builds, offset of sk_ll_usec is 0xe8. > > > > > > > > > > Are you using some debug options or an old tree ? > > > > > > I forgot to say: I'm running the latest kernel which I pulled around > > > two hours ago. Whatever kind of configs with/without debug options I > > > use, I can still reproduce it. > > > > Ok, please post : > > > > pahole --hex -C sock vmlinux > > 1) Enable the config: > $ pahole --hex -C sock vmlinux > struct sock { > struct sock_common __sk_common; /* 0 0x88 */ > /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ > __u8 > __cacheline_group_begin__sock_write_rx[0]; /* 0x88 0 */ > atomic_t sk_drops; /* 0x88 0x4 */ > __s32 sk_peek_off; /* 0x8c 0x4 */ > struct sk_buff_head sk_error_queue; /* 0x90 0x18 */ > struct sk_buff_head sk_receive_queue; /* 0xa8 0x18 */ > /* --- cacheline 3 boundary (192 bytes) --- */ > struct { > atomic_t rmem_alloc; /* 0xc0 0x4 */ > int len; /* 0xc4 0x4 */ > struct sk_buff * head; /* 0xc8 0x8 */ > struct sk_buff * tail; /* 0xd0 0x8 */ > } sk_backlog; /* 0xc0 0x18 */ > __u8 > __cacheline_group_end__sock_write_rx[0]; /* 0xd8 0 */ > __u8 > __cacheline_group_begin__sock_read_rx[0]; /* 0xd8 0 */ > struct dst_entry * sk_rx_dst; /* 0xd8 0x8 */ > int sk_rx_dst_ifindex; /* 0xe0 0x4 */ > u32 sk_rx_dst_cookie; /* 0xe4 0x4 */ > unsigned int sk_ll_usec; /* 0xe8 0x4 */ See here ? offset of sk_ll_usec is 0xe8, not 0x104 as you posted. Do not blindly trust perf here. Please run a benchmark with 1,000,000 af_unix messages being sent and received. I am guessing your patch makes no difference at all (certainly not 16 % as claimed in your changelog)
On Wed, Jul 24, 2024 at 12:28 AM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Jul 23, 2024 at 6:01 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > On Tue, Jul 23, 2024 at 11:26 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Tue, Jul 23, 2024 at 5:13 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > On Tue, Jul 23, 2024 at 11:09 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > On Tue, Jul 23, 2024 at 10:57 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > > > On Tue, Jul 23, 2024 at 3:57 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > > > > > When I was doing performance test on unix_poll(), I found out that > > > > > > > accessing sk->sk_ll_usec when calling sock_poll()->sk_can_busy_loop() > > > > > > > occupies too much time, which causes around 16% degradation. So I > > > > > > > decided to turn off this config, which cannot be done apparently > > > > > > > before this patch. > > > > > > > > > > > > Too many CONFIG_ options, distros will enable it anyway. > > > > > > > > > > > > In my builds, offset of sk_ll_usec is 0xe8. > > > > > > > > > > > > Are you using some debug options or an old tree ? > > > > > > > > I forgot to say: I'm running the latest kernel which I pulled around > > > > two hours ago. Whatever kind of configs with/without debug options I > > > > use, I can still reproduce it. > > > > > > Ok, please post : > > > > > > pahole --hex -C sock vmlinux > > > > 1) Enable the config: > > $ pahole --hex -C sock vmlinux > > struct sock { > > struct sock_common __sk_common; /* 0 0x88 */ > > /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ > > __u8 > > __cacheline_group_begin__sock_write_rx[0]; /* 0x88 0 */ > > atomic_t sk_drops; /* 0x88 0x4 */ > > __s32 sk_peek_off; /* 0x8c 0x4 */ > > struct sk_buff_head sk_error_queue; /* 0x90 0x18 */ > > struct sk_buff_head sk_receive_queue; /* 0xa8 0x18 */ > > /* --- cacheline 3 boundary (192 bytes) --- */ > > struct { > > atomic_t rmem_alloc; /* 0xc0 0x4 */ > > int len; /* 0xc4 0x4 */ > > struct sk_buff * head; /* 0xc8 0x8 */ > > struct sk_buff * tail; /* 0xd0 0x8 */ > > } sk_backlog; /* 0xc0 0x18 */ > > __u8 > > __cacheline_group_end__sock_write_rx[0]; /* 0xd8 0 */ > > __u8 > > __cacheline_group_begin__sock_read_rx[0]; /* 0xd8 0 */ > > struct dst_entry * sk_rx_dst; /* 0xd8 0x8 */ > > int sk_rx_dst_ifindex; /* 0xe0 0x4 */ > > u32 sk_rx_dst_cookie; /* 0xe4 0x4 */ > > unsigned int sk_ll_usec; /* 0xe8 0x4 */ > > See here ? offset of sk_ll_usec is 0xe8, not 0x104 as you posted. Oh, so sorry. My fault. I remembered only that perf record was executed in an old tree before you optimise the layout of struct sock. Then I found out if I disable the config applying to the latest tree running in my virtual machine, the result is better. So let me find a physical server to run the latest kernel and will get back more accurate information of 'perf record' here. > > Do not blindly trust perf here. > > Please run a benchmark with 1,000,000 af_unix messages being sent and received. > > I am guessing your patch makes no difference at all (certainly not 16 > % as claimed in your changelog) The fact is the performance would improve when I disable the config if I only test unix_poll related paths. The time spent can decrease from 2.45 to 2.05 which is 16%. As I said, it can be easily reproduced. Thanks, Jason
On Wed, Jul 24, 2024 at 8:38 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Wed, Jul 24, 2024 at 12:28 AM Eric Dumazet <edumazet@google.com> wrote: > > > > On Tue, Jul 23, 2024 at 6:01 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > On Tue, Jul 23, 2024 at 11:26 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Tue, Jul 23, 2024 at 5:13 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > On Tue, Jul 23, 2024 at 11:09 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > > > On Tue, Jul 23, 2024 at 10:57 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > > > > > On Tue, Jul 23, 2024 at 3:57 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > > > > > > > When I was doing performance test on unix_poll(), I found out that > > > > > > > > accessing sk->sk_ll_usec when calling sock_poll()->sk_can_busy_loop() > > > > > > > > occupies too much time, which causes around 16% degradation. So I > > > > > > > > decided to turn off this config, which cannot be done apparently > > > > > > > > before this patch. > > > > > > > > > > > > > > Too many CONFIG_ options, distros will enable it anyway. > > > > > > > > > > > > > > In my builds, offset of sk_ll_usec is 0xe8. > > > > > > > > > > > > > > Are you using some debug options or an old tree ? > > > > > > > > > > I forgot to say: I'm running the latest kernel which I pulled around > > > > > two hours ago. Whatever kind of configs with/without debug options I > > > > > use, I can still reproduce it. > > > > > > > > Ok, please post : > > > > > > > > pahole --hex -C sock vmlinux > > > > > > 1) Enable the config: > > > $ pahole --hex -C sock vmlinux > > > struct sock { > > > struct sock_common __sk_common; /* 0 0x88 */ > > > /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ > > > __u8 > > > __cacheline_group_begin__sock_write_rx[0]; /* 0x88 0 */ > > > atomic_t sk_drops; /* 0x88 0x4 */ > > > __s32 sk_peek_off; /* 0x8c 0x4 */ > > > struct sk_buff_head sk_error_queue; /* 0x90 0x18 */ > > > struct sk_buff_head sk_receive_queue; /* 0xa8 0x18 */ > > > /* --- cacheline 3 boundary (192 bytes) --- */ > > > struct { > > > atomic_t rmem_alloc; /* 0xc0 0x4 */ > > > int len; /* 0xc4 0x4 */ > > > struct sk_buff * head; /* 0xc8 0x8 */ > > > struct sk_buff * tail; /* 0xd0 0x8 */ > > > } sk_backlog; /* 0xc0 0x18 */ > > > __u8 > > > __cacheline_group_end__sock_write_rx[0]; /* 0xd8 0 */ > > > __u8 > > > __cacheline_group_begin__sock_read_rx[0]; /* 0xd8 0 */ > > > struct dst_entry * sk_rx_dst; /* 0xd8 0x8 */ > > > int sk_rx_dst_ifindex; /* 0xe0 0x4 */ > > > u32 sk_rx_dst_cookie; /* 0xe4 0x4 */ > > > unsigned int sk_ll_usec; /* 0xe8 0x4 */ > > > > See here ? offset of sk_ll_usec is 0xe8, not 0x104 as you posted. > > Oh, so sorry. My fault. I remembered only that perf record was > executed in an old tree before you optimise the layout of struct sock. > Then I found out if I disable the config applying to the latest tree > running in my virtual machine, the result is better. So let me find a > physical server to run the latest kernel and will get back more > accurate information of 'perf record' here. > > > > > Do not blindly trust perf here. > > > > Please run a benchmark with 1,000,000 af_unix messages being sent and received. > > > > I am guessing your patch makes no difference at all (certainly not 16 > > % as claimed in your changelog) > > The fact is the performance would improve when I disable the config if > I only test unix_poll related paths. The time spent can decrease from > 2.45 to 2.05 which is 16%. As I said, it can be easily reproduced. To prove that accessing the sk_ll_usec field could cause performance issue, I only remove those lines as below with CONFIG_NET_RX_BUSY_POLL enabled: diff --git a/net/socket.c b/net/socket.c index fcbdd5bc47ac..74a730330a01 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1392,20 +1392,11 @@ static __poll_t sock_poll(struct file *file, poll_table *wait) { struct socket *sock = file->private_data; const struct proto_ops *ops = READ_ONCE(sock->ops); - __poll_t events = poll_requested_events(wait), flag = 0; + __poll_t flag = 0; if (!ops->poll) return 0; - if (sk_can_busy_loop(sock->sk)) { - /* poll once if requested by the syscall */ - if (events & POLL_BUSY_LOOP) - sk_busy_loop(sock->sk, 1); - - /* if this socket can poll_ll, tell the system call */ - flag = POLL_BUSY_LOOP; - } - return ops->poll(file, sock, wait) | flag; } The result of time could decrease to ~2.1. > > Thanks, > Jason
On Wed, Jul 24, 2024 at 8:38 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Wed, Jul 24, 2024 at 12:28 AM Eric Dumazet <edumazet@google.com> wrote: > > > > On Tue, Jul 23, 2024 at 6:01 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > On Tue, Jul 23, 2024 at 11:26 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Tue, Jul 23, 2024 at 5:13 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > On Tue, Jul 23, 2024 at 11:09 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > > > On Tue, Jul 23, 2024 at 10:57 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > > > > > On Tue, Jul 23, 2024 at 3:57 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > > > > > > > When I was doing performance test on unix_poll(), I found out that > > > > > > > > accessing sk->sk_ll_usec when calling sock_poll()->sk_can_busy_loop() > > > > > > > > occupies too much time, which causes around 16% degradation. So I > > > > > > > > decided to turn off this config, which cannot be done apparently > > > > > > > > before this patch. > > > > > > > > > > > > > > Too many CONFIG_ options, distros will enable it anyway. > > > > > > > > > > > > > > In my builds, offset of sk_ll_usec is 0xe8. > > > > > > > > > > > > > > Are you using some debug options or an old tree ? > > > > > > > > > > I forgot to say: I'm running the latest kernel which I pulled around > > > > > two hours ago. Whatever kind of configs with/without debug options I > > > > > use, I can still reproduce it. > > > > > > > > Ok, please post : > > > > > > > > pahole --hex -C sock vmlinux > > > > > > 1) Enable the config: > > > $ pahole --hex -C sock vmlinux > > > struct sock { > > > struct sock_common __sk_common; /* 0 0x88 */ > > > /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ > > > __u8 > > > __cacheline_group_begin__sock_write_rx[0]; /* 0x88 0 */ > > > atomic_t sk_drops; /* 0x88 0x4 */ > > > __s32 sk_peek_off; /* 0x8c 0x4 */ > > > struct sk_buff_head sk_error_queue; /* 0x90 0x18 */ > > > struct sk_buff_head sk_receive_queue; /* 0xa8 0x18 */ > > > /* --- cacheline 3 boundary (192 bytes) --- */ > > > struct { > > > atomic_t rmem_alloc; /* 0xc0 0x4 */ > > > int len; /* 0xc4 0x4 */ > > > struct sk_buff * head; /* 0xc8 0x8 */ > > > struct sk_buff * tail; /* 0xd0 0x8 */ > > > } sk_backlog; /* 0xc0 0x18 */ > > > __u8 > > > __cacheline_group_end__sock_write_rx[0]; /* 0xd8 0 */ > > > __u8 > > > __cacheline_group_begin__sock_read_rx[0]; /* 0xd8 0 */ > > > struct dst_entry * sk_rx_dst; /* 0xd8 0x8 */ > > > int sk_rx_dst_ifindex; /* 0xe0 0x4 */ > > > u32 sk_rx_dst_cookie; /* 0xe4 0x4 */ > > > unsigned int sk_ll_usec; /* 0xe8 0x4 */ > > > > See here ? offset of sk_ll_usec is 0xe8, not 0x104 as you posted. > > Oh, so sorry. My fault. I remembered only that perf record was > executed in an old tree before you optimise the layout of struct sock. > Then I found out if I disable the config applying to the latest tree > running in my virtual machine, the result is better. So let me find a > physical server to run the latest kernel and will get back more > accurate information of 'perf record' here. Now I'm back. The same output of perf when running the latest kernel on the virtual server goes like this: │ │ static inline bool sk_can_busy_loop(const struct sock *sk) │ { │ return READ_ONCE(sk->sk_ll_usec) && !signal_pending(current); │ mov 0xe8(%rdx),%ebp 55.71 │ test %ebp,%ebp │ ↓ jne 62 │ sock_poll(): command I used: perf record -g -e cycles:k -F 999 -o tk5_select10.data -- ./bin-x86_64/select -E -C 200 -L -S -W -M -N "select_10" -n 100 -B 500 If it's running on the physical server, the perf output is like this: │ ↓ je e1 │ mov 0x18(%r13),%rdx 0.03 │ mov %rsi,%rbx 0.00 │ mov %rdi,%r12 │ mov 0xe8(%rdx),%r14d 26.48 │ test %r14d,%r14d What a interesting thing I found is that running on the physical server the delta output is better than on the virtual server: original kernel, remove access of sk_ll_usec physical: 2.26, 2.08 (delta is 8.4%) virtual: 2.45, 2.05 (delta is ~16%) I'm still confused about reading this sk_ll_usec can cause such a performance degradation situation. Eric, may I ask if you have more ideas/suggestions about this one? Thanks, Jason > > > > > Do not blindly trust perf here. > > > > Please run a benchmark with 1,000,000 af_unix messages being sent and received. > > > > I am guessing your patch makes no difference at all (certainly not 16 > > % as claimed in your changelog) > > The fact is the performance would improve when I disable the config if > I only test unix_poll related paths. The time spent can decrease from > 2.45 to 2.05 which is 16%. As I said, it can be easily reproduced. > > Thanks, > Jason
On Wed, Jul 24, 2024 at 9:33 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Wed, Jul 24, 2024 at 8:38 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > On Wed, Jul 24, 2024 at 12:28 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Tue, Jul 23, 2024 at 6:01 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > On Tue, Jul 23, 2024 at 11:26 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > On Tue, Jul 23, 2024 at 5:13 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > > > On Tue, Jul 23, 2024 at 11:09 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > > > > > On Tue, Jul 23, 2024 at 10:57 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > > > > > > > On Tue, Jul 23, 2024 at 3:57 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > > > > > > > > > When I was doing performance test on unix_poll(), I found out that > > > > > > > > > accessing sk->sk_ll_usec when calling sock_poll()->sk_can_busy_loop() > > > > > > > > > occupies too much time, which causes around 16% degradation. So I > > > > > > > > > decided to turn off this config, which cannot be done apparently > > > > > > > > > before this patch. > > > > > > > > > > > > > > > > Too many CONFIG_ options, distros will enable it anyway. > > > > > > > > > > > > > > > > In my builds, offset of sk_ll_usec is 0xe8. > > > > > > > > > > > > > > > > Are you using some debug options or an old tree ? > > > > > > > > > > > > I forgot to say: I'm running the latest kernel which I pulled around > > > > > > two hours ago. Whatever kind of configs with/without debug options I > > > > > > use, I can still reproduce it. > > > > > > > > > > Ok, please post : > > > > > > > > > > pahole --hex -C sock vmlinux > > > > > > > > 1) Enable the config: > > > > $ pahole --hex -C sock vmlinux > > > > struct sock { > > > > struct sock_common __sk_common; /* 0 0x88 */ > > > > /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ > > > > __u8 > > > > __cacheline_group_begin__sock_write_rx[0]; /* 0x88 0 */ > > > > atomic_t sk_drops; /* 0x88 0x4 */ > > > > __s32 sk_peek_off; /* 0x8c 0x4 */ > > > > struct sk_buff_head sk_error_queue; /* 0x90 0x18 */ > > > > struct sk_buff_head sk_receive_queue; /* 0xa8 0x18 */ > > > > /* --- cacheline 3 boundary (192 bytes) --- */ > > > > struct { > > > > atomic_t rmem_alloc; /* 0xc0 0x4 */ > > > > int len; /* 0xc4 0x4 */ > > > > struct sk_buff * head; /* 0xc8 0x8 */ > > > > struct sk_buff * tail; /* 0xd0 0x8 */ > > > > } sk_backlog; /* 0xc0 0x18 */ > > > > __u8 > > > > __cacheline_group_end__sock_write_rx[0]; /* 0xd8 0 */ > > > > __u8 > > > > __cacheline_group_begin__sock_read_rx[0]; /* 0xd8 0 */ > > > > struct dst_entry * sk_rx_dst; /* 0xd8 0x8 */ > > > > int sk_rx_dst_ifindex; /* 0xe0 0x4 */ > > > > u32 sk_rx_dst_cookie; /* 0xe4 0x4 */ > > > > unsigned int sk_ll_usec; /* 0xe8 0x4 */ > > > > > > See here ? offset of sk_ll_usec is 0xe8, not 0x104 as you posted. > > > > Oh, so sorry. My fault. I remembered only that perf record was > > executed in an old tree before you optimise the layout of struct sock. > > Then I found out if I disable the config applying to the latest tree > > running in my virtual machine, the result is better. So let me find a > > physical server to run the latest kernel and will get back more > > accurate information of 'perf record' here. > > Now I'm back. The same output of perf when running the latest kernel > on the virtual server goes like this: > │ > │ static inline bool sk_can_busy_loop(const struct sock *sk) > │ { > │ return READ_ONCE(sk->sk_ll_usec) && !signal_pending(current); > │ mov 0xe8(%rdx),%ebp > 55.71 │ test %ebp,%ebp > │ ↓ jne 62 > │ sock_poll(): > command I used: perf record -g -e cycles:k -F 999 -o tk5_select10.data > -- ./bin-x86_64/select -E -C 200 -L -S -W -M -N "select_10" -n 100 -B > 500 > > If it's running on the physical server, the perf output is like this: > │ ↓ je e1 > │ mov 0x18(%r13),%rdx > 0.03 │ mov %rsi,%rbx > 0.00 │ mov %rdi,%r12 > │ mov 0xe8(%rdx),%r14d > 26.48 │ test %r14d,%r14d > > What a interesting thing I found is that running on the physical > server the delta output is better than on the virtual server: > original kernel, remove access of sk_ll_usec > physical: 2.26, 2.08 (delta is 8.4%) > virtual: 2.45, 2.05 (delta is ~16%) > > I'm still confused about reading this sk_ll_usec can cause such a > performance degradation situation. > > Eric, may I ask if you have more ideas/suggestions about this one? > We do not micro-optimize based on 'perf' reports, because of artifacts. Please run a full workload, sending/receiving 1,000,000 messages and report the time difference, not on a precise function but the whole workload. Again, I am guessing there will be no difference, because the cache line is needed anyway. Please make sure to run the latest kernels, this will avoid you discovering issues that have been already fixed.
On Wed, Jul 24, 2024 at 4:54 PM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Jul 24, 2024 at 9:33 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > On Wed, Jul 24, 2024 at 8:38 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > On Wed, Jul 24, 2024 at 12:28 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Tue, Jul 23, 2024 at 6:01 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > On Tue, Jul 23, 2024 at 11:26 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > > > On Tue, Jul 23, 2024 at 5:13 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > > > > > On Tue, Jul 23, 2024 at 11:09 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > > > > > > > On Tue, Jul 23, 2024 at 10:57 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > > > > > > > > > On Tue, Jul 23, 2024 at 3:57 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > > > > > > > > > > > When I was doing performance test on unix_poll(), I found out that > > > > > > > > > > accessing sk->sk_ll_usec when calling sock_poll()->sk_can_busy_loop() > > > > > > > > > > occupies too much time, which causes around 16% degradation. So I > > > > > > > > > > decided to turn off this config, which cannot be done apparently > > > > > > > > > > before this patch. > > > > > > > > > > > > > > > > > > Too many CONFIG_ options, distros will enable it anyway. > > > > > > > > > > > > > > > > > > In my builds, offset of sk_ll_usec is 0xe8. > > > > > > > > > > > > > > > > > > Are you using some debug options or an old tree ? > > > > > > > > > > > > > > I forgot to say: I'm running the latest kernel which I pulled around > > > > > > > two hours ago. Whatever kind of configs with/without debug options I > > > > > > > use, I can still reproduce it. > > > > > > > > > > > > Ok, please post : > > > > > > > > > > > > pahole --hex -C sock vmlinux > > > > > > > > > > 1) Enable the config: > > > > > $ pahole --hex -C sock vmlinux > > > > > struct sock { > > > > > struct sock_common __sk_common; /* 0 0x88 */ > > > > > /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ > > > > > __u8 > > > > > __cacheline_group_begin__sock_write_rx[0]; /* 0x88 0 */ > > > > > atomic_t sk_drops; /* 0x88 0x4 */ > > > > > __s32 sk_peek_off; /* 0x8c 0x4 */ > > > > > struct sk_buff_head sk_error_queue; /* 0x90 0x18 */ > > > > > struct sk_buff_head sk_receive_queue; /* 0xa8 0x18 */ > > > > > /* --- cacheline 3 boundary (192 bytes) --- */ > > > > > struct { > > > > > atomic_t rmem_alloc; /* 0xc0 0x4 */ > > > > > int len; /* 0xc4 0x4 */ > > > > > struct sk_buff * head; /* 0xc8 0x8 */ > > > > > struct sk_buff * tail; /* 0xd0 0x8 */ > > > > > } sk_backlog; /* 0xc0 0x18 */ > > > > > __u8 > > > > > __cacheline_group_end__sock_write_rx[0]; /* 0xd8 0 */ > > > > > __u8 > > > > > __cacheline_group_begin__sock_read_rx[0]; /* 0xd8 0 */ > > > > > struct dst_entry * sk_rx_dst; /* 0xd8 0x8 */ > > > > > int sk_rx_dst_ifindex; /* 0xe0 0x4 */ > > > > > u32 sk_rx_dst_cookie; /* 0xe4 0x4 */ > > > > > unsigned int sk_ll_usec; /* 0xe8 0x4 */ > > > > > > > > See here ? offset of sk_ll_usec is 0xe8, not 0x104 as you posted. > > > > > > Oh, so sorry. My fault. I remembered only that perf record was > > > executed in an old tree before you optimise the layout of struct sock. > > > Then I found out if I disable the config applying to the latest tree > > > running in my virtual machine, the result is better. So let me find a > > > physical server to run the latest kernel and will get back more > > > accurate information of 'perf record' here. > > > > Now I'm back. The same output of perf when running the latest kernel > > on the virtual server goes like this: > > │ > > │ static inline bool sk_can_busy_loop(const struct sock *sk) > > │ { > > │ return READ_ONCE(sk->sk_ll_usec) && !signal_pending(current); > > │ mov 0xe8(%rdx),%ebp > > 55.71 │ test %ebp,%ebp > > │ ↓ jne 62 > > │ sock_poll(): > > command I used: perf record -g -e cycles:k -F 999 -o tk5_select10.data > > -- ./bin-x86_64/select -E -C 200 -L -S -W -M -N "select_10" -n 100 -B > > 500 > > > > If it's running on the physical server, the perf output is like this: > > │ ↓ je e1 > > │ mov 0x18(%r13),%rdx > > 0.03 │ mov %rsi,%rbx > > 0.00 │ mov %rdi,%r12 > > │ mov 0xe8(%rdx),%r14d > > 26.48 │ test %r14d,%r14d > > > > What a interesting thing I found is that running on the physical > > server the delta output is better than on the virtual server: > > original kernel, remove access of sk_ll_usec > > physical: 2.26, 2.08 (delta is 8.4%) > > virtual: 2.45, 2.05 (delta is ~16%) > > > > I'm still confused about reading this sk_ll_usec can cause such a > > performance degradation situation. > > > > Eric, may I ask if you have more ideas/suggestions about this one? > > > > We do not micro-optimize based on 'perf' reports, because of artifacts. Sure, I know this. The reason why I use perf to observe is that I found performance degradation between 5.x and the latest kernel. Then I started to look into the sock_poll() and unix_poll(). It turns out that some accesses of members can consume more time than expected. > > Please run a full workload, sending/receiving 1,000,000 messages and report > the time difference, not on a precise function but the whole workload. Okay. > > Again, I am guessing there will be no difference, because the cache > line is needed anyway. To conclude from the theory of the layout, I agree that I cannot see any better method to improve. > > Please make sure to run the latest kernels, this will avoid you > discovering issues that have been already fixed. Sure, I did it based on the latest kernel as my previous emails said. Without accessing sk_ll_usec, the performance is better. Anyway, thanks so much for your help! Thanks, Jason
diff --git a/net/Kconfig b/net/Kconfig index d27d0deac0bf..1f1b793984fe 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -335,8 +335,10 @@ config CGROUP_NET_CLASSID being used in cls_cgroup and for netfilter matching. config NET_RX_BUSY_POLL - bool + bool "Low latency busy poll timeout" default y if !PREEMPT_RT || (PREEMPT_RT && !NETCONSOLE) + help + Approximate time in us to spin waiting for packets on the device queue. config BQL bool