Message ID | 20250305034550.879255-2-edumazet@google.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 9544d60a2605d1500cf5c3e331a76b9eaf4538c9 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | tcp: even faster connect() under stress | expand |
From: Eric Dumazet <edumazet@google.com> Date: Wed, 5 Mar 2025 03:45:49 +0000 > In order to speedup __inet_hash_connect(), we want to ensure hash values > for <source address, port X, destination address, destination port> > are not randomly spread, but monotonically increasing. > > Goal is to allow __inet_hash_connect() to derive the hash value > of a candidate 4-tuple with a single addition in the following > patch in the series. > > Given : > hash_0 = inet_ehashfn(saddr, 0, daddr, dport) > hash_sport = inet_ehashfn(saddr, sport, daddr, dport) > > Then (hash_sport == hash_0 + sport) for all sport values. > > As far as I know, there is no security implication with this change. > > After this patch, when __inet_hash_connect() has to try XXXX candidates, > the hash table buckets are contiguous and packed, allowing > a better use of cpu caches and hardware prefetchers. > > Tested: > > Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog > Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server > > Before this patch: > > utime_start=0.271607 > utime_end=3.847111 > stime_start=18.407684 > stime_end=1997.485557 > num_transactions=1350742 > latency_min=0.014131929 > latency_max=17.895073144 > latency_mean=0.505675853 > latency_stddev=2.125164772 > num_samples=307884 > throughput=139866.80 > > perf top on client: > > 56.86% [kernel] [k] __inet6_check_established > 17.96% [kernel] [k] __inet_hash_connect > 13.88% [kernel] [k] inet6_ehashfn > 2.52% [kernel] [k] rcu_all_qs > 2.01% [kernel] [k] __cond_resched > 0.41% [kernel] [k] _raw_spin_lock > > After this patch: > > utime_start=0.286131 > utime_end=4.378886 > stime_start=11.952556 > stime_end=1991.655533 > num_transactions=1446830 > latency_min=0.001061085 > latency_max=12.075275028 > latency_mean=0.376375302 > latency_stddev=1.361969596 > num_samples=306383 > throughput=151866.56 > > perf top: > > 50.01% [kernel] [k] __inet6_check_established > 20.65% [kernel] [k] __inet_hash_connect > 15.81% [kernel] [k] inet6_ehashfn > 2.92% [kernel] [k] rcu_all_qs > 2.34% [kernel] [k] __cond_resched > 0.50% [kernel] [k] _raw_spin_lock > 0.34% [kernel] [k] sched_balance_trigger > 0.24% [kernel] [k] queued_spin_lock_slowpath > > There is indeed an increase of throughput and reduction of latency. > > Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
On Wed, Mar 5, 2025 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote: > > In order to speedup __inet_hash_connect(), we want to ensure hash values > for <source address, port X, destination address, destination port> > are not randomly spread, but monotonically increasing. > > Goal is to allow __inet_hash_connect() to derive the hash value > of a candidate 4-tuple with a single addition in the following > patch in the series. > > Given : > hash_0 = inet_ehashfn(saddr, 0, daddr, dport) > hash_sport = inet_ehashfn(saddr, sport, daddr, dport) > > Then (hash_sport == hash_0 + sport) for all sport values. > > As far as I know, there is no security implication with this change. Good to know this. The moment I read the first paragraph, I was thinking if it might bring potential risk. Sorry that I hesitate to bring up one question: could this new algorithm result in sockets concentrating into several buckets instead of being sufficiently dispersed like before. Well good news is that I tested other cases like TCP_CRR and saw no degradation in performance. But they didn't cover establishing from one client to many different servers cases. > > After this patch, when __inet_hash_connect() has to try XXXX candidates, > the hash table buckets are contiguous and packed, allowing > a better use of cpu caches and hardware prefetchers. > > Tested: > > Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog > Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server > > Before this patch: > > utime_start=0.271607 > utime_end=3.847111 > stime_start=18.407684 > stime_end=1997.485557 > num_transactions=1350742 > latency_min=0.014131929 > latency_max=17.895073144 > latency_mean=0.505675853 > latency_stddev=2.125164772 > num_samples=307884 > throughput=139866.80 > > perf top on client: > > 56.86% [kernel] [k] __inet6_check_established > 17.96% [kernel] [k] __inet_hash_connect > 13.88% [kernel] [k] inet6_ehashfn > 2.52% [kernel] [k] rcu_all_qs > 2.01% [kernel] [k] __cond_resched > 0.41% [kernel] [k] _raw_spin_lock > > After this patch: > > utime_start=0.286131 > utime_end=4.378886 > stime_start=11.952556 > stime_end=1991.655533 > num_transactions=1446830 > latency_min=0.001061085 > latency_max=12.075275028 > latency_mean=0.376375302 > latency_stddev=1.361969596 > num_samples=306383 > throughput=151866.56 > > perf top: > > 50.01% [kernel] [k] __inet6_check_established > 20.65% [kernel] [k] __inet_hash_connect > 15.81% [kernel] [k] inet6_ehashfn > 2.92% [kernel] [k] rcu_all_qs > 2.34% [kernel] [k] __cond_resched > 0.50% [kernel] [k] _raw_spin_lock > 0.34% [kernel] [k] sched_balance_trigger > 0.24% [kernel] [k] queued_spin_lock_slowpath > > There is indeed an increase of throughput and reduction of latency. > > Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Throughput goes from 12829 to 26072.. The percentage increase - 103% - is alluring to me! Thanks, Jason
On Thu, Mar 6, 2025 at 8:54 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Wed, Mar 5, 2025 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote: > > > > In order to speedup __inet_hash_connect(), we want to ensure hash values > > for <source address, port X, destination address, destination port> > > are not randomly spread, but monotonically increasing. > > > > Goal is to allow __inet_hash_connect() to derive the hash value > > of a candidate 4-tuple with a single addition in the following > > patch in the series. > > > > Given : > > hash_0 = inet_ehashfn(saddr, 0, daddr, dport) > > hash_sport = inet_ehashfn(saddr, sport, daddr, dport) > > > > Then (hash_sport == hash_0 + sport) for all sport values. > > > > As far as I know, there is no security implication with this change. > > Good to know this. The moment I read the first paragraph, I was > thinking if it might bring potential risk. > > Sorry that I hesitate to bring up one question: could this new > algorithm result in sockets concentrating into several buckets instead > of being sufficiently dispersed like before. As I said, I see no difference for servers, since their sport is a fixed value. What matters for them is the hash contribution of the remote address and port, because the server port is usually well known. This change does not change the hash distribution, an attacker will not be able to target a particular bucket. > Well good news is that I > tested other cases like TCP_CRR and saw no degradation in performance. > But they didn't cover establishing from one client to many different > servers cases. > > > > > After this patch, when __inet_hash_connect() has to try XXXX candidates, > > the hash table buckets are contiguous and packed, allowing > > a better use of cpu caches and hardware prefetchers. > > > > Tested: > > > > Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog > > Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server > > > > Before this patch: > > > > utime_start=0.271607 > > utime_end=3.847111 > > stime_start=18.407684 > > stime_end=1997.485557 > > num_transactions=1350742 > > latency_min=0.014131929 > > latency_max=17.895073144 > > latency_mean=0.505675853 > > latency_stddev=2.125164772 > > num_samples=307884 > > throughput=139866.80 > > > > perf top on client: > > > > 56.86% [kernel] [k] __inet6_check_established > > 17.96% [kernel] [k] __inet_hash_connect > > 13.88% [kernel] [k] inet6_ehashfn > > 2.52% [kernel] [k] rcu_all_qs > > 2.01% [kernel] [k] __cond_resched > > 0.41% [kernel] [k] _raw_spin_lock > > > > After this patch: > > > > utime_start=0.286131 > > utime_end=4.378886 > > stime_start=11.952556 > > stime_end=1991.655533 > > num_transactions=1446830 > > latency_min=0.001061085 > > latency_max=12.075275028 > > latency_mean=0.376375302 > > latency_stddev=1.361969596 > > num_samples=306383 > > throughput=151866.56 > > > > perf top: > > > > 50.01% [kernel] [k] __inet6_check_established > > 20.65% [kernel] [k] __inet_hash_connect > > 15.81% [kernel] [k] inet6_ehashfn > > 2.92% [kernel] [k] rcu_all_qs > > 2.34% [kernel] [k] __cond_resched > > 0.50% [kernel] [k] _raw_spin_lock > > 0.34% [kernel] [k] sched_balance_trigger > > 0.24% [kernel] [k] queued_spin_lock_slowpath > > > > There is indeed an increase of throughput and reduction of latency. > > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > Tested-by: Jason Xing <kerneljasonxing@gmail.com> > Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> > > Throughput goes from 12829 to 26072.. The percentage increase - 103% - > is alluring to me! > > Thanks, > Jason
On Thu, Mar 6, 2025 at 4:14 PM Eric Dumazet <edumazet@google.com> wrote: > > On Thu, Mar 6, 2025 at 8:54 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > On Wed, Mar 5, 2025 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > In order to speedup __inet_hash_connect(), we want to ensure hash values > > > for <source address, port X, destination address, destination port> > > > are not randomly spread, but monotonically increasing. > > > > > > Goal is to allow __inet_hash_connect() to derive the hash value > > > of a candidate 4-tuple with a single addition in the following > > > patch in the series. > > > > > > Given : > > > hash_0 = inet_ehashfn(saddr, 0, daddr, dport) > > > hash_sport = inet_ehashfn(saddr, sport, daddr, dport) > > > > > > Then (hash_sport == hash_0 + sport) for all sport values. > > > > > > As far as I know, there is no security implication with this change. > > > > Good to know this. The moment I read the first paragraph, I was > > thinking if it might bring potential risk. > > > > Sorry that I hesitate to bring up one question: could this new > > algorithm result in sockets concentrating into several buckets instead > > of being sufficiently dispersed like before. > > As I said, I see no difference for servers, since their sport is a fixed value. > > What matters for them is the hash contribution of the remote address and port, > because the server port is usually well known. > > This change does not change the hash distribution, an attacker will not be able > to target a particular bucket. Point taken. Thank you very much for the explanation. Thanks, Jason > > > Well good news is that I > > tested other cases like TCP_CRR and saw no degradation in performance. > > But they didn't cover establishing from one client to many different > > servers cases. > > > > > > > > After this patch, when __inet_hash_connect() has to try XXXX candidates, > > > the hash table buckets are contiguous and packed, allowing > > > a better use of cpu caches and hardware prefetchers. > > > > > > Tested: > > > > > > Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog > > > Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server > > > > > > Before this patch: > > > > > > utime_start=0.271607 > > > utime_end=3.847111 > > > stime_start=18.407684 > > > stime_end=1997.485557 > > > num_transactions=1350742 > > > latency_min=0.014131929 > > > latency_max=17.895073144 > > > latency_mean=0.505675853 > > > latency_stddev=2.125164772 > > > num_samples=307884 > > > throughput=139866.80 > > > > > > perf top on client: > > > > > > 56.86% [kernel] [k] __inet6_check_established > > > 17.96% [kernel] [k] __inet_hash_connect > > > 13.88% [kernel] [k] inet6_ehashfn > > > 2.52% [kernel] [k] rcu_all_qs > > > 2.01% [kernel] [k] __cond_resched > > > 0.41% [kernel] [k] _raw_spin_lock > > > > > > After this patch: > > > > > > utime_start=0.286131 > > > utime_end=4.378886 > > > stime_start=11.952556 > > > stime_end=1991.655533 > > > num_transactions=1446830 > > > latency_min=0.001061085 > > > latency_max=12.075275028 > > > latency_mean=0.376375302 > > > latency_stddev=1.361969596 > > > num_samples=306383 > > > throughput=151866.56 > > > > > > perf top: > > > > > > 50.01% [kernel] [k] __inet6_check_established > > > 20.65% [kernel] [k] __inet_hash_connect > > > 15.81% [kernel] [k] inet6_ehashfn > > > 2.92% [kernel] [k] rcu_all_qs > > > 2.34% [kernel] [k] __cond_resched > > > 0.50% [kernel] [k] _raw_spin_lock > > > 0.34% [kernel] [k] sched_balance_trigger > > > 0.24% [kernel] [k] queued_spin_lock_slowpath > > > > > > There is indeed an increase of throughput and reduction of latency. > > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > > > Tested-by: Jason Xing <kerneljasonxing@gmail.com> > > Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> > > > > Throughput goes from 12829 to 26072.. The percentage increase - 103% - > > is alluring to me! > > > > Thanks, > > Jason
Hello, kernel test robot noticed a 26.0% improvement of stress-ng.sockmany.ops_per_sec on: commit: 265acc444f8a96246e9d42b54b6931d078034218 ("[PATCH net-next 1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()") url: https://github.com/intel-lab-lkp/linux/commits/Eric-Dumazet/inet-change-lport-contribution-to-inet_ehashfn-and-inet6_ehashfn/20250305-114734 base: https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git f252f23ab657cd224cb8334ba69966396f3f629b patch link: https://lore.kernel.org/all/20250305034550.879255-2-edumazet@google.com/ patch subject: [PATCH net-next 1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn() testcase: stress-ng config: x86_64-rhel-9.4 compiler: gcc-12 test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory parameters: nr_threads: 100% testtime: 60s test: sockmany cpufreq_governor: performance In addition to that, the commit also has significant impact on the following tests: +------------------+---------------------------------------------------------------------------------------------+ | testcase: change | stress-ng: stress-ng.sockmany.ops_per_sec 4.4% improvement | | test machine | 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory | | test parameters | cpufreq_governor=performance | | | nr_threads=100% | | | test=sockmany | | | testtime=60s | +------------------+---------------------------------------------------------------------------------------------+ Details are as below: --------------------------------------------------------------------------------------------------> The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20250317/202503171623.f2e16b60-lkp@intel.com ========================================================================================= compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime: gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp8/sockmany/stress-ng/60s commit: f252f23ab6 ("net: Prevent use after free in netif_napi_set_irq_locked()") 265acc444f ("inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()") f252f23ab657cd22 265acc444f8a96246e9d42b54b6 ---------------- --------------------------- %stddev %change %stddev \ | \ 0.60 ± 6% +0.2 0.75 ± 6% mpstat.cpu.all.soft% 376850 ± 9% +15.7% 436068 ± 9% numa-numastat.node0.local_node 376612 ± 9% +15.8% 435968 ± 9% numa-vmstat.node0.numa_local 54708 +22.0% 66753 ± 2% vmstat.system.cs 2308 +1167.7% 29267 ± 26% perf-c2c.HITM.local 2499 +1078.3% 29447 ± 26% perf-c2c.HITM.total 1413 ± 8% -13.8% 1218 ± 4% sched_debug.cfs_rq:/.runnable_avg.max 28302 +21.2% 34303 ± 2% sched_debug.cpu.nr_switches.avg 39625 ± 6% +63.4% 64761 ± 6% sched_debug.cpu.nr_switches.max 4170 ± 9% +126.1% 9429 ± 8% sched_debug.cpu.nr_switches.stddev 1606932 +25.9% 2023746 ± 3% stress-ng.sockmany.ops 26687 +26.0% 33624 ± 3% stress-ng.sockmany.ops_per_sec 1561801 +28.1% 2000939 ± 3% stress-ng.time.involuntary_context_switches 1731525 +22.3% 2118259 ± 2% stress-ng.time.voluntary_context_switches 84783 +2.6% 86953 proc-vmstat.nr_shmem 5339 ± 6% -26.4% 3931 ± 16% proc-vmstat.numa_hint_faults_local 878479 +6.8% 937819 proc-vmstat.numa_hit 812262 +7.3% 871615 proc-vmstat.numa_local 2550690 +12.5% 2870404 proc-vmstat.pgalloc_normal 2407108 +13.2% 2724922 proc-vmstat.pgfree 21.96 -17.2% 18.18 ± 2% perf-stat.i.MPKI 7.517e+09 +18.8% 8.933e+09 perf-stat.i.branch-instructions 2.70 -0.7 1.96 perf-stat.i.branch-miss-rate% 2.03e+08 -13.1% 1.765e+08 perf-stat.i.branch-misses 60.22 -2.3 57.89 ± 2% perf-stat.i.cache-miss-rate% 1.472e+09 +4.7% 1.542e+09 perf-stat.i.cache-references 56669 +22.3% 69301 ± 2% perf-stat.i.context-switches 5.56 -18.4% 4.53 ± 2% perf-stat.i.cpi 4.24e+10 +19.2% 5.054e+10 perf-stat.i.instructions 0.20 +20.1% 0.24 ± 4% perf-stat.i.ipc 0.49 +21.0% 0.60 ± 8% perf-stat.i.metric.K/sec 21.03 -15.1% 17.85 perf-stat.overall.MPKI 2.70 -0.7 1.98 perf-stat.overall.branch-miss-rate% 60.56 -2.1 58.49 perf-stat.overall.cache-miss-rate% 5.34 -16.6% 4.45 perf-stat.overall.cpi 253.77 -1.7% 249.50 perf-stat.overall.cycles-between-cache-misses 0.19 +19.9% 0.22 perf-stat.overall.ipc 7.395e+09 +18.9% 8.789e+09 perf-stat.ps.branch-instructions 1.997e+08 -13.0% 1.737e+08 perf-stat.ps.branch-misses 1.448e+09 +4.7% 1.517e+09 perf-stat.ps.cache-references 55820 +22.2% 68204 ± 2% perf-stat.ps.context-switches 4.172e+10 +19.2% 4.972e+10 perf-stat.ps.instructions 2.556e+12 +20.2% 3.072e+12 ± 2% perf-stat.total.instructions 0.35 ± 9% -14.9% 0.29 ± 6% perf-sched.sch_delay.avg.ms.__cond_resched.__inet_hash_connect.tcp_v4_connect.__inet_stream_connect.inet_stream_connect 0.06 ± 7% -20.5% 0.04 ± 4% perf-sched.sch_delay.avg.ms.__cond_resched.__release_sock.release_sock.__inet_stream_connect.inet_stream_connect 0.16 ±218% +798.3% 1.44 ± 40% perf-sched.sch_delay.avg.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_empty_file.alloc_file_pseudo.sock_alloc_file 0.25 ±152% +291.3% 0.99 ± 45% perf-sched.sch_delay.avg.ms.__cond_resched.kmem_cache_alloc_noprof.security_inode_alloc.inode_init_always_gfp.alloc_inode 0.11 ±166% +568.2% 0.75 ± 45% perf-sched.sch_delay.avg.ms.__cond_resched.lock_sock_nested.inet_stream_connect.__sys_connect.__x64_sys_connect 0.84 ± 14% +39.2% 1.17 ± 9% perf-sched.sch_delay.avg.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part 0.11 ± 22% +108.5% 0.23 ± 12% perf-sched.sch_delay.avg.ms.schedule_timeout.wait_woken.sk_wait_data.tcp_recvmsg_locked 0.08 ± 59% -60.0% 0.03 ± 4% perf-sched.sch_delay.max.ms.__cond_resched.__release_sock.release_sock.tcp_sendmsg.__sys_sendto 0.16 ±218% +1286.4% 2.22 ± 25% perf-sched.sch_delay.max.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_empty_file.alloc_file_pseudo.sock_alloc_file 0.13 ±153% +910.1% 1.27 ± 34% perf-sched.sch_delay.max.ms.__cond_resched.lock_sock_nested.inet_stream_connect.__sys_connect.__x64_sys_connect 9.23 -12.5% 8.08 perf-sched.total_wait_and_delay.average.ms 139892 +15.3% 161338 perf-sched.total_wait_and_delay.count.ms 9.18 -12.5% 8.03 perf-sched.total_wait_time.average.ms 0.70 ± 8% -14.5% 0.60 ± 6% perf-sched.wait_and_delay.avg.ms.__cond_resched.__inet_hash_connect.tcp_v4_connect.__inet_stream_connect.inet_stream_connect 0.11 ± 8% -20.1% 0.09 ± 4% perf-sched.wait_and_delay.avg.ms.__cond_resched.__release_sock.release_sock.__inet_stream_connect.inet_stream_connect 429.48 ± 44% +63.6% 702.60 ± 11% perf-sched.wait_and_delay.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64 4.97 -14.0% 4.28 perf-sched.wait_and_delay.avg.ms.schedule_timeout.inet_csk_accept.inet_accept.do_accept 0.23 ± 21% +104.2% 0.46 ± 12% perf-sched.wait_and_delay.avg.ms.schedule_timeout.wait_woken.sk_wait_data.tcp_recvmsg_locked 48576 ± 5% +36.3% 66215 ± 2% perf-sched.wait_and_delay.count.__cond_resched.__release_sock.release_sock.__inet_stream_connect.inet_stream_connect 81.83 +9.8% 89.83 ± 2% perf-sched.wait_and_delay.count.schedule_hrtimeout_range.do_poll.constprop.0.do_sys_poll 64098 +16.3% 74560 perf-sched.wait_and_delay.count.schedule_timeout.inet_csk_accept.inet_accept.do_accept 15531 ± 17% -46.2% 8355 ± 6% perf-sched.wait_and_delay.count.schedule_timeout.wait_woken.sk_wait_data.tcp_recvmsg_locked 0.36 ± 8% -14.2% 0.31 ± 6% perf-sched.wait_time.avg.ms.__cond_resched.__inet_hash_connect.tcp_v4_connect.__inet_stream_connect.inet_stream_connect 0.06 ± 7% -20.2% 0.04 ± 4% perf-sched.wait_time.avg.ms.__cond_resched.__release_sock.release_sock.__inet_stream_connect.inet_stream_connect 0.04 ±178% -94.4% 0.00 ±130% perf-sched.wait_time.avg.ms.__cond_resched.down_write_killable.exec_mmap.begin_new_exec.load_elf_binary 0.16 ±218% +798.5% 1.44 ± 40% perf-sched.wait_time.avg.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_empty_file.alloc_file_pseudo.sock_alloc_file 0.11 ±166% +568.6% 0.75 ± 45% perf-sched.wait_time.avg.ms.__cond_resched.lock_sock_nested.inet_stream_connect.__sys_connect.__x64_sys_connect 427.69 ± 45% +63.1% 697.48 ± 10% perf-sched.wait_time.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64 4.95 -14.0% 4.26 perf-sched.wait_time.avg.ms.schedule_timeout.inet_csk_accept.inet_accept.do_accept 0.12 ± 20% +99.9% 0.23 ± 12% perf-sched.wait_time.avg.ms.schedule_timeout.wait_woken.sk_wait_data.tcp_recvmsg_locked 0.16 ±218% +1286.4% 2.22 ± 25% perf-sched.wait_time.max.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_empty_file.alloc_file_pseudo.sock_alloc_file 0.13 ±153% +911.4% 1.27 ± 34% perf-sched.wait_time.max.ms.__cond_resched.lock_sock_nested.inet_stream_connect.__sys_connect.__x64_sys_connect *************************************************************************************************** lkp-spr-r02: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory ========================================================================================= compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime: gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/lkp-spr-r02/sockmany/stress-ng/60s commit: f252f23ab6 ("net: Prevent use after free in netif_napi_set_irq_locked()") 265acc444f ("inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()") f252f23ab657cd22 265acc444f8a96246e9d42b54b6 ---------------- --------------------------- %stddev %change %stddev \ | \ 205766 +3.2% 212279 vmstat.system.cs 309724 ± 5% +63.6% 506684 ± 9% sched_debug.cfs_rq:/.avg_vruntime.stddev 309724 ± 5% +63.6% 506684 ± 9% sched_debug.cfs_rq:/.min_vruntime.stddev 1307371 ± 8% -14.5% 1117523 ± 7% sched_debug.cpu.avg_idle.max 4333131 +4.4% 4525951 stress-ng.sockmany.ops 71816 +4.4% 74988 stress-ng.sockmany.ops_per_sec 7639150 +3.6% 7910527 stress-ng.time.voluntary_context_switches 693603 -18.6% 564616 ± 3% perf-c2c.DRAM.local 611374 -16.8% 508688 ± 2% perf-c2c.DRAM.remote 19509 +994.2% 213470 ± 7% perf-c2c.HITM.local 20252 +957.6% 214187 ± 7% perf-c2c.HITM.total 204521 +3.1% 210765 proc-vmstat.nr_shmem 938137 +2.9% 965493 proc-vmstat.nr_slab_reclaimable 3102658 +3.0% 3196837 proc-vmstat.nr_slab_unreclaimable 2113801 +1.8% 2151131 proc-vmstat.numa_hit 1881174 +2.0% 1919223 proc-vmstat.numa_local 6186586 +3.6% 6406837 proc-vmstat.pgalloc_normal 0.76 ± 46% -83.0% 0.13 ±144% perf-sched.sch_delay.avg.ms.__cond_resched.mutex_lock.perf_poll.do_poll.constprop 0.02 ± 2% -6.3% 0.02 ± 2% perf-sched.sch_delay.avg.ms.schedule_timeout.inet_csk_accept.inet_accept.do_accept 15.43 -12.6% 13.48 perf-sched.total_wait_and_delay.average.ms 234971 +15.6% 271684 perf-sched.total_wait_and_delay.count.ms 15.37 -12.6% 13.43 perf-sched.total_wait_time.average.ms 140.18 ± 5% -37.2% 88.02 ± 11% perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range.do_poll.constprop.0.do_sys_poll 10.17 -14.1% 8.74 perf-sched.wait_and_delay.avg.ms.schedule_timeout.inet_csk_accept.inet_accept.do_accept 4.02 -100.0% 0.00 perf-sched.wait_and_delay.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread 104089 +16.4% 121193 perf-sched.wait_and_delay.count.__cond_resched.__release_sock.release_sock.__inet_stream_connect.inet_stream_connect 88.17 ± 6% +68.1% 148.17 ± 13% perf-sched.wait_and_delay.count.schedule_hrtimeout_range.do_poll.constprop.0.do_sys_poll 108724 +16.8% 127034 perf-sched.wait_and_delay.count.schedule_timeout.inet_csk_accept.inet_accept.do_accept 1232 -100.0% 0.00 perf-sched.wait_and_delay.count.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread 4592 ± 12% +26.1% 5792 ± 14% perf-sched.wait_and_delay.count.schedule_timeout.wait_woken.sk_wait_data.tcp_recvmsg_locked 11.29 ± 68% -100.0% 0.00 perf-sched.wait_and_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread 9.99 -13.3% 8.66 perf-sched.wait_time.avg.ms.__cond_resched.__release_sock.release_sock.tcp_sendmsg.__sys_sendto 139.53 ± 6% -37.2% 87.60 ± 11% perf-sched.wait_time.avg.ms.schedule_hrtimeout_range.do_poll.constprop.0.do_sys_poll 10.15 -14.1% 8.72 perf-sched.wait_time.avg.ms.schedule_timeout.inet_csk_accept.inet_accept.do_accept 41.10 -17.2% 34.03 perf-stat.i.MPKI 1.424e+10 +14.6% 1.631e+10 perf-stat.i.branch-instructions 2.28 -0.1 2.17 perf-stat.i.branch-miss-rate% 3.193e+08 +9.4% 3.492e+08 perf-stat.i.branch-misses 77.01 -9.5 67.48 perf-stat.i.cache-miss-rate% 2.981e+09 -5.1% 2.83e+09 perf-stat.i.cache-misses 3.806e+09 +8.4% 4.127e+09 perf-stat.i.cache-references 217129 +3.2% 224056 perf-stat.i.context-switches 8.68 -12.7% 7.58 perf-stat.i.cpi 242.24 +4.0% 251.97 perf-stat.i.cycles-between-cache-misses 7.608e+10 +14.1% 8.679e+10 perf-stat.i.instructions 0.13 +13.3% 0.15 perf-stat.i.ipc 39.15 -16.8% 32.58 perf-stat.overall.MPKI 2.24 -0.1 2.14 perf-stat.overall.branch-miss-rate% 78.30 -9.7 68.56 perf-stat.overall.cache-miss-rate% 8.35 -12.4% 7.31 perf-stat.overall.cpi 213.17 +5.3% 224.53 perf-stat.overall.cycles-between-cache-misses 0.12 +14.1% 0.14 perf-stat.overall.ipc 1.401e+10 +14.6% 1.604e+10 perf-stat.ps.branch-instructions 3.139e+08 +9.4% 3.434e+08 perf-stat.ps.branch-misses 2.931e+09 -5.1% 2.782e+09 perf-stat.ps.cache-misses 3.743e+09 +8.4% 4.058e+09 perf-stat.ps.cache-references 213541 +3.3% 220574 perf-stat.ps.context-switches 7.485e+10 +14.1% 8.539e+10 perf-stat.ps.instructions 4.597e+12 +13.9% 5.235e+12 perf-stat.total.instructions Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index d1b5f45ee718410fdf3e78c113c7ebd4a1ddba40..3025d2b708852acd9744709a897fca17564523d5 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -35,8 +35,8 @@ u32 inet_ehashfn(const struct net *net, const __be32 laddr, { net_get_random_once(&inet_ehash_secret, sizeof(inet_ehash_secret)); - return __inet_ehashfn(laddr, lport, faddr, fport, - inet_ehash_secret + net_hash_mix(net)); + return lport + __inet_ehashfn(laddr, 0, faddr, fport, + inet_ehash_secret + net_hash_mix(net)); } EXPORT_SYMBOL_GPL(inet_ehashfn); diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c index 9be315496459fcb391123a07ac887e2f59d27360..3d95f1e75a118ff8027d4ec0f33910d23b6af832 100644 --- a/net/ipv6/inet6_hashtables.c +++ b/net/ipv6/inet6_hashtables.c @@ -35,8 +35,8 @@ u32 inet6_ehashfn(const struct net *net, lhash = (__force u32)laddr->s6_addr32[3]; fhash = __ipv6_addr_jhash(faddr, tcp_ipv6_hash_secret); - return __inet6_ehashfn(lhash, lport, fhash, fport, - inet6_ehash_secret + net_hash_mix(net)); + return lport + __inet6_ehashfn(lhash, 0, fhash, fport, + inet6_ehash_secret + net_hash_mix(net)); } EXPORT_SYMBOL_GPL(inet6_ehashfn);
In order to speedup __inet_hash_connect(), we want to ensure hash values for <source address, port X, destination address, destination port> are not randomly spread, but monotonically increasing. Goal is to allow __inet_hash_connect() to derive the hash value of a candidate 4-tuple with a single addition in the following patch in the series. Given : hash_0 = inet_ehashfn(saddr, 0, daddr, dport) hash_sport = inet_ehashfn(saddr, sport, daddr, dport) Then (hash_sport == hash_0 + sport) for all sport values. As far as I know, there is no security implication with this change. After this patch, when __inet_hash_connect() has to try XXXX candidates, the hash table buckets are contiguous and packed, allowing a better use of cpu caches and hardware prefetchers. Tested: Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server Before this patch: utime_start=0.271607 utime_end=3.847111 stime_start=18.407684 stime_end=1997.485557 num_transactions=1350742 latency_min=0.014131929 latency_max=17.895073144 latency_mean=0.505675853 latency_stddev=2.125164772 num_samples=307884 throughput=139866.80 perf top on client: 56.86% [kernel] [k] __inet6_check_established 17.96% [kernel] [k] __inet_hash_connect 13.88% [kernel] [k] inet6_ehashfn 2.52% [kernel] [k] rcu_all_qs 2.01% [kernel] [k] __cond_resched 0.41% [kernel] [k] _raw_spin_lock After this patch: utime_start=0.286131 utime_end=4.378886 stime_start=11.952556 stime_end=1991.655533 num_transactions=1446830 latency_min=0.001061085 latency_max=12.075275028 latency_mean=0.376375302 latency_stddev=1.361969596 num_samples=306383 throughput=151866.56 perf top: 50.01% [kernel] [k] __inet6_check_established 20.65% [kernel] [k] __inet_hash_connect 15.81% [kernel] [k] inet6_ehashfn 2.92% [kernel] [k] rcu_all_qs 2.34% [kernel] [k] __cond_resched 0.50% [kernel] [k] _raw_spin_lock 0.34% [kernel] [k] sched_balance_trigger 0.24% [kernel] [k] queued_spin_lock_slowpath There is indeed an increase of throughput and reduction of latency. Signed-off-by: Eric Dumazet <edumazet@google.com> --- net/ipv4/inet_hashtables.c | 4 ++-- net/ipv6/inet6_hashtables.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-)