Message ID | 20240507124229.446802-1-leone4fernando@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | net: route: improve route hinting | expand |
On Tue, May 7, 2024 at 2:43 PM Leone Fernando <leone4fernando@gmail.com> wrote: > > In 2017, Paolo Abeni introduced the hinting mechanism [1] to the routing > sub-system. The hinting optimization improves performance by reusing > previously found dsts instead of looking them up for each skb. > > This patch series introduces a generalized version of the hinting mechanism that > can "remember" a larger number of dsts. This reduces the number of dst > lookups for frequently encountered daddrs. > > Before diving into the code and the benchmarking results, it's important > to address the deletion of the old route cache [2] and why > this solution is different. The original cache was complicated, > vulnerable to DOS attacks and had unstable performance. > > The new input dst_cache is much simpler thanks to its lazy approach, > improving performance without the overhead of the removed cache > implementation. Instead of using timers and GC, the deletion of invalid > entries is performed lazily during their lookups. > The dsts are stored in a simple, lightweight, static hash table. This > keeps the lookup times fast yet stable, preventing DOS upon cache misses. > The new input dst_cache implementation is built over the existing > dst_cache code which supplies a fast lockless percpu behavior. > > The measurement setup is comprised of 2 machines with mlx5 100Gbit NIC. > I sent small UDP packets with 5000 daddrs (10x of cache size) from one > machine to the other while also varying the saddr and the tos. I set > an iptables rule to drop the packets after routing. the receiving > machine's CPU (i9) was saturated. > > Thanks a lot to David Ahern for all the help and guidance! > > I measured the rx PPS using ifpps and the per-queue PPS using ethtool -S. > These are the results: How device dismantles are taken into account ? I am currently tracking a bug in dst_cache, triggering sometimes when running pmtu.sh selftest. Apparently, dst_cache_per_cpu_dst_set() can cache dst that have no dst->rt_uncached linkage. There is no cleanup (at least in vxlan) to make sure cached dst are either freed or their dst->dev changed. TEST: ipv6: cleanup of cached exceptions - nexthop objects [ OK ] [ 1001.344490] vxlan: __vxlan_fdb_free calling dst_cache_destroy(ffff8f12422cbb90) [ 1001.345253] dst_cache_destroy dst_cache=ffff8f12422cbb90 ->cache=0000417580008d30 [ 1001.378615] vxlan: __vxlan_fdb_free calling dst_cache_destroy(ffff8f12471e31d0) [ 1001.379260] dst_cache_destroy dst_cache=ffff8f12471e31d0 ->cache=0000417580008608 [ 1011.349730] unregister_netdevice: waiting for veth_A-R1 to become free. Usage count = 7 [ 1011.350562] ref_tracker: veth_A-R1@000000009392ed3b has 1/6 users at [ 1011.350562] dst_alloc+0x76/0x160 [ 1011.350562] ip6_dst_alloc+0x25/0x80 [ 1011.350562] ip6_pol_route+0x2a8/0x450 [ 1011.350562] ip6_pol_route_output+0x1f/0x30 [ 1011.350562] fib6_rule_lookup+0x163/0x270 [ 1011.350562] ip6_route_output_flags+0xda/0x190 [ 1011.350562] ip6_dst_lookup_tail.constprop.0+0x1d0/0x260 [ 1011.350562] ip6_dst_lookup_flow+0x47/0xa0 [ 1011.350562] udp_tunnel6_dst_lookup+0x158/0x210 [ 1011.350562] vxlan_xmit_one+0x4c6/0x1550 [vxlan] [ 1011.350562] vxlan_xmit+0x535/0x1500 [vxlan] [ 1011.350562] dev_hard_start_xmit+0x7b/0x1e0 [ 1011.350562] __dev_queue_xmit+0x20c/0xe40 [ 1011.350562] arp_xmit+0x1d/0x50 [ 1011.350562] arp_send_dst+0x7f/0xa0 [ 1011.350562] arp_solicit+0xf6/0x2f0 [ 1011.350562] [ 1011.350562] ref_tracker: veth_A-R1@000000009392ed3b has 3/6 users at [ 1011.350562] dst_alloc+0x76/0x160 [ 1011.350562] ip6_dst_alloc+0x25/0x80 [ 1011.350562] ip6_pol_route+0x2a8/0x450 [ 1011.350562] ip6_pol_route_output+0x1f/0x30 [ 1011.350562] fib6_rule_lookup+0x163/0x270 [ 1011.350562] ip6_route_output_flags+0xda/0x190 [ 1011.350562] ip6_dst_lookup_tail.constprop.0+0x1d0/0x260 [ 1011.350562] ip6_dst_lookup_flow+0x47/0xa0 [ 1011.350562] udp_tunnel6_dst_lookup+0x158/0x210 [ 1011.350562] vxlan_xmit_one+0x4c6/0x1550 [vxlan] [ 1011.350562] vxlan_xmit+0x535/0x1500 [vxlan] [ 1011.350562] dev_hard_start_xmit+0x7b/0x1e0 [ 1011.350562] __dev_queue_xmit+0x20c/0xe40 [ 1011.350562] ip6_finish_output2+0x2ea/0x6e0 [ 1011.350562] ip6_finish_output+0x143/0x320 [ 1011.350562] ip6_output+0x74/0x140 [ 1011.350562] [ 1011.350562] ref_tracker: veth_A-R1@000000009392ed3b has 1/6 users at [ 1011.350562] netdev_get_by_index+0xc0/0xe0 [ 1011.350562] fib6_nh_init+0x1a9/0xa90 [ 1011.350562] rtm_new_nexthop+0x6fa/0x1580 [ 1011.350562] rtnetlink_rcv_msg+0x155/0x3e0 [ 1011.350562] netlink_rcv_skb+0x61/0x110 [ 1011.350562] rtnetlink_rcv+0x19/0x20 [ 1011.350562] netlink_unicast+0x23f/0x380 [ 1011.350562] netlink_sendmsg+0x1fc/0x430 [ 1011.350562] ____sys_sendmsg+0x2ef/0x320 [ 1011.350562] ___sys_sendmsg+0x86/0xd0 [ 1011.350562] __sys_sendmsg+0x67/0xc0 [ 1011.350562] __x64_sys_sendmsg+0x21/0x30 [ 1011.350562] x64_sys_call+0x252/0x2030 [ 1011.350562] do_syscall_64+0x6c/0x190 [ 1011.350562] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 1011.350562] [ 1011.350562] ref_tracker: veth_A-R1@000000009392ed3b has 1/6 users at [ 1011.350562] ipv6_add_dev+0x136/0x530 [ 1011.350562] addrconf_notify+0x19d/0x770 [ 1011.350562] notifier_call_chain+0x65/0xd0 [ 1011.350562] raw_notifier_call_chain+0x1a/0x20 [ 1011.350562] call_netdevice_notifiers_info+0x54/0x90 [ 1011.350562] register_netdevice+0x61e/0x790 [ 1011.350562] veth_newlink+0x230/0x440 [ 1011.350562] __rtnl_newlink+0x7d2/0xaa0 [ 1011.350562] rtnl_newlink+0x4c/0x70 [ 1011.350562] rtnetlink_rcv_msg+0x155/0x3e0 [ 1011.350562] netlink_rcv_skb+0x61/0x110 [ 1011.350562] rtnetlink_rcv+0x19/0x20 [ 1011.350562] netlink_unicast+0x23f/0x380 [ 1011.350562] netlink_sendmsg+0x1fc/0x430 [ 1011.350562] ____sys_sendmsg+0x2ef/0x320 [ 1011.350562] ___sys_sendmsg+0x86/0xd0 [ 1011.350562]
> On Tue, May 7, 2024 at 2:43 PM Leone Fernando <leone4fernando@gmail.com> wrote: >> >> In 2017, Paolo Abeni introduced the hinting mechanism [1] to the routing >> sub-system. The hinting optimization improves performance by reusing >> previously found dsts instead of looking them up for each skb. >> >> This patch series introduces a generalized version of the hinting mechanism that >> can "remember" a larger number of dsts. This reduces the number of dst >> lookups for frequently encountered daddrs. >> >> Before diving into the code and the benchmarking results, it's important >> to address the deletion of the old route cache [2] and why >> this solution is different. The original cache was complicated, >> vulnerable to DOS attacks and had unstable performance. >> >> The new input dst_cache is much simpler thanks to its lazy approach, >> improving performance without the overhead of the removed cache >> implementation. Instead of using timers and GC, the deletion of invalid >> entries is performed lazily during their lookups. >> The dsts are stored in a simple, lightweight, static hash table. This >> keeps the lookup times fast yet stable, preventing DOS upon cache misses. >> The new input dst_cache implementation is built over the existing >> dst_cache code which supplies a fast lockless percpu behavior. >> >> The measurement setup is comprised of 2 machines with mlx5 100Gbit NIC. >> I sent small UDP packets with 5000 daddrs (10x of cache size) from one >> machine to the other while also varying the saddr and the tos. I set >> an iptables rule to drop the packets after routing. the receiving >> machine's CPU (i9) was saturated. >> >> Thanks a lot to David Ahern for all the help and guidance! >> >> I measured the rx PPS using ifpps and the per-queue PPS using ethtool -S. >> These are the results: > > How device dismantles are taken into account ? > > I am currently tracking a bug in dst_cache, triggering sometimes when > running pmtu.sh selftest. > > Apparently, dst_cache_per_cpu_dst_set() can cache dst that have no > dst->rt_uncached > linkage. The dst_cache_input that was introduced in this series caches input routes that are owned by the fib tree. These routes have a rt_uncached linkage. So I think this bug will not replicate to dst_cache_input. > There is no cleanup (at least in vxlan) to make sure cached dst are > either freed or > their dst->dev changed. > > > TEST: ipv6: cleanup of cached exceptions - nexthop objects [ OK ] > [ 1001.344490] vxlan: __vxlan_fdb_free calling > dst_cache_destroy(ffff8f12422cbb90) > [ 1001.345253] dst_cache_destroy dst_cache=ffff8f12422cbb90 > ->cache=0000417580008d30 > [ 1001.378615] vxlan: __vxlan_fdb_free calling > dst_cache_destroy(ffff8f12471e31d0) > [ 1001.379260] dst_cache_destroy dst_cache=ffff8f12471e31d0 > ->cache=0000417580008608 > [ 1011.349730] unregister_netdevice: waiting for veth_A-R1 to become > free. Usage count = 7 > [ 1011.350562] ref_tracker: veth_A-R1@000000009392ed3b has 1/6 users at > [ 1011.350562] dst_alloc+0x76/0x160 > [ 1011.350562] ip6_dst_alloc+0x25/0x80 > [ 1011.350562] ip6_pol_route+0x2a8/0x450 > [ 1011.350562] ip6_pol_route_output+0x1f/0x30 > [ 1011.350562] fib6_rule_lookup+0x163/0x270 > [ 1011.350562] ip6_route_output_flags+0xda/0x190 > [ 1011.350562] ip6_dst_lookup_tail.constprop.0+0x1d0/0x260 > [ 1011.350562] ip6_dst_lookup_flow+0x47/0xa0 > [ 1011.350562] udp_tunnel6_dst_lookup+0x158/0x210 > [ 1011.350562] vxlan_xmit_one+0x4c6/0x1550 [vxlan] > [ 1011.350562] vxlan_xmit+0x535/0x1500 [vxlan] > [ 1011.350562] dev_hard_start_xmit+0x7b/0x1e0 > [ 1011.350562] __dev_queue_xmit+0x20c/0xe40 > [ 1011.350562] arp_xmit+0x1d/0x50 > [ 1011.350562] arp_send_dst+0x7f/0xa0 > [ 1011.350562] arp_solicit+0xf6/0x2f0 > [ 1011.350562] > [ 1011.350562] ref_tracker: veth_A-R1@000000009392ed3b has 3/6 users at > [ 1011.350562] dst_alloc+0x76/0x160 > [ 1011.350562] ip6_dst_alloc+0x25/0x80 > [ 1011.350562] ip6_pol_route+0x2a8/0x450 > [ 1011.350562] ip6_pol_route_output+0x1f/0x30 > [ 1011.350562] fib6_rule_lookup+0x163/0x270 > [ 1011.350562] ip6_route_output_flags+0xda/0x190 > [ 1011.350562] ip6_dst_lookup_tail.constprop.0+0x1d0/0x260 > [ 1011.350562] ip6_dst_lookup_flow+0x47/0xa0 > [ 1011.350562] udp_tunnel6_dst_lookup+0x158/0x210 > [ 1011.350562] vxlan_xmit_one+0x4c6/0x1550 [vxlan] > [ 1011.350562] vxlan_xmit+0x535/0x1500 [vxlan] > [ 1011.350562] dev_hard_start_xmit+0x7b/0x1e0 > [ 1011.350562] __dev_queue_xmit+0x20c/0xe40 > [ 1011.350562] ip6_finish_output2+0x2ea/0x6e0 > [ 1011.350562] ip6_finish_output+0x143/0x320 > [ 1011.350562] ip6_output+0x74/0x140 > [ 1011.350562] > [ 1011.350562] ref_tracker: veth_A-R1@000000009392ed3b has 1/6 users at > [ 1011.350562] netdev_get_by_index+0xc0/0xe0 > [ 1011.350562] fib6_nh_init+0x1a9/0xa90 > [ 1011.350562] rtm_new_nexthop+0x6fa/0x1580 > [ 1011.350562] rtnetlink_rcv_msg+0x155/0x3e0 > [ 1011.350562] netlink_rcv_skb+0x61/0x110 > [ 1011.350562] rtnetlink_rcv+0x19/0x20 > [ 1011.350562] netlink_unicast+0x23f/0x380 > [ 1011.350562] netlink_sendmsg+0x1fc/0x430 > [ 1011.350562] ____sys_sendmsg+0x2ef/0x320 > [ 1011.350562] ___sys_sendmsg+0x86/0xd0 > [ 1011.350562] __sys_sendmsg+0x67/0xc0 > [ 1011.350562] __x64_sys_sendmsg+0x21/0x30 > [ 1011.350562] x64_sys_call+0x252/0x2030 > [ 1011.350562] do_syscall_64+0x6c/0x190 > [ 1011.350562] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 1011.350562] > [ 1011.350562] ref_tracker: veth_A-R1@000000009392ed3b has 1/6 users at > [ 1011.350562] ipv6_add_dev+0x136/0x530 > [ 1011.350562] addrconf_notify+0x19d/0x770 > [ 1011.350562] notifier_call_chain+0x65/0xd0 > [ 1011.350562] raw_notifier_call_chain+0x1a/0x20 > [ 1011.350562] call_netdevice_notifiers_info+0x54/0x90 > [ 1011.350562] register_netdevice+0x61e/0x790 > [ 1011.350562] veth_newlink+0x230/0x440 > [ 1011.350562] __rtnl_newlink+0x7d2/0xaa0 > [ 1011.350562] rtnl_newlink+0x4c/0x70 > [ 1011.350562] rtnetlink_rcv_msg+0x155/0x3e0 > [ 1011.350562] netlink_rcv_skb+0x61/0x110 > [ 1011.350562] rtnetlink_rcv+0x19/0x20 > [ 1011.350562] netlink_unicast+0x23f/0x380 > [ 1011.350562] netlink_sendmsg+0x1fc/0x430 > [ 1011.350562] ____sys_sendmsg+0x2ef/0x320 > [ 1011.350562] ___sys_sendmsg+0x86/0xd0 > [ 1011.350562]