[net-next,0/4] net: route: improve route hinting

Message ID	20240307171202.232684-1-leone4fernando@gmail.com (mailing list archive)
Headers	show Received: from mail-wr1-f68.google.com (mail-wr1-f68.google.com [209.85.221.68]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0B06F12D76A; Thu, 7 Mar 2024 17:12:38 +0000 (UTC) From: Leone Fernando <leone4fernando@gmail.com> To: davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, dsahern@kernel.org, willemb@google.com Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Leone Fernando <leone4fernando@gmail.com> Subject: [PATCH net-next 0/4] net: route: improve route hinting Date: Thu, 7 Mar 2024 18:11:58 +0100 Message-Id: <20240307171202.232684-1-leone4fernando@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	net: route: improve route hinting \| expand [net-next,0/4] net: route: improve route hinting [net-next,1/4] net: route: expire rt if the dst it holds is expired [net-next,2/4] net: dst_cache: add input_dst_cache API [net-next,3/4] net: route: always compile dst_cache [net-next,4/4] net: route: replace route hints with input_dst_cache

Leone Fernando March 7, 2024, 5:11 p.m. UTC

In 2017, Paolo Abeni introduced the hinting mechanism [1] to the routing
sub-system. The hinting optimization improves performance by reusing
previously found dsts instead of looking them up for each skb.

This patch series introduces a generalized version of the hinting mechanism that
can "remember" a larger number of dsts. This reduces the number of dst
lookups for frequently encountered daddrs.

Before diving into the code and the benchmarking results, it's important
to address the deletion of the old route cache [2] and why
this solution is different. The original cache was complicated,
vulnerable to DOS attacks and had unstable performance.

The new input dst_cache is much simpler thanks to its lazy approach,
improving performance without the overhead of the removed cache
implementation. Instead of using timers and GC, the deletion of invalid
entries is performed lazily during their lookups.
The dsts are stored in a simple, lightweight, static hash table. This
keeps the lookup times fast yet stable, preventing DOS upon cache misses.
The new input dst_cache implementation is built over the existing
dst_cache code which supplies a fast lockless percpu behavior.

I tested this patch using udp floods with different number of daddrs.
The benchmarking setup is comprised of 3 machines: a sender,
a forwarder and a receiver. I measured the PPS received by the receiver
as the forwarder was running either the mainline kernel or the patched
kernel, comparing the results. The dst_cache I tested in this benchmark
used a total of 512 hash table entries, split into buckets of 4
entries each.

These are the results:
  UDP             mainline              patched                   delta
conns pcpu         Kpps                  Kpps                       %
   1              274.0255              269.2205                  -1.75
   2              257.3748              268.0947                   4.17
  15              241.3513              258.8016                   7.23
 100              238.3419              258.4939                   8.46
 500              238.5390              252.6425                   5.91
1000              238.7570              242.1820                   1.43
2000              238.7780              236.2640                  -1.05
4000              239.0440              233.5320                  -2.31
8000              239.3248              232.5680                  -2.82

As you can see, this patch improves performance up until ~1500
connections, after which the rate of improvement diminishes
due to the growing number of cache misses.
It's important to note that in the worst scenario, every packet will
cause a cache miss, resulting in only a constant performance degradation
due to the fixed cache and bucket sizes. This means that the cache is
resistant to DOS attacks.

Based on the above measurements, it seems that the performance
degradation flattens at around 3%. Note that the number of concurrent
connections at which performance starts to degrade depends on the cache
size and the amount of cpus.

[1] https://lore.kernel.org/netdev/cover.1574252982.git.pabeni@redhat.com/
[2] https://lore.kernel.org/netdev/20120720.142502.1144557295933737451.davem@davemloft.net/

v1:
- fix typo while allocating per-cpu cache
- while using dst from the dst_cache set IPSKB_DOREDIRECT correctly
- always compile dst_cache

RFC-v2:
- remove unnecessary macro
- move inline to .h file

RFC-v1: https://lore.kernel.org/netdev/d951b371-4138-4bda-a1c5-7606a28c81f0@gmail.com/
RFC-v2: https://lore.kernel.org/netdev/3a17c86d-08a5-46d2-8622-abc13d4a411e@gmail.com/

Leone Fernando (4):
  net: route: expire rt if the dst it holds is expired
  net: dst_cache: add input_dst_cache API
  net: route: always compile dst_cache
  net: route: replace route hints with input_dst_cache

 drivers/net/Kconfig        |   1 -
 include/net/dst_cache.h    |  68 +++++++++++++++++++
 include/net/dst_metadata.h |   2 -
 include/net/ip_tunnels.h   |   2 -
 include/net/route.h        |   6 +-
 net/Kconfig                |   4 --
 net/core/Makefile          |   3 +-
 net/core/dst.c             |   4 --
 net/core/dst_cache.c       | 132 +++++++++++++++++++++++++++++++++++++
 net/ipv4/Kconfig           |   1 -
 net/ipv4/ip_input.c        |  58 ++++++++--------
 net/ipv4/ip_tunnel_core.c  |   4 --
 net/ipv4/route.c           |  75 +++++++++++++++------
 net/ipv4/udp_tunnel_core.c |   4 --
 net/ipv6/Kconfig           |   4 --
 net/ipv6/ip6_udp_tunnel.c  |   4 --
 net/netfilter/nft_tunnel.c |   2 -
 net/openvswitch/Kconfig    |   1 -
 net/sched/act_tunnel_key.c |   2 -
 19 files changed, 291 insertions(+), 86 deletions(-)

David Ahern March 9, 2024, 4:53 a.m. UTC | #1

On 3/7/24 10:11 AM, Leone Fernando wrote:
> In 2017, Paolo Abeni introduced the hinting mechanism [1] to the routing
> sub-system. The hinting optimization improves performance by reusing
> previously found dsts instead of looking them up for each skb.
> 
> This patch series introduces a generalized version of the hinting mechanism that
> can "remember" a larger number of dsts. This reduces the number of dst
> lookups for frequently encountered daddrs.
> 
> Before diving into the code and the benchmarking results, it's important
> to address the deletion of the old route cache [2] and why
> this solution is different. The original cache was complicated,
> vulnerable to DOS attacks and had unstable performance.
> 
> The new input dst_cache is much simpler thanks to its lazy approach,
> improving performance without the overhead of the removed cache
> implementation. Instead of using timers and GC, the deletion of invalid
> entries is performed lazily during their lookups.
> The dsts are stored in a simple, lightweight, static hash table. This
> keeps the lookup times fast yet stable, preventing DOS upon cache misses.
> The new input dst_cache implementation is built over the existing
> dst_cache code which supplies a fast lockless percpu behavior.
> 
> I tested this patch using udp floods with different number of daddrs.
> The benchmarking setup is comprised of 3 machines: a sender,
> a forwarder and a receiver. I measured the PPS received by the receiver
> as the forwarder was running either the mainline kernel or the patched
> kernel, comparing the results. The dst_cache I tested in this benchmark
> used a total of 512 hash table entries, split into buckets of 4
> entries each.
> 
> These are the results:
>   UDP             mainline              patched                   delta
> conns pcpu         Kpps                  Kpps                       %
>    1              274.0255              269.2205                  -1.75
>    2              257.3748              268.0947                   4.17
>   15              241.3513              258.8016                   7.23
>  100              238.3419              258.4939                   8.46
>  500              238.5390              252.6425                   5.91
> 1000              238.7570              242.1820                   1.43
> 2000              238.7780              236.2640                  -1.05
> 4000              239.0440              233.5320                  -2.31
> 8000              239.3248              232.5680                  -2.82
> 

I have looked at all of the sets sent. I can not convince myself this is
a good idea, but at the same time I do not have constructive feedback on
why it is not acceptable. The gains are modest at best.

Leone Fernando March 12, 2024, 3:38 p.m. UTC | #2

David Ahern wrote:
> I have looked at all of the sets sent. I can not convince myself this is
> a good idea, but at the same time I do not have constructive feedback on
> why it is not acceptable. The gains are modest at best.
> 

Thanks for the comment.

I believe an improvement of 5-8% in PPS is significant. 
Note that the cache is per-cpu (e.g., for a machine with 10 CPUs, the improvement
affects 10X the conns mentioned).

Could you please provide more information about what you don't like
in the patch?

Some possible issues I can think of:
- Do you think the improvement is not affecting the common case?
In this case, it can be solved by tweaking the cache parameters.

- In case performance degradation for some # of conns is problematic - we can
find ways to reduce it. For example, the degradation for 1 connection can
probably be solved by keeping the route hints.

Leone

Leone Fernando April 2, 2024, 10:08 a.m. UTC | #3

Hi David,

I plan to continue working on this patch, and it
would be helpful if you could share some more thoughts.
As I said before, I think this patch is significant, and my measurements
showed a consistent improvement in most cases.

Thanks,
Leone

David Ahern April 2, 2024, 3:02 p.m. UTC | #4

On 4/2/24 4:08 AM, Leone Fernando wrote:
> Hi David,
> 
> I plan to continue working on this patch, and it
> would be helpful if you could share some more thoughts.
> As I said before, I think this patch is significant, and my measurements
> showed a consistent improvement in most cases.
> 

It seems to me patch 1 and a version of it for IPv6 should go in
independent of this set.

For the rest of it, Jakub's response was a good summary: it is hard to
know if there is a benefit to real workloads. Cache's consume resources
(memory and cpu) and will be wrong some percentage of the time
increasing overhead.

Also, it is targeted at a very narrow use case -- IPv4 only, no custom
FIB rules and no multipath.

Leone Fernando April 3, 2024, 11:50 a.m. UTC | #5

Hi David,

> 
> For the rest of it, Jakub's response was a good summary: it is hard to
> know if there is a benefit to real workloads. Cache's consume resources
> (memory and cpu) and will be wrong some percentage of the time
> increasing overhead.
> 

I understand what you are saying here.
Do you have an idea for extra tests or measurements to make sure
the patch also improves in real workloads?

> Also, it is targeted at a very narrow use case -- IPv4 only, no custom
> FIB rules and no multipath.

Implementation for IPv6 is almost identical. I decided to start with IPv4 
and I plan to submit IPv6 support in a future patch. 
This patch basically improves the hinting mechanism that Paolo introduced
and it handles the same cases.
Adding support for FIB rules and multipath is a bit more complicated but 
also possible. I am willing to keep working on this patch to meet the 
remaining cases including IPv6.

Thanks,
Leone

[net-next,0/4] net: route: improve route hinting

Message

Comments