Message ID | 20221218234801.579114-1-jmaxwell37@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] ipv6: fix routing cache overflow for raw sockets | expand |
On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > these warnings: > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > . > . > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. If I read correctly, the maximum number of dst that the raw socket can use this way is limited by the number of packets it allows via the sndbuf limit, right? Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, ipvs, seg6? @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows? Thanks, Paolo
On 12/20/22 5:35 AM, Paolo Abeni wrote: > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: >> Sending Ipv6 packets in a loop via a raw socket triggers an issue where a >> route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly >> consumes the Ipv6 max_size threshold which defaults to 4096 resulting in >> these warnings: >> >> [1] 99.187805] dst_alloc: 7728 callbacks suppressed >> [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. >> . >> . >> [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > If I read correctly, the maximum number of dst that the raw socket can > use this way is limited by the number of packets it allows via the > sndbuf limit, right? > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > ipvs, seg6? > > @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows? > > Thanks, > > Paolo > If I recall the details correctly: that sysctl limit was added back when ipv6 routes were managed as dst_entries and there was a desire to allow an admin to limit the memory consumed. At this point in time, IPv6 is more inline with IPv4 - a separate struct for fib entries from dst entries. That "Route cache is full" message is now out of date since this is dst_entries which have a gc mechanism. IPv4 does not limit the number of dst_entries that can be allocated (ip_rt_max_size is the sysctl variable behind the ipv4 version of max_size and it is a no-op). IPv6 can probably do the same here? I do not believe the suggested flag is the right change.
Hello, On Tue, 20 Dec 2022, Paolo Abeni wrote: > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > these warnings: > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > . > > . > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > If I read correctly, the maximum number of dst that the raw socket can > use this way is limited by the number of packets it allows via the > sndbuf limit, right? > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > ipvs, seg6? For IPVS there is no sndbuf limit. IPVS uses this flag when receiving packets from world (destined to some local Virtual IP) and then diverts/redirects the packets (without changing daddr) to one of its backend servers on the LAN (no RTF_GATEWAY on such routes). So, for each packet IPVS requests output route with FLOWI_FLAG_KNOWN_NH flag and then sends the packet to backend server (nexthop) using this route attached to the skb. Packet rate is usually high. The goal is nexthop to be used from the route, not from the IP header. KNOWN_NH means "nexthop is provided in route, not in daddr". As for the implementation details in ipv6, I can not comment. But all users that set the flag wants this, to send packet where daddr can be != nexthop. > @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows? Regards -- Julian Anastasov <ja@ssi.bg>
Hello, On Tue, 20 Dec 2022, Paolo Abeni wrote: > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > ipvs, seg6? I forgot to mention one thing: IPVS can cache such routes in its own storage, one per backend server, it still calls dst->ops->check for them. So, such route can live for long time, that is why they were created as uncached. So, IPVS requests one route, remembers it and then can attach it to multiple packets for this backend server with skb_dst_set_noref. So, IPVS have to use 4096 backend servers to hit this limit. It does not look correct in this patch to invalidate the FLOWI_FLAG_KNOWN_NH flag with a FLOWI_FLAG_SKIP_RAW flag. The same thing would be to not set FLOWI_FLAG_KNOWN_NH which is wrong for the hdrincl case. Regards -- Julian Anastasov <ja@ssi.bg>
On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > these warnings: > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > . > > . > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > If I read correctly, the maximum number of dst that the raw socket can > use this way is limited by the number of packets it allows via the > sndbuf limit, right? > Yes, but in my test sndbuf limit is never hit so it clones a route for every packet. e.g: output from C program sending 5000000 packets via a raw socket. ip raw: total num pkts 5000000 # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}' Attaching 1 probe... @count[a.out]: 5000009 > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > ipvs, seg6? > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it. But we have only seen this for raw sockets so far. Regards Jon > @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows? > > Thanks, > > Paolo >
On Wed, Dec 21, 2022 at 2:10 AM David Ahern <dsahern@kernel.org> wrote: > > On 12/20/22 5:35 AM, Paolo Abeni wrote: > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > >> Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > >> route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > >> consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > >> these warnings: > >> > >> [1] 99.187805] dst_alloc: 7728 callbacks suppressed > >> [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > >> . > >> . > >> [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > If I read correctly, the maximum number of dst that the raw socket can > > use this way is limited by the number of packets it allows via the > > sndbuf limit, right? > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > ipvs, seg6? > > > > @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows? > > > > Thanks, > > > > Paolo > > > > If I recall the details correctly: that sysctl limit was added back when > ipv6 routes were managed as dst_entries and there was a desire to allow > an admin to limit the memory consumed. At this point in time, IPv6 is > more inline with IPv4 - a separate struct for fib entries from dst > entries. That "Route cache is full" message is now out of date since > this is dst_entries which have a gc mechanism. > > IPv4 does not limit the number of dst_entries that can be allocated > (ip_rt_max_size is the sysctl variable behind the ipv4 version of > max_size and it is a no-op). IPv6 can probably do the same here? > diff --git a/net/ipv6/route.c b/net/ipv6/route.c index dbc224023977..701aba7feaf5 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -6470,7 +6470,7 @@ static int __net_init ip6_route_net_init(struct net *net) #endif net->ipv6.sysctl.flush_delay = 0; - net->ipv6.sysctl.ip6_rt_max_size = 4096; + net->ipv6.sysctl.ip6_rt_max_size = INT_MAX; net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2; net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ; net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ; The above patch resolved it for the Ipv6 reproducer. Would that be sufficient? > I do not believe the suggested flag is the right change. Regards Jon
On Wed, Dec 21, 2022 at 8:55 AM Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > On Wed, Dec 21, 2022 at 2:10 AM David Ahern <dsahern@kernel.org> wrote: > > > > On 12/20/22 5:35 AM, Paolo Abeni wrote: > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > >> Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > >> route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > >> consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > >> these warnings: > > >> > > >> [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > >> [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > >> . > > >> . > > >> [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > If I read correctly, the maximum number of dst that the raw socket can > > > use this way is limited by the number of packets it allows via the > > > sndbuf limit, right? > > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > > ipvs, seg6? > > > > > > @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows? > > > > > > Thanks, > > > > > > Paolo > > > > > > > If I recall the details correctly: that sysctl limit was added back when > > ipv6 routes were managed as dst_entries and there was a desire to allow > > an admin to limit the memory consumed. At this point in time, IPv6 is > > more inline with IPv4 - a separate struct for fib entries from dst > > entries. That "Route cache is full" message is now out of date since > > this is dst_entries which have a gc mechanism. > > > > IPv4 does not limit the number of dst_entries that can be allocated > > (ip_rt_max_size is the sysctl variable behind the ipv4 version of > > max_size and it is a no-op). IPv6 can probably do the same here? > > > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > index dbc224023977..701aba7feaf5 100644 > --- a/net/ipv6/route.c > +++ b/net/ipv6/route.c > @@ -6470,7 +6470,7 @@ static int __net_init ip6_route_net_init(struct net *net) > #endif > > net->ipv6.sysctl.flush_delay = 0; > - net->ipv6.sysctl.ip6_rt_max_size = 4096; > + net->ipv6.sysctl.ip6_rt_max_size = INT_MAX; > net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2; > net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ; > net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ; > > The above patch resolved it for the Ipv6 reproducer. > > Would that be sufficient? > Otherwise if you prefer to make Ipv6 behaviour similar to IPv4. Rather than upping max_size. Here is prototype patch that removes the max_size check for Ipv6: diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h index 88ff7bb2bb9b..632086b2f644 100644 --- a/include/net/dst_ops.h +++ b/include/net/dst_ops.h @@ -16,7 +16,7 @@ struct dst_ops { unsigned short family; unsigned int gc_thresh; - int (*gc)(struct dst_ops *ops); + void (*gc)(struct dst_ops *ops); struct dst_entry * (*check)(struct dst_entry *, __u32 cookie); unsigned int (*default_advmss)(const struct dst_entry *); unsigned int (*mtu)(const struct dst_entry *); diff --git a/net/core/dst.c b/net/core/dst.c index 497ef9b3fc6a..dcb85267bc4c 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev, if (ops->gc && !(flags & DST_NOCOUNT) && - dst_entries_get_fast(ops) > ops->gc_thresh) { - if (ops->gc(ops)) { - pr_notice_ratelimited("Route cache is full: consider increasing sysctl net.ipv6.route.max_size.\n"); - return NULL; - } - } + dst_entries_get_fast(ops) > ops->gc_thresh) + ops->gc(ops); dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC); if (!dst) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index dbc224023977..8db7c5436da4 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct dst_entry *); static void ip6_dst_destroy(struct dst_entry *); static void ip6_dst_ifdown(struct dst_entry *, struct net_device *dev, int how); -static int ip6_dst_gc(struct dst_ops *ops); +static void ip6_dst_gc(struct dst_ops *ops); static int ip6_pkt_discard(struct sk_buff *skb); static int ip6_pkt_discard_out(struct net *net, struct sock *sk, struct sk_buff *skb); @@ -3295,32 +3295,21 @@ struct dst_entry *icmp6_dst_alloc(struct net_device *dev, return dst; } -static int ip6_dst_gc(struct dst_ops *ops) +static void ip6_dst_gc(struct dst_ops *ops) { struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops); - int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval; - int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size; int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity; int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout; - unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc; int entries; entries = dst_entries_get_fast(ops); - if (entries > rt_max_size) - entries = dst_entries_get_slow(ops); - - if (time_after(rt_last_gc + rt_min_interval, jiffies) && - entries <= rt_max_size) - goto out; net->ipv6.ip6_rt_gc_expire++; fib6_run_gc(net->ipv6.ip6_rt_gc_expire, net, true); entries = dst_entries_get_slow(ops); if (entries < ops->gc_thresh) net->ipv6.ip6_rt_gc_expire = rt_gc_timeout>>1; -out: net->ipv6.ip6_rt_gc_expire -= net->ipv6.ip6_rt_gc_expire>>rt_elasticity; - return entries > rt_max_size; } static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg, > > I do not believe the suggested flag is the right change. > > Regards > > Jon
On Wed, Dec 21, 2022 at 3:31 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > On Wed, Dec 21, 2022 at 8:55 AM Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > On Wed, Dec 21, 2022 at 2:10 AM David Ahern <dsahern@kernel.org> wrote: > > > > > > On 12/20/22 5:35 AM, Paolo Abeni wrote: > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > > >> Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > > >> route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > > >> consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > > >> these warnings: > > > >> > > > >> [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > >> [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > >> . > > > >> . > > > >> [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > If I read correctly, the maximum number of dst that the raw socket can > > > > use this way is limited by the number of packets it allows via the > > > > sndbuf limit, right? > > > > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > > > ipvs, seg6? > > > > > > > > @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows? > > > > > > > > Thanks, > > > > > > > > Paolo > > > > > > > > > > If I recall the details correctly: that sysctl limit was added back when > > > ipv6 routes were managed as dst_entries and there was a desire to allow > > > an admin to limit the memory consumed. At this point in time, IPv6 is > > > more inline with IPv4 - a separate struct for fib entries from dst > > > entries. That "Route cache is full" message is now out of date since > > > this is dst_entries which have a gc mechanism. > > > > > > IPv4 does not limit the number of dst_entries that can be allocated > > > (ip_rt_max_size is the sysctl variable behind the ipv4 version of > > > max_size and it is a no-op). IPv6 can probably do the same here? > > > > > > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > > index dbc224023977..701aba7feaf5 100644 > > --- a/net/ipv6/route.c > > +++ b/net/ipv6/route.c > > @@ -6470,7 +6470,7 @@ static int __net_init ip6_route_net_init(struct net *net) > > #endif > > > > net->ipv6.sysctl.flush_delay = 0; > > - net->ipv6.sysctl.ip6_rt_max_size = 4096; > > + net->ipv6.sysctl.ip6_rt_max_size = INT_MAX; > > net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2; > > net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ; > > net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ; > > > > The above patch resolved it for the Ipv6 reproducer. > > > > Would that be sufficient? > > > > Otherwise if you prefer to make Ipv6 behaviour similar to IPv4. > Rather than upping max_size. > > Here is prototype patch that removes the max_size check for Ipv6: > There are some mistakes in this prototype patch. Let me come up with a better one. I'll submit a new patch in the new year for review. Thanks for the suggestion DavidA. Regards Jon > diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h > index 88ff7bb2bb9b..632086b2f644 100644 > --- a/include/net/dst_ops.h > +++ b/include/net/dst_ops.h > @@ -16,7 +16,7 @@ struct dst_ops { > unsigned short family; > unsigned int gc_thresh; > > - int (*gc)(struct dst_ops *ops); > + void (*gc)(struct dst_ops *ops); > struct dst_entry * (*check)(struct dst_entry *, __u32 cookie); > unsigned int (*default_advmss)(const struct dst_entry *); > unsigned int (*mtu)(const struct dst_entry *); > diff --git a/net/core/dst.c b/net/core/dst.c > index 497ef9b3fc6a..dcb85267bc4c 100644 > --- a/net/core/dst.c > +++ b/net/core/dst.c > @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev, > > if (ops->gc && > !(flags & DST_NOCOUNT) && > - dst_entries_get_fast(ops) > ops->gc_thresh) { > - if (ops->gc(ops)) { > - pr_notice_ratelimited("Route cache is full: > consider increasing sysctl net.ipv6.route.max_size.\n"); > - return NULL; > - } > - } > + dst_entries_get_fast(ops) > ops->gc_thresh) > + ops->gc(ops); > > dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC); > if (!dst) > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > index dbc224023977..8db7c5436da4 100644 > --- a/net/ipv6/route.c > +++ b/net/ipv6/route.c > @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct > dst_entry *); > static void ip6_dst_destroy(struct dst_entry *); > static void ip6_dst_ifdown(struct dst_entry *, > struct net_device *dev, int how); > -static int ip6_dst_gc(struct dst_ops *ops); > +static void ip6_dst_gc(struct dst_ops *ops); > > static int ip6_pkt_discard(struct sk_buff *skb); > static int ip6_pkt_discard_out(struct net *net, struct > sock *sk, struct sk_buff *skb); > @@ -3295,32 +3295,21 @@ struct dst_entry *icmp6_dst_alloc(struct > net_device *dev, > return dst; > } > > -static int ip6_dst_gc(struct dst_ops *ops) > +static void ip6_dst_gc(struct dst_ops *ops) > { > struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops); > - int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval; > - int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size; > int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity; > int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout; > - unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc; > int entries; > > entries = dst_entries_get_fast(ops); > - if (entries > rt_max_size) > - entries = dst_entries_get_slow(ops); > - > - if (time_after(rt_last_gc + rt_min_interval, jiffies) && > - entries <= rt_max_size) > - goto out; > > net->ipv6.ip6_rt_gc_expire++; > fib6_run_gc(net->ipv6.ip6_rt_gc_expire, net, true); > entries = dst_entries_get_slow(ops); > if (entries < ops->gc_thresh) > net->ipv6.ip6_rt_gc_expire = rt_gc_timeout>>1; > -out: > net->ipv6.ip6_rt_gc_expire -= net->ipv6.ip6_rt_gc_expire>>rt_elasticity; > - return entries > rt_max_size; > } > > static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg, > > > > I do not believe the suggested flag is the right change. > > > > Regards > > > > Jon
On 12/21/22 10:39 PM, Jonathan Maxwell wrote: > There are some mistakes in this prototype patch. Let me come up with a > better one. I'll submit a new patch in the new year for review. Thanks for > the suggestion DavidA. you are welcome. When you submit the next one, it would be helpful to show change in memory consumption and a comparison to IPv4 for a similar raw socket program.
On Fri, Dec 23, 2022 at 3:17 AM David Ahern <dsahern@kernel.org> wrote: > > On 12/21/22 10:39 PM, Jonathan Maxwell wrote: > > There are some mistakes in this prototype patch. Let me come up with a > > better one. I'll submit a new patch in the new year for review. Thanks for > > the suggestion DavidA. > > you are welcome. When you submit the next one, it would be helpful to > show change in memory consumption and a comparison to IPv4 for a similar > raw socket program. Sure will do, I have an ipv4 version of the test program and a better patch. Initial results are quite similar for resource usage. The ipv6 GC does a good job at managing memory resources with max_size threshold removed. I'll include some comparison stats after I test this extensively when I submit the new patch.
Hi Jon, please see below, thanks. On Wed, 21 Dec 2022 08:48:11 +1100 Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > > these warnings: > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > . > > > . > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > If I read correctly, the maximum number of dst that the raw socket can > > use this way is limited by the number of packets it allows via the > > sndbuf limit, right? > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for > every packet. > > e.g: > > output from C program sending 5000000 packets via a raw socket. > > ip raw: total num pkts 5000000 > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}' > Attaching 1 probe... > > @count[a.out]: 5000009 > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > ipvs, seg6? > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it. > But we have only seen this for raw sockets so far. > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop) different from the one carried by the IPv6 header. For this purpose, seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH. > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > . > > > . > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. I can reproduce the same warning messages reported by you, by instantiating an End.X behavior whose nexthop is handled by a route for which there is no "via". In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop()) triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii) and the res.nh->fib_nh_gw_family is 0 (as already pointed out). > Regards > > Jon Ciao, Andrea
On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote: > > Hi Jon, > please see below, thanks. > > On Wed, 21 Dec 2022 08:48:11 +1100 > Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > > > these warnings: > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > . > > > > . > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > If I read correctly, the maximum number of dst that the raw socket can > > > use this way is limited by the number of packets it allows via the > > > sndbuf limit, right? > > > > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for > > every packet. > > > > e.g: > > > > output from C program sending 5000000 packets via a raw socket. > > > > ip raw: total num pkts 5000000 > > > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}' > > Attaching 1 probe... > > > > @count[a.out]: 5000009 > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > > ipvs, seg6? > > > > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it. > > But we have only seen this for raw sockets so far. > > > > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop) > different from the one carried by the IPv6 header. For this purpose, > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH. > Hi Andrea, Thanks for pointing that datapath out. The more generic approach we are taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances of this. > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > . > > > > . > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > I can reproduce the same warning messages reported by you, by instantiating an > End.X behavior whose nexthop is handled by a route for which there is no "via". > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop()) > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii) > and the res.nh->fib_nh_gw_family is 0 (as already pointed out). > Nice, when I get back after the holiday break I'll submit the next patch. It would be great if you could test the new patch and let me know how it works in your tests at that juncture. I'll keep you posted. Regards Jon > > Regards > > > > Jon > > Ciao, > Andrea
Hi Andrea, Happy New Year. Any chance you could test this patch based on the latest net-next kernel and let me know the result? diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h index 88ff7bb2bb9b..632086b2f644 100644 --- a/include/net/dst_ops.h +++ b/include/net/dst_ops.h @@ -16,7 +16,7 @@ struct dst_ops { unsigned short family; unsigned int gc_thresh; - int (*gc)(struct dst_ops *ops); + void (*gc)(struct dst_ops *ops); struct dst_entry * (*check)(struct dst_entry *, __u32 cookie); unsigned int (*default_advmss)(const struct dst_entry *); unsigned int (*mtu)(const struct dst_entry *); diff --git a/net/core/dst.c b/net/core/dst.c index 6d2dd03dafa8..31c08a3386d3 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev, if (ops->gc && !(flags & DST_NOCOUNT) && - dst_entries_get_fast(ops) > ops->gc_thresh) { - if (ops->gc(ops)) { - pr_notice_ratelimited("Route cache is full: consider increasing sysctl net.ipv6.route.max_size.\n"); - return NULL; - } - } + dst_entries_get_fast(ops) > ops->gc_thresh) + ops->gc(ops); dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC); if (!dst) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index e74e0361fd92..b643dda68d31 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct dst_entry *); static void ip6_dst_destroy(struct dst_entry *); static void ip6_dst_ifdown(struct dst_entry *, struct net_device *dev, int how); -static int ip6_dst_gc(struct dst_ops *ops); +static void ip6_dst_gc(struct dst_ops *ops); static int ip6_pkt_discard(struct sk_buff *skb); static int ip6_pkt_discard_out(struct net *net, struct sock *sk, struct sk_buff *skb); @@ -3284,11 +3284,10 @@ struct dst_entry *icmp6_dst_alloc(struct net_device *dev, return dst; } -static int ip6_dst_gc(struct dst_ops *ops) +static void ip6_dst_gc(struct dst_ops *ops) { struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops); int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval; - int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size; int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity; int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout; unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc; @@ -3296,11 +3295,10 @@ static int ip6_dst_gc(struct dst_ops *ops) int entries; entries = dst_entries_get_fast(ops); - if (entries > rt_max_size) + if (entries > ops->gc_thresh) entries = dst_entries_get_slow(ops); - if (time_after(rt_last_gc + rt_min_interval, jiffies) && - entries <= rt_max_size) + if (time_after(rt_last_gc + rt_min_interval, jiffies)) goto out; fib6_run_gc(atomic_inc_return(&net->ipv6.ip6_rt_gc_expire), net, true); @@ -3310,7 +3308,6 @@ static int ip6_dst_gc(struct dst_ops *ops) out: val = atomic_read(&net->ipv6.ip6_rt_gc_expire); atomic_set(&net->ipv6.ip6_rt_gc_expire, val - (val >> rt_elasticity)); - return entries > rt_max_size; } static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg, @@ -6512,7 +6509,7 @@ static int __net_init ip6_route_net_init(struct net *net) #endif net->ipv6.sysctl.flush_delay = 0; - net->ipv6.sysctl.ip6_rt_max_size = 4096; + net->ipv6.sysctl.ip6_rt_max_size = INT_MAX; net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2; net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ; net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ; On Sat, Dec 24, 2022 at 6:38 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote: > > > > Hi Jon, > > please see below, thanks. > > > > On Wed, 21 Dec 2022 08:48:11 +1100 > > Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > > > > these warnings: > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > . > > > > > . > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > If I read correctly, the maximum number of dst that the raw socket can > > > > use this way is limited by the number of packets it allows via the > > > > sndbuf limit, right? > > > > > > > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for > > > every packet. > > > > > > e.g: > > > > > > output from C program sending 5000000 packets via a raw socket. > > > > > > ip raw: total num pkts 5000000 > > > > > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}' > > > Attaching 1 probe... > > > > > > @count[a.out]: 5000009 > > > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > > > ipvs, seg6? > > > > > > > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it. > > > But we have only seen this for raw sockets so far. > > > > > > > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some > > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a > > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop) > > different from the one carried by the IPv6 header. For this purpose, > > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH. > > > Hi Andrea, > > Thanks for pointing that datapath out. The more generic approach we are > taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances > of this. > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > . > > > > > . > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > I can reproduce the same warning messages reported by you, by instantiating an > > End.X behavior whose nexthop is handled by a route for which there is no "via". > > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop()) > > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii) > > and the res.nh->fib_nh_gw_family is 0 (as already pointed out). > > > > Nice, when I get back after the holiday break I'll submit the next patch. It > would be great if you could test the new patch and let me know how it works in > your tests at that juncture. I'll keep you posted. > > Regards > > Jon > > > > Regards > > > > > > Jon > > > > Ciao, > > Andrea
Hi Jon, please see below, thanks. On Tue, 3 Jan 2023 10:59:50 +1100 Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > Hi Andrea, > > Happy New Year. > Thank you, Happy New Year to you too and everybody on the mailing list as well. > Any chance you could test this patch based on the latest net-next > kernel and let me know the result? > > diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h > index 88ff7bb2bb9b..632086b2f644 100644 > --- a/include/net/dst_ops.h > +++ b/include/net/dst_ops.h > @@ -16,7 +16,7 @@ struct dst_ops { > unsigned short family; > unsigned int gc_thresh; > > - int (*gc)(struct dst_ops *ops); > + void (*gc)(struct dst_ops *ops); > struct dst_entry * (*check)(struct dst_entry *, __u32 cookie); > unsigned int (*default_advmss)(const struct dst_entry *); > unsigned int (*mtu)(const struct dst_entry *); > diff --git a/net/core/dst.c b/net/core/dst.c > index 6d2dd03dafa8..31c08a3386d3 100644 > --- a/net/core/dst.c > +++ b/net/core/dst.c > @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev, > > if (ops->gc && > !(flags & DST_NOCOUNT) && > - dst_entries_get_fast(ops) > ops->gc_thresh) { > - if (ops->gc(ops)) { > - pr_notice_ratelimited("Route cache is full: > consider increasing sysctl net.ipv6.route.max_size.\n"); > - return NULL; > - } > - } > + dst_entries_get_fast(ops) > ops->gc_thresh) > + ops->gc(ops); > > dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC); > if (!dst) > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > index e74e0361fd92..b643dda68d31 100644 > --- a/net/ipv6/route.c > +++ b/net/ipv6/route.c > @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct > dst_entry *); > static void ip6_dst_destroy(struct dst_entry *); > static void ip6_dst_ifdown(struct dst_entry *, > struct net_device *dev, int how); > -static int ip6_dst_gc(struct dst_ops *ops); > +static void ip6_dst_gc(struct dst_ops *ops); > > static int ip6_pkt_discard(struct sk_buff *skb); > static int ip6_pkt_discard_out(struct net *net, struct > sock *sk, struct sk_buff *skb); > @@ -3284,11 +3284,10 @@ struct dst_entry *icmp6_dst_alloc(struct > net_device *dev, > return dst; > } > > -static int ip6_dst_gc(struct dst_ops *ops) > +static void ip6_dst_gc(struct dst_ops *ops) > { > struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops); > int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval; > - int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size; > int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity; > int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout; > unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc; > @@ -3296,11 +3295,10 @@ static int ip6_dst_gc(struct dst_ops *ops) > int entries; > > entries = dst_entries_get_fast(ops); > - if (entries > rt_max_size) > + if (entries > ops->gc_thresh) > entries = dst_entries_get_slow(ops); > > - if (time_after(rt_last_gc + rt_min_interval, jiffies) && > - entries <= rt_max_size) > + if (time_after(rt_last_gc + rt_min_interval, jiffies)) > goto out; > > fib6_run_gc(atomic_inc_return(&net->ipv6.ip6_rt_gc_expire), net, true); > @@ -3310,7 +3308,6 @@ static int ip6_dst_gc(struct dst_ops *ops) > out: > val = atomic_read(&net->ipv6.ip6_rt_gc_expire); > atomic_set(&net->ipv6.ip6_rt_gc_expire, val - (val >> rt_elasticity)); > - return entries > rt_max_size; > } > > static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg, > @@ -6512,7 +6509,7 @@ static int __net_init ip6_route_net_init(struct net *net) > #endif > > net->ipv6.sysctl.flush_delay = 0; > - net->ipv6.sysctl.ip6_rt_max_size = 4096; > + net->ipv6.sysctl.ip6_rt_max_size = INT_MAX; > net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2; > net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ; > net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ; > Yes, I will apply this patch in the next days and check how it deals with the seg6 subsystem. I will keep you posted. Ciao, Andrea > On Sat, Dec 24, 2022 at 6:38 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote: > > > > > > Hi Jon, > > > please see below, thanks. > > > > > > On Wed, 21 Dec 2022 08:48:11 +1100 > > > Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > > > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > > > > > these warnings: > > > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > . > > > > > > . > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > > If I read correctly, the maximum number of dst that the raw socket can > > > > > use this way is limited by the number of packets it allows via the > > > > > sndbuf limit, right? > > > > > > > > > > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for > > > > every packet. > > > > > > > > e.g: > > > > > > > > output from C program sending 5000000 packets via a raw socket. > > > > > > > > ip raw: total num pkts 5000000 > > > > > > > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}' > > > > Attaching 1 probe... > > > > > > > > @count[a.out]: 5000009 > > > > > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > > > > ipvs, seg6? > > > > > > > > > > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it. > > > > But we have only seen this for raw sockets so far. > > > > > > > > > > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some > > > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a > > > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop) > > > different from the one carried by the IPv6 header. For this purpose, > > > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH. > > > > > Hi Andrea, > > > > Thanks for pointing that datapath out. The more generic approach we are > > taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances > > of this. > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > . > > > > > > . > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > I can reproduce the same warning messages reported by you, by instantiating an > > > End.X behavior whose nexthop is handled by a route for which there is no "via". > > > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop()) > > > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii) > > > and the res.nh->fib_nh_gw_family is 0 (as already pointed out). > > > > > > > Nice, when I get back after the holiday break I'll submit the next patch. It > > would be great if you could test the new patch and let me know how it works in > > your tests at that juncture. I'll keep you posted. > > > > Regards > > > > Jon > > > > > > Regards > > > > > > > > Jon > > > > > > Ciao, > > > Andrea
Hi Jon, please see after, thanks. > > > Any chance you could test this patch based on the latest net-next > > kernel and let me know the result? > > > > diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h > > index 88ff7bb2bb9b..632086b2f644 100644 > > --- a/include/net/dst_ops.h > > +++ b/include/net/dst_ops.h > > @@ -16,7 +16,7 @@ struct dst_ops { > > unsigned short family; > > unsigned int gc_thresh; > > > > - int (*gc)(struct dst_ops *ops); > > + void (*gc)(struct dst_ops *ops); > > struct dst_entry * (*check)(struct dst_entry *, __u32 cookie); > > unsigned int (*default_advmss)(const struct dst_entry *); > > unsigned int (*mtu)(const struct dst_entry *); > > diff --git a/net/core/dst.c b/net/core/dst.c > > index 6d2dd03dafa8..31c08a3386d3 100644 > > --- a/net/core/dst.c > > +++ b/net/core/dst.c > > @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev, > > > > if (ops->gc && > > !(flags & DST_NOCOUNT) && > > - dst_entries_get_fast(ops) > ops->gc_thresh) { > > - if (ops->gc(ops)) { > > - pr_notice_ratelimited("Route cache is full: > > consider increasing sysctl net.ipv6.route.max_size.\n"); > > - return NULL; > > - } > > - } > > + dst_entries_get_fast(ops) > ops->gc_thresh) > > + ops->gc(ops); > > > > dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC); > > if (!dst) > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > > index e74e0361fd92..b643dda68d31 100644 > > --- a/net/ipv6/route.c > > +++ b/net/ipv6/route.c > > @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct > > dst_entry *); > > static void ip6_dst_destroy(struct dst_entry *); > > static void ip6_dst_ifdown(struct dst_entry *, > > struct net_device *dev, int how); > > -static int ip6_dst_gc(struct dst_ops *ops); > > +static void ip6_dst_gc(struct dst_ops *ops); > > > > static int ip6_pkt_discard(struct sk_buff *skb); > > static int ip6_pkt_discard_out(struct net *net, struct > > sock *sk, struct sk_buff *skb); > > @@ -3284,11 +3284,10 @@ struct dst_entry *icmp6_dst_alloc(struct > > net_device *dev, > > return dst; > > } > > > > -static int ip6_dst_gc(struct dst_ops *ops) > > +static void ip6_dst_gc(struct dst_ops *ops) > > { > > struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops); > > int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval; > > - int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size; > > int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity; > > int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout; > > unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc; > > @@ -3296,11 +3295,10 @@ static int ip6_dst_gc(struct dst_ops *ops) > > int entries; > > > > entries = dst_entries_get_fast(ops); > > - if (entries > rt_max_size) > > + if (entries > ops->gc_thresh) > > entries = dst_entries_get_slow(ops); > > > > - if (time_after(rt_last_gc + rt_min_interval, jiffies) && > > - entries <= rt_max_size) > > + if (time_after(rt_last_gc + rt_min_interval, jiffies)) > > goto out; > > > > fib6_run_gc(atomic_inc_return(&net->ipv6.ip6_rt_gc_expire), net, true); > > @@ -3310,7 +3308,6 @@ static int ip6_dst_gc(struct dst_ops *ops) > > out: > > val = atomic_read(&net->ipv6.ip6_rt_gc_expire); > > atomic_set(&net->ipv6.ip6_rt_gc_expire, val - (val >> rt_elasticity)); > > - return entries > rt_max_size; > > } > > > > static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg, > > @@ -6512,7 +6509,7 @@ static int __net_init ip6_route_net_init(struct net *net) > > #endif > > > > net->ipv6.sysctl.flush_delay = 0; > > - net->ipv6.sysctl.ip6_rt_max_size = 4096; > > + net->ipv6.sysctl.ip6_rt_max_size = INT_MAX; > > net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2; > > net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ; > > net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ; > > > > Yes, I will apply this patch in the next days and check how it deals with the > seg6 subsystem. I will keep you posted. > I applied the patch* to the net-next (HEAD 6bd4755c7c49) and did some tests on the seg6 subsystem, specifically running the End.X/DX6 behaviors. They seem to work fine. (*) I had to slightly edit the patch because of the code formatting, e.g. some incorrect line breaks, spaces, etc. Ciao, Andrea > > > On Sat, Dec 24, 2022 at 6:38 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > > > On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote: > > > > > > > > Hi Jon, > > > > please see below, thanks. > > > > > > > > On Wed, 21 Dec 2022 08:48:11 +1100 > > > > Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > > > > > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > > > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > > > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > > > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > > > > > > these warnings: > > > > > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > . > > > > > > > . > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > > > > If I read correctly, the maximum number of dst that the raw socket can > > > > > > use this way is limited by the number of packets it allows via the > > > > > > sndbuf limit, right? > > > > > > > > > > > > > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for > > > > > every packet. > > > > > > > > > > e.g: > > > > > > > > > > output from C program sending 5000000 packets via a raw socket. > > > > > > > > > > ip raw: total num pkts 5000000 > > > > > > > > > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}' > > > > > Attaching 1 probe... > > > > > > > > > > @count[a.out]: 5000009 > > > > > > > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > > > > > ipvs, seg6? > > > > > > > > > > > > > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it. > > > > > But we have only seen this for raw sockets so far. > > > > > > > > > > > > > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some > > > > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a > > > > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop) > > > > different from the one carried by the IPv6 header. For this purpose, > > > > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH. > > > > > > > Hi Andrea, > > > > > > Thanks for pointing that datapath out. The more generic approach we are > > > taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances > > > of this. > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > . > > > > > > > . > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > I can reproduce the same warning messages reported by you, by instantiating an > > > > End.X behavior whose nexthop is handled by a route for which there is no "via". > > > > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop()) > > > > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii) > > > > and the res.nh->fib_nh_gw_family is 0 (as already pointed out). > > > > > > > > > > Nice, when I get back after the holiday break I'll submit the next patch. It > > > would be great if you could test the new patch and let me know how it works in > > > your tests at that juncture. I'll keep you posted. > > > > > > Regards > > > > > > Jon > > > > > > > > Regards > > > > > > > > > > Jon > > > > > > > > Ciao, > > > > Andrea > > > -- > Andrea Mayer <andrea.mayer@uniroma2.it>
On Sat, Jan 7, 2023 at 10:27 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote: > > Hi Jon, > please see after, thanks. > > > > > > Any chance you could test this patch based on the latest net-next > > > kernel and let me know the result? > > > > > > diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h > > > index 88ff7bb2bb9b..632086b2f644 100644 > > > --- a/include/net/dst_ops.h > > > +++ b/include/net/dst_ops.h > > > @@ -16,7 +16,7 @@ struct dst_ops { > > > unsigned short family; > > > unsigned int gc_thresh; > > > > > > - int (*gc)(struct dst_ops *ops); > > > + void (*gc)(struct dst_ops *ops); > > > struct dst_entry * (*check)(struct dst_entry *, __u32 cookie); > > > unsigned int (*default_advmss)(const struct dst_entry *); > > > unsigned int (*mtu)(const struct dst_entry *); > > > diff --git a/net/core/dst.c b/net/core/dst.c > > > index 6d2dd03dafa8..31c08a3386d3 100644 > > > --- a/net/core/dst.c > > > +++ b/net/core/dst.c > > > @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev, > > > > > > if (ops->gc && > > > !(flags & DST_NOCOUNT) && > > > - dst_entries_get_fast(ops) > ops->gc_thresh) { > > > - if (ops->gc(ops)) { > > > - pr_notice_ratelimited("Route cache is full: > > > consider increasing sysctl net.ipv6.route.max_size.\n"); > > > - return NULL; > > > - } > > > - } > > > + dst_entries_get_fast(ops) > ops->gc_thresh) > > > + ops->gc(ops); > > > > > > dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC); > > > if (!dst) > > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > > > index e74e0361fd92..b643dda68d31 100644 > > > --- a/net/ipv6/route.c > > > +++ b/net/ipv6/route.c > > > @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct > > > dst_entry *); > > > static void ip6_dst_destroy(struct dst_entry *); > > > static void ip6_dst_ifdown(struct dst_entry *, > > > struct net_device *dev, int how); > > > -static int ip6_dst_gc(struct dst_ops *ops); > > > +static void ip6_dst_gc(struct dst_ops *ops); > > > > > > static int ip6_pkt_discard(struct sk_buff *skb); > > > static int ip6_pkt_discard_out(struct net *net, struct > > > sock *sk, struct sk_buff *skb); > > > @@ -3284,11 +3284,10 @@ struct dst_entry *icmp6_dst_alloc(struct > > > net_device *dev, > > > return dst; > > > } > > > > > > -static int ip6_dst_gc(struct dst_ops *ops) > > > +static void ip6_dst_gc(struct dst_ops *ops) > > > { > > > struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops); > > > int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval; > > > - int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size; > > > int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity; > > > int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout; > > > unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc; > > > @@ -3296,11 +3295,10 @@ static int ip6_dst_gc(struct dst_ops *ops) > > > int entries; > > > > > > entries = dst_entries_get_fast(ops); > > > - if (entries > rt_max_size) > > > + if (entries > ops->gc_thresh) > > > entries = dst_entries_get_slow(ops); > > > > > > - if (time_after(rt_last_gc + rt_min_interval, jiffies) && > > > - entries <= rt_max_size) > > > + if (time_after(rt_last_gc + rt_min_interval, jiffies)) > > > goto out; > > > > > > fib6_run_gc(atomic_inc_return(&net->ipv6.ip6_rt_gc_expire), net, true); > > > @@ -3310,7 +3308,6 @@ static int ip6_dst_gc(struct dst_ops *ops) > > > out: > > > val = atomic_read(&net->ipv6.ip6_rt_gc_expire); > > > atomic_set(&net->ipv6.ip6_rt_gc_expire, val - (val >> rt_elasticity)); > > > - return entries > rt_max_size; > > > } > > > > > > static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg, > > > @@ -6512,7 +6509,7 @@ static int __net_init ip6_route_net_init(struct net *net) > > > #endif > > > > > > net->ipv6.sysctl.flush_delay = 0; > > > - net->ipv6.sysctl.ip6_rt_max_size = 4096; > > > + net->ipv6.sysctl.ip6_rt_max_size = INT_MAX; > > > net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2; > > > net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ; > > > net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ; > > > > > > > Yes, I will apply this patch in the next days and check how it deals with the > > seg6 subsystem. I will keep you posted. > > > > I applied the patch* to the net-next (HEAD 6bd4755c7c49) and did some tests on > the seg6 subsystem, specifically running the End.X/DX6 behaviors. They seem to > work fine. Thanks Andrea much appreciated. It worked fine in my raw socket tests as well. I'll look at submitting it soon. > > (*) I had to slightly edit the patch because of the code formatting, e.g. > some incorrect line breaks, spaces, etc. > Sorry about that, I should have sent it from git to avoid that. Regards Jon > Ciao, > Andrea > > > > > > On Sat, Dec 24, 2022 at 6:38 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > > > > > On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote: > > > > > > > > > > Hi Jon, > > > > > please see below, thanks. > > > > > > > > > > On Wed, 21 Dec 2022 08:48:11 +1100 > > > > > Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > > > > > > > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > > > > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > > > > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > > > > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > > > > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > > > > > > > these warnings: > > > > > > > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > . > > > > > > > > . > > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > > > > > > If I read correctly, the maximum number of dst that the raw socket can > > > > > > > use this way is limited by the number of packets it allows via the > > > > > > > sndbuf limit, right? > > > > > > > > > > > > > > > > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for > > > > > > every packet. > > > > > > > > > > > > e.g: > > > > > > > > > > > > output from C program sending 5000000 packets via a raw socket. > > > > > > > > > > > > ip raw: total num pkts 5000000 > > > > > > > > > > > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}' > > > > > > Attaching 1 probe... > > > > > > > > > > > > @count[a.out]: 5000009 > > > > > > > > > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > > > > > > ipvs, seg6? > > > > > > > > > > > > > > > > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it. > > > > > > But we have only seen this for raw sockets so far. > > > > > > > > > > > > > > > > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some > > > > > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a > > > > > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop) > > > > > different from the one carried by the IPv6 header. For this purpose, > > > > > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH. > > > > > > > > > Hi Andrea, > > > > > > > > Thanks for pointing that datapath out. The more generic approach we are > > > > taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances > > > > of this. > > > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > . > > > > > > > > . > > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > > I can reproduce the same warning messages reported by you, by instantiating an > > > > > End.X behavior whose nexthop is handled by a route for which there is no "via". > > > > > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop()) > > > > > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii) > > > > > and the res.nh->fib_nh_gw_family is 0 (as already pointed out). > > > > > > > > > > > > > Nice, when I get back after the holiday break I'll submit the next patch. It > > > > would be great if you could test the new patch and let me know how it works in > > > > your tests at that juncture. I'll keep you posted. > > > > > > > > Regards > > > > > > > > Jon > > > > > > > > > > Regards > > > > > > > > > > > > Jon > > > > > > > > > > Ciao, > > > > > Andrea > > > > > > -- > > Andrea Mayer <andrea.mayer@uniroma2.it> > > > -- > Andrea Mayer <andrea.mayer@uniroma2.it>
On Sun, 8 Jan 2023 10:46:09 +1100 Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > On Sat, Jan 7, 2023 at 10:27 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote: > > > > Hi Jon, > > please see after, thanks. > > > > > > > > > Any chance you could test this patch based on the latest net-next > > > > kernel and let me know the result? > > > > > > > > diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h > > > > index 88ff7bb2bb9b..632086b2f644 100644 > > > > --- a/include/net/dst_ops.h > > > > +++ b/include/net/dst_ops.h > > > > @@ -16,7 +16,7 @@ struct dst_ops { > > > > unsigned short family; > > > > unsigned int gc_thresh; > > > > > > > > - int (*gc)(struct dst_ops *ops); > > > > + void (*gc)(struct dst_ops *ops); > > > > struct dst_entry * (*check)(struct dst_entry *, __u32 cookie); > > > > unsigned int (*default_advmss)(const struct dst_entry *); > > > > unsigned int (*mtu)(const struct dst_entry *); > > > > diff --git a/net/core/dst.c b/net/core/dst.c > > > > index 6d2dd03dafa8..31c08a3386d3 100644 > > > > --- a/net/core/dst.c > > > > +++ b/net/core/dst.c > > > > @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev, > > > > > > > > if (ops->gc && > > > > !(flags & DST_NOCOUNT) && > > > > - dst_entries_get_fast(ops) > ops->gc_thresh) { > > > > - if (ops->gc(ops)) { > > > > - pr_notice_ratelimited("Route cache is full: > > > > consider increasing sysctl net.ipv6.route.max_size.\n"); > > > > - return NULL; > > > > - } > > > > - } > > > > + dst_entries_get_fast(ops) > ops->gc_thresh) > > > > + ops->gc(ops); > > > > > > > > dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC); > > > > if (!dst) > > > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > > > > index e74e0361fd92..b643dda68d31 100644 > > > > --- a/net/ipv6/route.c > > > > +++ b/net/ipv6/route.c > > > > @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct > > > > dst_entry *); > > > > static void ip6_dst_destroy(struct dst_entry *); > > > > static void ip6_dst_ifdown(struct dst_entry *, > > > > struct net_device *dev, int how); > > > > -static int ip6_dst_gc(struct dst_ops *ops); > > > > +static void ip6_dst_gc(struct dst_ops *ops); > > > > > > > > static int ip6_pkt_discard(struct sk_buff *skb); > > > > static int ip6_pkt_discard_out(struct net *net, struct > > > > sock *sk, struct sk_buff *skb); > > > > @@ -3284,11 +3284,10 @@ struct dst_entry *icmp6_dst_alloc(struct > > > > net_device *dev, > > > > return dst; > > > > } > > > > > > > > -static int ip6_dst_gc(struct dst_ops *ops) > > > > +static void ip6_dst_gc(struct dst_ops *ops) > > > > { > > > > struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops); > > > > int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval; > > > > - int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size; > > > > int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity; > > > > int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout; > > > > unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc; > > > > @@ -3296,11 +3295,10 @@ static int ip6_dst_gc(struct dst_ops *ops) > > > > int entries; > > > > > > > > entries = dst_entries_get_fast(ops); > > > > - if (entries > rt_max_size) > > > > + if (entries > ops->gc_thresh) > > > > entries = dst_entries_get_slow(ops); > > > > > > > > - if (time_after(rt_last_gc + rt_min_interval, jiffies) && > > > > - entries <= rt_max_size) > > > > + if (time_after(rt_last_gc + rt_min_interval, jiffies)) > > > > goto out; > > > > > > > > fib6_run_gc(atomic_inc_return(&net->ipv6.ip6_rt_gc_expire), net, true); > > > > @@ -3310,7 +3308,6 @@ static int ip6_dst_gc(struct dst_ops *ops) > > > > out: > > > > val = atomic_read(&net->ipv6.ip6_rt_gc_expire); > > > > atomic_set(&net->ipv6.ip6_rt_gc_expire, val - (val >> rt_elasticity)); > > > > - return entries > rt_max_size; > > > > } > > > > > > > > static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg, > > > > @@ -6512,7 +6509,7 @@ static int __net_init ip6_route_net_init(struct net *net) > > > > #endif > > > > > > > > net->ipv6.sysctl.flush_delay = 0; > > > > - net->ipv6.sysctl.ip6_rt_max_size = 4096; > > > > + net->ipv6.sysctl.ip6_rt_max_size = INT_MAX; > > > > net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2; > > > > net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ; > > > > net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ; > > > > > > > > > > Yes, I will apply this patch in the next days and check how it deals with the > > > seg6 subsystem. I will keep you posted. > > > > > > > I applied the patch* to the net-next (HEAD 6bd4755c7c49) and did some tests on > > the seg6 subsystem, specifically running the End.X/DX6 behaviors. They seem to > > work fine. > > Thanks Andrea much appreciated. It worked fine in my raw socket tests as well. You're welcome. Good! > I'll look at submitting it soon. Please let me know (keep me in cc). > > > > > (*) I had to slightly edit the patch because of the code formatting, e.g. > > some incorrect line breaks, spaces, etc. > > > > Sorry about that, I should have sent it from git to avoid that. > Don't worry. > Regards > > Jon > Ciao, Andrea > > > > > > > On Sat, Dec 24, 2022 at 6:38 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > > > > > > > On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote: > > > > > > > > > > > > Hi Jon, > > > > > > please see below, thanks. > > > > > > > > > > > > On Wed, 21 Dec 2022 08:48:11 +1100 > > > > > > Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > > > > > > > > > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > > > > > > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > > > > > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > > > > > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > > > > > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > > > > > > > > these warnings: > > > > > > > > > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > . > > > > > > > > > . > > > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > > > > > > > > If I read correctly, the maximum number of dst that the raw socket can > > > > > > > > use this way is limited by the number of packets it allows via the > > > > > > > > sndbuf limit, right? > > > > > > > > > > > > > > > > > > > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for > > > > > > > every packet. > > > > > > > > > > > > > > e.g: > > > > > > > > > > > > > > output from C program sending 5000000 packets via a raw socket. > > > > > > > > > > > > > > ip raw: total num pkts 5000000 > > > > > > > > > > > > > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}' > > > > > > > Attaching 1 probe... > > > > > > > > > > > > > > @count[a.out]: 5000009 > > > > > > > > > > > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > > > > > > > ipvs, seg6? > > > > > > > > > > > > > > > > > > > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it. > > > > > > > But we have only seen this for raw sockets so far. > > > > > > > > > > > > > > > > > > > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some > > > > > > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a > > > > > > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop) > > > > > > different from the one carried by the IPv6 header. For this purpose, > > > > > > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH. > > > > > > > > > > > Hi Andrea, > > > > > > > > > > Thanks for pointing that datapath out. The more generic approach we are > > > > > taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances > > > > > of this. > > > > > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > . > > > > > > > > > . > > > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > > > > I can reproduce the same warning messages reported by you, by instantiating an > > > > > > End.X behavior whose nexthop is handled by a route for which there is no "via". > > > > > > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop()) > > > > > > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii) > > > > > > and the res.nh->fib_nh_gw_family is 0 (as already pointed out). > > > > > > > > > > > > > > > > Nice, when I get back after the holiday break I'll submit the next patch. It > > > > > would be great if you could test the new patch and let me know how it works in > > > > > your tests at that juncture. I'll keep you posted. > > > > > > > > > > Regards > > > > > > > > > > Jon > > > > > > > > > > > > Regards > > > > > > > > > > > > > > Jon > > > > > > > > > > > > Ciao, > > > > > > Andrea > > > > > > > > > -- > > > Andrea Mayer <andrea.mayer@uniroma2.it> > > > > > > -- > > Andrea Mayer <andrea.mayer@uniroma2.it>
diff --git a/include/net/flow.h b/include/net/flow.h index 2f0da4f0318b..30b8973ffb4b 100644 --- a/include/net/flow.h +++ b/include/net/flow.h @@ -37,6 +37,7 @@ struct flowi_common { __u8 flowic_flags; #define FLOWI_FLAG_ANYSRC 0x01 #define FLOWI_FLAG_KNOWN_NH 0x02 +#define FLOWI_FLAG_SKIP_RAW 0x04 __u32 flowic_secid; kuid_t flowic_uid; struct flowi_tunnel flowic_tun_key; diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index a06a9f847db5..0b89a7e66d09 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -884,7 +884,7 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) security_sk_classify_flow(sk, flowi6_to_flowi_common(&fl6)); if (hdrincl) - fl6.flowi6_flags |= FLOWI_FLAG_KNOWN_NH; + fl6.flowi6_flags |= FLOWI_FLAG_KNOWN_NH | FLOWI_FLAG_SKIP_RAW; if (ipc6.tclass < 0) ipc6.tclass = np->tclass; diff --git a/net/ipv6/route.c b/net/ipv6/route.c index e74e0361fd92..beae0bd61738 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2226,6 +2226,7 @@ struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table, if (rt) { goto out; } else if (unlikely((fl6->flowi6_flags & FLOWI_FLAG_KNOWN_NH) && + !(fl6->flowi6_flags & FLOWI_FLAG_SKIP_RAW) && !res.nh->fib_nh_gw_family)) { /* Create a RTF_CACHE clone which will not be * owned by the fib6 tree. It is for the special case where
Sending Ipv6 packets in a loop via a raw socket triggers an issue where a route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly consumes the Ipv6 max_size threshold which defaults to 4096 resulting in these warnings: [1] 99.187805] dst_alloc: 7728 callbacks suppressed [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. . . [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. When this happens the packet is dropped and sendto() gets a network is unreachable error: # ./a.out -s remaining pkt 200557 errno 101 remaining pkt 196462 errno 101 . . remaining pkt 126821 errno 101 Fix this by adding a flag to prevent the cloning of routes for raw sockets. Which makes the Ipv6 routing code use per-cpu routes instead which prevents packet drop due to max_size overflow. Ipv4 is not affected because it has a very large default max_size. Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> --- include/net/flow.h | 1 + net/ipv6/raw.c | 2 +- net/ipv6/route.c | 1 + 3 files changed, 3 insertions(+), 1 deletion(-)