Message ID | 20230718080044.2738833-1-liuhangbin@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] ipv4/fib: send RTM_DELROUTE notify when flush fib | expand |
On Tue, Jul 18, 2023 at 04:00:44PM +0800, Hangbin Liu wrote: > After deleting an interface address in fib_del_ifaddr(), the function > scans the fib_info list for stray entries and calls fib_flush. > Then the stray entries will be deleted silently and no RTM_DELROUTE > notification will be sent. [...] > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 74d403dbd2b4..1a88013f6a5b 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -2026,6 +2026,7 @@ void fib_table_flush_external(struct fib_table *tb) > int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) > { > struct trie *t = (struct trie *)tb->tb_data; > + struct nl_info info = { .nl_net = net }; > struct key_vector *pn = t->kv; > unsigned long cindex = 1; > struct hlist_node *tmp; > @@ -2088,6 +2089,8 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) > > fib_notify_alias_delete(net, n->key, &n->leaf, fa, > NULL); > + rtmsg_fib(RTM_DELROUTE, htonl(n->key), fa, > + KEYLENGTH - fa->fa_slen, tb->tb_id, &info, 0); fib_table_flush() isn't only called when an address is deleted, but also when an interface is deleted or put down. The lack of notification in these cases is deliberate. Commit 7c6bb7d2faaf ("net/ipv6: Add knob to skip DELROUTE message on device down") introduced a sysctl to make IPv6 behave like IPv4 in this regard, but this patch breaks it. IMO, the number of routes being flushed because a preferred source address is deleted is significantly lower compared to interface down / deletion, so generating notifications in this case is probably OK. It also seems to be consistent with IPv6 given that rt6_remove_prefsrc() calls fib6_clean_all() and not fib6_clean_all_skip_notify(). > hlist_del_rcu(&fa->fa_list); > fib_release_info(fa->fa_info); > alias_free_mem_rcu(fa); > -- > 2.38.1 > >
On Tue, Jul 18, 2023 at 01:19:13PM +0300, Ido Schimmel wrote: > On Tue, Jul 18, 2023 at 04:00:44PM +0800, Hangbin Liu wrote: > > After deleting an interface address in fib_del_ifaddr(), the function > > scans the fib_info list for stray entries and calls fib_flush. > > Then the stray entries will be deleted silently and no RTM_DELROUTE > > notification will be sent. > > [...] > > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > > index 74d403dbd2b4..1a88013f6a5b 100644 > > --- a/net/ipv4/fib_trie.c > > +++ b/net/ipv4/fib_trie.c > > @@ -2026,6 +2026,7 @@ void fib_table_flush_external(struct fib_table *tb) > > int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) > > { > > struct trie *t = (struct trie *)tb->tb_data; > > + struct nl_info info = { .nl_net = net }; > > struct key_vector *pn = t->kv; > > unsigned long cindex = 1; > > struct hlist_node *tmp; > > @@ -2088,6 +2089,8 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) > > > > fib_notify_alias_delete(net, n->key, &n->leaf, fa, > > NULL); > > + rtmsg_fib(RTM_DELROUTE, htonl(n->key), fa, > > + KEYLENGTH - fa->fa_slen, tb->tb_id, &info, 0); > > fib_table_flush() isn't only called when an address is deleted, but also > when an interface is deleted or put down. The lack of notification in > these cases is deliberate. Commit 7c6bb7d2faaf ("net/ipv6: Add knob to > skip DELROUTE message on device down") introduced a sysctl to make IPv6 > behave like IPv4 in this regard, but this patch breaks it. > > IMO, the number of routes being flushed because a preferred source > address is deleted is significantly lower compared to interface down / > deletion, so generating notifications in this case is probably OK. It > also seems to be consistent with IPv6 given that rt6_remove_prefsrc() > calls fib6_clean_all() and not fib6_clean_all_skip_notify(). Actually, looking closer at IPv6, it seems that routes are not deleted, but instead the preferred source address is simply removed from them (w/o a notification). Anyway, my point is that a RTM_DELROUTE notification should not be emitted for routes that are deleted because of interface down / deletion. > > > hlist_del_rcu(&fa->fa_list); > > fib_release_info(fa->fa_info); > > alias_free_mem_rcu(fa); > > -- > > 2.38.1 > > > >
On 7/18/23 4:32 AM, Ido Schimmel wrote: >>> @@ -2088,6 +2089,8 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) >>> >>> fib_notify_alias_delete(net, n->key, &n->leaf, fa, >>> NULL); >>> + rtmsg_fib(RTM_DELROUTE, htonl(n->key), fa, >>> + KEYLENGTH - fa->fa_slen, tb->tb_id, &info, 0); >> >> fib_table_flush() isn't only called when an address is deleted, but also >> when an interface is deleted or put down. The lack of notification in >> these cases is deliberate. Commit 7c6bb7d2faaf ("net/ipv6: Add knob to >> skip DELROUTE message on device down") introduced a sysctl to make IPv6 >> behave like IPv4 in this regard, but this patch breaks it. >> >> IMO, the number of routes being flushed because a preferred source >> address is deleted is significantly lower compared to interface down / >> deletion, so generating notifications in this case is probably OK. It >> also seems to be consistent with IPv6 given that rt6_remove_prefsrc() >> calls fib6_clean_all() and not fib6_clean_all_skip_notify(). > > Actually, looking closer at IPv6, it seems that routes are not deleted, > but instead the preferred source address is simply removed from them > (w/o a notification). > > Anyway, my point is that a RTM_DELROUTE notification should not be > emitted for routes that are deleted because of interface down / > deletion. > exactly.
On Tue, 18 Jul 2023 13:19:06 +0300 Ido Schimmel <idosch@idosch.org> wrote: > fib_table_flush() isn't only called when an address is deleted, but also > when an interface is deleted or put down. The lack of notification in > these cases is deliberate. Commit 7c6bb7d2faaf ("net/ipv6: Add knob to > skip DELROUTE message on device down") introduced a sysctl to make IPv6 > behave like IPv4 in this regard, but this patch breaks it. > > IMO, the number of routes being flushed because a preferred source > address is deleted is significantly lower compared to interface down / > deletion, so generating notifications in this case is probably OK. It > also seems to be consistent with IPv6 given that rt6_remove_prefsrc() > calls fib6_clean_all() and not fib6_clean_all_skip_notify(). Agree. Imagine the case of 3 million routes and device goes down. There is a reason IPv4 behaves the way it does.
On Tue, Jul 18, 2023 at 08:58:14AM -0700, Stephen Hemminger wrote: > On Tue, 18 Jul 2023 13:19:06 +0300 > Ido Schimmel <idosch@idosch.org> wrote: > > > fib_table_flush() isn't only called when an address is deleted, but also > > when an interface is deleted or put down. The lack of notification in > > these cases is deliberate. Commit 7c6bb7d2faaf ("net/ipv6: Add knob to > > skip DELROUTE message on device down") introduced a sysctl to make IPv6 > > behave like IPv4 in this regard, but this patch breaks it. > > > > IMO, the number of routes being flushed because a preferred source > > address is deleted is significantly lower compared to interface down / > > deletion, so generating notifications in this case is probably OK. It How about ignore route deletion for link down? e.g. diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 74d403dbd2b4..11c0f325e887 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -2026,6 +2026,7 @@ void fib_table_flush_external(struct fib_table *tb) int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) { struct trie *t = (struct trie *)tb->tb_data; + struct nl_info info = { .nl_net = net }; struct key_vector *pn = t->kv; unsigned long cindex = 1; struct hlist_node *tmp; @@ -2088,6 +2089,11 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) fib_notify_alias_delete(net, n->key, &n->leaf, fa, NULL); + if (!(fi->fib_flags & RTNH_F_LINKDOWN)) { + rtmsg_fib(RTM_DELROUTE, htonl(n->key), fa, + KEYLENGTH - fa->fa_slen, tb->tb_id, &info, 0); + } hlist_del_rcu(&fa->fa_list); fib_release_info(fa->fa_info); alias_free_mem_rcu(fa); > > also seems to be consistent with IPv6 given that rt6_remove_prefsrc() > > calls fib6_clean_all() and not fib6_clean_all_skip_notify(). > > Agree. Imagine the case of 3 million routes and device goes down. > There is a reason IPv4 behaves the way it does.
On Thu, Jul 20, 2023 at 03:51:13PM +0800, Hangbin Liu wrote: > On Tue, Jul 18, 2023 at 08:58:14AM -0700, Stephen Hemminger wrote: > > On Tue, 18 Jul 2023 13:19:06 +0300 > > Ido Schimmel <idosch@idosch.org> wrote: > > > > > fib_table_flush() isn't only called when an address is deleted, but also > > > when an interface is deleted or put down. The lack of notification in > > > these cases is deliberate. Commit 7c6bb7d2faaf ("net/ipv6: Add knob to > > > skip DELROUTE message on device down") introduced a sysctl to make IPv6 > > > behave like IPv4 in this regard, but this patch breaks it. > > > > > > IMO, the number of routes being flushed because a preferred source > > > address is deleted is significantly lower compared to interface down / > > > deletion, so generating notifications in this case is probably OK. It > > How about ignore route deletion for link down? e.g. > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 74d403dbd2b4..11c0f325e887 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -2026,6 +2026,7 @@ void fib_table_flush_external(struct fib_table *tb) > int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) > { > struct trie *t = (struct trie *)tb->tb_data; > + struct nl_info info = { .nl_net = net }; > struct key_vector *pn = t->kv; > unsigned long cindex = 1; > struct hlist_node *tmp; > @@ -2088,6 +2089,11 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) > > fib_notify_alias_delete(net, n->key, &n->leaf, fa, > NULL); > + if (!(fi->fib_flags & RTNH_F_LINKDOWN)) { > + rtmsg_fib(RTM_DELROUTE, htonl(n->key), fa, > + KEYLENGTH - fa->fa_slen, tb->tb_id, &info, 0); > + } Will you get a notification in this case for 198.51.100.0/24? # ip link add name dummy1 up type dummy # ip link add name dummy2 up type dummy # ip address add 192.0.2.1/24 dev dummy1 # ip route add 198.51.100.0/24 dev dummy2 src 192.0.2.1 # ip link set dev dummy2 carrier off # ip -4 r s 192.0.2.0/24 dev dummy1 proto kernel scope link src 192.0.2.1 198.51.100.0/24 dev dummy2 scope link src 192.0.2.1 linkdown # ip address del 192.0.2.1/24 dev dummy1 # ip -4 r s > hlist_del_rcu(&fa->fa_list); > fib_release_info(fa->fa_info); > alias_free_mem_rcu(fa); > > > > also seems to be consistent with IPv6 given that rt6_remove_prefsrc() > > > calls fib6_clean_all() and not fib6_clean_all_skip_notify(). > > > > Agree. Imagine the case of 3 million routes and device goes down. > > There is a reason IPv4 behaves the way it does.
On Thu, Jul 20, 2023 at 05:29:58PM +0300, Ido Schimmel wrote: > > > > IMO, the number of routes being flushed because a preferred source > > > > address is deleted is significantly lower compared to interface down / > > > > deletion, so generating notifications in this case is probably OK. It > > > > How about ignore route deletion for link down? e.g. > > > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > > index 74d403dbd2b4..11c0f325e887 100644 > > --- a/net/ipv4/fib_trie.c > > +++ b/net/ipv4/fib_trie.c > > @@ -2026,6 +2026,7 @@ void fib_table_flush_external(struct fib_table *tb) > > int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) > > { > > struct trie *t = (struct trie *)tb->tb_data; > > + struct nl_info info = { .nl_net = net }; > > struct key_vector *pn = t->kv; > > unsigned long cindex = 1; > > struct hlist_node *tmp; > > @@ -2088,6 +2089,11 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) > > > > fib_notify_alias_delete(net, n->key, &n->leaf, fa, > > NULL); > > + if (!(fi->fib_flags & RTNH_F_LINKDOWN)) { > > + rtmsg_fib(RTM_DELROUTE, htonl(n->key), fa, > > + KEYLENGTH - fa->fa_slen, tb->tb_id, &info, 0); > > + } > > Will you get a notification in this case for 198.51.100.0/24? No. Do you think it is expected with this patch or not? > > # ip link add name dummy1 up type dummy > # ip link add name dummy2 up type dummy > # ip address add 192.0.2.1/24 dev dummy1 > # ip route add 198.51.100.0/24 dev dummy2 src 192.0.2.1 > # ip link set dev dummy2 carrier off > # ip -4 r s > 192.0.2.0/24 dev dummy1 proto kernel scope link src 192.0.2.1 > 198.51.100.0/24 dev dummy2 scope link src 192.0.2.1 linkdown > # ip address del 192.0.2.1/24 dev dummy1 > # ip -4 r s + ip link set dev dummy2 carrier off + sleep 1 + ip -4 r s default via 10.73.131.254 dev eth0 proto dhcp src 10.73.131.181 metric 100 10.73.130.0/23 dev eth0 proto kernel scope link src 10.73.131.181 metric 100 192.0.2.0/24 dev dummy1 proto kernel scope link src 192.0.2.1 198.51.100.0/24 dev dummy2 scope link src 192.0.2.1 linkdown + sleep 1 + ip address del 192.0.2.1/24 dev dummy1 Deleted 192.0.2.0/24 dev dummy1 proto kernel scope link src 192.0.2.1 Deleted broadcast 192.0.2.255 dev dummy1 table local proto kernel scope link src 192.0.2.1 Deleted local 192.0.2.1 dev dummy1 table local proto kernel scope host src 192.0.2.1 + sleep 1 + ip -4 r s default via 10.73.131.254 dev eth0 proto dhcp src 10.73.131.181 metric 100 10.73.130.0/23 dev eth0 proto kernel scope link src 10.73.131.181 metric 100 Thanks Hangbin
On 7/20/23 7:34 PM, Hangbin Liu wrote: > On Thu, Jul 20, 2023 at 05:29:58PM +0300, Ido Schimmel wrote: >>>>> IMO, the number of routes being flushed because a preferred source >>>>> address is deleted is significantly lower compared to interface down / >>>>> deletion, so generating notifications in this case is probably OK. It >>> >>> How about ignore route deletion for link down? e.g. >>> >>> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c >>> index 74d403dbd2b4..11c0f325e887 100644 >>> --- a/net/ipv4/fib_trie.c >>> +++ b/net/ipv4/fib_trie.c >>> @@ -2026,6 +2026,7 @@ void fib_table_flush_external(struct fib_table *tb) >>> int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) >>> { >>> struct trie *t = (struct trie *)tb->tb_data; >>> + struct nl_info info = { .nl_net = net }; >>> struct key_vector *pn = t->kv; >>> unsigned long cindex = 1; >>> struct hlist_node *tmp; >>> @@ -2088,6 +2089,11 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) >>> >>> fib_notify_alias_delete(net, n->key, &n->leaf, fa, >>> NULL); >>> + if (!(fi->fib_flags & RTNH_F_LINKDOWN)) { >>> + rtmsg_fib(RTM_DELROUTE, htonl(n->key), fa, >>> + KEYLENGTH - fa->fa_slen, tb->tb_id, &info, 0); >>> + } >> >> Will you get a notification in this case for 198.51.100.0/24? > > No. Do you think it is expected with this patch or not? The intent is that notifications are sent for link events but not route events which are easily deduced from the link events.
On Thu, Jul 20, 2023 at 10:01:06PM -0600, David Ahern wrote: > >>> How about ignore route deletion for link down? e.g. > >>> > >>> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > >>> index 74d403dbd2b4..11c0f325e887 100644 > >>> --- a/net/ipv4/fib_trie.c > >>> +++ b/net/ipv4/fib_trie.c > >>> @@ -2026,6 +2026,7 @@ void fib_table_flush_external(struct fib_table *tb) > >>> int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) > >>> { > >>> struct trie *t = (struct trie *)tb->tb_data; > >>> + struct nl_info info = { .nl_net = net }; > >>> struct key_vector *pn = t->kv; > >>> unsigned long cindex = 1; > >>> struct hlist_node *tmp; > >>> @@ -2088,6 +2089,11 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) > >>> > >>> fib_notify_alias_delete(net, n->key, &n->leaf, fa, > >>> NULL); > >>> + if (!(fi->fib_flags & RTNH_F_LINKDOWN)) { > >>> + rtmsg_fib(RTM_DELROUTE, htonl(n->key), fa, > >>> + KEYLENGTH - fa->fa_slen, tb->tb_id, &info, 0); > >>> + } > >> > >> Will you get a notification in this case for 198.51.100.0/24? > > > > No. Do you think it is expected with this patch or not? > > The intent is that notifications are sent for link events but not route > events which are easily deduced from the link events. Sorry, I didn't get what you mean. The link events should be notified to user via rtmsg_ifinfo_event()? So I think here we ignore the route events caused by link down looks reasonable. Thanks Hangbin
On Fri, Jul 21, 2023 at 01:46:13PM +0800, Hangbin Liu wrote: > On Thu, Jul 20, 2023 at 10:01:06PM -0600, David Ahern wrote: > > >>> How about ignore route deletion for link down? e.g. > > >>> > > >>> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > > >>> index 74d403dbd2b4..11c0f325e887 100644 > > >>> --- a/net/ipv4/fib_trie.c > > >>> +++ b/net/ipv4/fib_trie.c > > >>> @@ -2026,6 +2026,7 @@ void fib_table_flush_external(struct fib_table *tb) > > >>> int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) > > >>> { > > >>> struct trie *t = (struct trie *)tb->tb_data; > > >>> + struct nl_info info = { .nl_net = net }; > > >>> struct key_vector *pn = t->kv; > > >>> unsigned long cindex = 1; > > >>> struct hlist_node *tmp; > > >>> @@ -2088,6 +2089,11 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) > > >>> > > >>> fib_notify_alias_delete(net, n->key, &n->leaf, fa, > > >>> NULL); > > >>> + if (!(fi->fib_flags & RTNH_F_LINKDOWN)) { > > >>> + rtmsg_fib(RTM_DELROUTE, htonl(n->key), fa, > > >>> + KEYLENGTH - fa->fa_slen, tb->tb_id, &info, 0); > > >>> + } > > >> > > >> Will you get a notification in this case for 198.51.100.0/24? > > > > > > No. Do you think it is expected with this patch or not? > > > > The intent is that notifications are sent for link events but not route > > events which are easily deduced from the link events. > > Sorry, I didn't get what you mean. The link events should be notified to user > via rtmsg_ifinfo_event()? So I think here we ignore the route events caused by > link down looks reasonable. The route in the scenario I mentioned wasn't deleted because of a link event, but because the source address was deleted yet no notification was emitted. IMO, this is wrong given the description of the patch. I assume NetworkManager already knows how to delete routes given RTM_DELLINK events. Can't it be taught to react to RTM_DELADDR events as well? Then this functionality will always work, regardless of the kernel version being used.
On Sun, Jul 23, 2023 at 10:38:11AM +0300, Ido Schimmel wrote: > > > >>> @@ -2088,6 +2089,11 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) > > > >>> > > > >>> fib_notify_alias_delete(net, n->key, &n->leaf, fa, > > > >>> NULL); > > > >>> + if (!(fi->fib_flags & RTNH_F_LINKDOWN)) { > > > >>> + rtmsg_fib(RTM_DELROUTE, htonl(n->key), fa, > > > >>> + KEYLENGTH - fa->fa_slen, tb->tb_id, &info, 0); > > > >>> + } > > > >> > > > >> Will you get a notification in this case for 198.51.100.0/24? > > > > > > > > No. Do you think it is expected with this patch or not? > > > > > > The intent is that notifications are sent for link events but not route > > > events which are easily deduced from the link events. > > > > Sorry, I didn't get what you mean. The link events should be notified to user > > via rtmsg_ifinfo_event()? So I think here we ignore the route events caused by > > link down looks reasonable. > > The route in the scenario I mentioned wasn't deleted because of a link > event, but because the source address was deleted yet no notification > was emitted. IMO, this is wrong given the description of the patch. OK. I got what you mean now. Is there a way to get the device the route bond in fib_table_flush() so we can also check the dev flag? > > I assume NetworkManager already knows how to delete routes given > RTM_DELLINK events. Can't it be taught to react to RTM_DELADDR events as > well? Then this functionality will always work, regardless of the kernel > version being used. The NetworkManager keeps a cache of the routes. Missing/Wrong events mean that the cache becomes inconsistent. The IPv4 will not send src route delete info if it's bond to other device. While IPv6 only modify the src route instead of delete it, and also no notify. So NetworkManager developers complained and hope to have a consistent and clear notification about route modify/delete. Thanks Hangbin
On Mon, 24 Jul 2023 16:56:37 +0800 Hangbin Liu <liuhangbin@gmail.com> wrote: > The NetworkManager keeps a cache of the routes. Missing/Wrong events mean that > the cache becomes inconsistent. The IPv4 will not send src route delete info > if it's bond to other device. While IPv6 only modify the src route instead of > delete it, and also no notify. So NetworkManager developers complained and > hope to have a consistent and clear notification about route modify/delete. Read FRR they get it right. The routing daemons have to track kernel, and the semantics have been worked out for years.
On Mon, Jul 24, 2023 at 08:48:20AM -0700, Stephen Hemminger wrote: > On Mon, 24 Jul 2023 16:56:37 +0800 > Hangbin Liu <liuhangbin@gmail.com> wrote: > > > The NetworkManager keeps a cache of the routes. Missing/Wrong events mean that > > the cache becomes inconsistent. The IPv4 will not send src route delete info > > if it's bond to other device. While IPv6 only modify the src route instead of > > delete it, and also no notify. So NetworkManager developers complained and > > hope to have a consistent and clear notification about route modify/delete. > > Read FRR they get it right. The routing daemons have to track kernel, > and the semantics have been worked out for years. Yes, normally the routing daemon need to track kernel. On the other hand, the kernel also need to make a clear feedback. The userspace developers may not know the kernel code very well. The unclear/inconsistent notification would make them confused. Thanks Hangbin
Hello, kernel test robot noticed "kernel-selftests.net.fib_nexthops.sh.fail" on: commit: 4c189e1954949695724236c7c1f182a39e67b262 ("[PATCH net-next] ipv4/fib: send RTM_DELROUTE notify when flush fib") url: https://github.com/intel-lab-lkp/linux/commits/Hangbin-Liu/ipv4-fib-send-RTM_DELROUTE-notify-when-flush-fib/20230718-195536 base: https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git d3750076d4641b697da990b6aee5b096a10c4d12 patch link: https://lore.kernel.org/all/20230718080044.2738833-1-liuhangbin@gmail.com/ patch subject: [PATCH net-next] ipv4/fib: send RTM_DELROUTE notify when flush fib in testcase: kernel-selftests version: kernel-selftests-x86_64-60acb023-1_20230329 with following parameters: group: net test: fib_nexthops.sh compiler: gcc-12 test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz (Cascade Lake) with 32G memory (please refer to attached dmesg/kmsg for entire log/backtrace) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202307252113.2b2036b2-oliver.sang@intel.com TAP version 13 1..1 # timeout set to 3600 # selftests: net: fib_nexthops.sh # # Basic functional tests # ---------------------- # TEST: List with nothing defined [ OK ] ... # # IPv4 nexthop api compat mode # ---------------------------- # TEST: IPv4 default nexthop compat mode check [ OK ] # TEST: IPv4 compat mode on - route add notification [ OK ] # TEST: IPv4 compat mode on - route dump [ OK ] # TEST: IPv4 compat mode on - nexthop change [ OK ] # TEST: IPv4 set compat mode - 0 [ OK ] # TEST: IPv4 compat mode off - route add notification [ OK ] # TEST: IPv4 compat mode off - route dump [ OK ] # TEST: IPv4 compat mode off - nexthop change [ OK ] # TEST: IPv4 compat mode off - nexthop delete [FAIL] <--- # TEST: IPv4 set compat mode - 1 [ OK ] # # IPv4 fdb groups functional # -------------------------- # TEST: Fdb Nexthop group with multiple nexthops [ OK ] ... # # IPv6 runtime resilient nexthop group torture # -------------------------------------------- # TEST: IPv6 resilient nexthop group torture test [ OK ] # # Tests passed: 227 # Tests failed: 1 not ok 1 selftests: net: fib_nexthops.sh # exit=1 To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests sudo bin/lkp install job.yaml # job file is attached in this email bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run sudo bin/lkp run generated-yaml-file # if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state.
On Tue, 25 Jul 2023 16:20:59 +0800 Hangbin Liu <liuhangbin@gmail.com> wrote: > On Mon, Jul 24, 2023 at 08:48:20AM -0700, Stephen Hemminger wrote: > > On Mon, 24 Jul 2023 16:56:37 +0800 > > Hangbin Liu <liuhangbin@gmail.com> wrote: > > > > > The NetworkManager keeps a cache of the routes. Missing/Wrong events mean that > > > the cache becomes inconsistent. The IPv4 will not send src route delete info > > > if it's bond to other device. While IPv6 only modify the src route instead of > > > delete it, and also no notify. So NetworkManager developers complained and > > > hope to have a consistent and clear notification about route modify/delete. > > > > Read FRR they get it right. The routing daemons have to track kernel, > > and the semantics have been worked out for years. > > Yes, normally the routing daemon need to track kernel. On the other hand, > the kernel also need to make a clear feedback. The userspace developers may > not know the kernel code very well. The unclear/inconsistent notification > would make them confused. Right, that should be addressed by clearer documentation of the semantics and the rational.
Hi Stephen, Ido, David, On Mon, Jul 24, 2023 at 08:48:20AM -0700, Stephen Hemminger wrote: > On Mon, 24 Jul 2023 16:56:37 +0800 > Hangbin Liu <liuhangbin@gmail.com> wrote: > > > The NetworkManager keeps a cache of the routes. Missing/Wrong events mean that > > the cache becomes inconsistent. The IPv4 will not send src route delete info > > if it's bond to other device. While IPv6 only modify the src route instead of > > delete it, and also no notify. So NetworkManager developers complained and > > hope to have a consistent and clear notification about route modify/delete. > > Read FRR they get it right. The routing daemons have to track kernel, > and the semantics have been worked out for years. Since we are talking about whether we should fix the issues or doc them. I have some other route issues reported by NetworkManager developers. And want discuss with you. For IPv4, we add new route instead append the nexthop to same dest(or do I miss something?). Since the route are not merged, the nexthop weight is not shown, which make them look like the same for users. For IPv4, the scope is also not shown, which look like the same for users. While IPv6 will append another nexthop to the route if dest is same. But there are 2 issues here: 1. the *type* and *protocol* field are actally ignored 2. when do `ip monitor route`, the info dumpped in fib6_add_rt2node() use the config info from user space. When means `ip monitor` show the incorrect type and protocol So my questions are, should we show weight/scope for IPv4? How to deal the type/proto info missing for IPv6? How to deal with the difference of merging policy for IPv4/IPv6? Here is the reproducer: + ip link add dummy0 up type dummy + ip link add dummy1 up type dummy + ip link add dummy2 up type dummy + ip addr add 172.16.104.1/24 dev dummy1 + ip addr add 172.16.104.2/24 dev dummy2 + ip route add 172.16.105.0/24 table 100 via 172.16.104.100 dev dummy1 + ip route append 172.16.105.0/24 table 100 via 172.16.104.100 dev dummy2 + ip route add 172.16.106.0/24 table 100 nexthop via 172.16.104.100 dev dummy1 weight 1 + ip route append 172.16.106.0/24 table 100 nexthop via 172.16.104.100 dev dummy1 weight 2 + ip route show table 100 172.16.105.0/24 via 172.16.104.100 dev dummy1 172.16.105.0/24 via 172.16.104.100 dev dummy2 172.16.106.0/24 via 172.16.104.100 dev dummy1 172.16.106.0/24 via 172.16.104.100 dev dummy1 + ip route add local default dev dummy1 table 200 + ip route add 172.16.107.0/24 table 200 nexthop via 172.16.104.100 dev dummy1 + ip route prepend default dev dummy1 table 200 + ip route append 172.16.107.0/24 table 200 nexthop via 172.16.104.100 dev dummy1 + ip route show table 200 default dev dummy1 scope link local default dev dummy1 scope host 172.16.107.0/24 via 172.16.104.100 dev dummy1 172.16.107.0/24 via 172.16.104.100 dev dummy1 + ip addr add 2001:db8:101::1/64 dev dummy1 + ip addr add 2001:db8:101::2/64 dev dummy2 + ip route add 2001:db8:102::/64 via 2001:db8:101::10 dev dummy1 table 100 + ip route prepend 2001:db8:102::/64 via 2001:db8:101::10 dev dummy2 table 100 + ip route add local 2001:db8:103::/64 via 2001:db8:101::10 dev dummy1 table 100 + ip route prepend unicast 2001:db8:103::/64 via 2001:db8:101::10 dev dummy2 table 100 + ip monitor route & + sleep 1 + ip route add 2001:db8:104::/64 via 2001:db8:101::10 dev dummy1 proto kernel table 100 2001:db8:104::/64 via 2001:db8:101::10 dev dummy1 table 100 proto kernel metric 1024 pref medium + ip route prepend 2001:db8:104::/64 via 2001:db8:101::10 dev dummy2 proto bgp table 100 2001:db8:104::/64 table 100 proto bgp metric 1024 pref medium nexthop via 2001:db8:101::10 dev dummy2 weight 1 nexthop via 2001:db8:101::10 dev dummy1 weight 1 + ip -6 route show table 100 2001:db8:102::/64 metric 1024 pref medium nexthop via 2001:db8:101::10 dev dummy1 weight 1 nexthop via 2001:db8:101::10 dev dummy2 weight 1 local 2001:db8:103::/64 metric 1024 pref medium nexthop via 2001:db8:101::10 dev dummy1 weight 1 nexthop via 2001:db8:101::10 dev dummy2 weight 1 2001:db8:104::/64 proto kernel metric 1024 pref medium nexthop via 2001:db8:101::10 dev dummy1 weight 1 nexthop via 2001:db8:101::10 dev dummy2 weight 1 + kill $! Thanks Hangbin
On 7/26/23 4:17 AM, Hangbin Liu wrote: > Hi Stephen, Ido, David, > On Mon, Jul 24, 2023 at 08:48:20AM -0700, Stephen Hemminger wrote: >> On Mon, 24 Jul 2023 16:56:37 +0800 >> Hangbin Liu <liuhangbin@gmail.com> wrote: >> >>> The NetworkManager keeps a cache of the routes. Missing/Wrong events mean that >>> the cache becomes inconsistent. The IPv4 will not send src route delete info >>> if it's bond to other device. While IPv6 only modify the src route instead of >>> delete it, and also no notify. So NetworkManager developers complained and >>> hope to have a consistent and clear notification about route modify/delete. >> >> Read FRR they get it right. The routing daemons have to track kernel, >> and the semantics have been worked out for years. > > Since we are talking about whether we should fix the issues or doc them. I > have some other route issues reported by NetworkManager developers. And want > discuss with you. > > For IPv4, we add new route instead append the nexthop to same dest(or do I > miss something?). Since the route are not merged, the nexthop weight is not > shown, which make them look like the same for users. For IPv4, the scope is > also not shown, which look like the same for users. > > While IPv6 will append another nexthop to the route if dest is same. But there > are 2 issues here: > 1. the *type* and *protocol* field are actally ignored > 2. when do `ip monitor route`, the info dumpped in fib6_add_rt2node() > use the config info from user space. When means `ip monitor` show the > incorrect type and protocol > > So my questions are, should we show weight/scope for IPv4? How to deal the > type/proto info missing for IPv6? How to deal with the difference of merging > policy for IPv4/IPv6? > > Here is the reproducer: > > + ip link add dummy0 up type dummy > + ip link add dummy1 up type dummy > + ip link add dummy2 up type dummy > + ip addr add 172.16.104.1/24 dev dummy1 > + ip addr add 172.16.104.2/24 dev dummy2 > + ip route add 172.16.105.0/24 table 100 via 172.16.104.100 dev dummy1 > + ip route append 172.16.105.0/24 table 100 via 172.16.104.100 dev dummy2 > + ip route add 172.16.106.0/24 table 100 nexthop via 172.16.104.100 dev dummy1 weight 1 > + ip route append 172.16.106.0/24 table 100 nexthop via 172.16.104.100 dev dummy1 weight 2 Weight only has meaning with a multipath route. In both of these caess these are 2 separate entries in the FIB with the second one only hit under certain conditions. > + ip route show table 100 > 172.16.105.0/24 via 172.16.104.100 dev dummy1 > 172.16.105.0/24 via 172.16.104.100 dev dummy2 > 172.16.106.0/24 via 172.16.104.100 dev dummy1 > 172.16.106.0/24 via 172.16.104.100 dev dummy1 > > + ip route add local default dev dummy1 table 200 > + ip route add 172.16.107.0/24 table 200 nexthop via 172.16.104.100 dev dummy1 > + ip route prepend default dev dummy1 table 200 > + ip route append 172.16.107.0/24 table 200 nexthop via 172.16.104.100 dev dummy1 similarly here with prepend and append. For all of these, look at fib_tests.sh, ipv4_rt_add(). It runs through combination of flags and in some cases only documents existing behavior. > + ip route show table 200 > default dev dummy1 scope link > local default dev dummy1 scope host > 172.16.107.0/24 via 172.16.104.100 dev dummy1 > 172.16.107.0/24 via 172.16.104.100 dev dummy1 > > + ip addr add 2001:db8:101::1/64 dev dummy1 > + ip addr add 2001:db8:101::2/64 dev dummy2 > + ip route add 2001:db8:102::/64 via 2001:db8:101::10 dev dummy1 table 100 > + ip route prepend 2001:db8:102::/64 via 2001:db8:101::10 dev dummy2 table 100 > + ip route add local 2001:db8:103::/64 via 2001:db8:101::10 dev dummy1 table 100 > + ip route prepend unicast 2001:db8:103::/64 via 2001:db8:101::10 dev dummy2 table 1 Unfortunately the original IPv6 multipath implementation did not follow the same semantics as IPv4. Each leg in a MP route is a separate entry and the append and prepend work differently for v6. :-( This difference is one of the many goals of the separate nexthop objects -- aligning ipv4 and ipv6 behavior which can only be done with a new API. There were many attempts to make the legacy route infrastructure more closely aligned between v4 and v6 and inevitably each was reverted because it broke some existing user. > + ip monitor route & > + sleep 1 > + ip route add 2001:db8:104::/64 via 2001:db8:101::10 dev dummy1 proto kernel table 100 > 2001:db8:104::/64 via 2001:db8:101::10 dev dummy1 table 100 proto kernel metric 1024 pref medium > + ip route prepend 2001:db8:104::/64 via 2001:db8:101::10 dev dummy2 proto bgp table 100 > 2001:db8:104::/64 table 100 proto bgp metric 1024 pref medium > nexthop via 2001:db8:101::10 dev dummy2 weight 1 > nexthop via 2001:db8:101::10 dev dummy1 weight 1 > + ip -6 route show table 100 > 2001:db8:102::/64 metric 1024 pref medium > nexthop via 2001:db8:101::10 dev dummy1 weight 1 > nexthop via 2001:db8:101::10 dev dummy2 weight 1 > local 2001:db8:103::/64 metric 1024 pref medium > nexthop via 2001:db8:101::10 dev dummy1 weight 1 > nexthop via 2001:db8:101::10 dev dummy2 weight 1 > 2001:db8:104::/64 proto kernel metric 1024 pref medium > nexthop via 2001:db8:101::10 dev dummy1 weight 1 > nexthop via 2001:db8:101::10 dev dummy2 weight 1 > + kill $! > > Thanks > Hangbin
On Wed, Jul 26, 2023 at 09:57:59AM -0600, David Ahern wrote: > > So my questions are, should we show weight/scope for IPv4? How to deal the > > type/proto info missing for IPv6? How to deal with the difference of merging > > policy for IPv4/IPv6? > > + ip route add 172.16.105.0/24 table 100 via 172.16.104.100 dev dummy1 > > + ip route append 172.16.105.0/24 table 100 via 172.16.104.100 dev dummy2 > > > + ip route add 172.16.106.0/24 table 100 nexthop via 172.16.104.100 dev dummy1 weight 1 > > + ip route append 172.16.106.0/24 table 100 nexthop via 172.16.104.100 dev dummy1 weight 2 > > Weight only has meaning with a multipath route. In both of these caess > these are 2 separate entries in the FIB Yes, we know these are 2 separate entries. The NM developers know these are 2 separate entries. But the uses don't know, and the route daemon don't know. If a user add these 2 entires. And kernel show them as the same. The route daemon will store them as a same entries. But if the user delete the entry. We actually delete one and left one in the kernel. This will make the route daemon and user confused. So my question is, should we export the weight/scope? Or stop user add the second entry? Or just leave it there and ask route daemon/uses try the new nexthop api. > with the second one only hit under certain conditions. Just curious, with what kind of certain conditions we will hit the second one? > > > + ip route show table 200 > > default dev dummy1 scope link > > local default dev dummy1 scope host > > 172.16.107.0/24 via 172.16.104.100 dev dummy1 > > 172.16.107.0/24 via 172.16.104.100 dev dummy1 > > > > + ip addr add 2001:db8:101::1/64 dev dummy1 > > + ip addr add 2001:db8:101::2/64 dev dummy2 > > + ip route add 2001:db8:102::/64 via 2001:db8:101::10 dev dummy1 table 100 > > + ip route prepend 2001:db8:102::/64 via 2001:db8:101::10 dev dummy2 table 100 > > + ip route add local 2001:db8:103::/64 via 2001:db8:101::10 dev dummy1 table 100 > > + ip route prepend unicast 2001:db8:103::/64 via 2001:db8:101::10 dev dummy2 table 1 > Unfortunately the original IPv6 multipath implementation did not follow > the same semantics as IPv4. Each leg in a MP route is a separate entry > and the append and prepend work differently for v6. :-( > > This difference is one of the many goals of the separate nexthop objects > -- aligning ipv4 and ipv6 behavior which can only be done with a new > API. There were many attempts to make the legacy route infrastructure > more closely aligned between v4 and v6 and inevitably each was reverted > because it broke some existing user. Yes, I understand the difficult and risk to aligned the v4/v6 behavior. On the other hand, changing to new nexthop api also a large work for the routing daemons. Here is a quote from NM developers replied to me. "If the issues (this and others) of the netlink API for route objects can be fixed, then there seems less reason to change NetworkManager to nexthop objects. If it cannot (won't) be fixed, then would be another argument for using nexthop objects..." I will check if all the issues could be fixed with new nexthop api. Thanks Hangbin
On Wed, Jul 26, 2023 at 06:17:05PM +0800, Hangbin Liu wrote: > Hi Stephen, Ido, David, > On Mon, Jul 24, 2023 at 08:48:20AM -0700, Stephen Hemminger wrote: > > On Mon, 24 Jul 2023 16:56:37 +0800 > > Hangbin Liu <liuhangbin@gmail.com> wrote: > > > > > The NetworkManager keeps a cache of the routes. Missing/Wrong events mean that > > > the cache becomes inconsistent. The IPv4 will not send src route delete info > > > if it's bond to other device. While IPv6 only modify the src route instead of > > > delete it, and also no notify. So NetworkManager developers complained and > > > hope to have a consistent and clear notification about route modify/delete. > > > > Read FRR they get it right. The routing daemons have to track kernel, > > and the semantics have been worked out for years. > > Since we are talking about whether we should fix the issues or doc them. I > have some other route issues reported by NetworkManager developers. And want > discuss with you. > > For IPv4, we add new route instead append the nexthop to same dest(or do I > miss something?). The append / prepend trick to create a multipath route is an IPv6 hack. The correct way to install a multipath route is to add it in one go like in the IPv4 implementation (which predates the IPv6 implementation) or use the nexthop API. > Since the route are not merged, the nexthop weight is not shown, which > make them look like the same for users. For IPv4, the scope is also > not shown, which look like the same for users. The routes are the same, but separate. They do not form a multipath route. Weight is meaningless for a non-multipath route. > > While IPv6 will append another nexthop to the route if dest is same. Yes, that's a hack. > But there are 2 issues here: > 1. the *type* and *protocol* field are actally ignored > 2. when do `ip monitor route`, the info dumpped in fib6_add_rt2node() > use the config info from user space. When means `ip monitor` show the > incorrect type and protocol > > So my questions are, should we show weight/scope for IPv4? How to deal the > type/proto info missing for IPv6? How to deal with the difference of merging > policy for IPv4/IPv6? In my opinion, if you want consistent behavior between IPv4 and IPv6 for multipath routes, then I suggest using the nexthop API. It was merged in 5.3 (IIRC) and FRR started using it by default a few years ago. Other than a few bugs that were fixed, I don't remember many complaints. Also, any nexthop-related features will only be implemented in the nexthop API, not in the legacy API. Resilient nexthop groups is one example.
On 7/26/23 10:19 PM, Hangbin Liu wrote: > On Wed, Jul 26, 2023 at 09:57:59AM -0600, David Ahern wrote: >>> So my questions are, should we show weight/scope for IPv4? How to deal the >>> type/proto info missing for IPv6? How to deal with the difference of merging >>> policy for IPv4/IPv6? >>> + ip route add 172.16.105.0/24 table 100 via 172.16.104.100 dev dummy1 >>> + ip route append 172.16.105.0/24 table 100 via 172.16.104.100 dev dummy2 >> >>> + ip route add 172.16.106.0/24 table 100 nexthop via 172.16.104.100 dev dummy1 weight 1 >>> + ip route append 172.16.106.0/24 table 100 nexthop via 172.16.104.100 dev dummy1 weight 2 >> >> Weight only has meaning with a multipath route. In both of these caess >> these are 2 separate entries in the FIB > > Yes, we know these are 2 separate entries. The NM developers know these > are 2 separate entries. But the uses don't know, and the route daemon don't > know. If a user add these 2 entires. And kernel show them as the same. The > route daemon will store them as a same entries. But if the user delete the > entry. We actually delete one and left one in the kernel. This will make > the route daemon and user confused. > > So my question is, should we export the weight/scope? Or stop user add > the second entry? Or just leave it there and ask route daemon/uses try > the new nexthop api. > >> with the second one only hit under certain conditions. > > Just curious, with what kind of certain conditions we will hit the second one? Look at the checks in net/ipv4/fib_trie.c starting at line 1573 (comment before is "/* Step 3: Process the leaf, if that fails fall back to backtracing */") > >> >>> + ip route show table 200 >>> default dev dummy1 scope link >>> local default dev dummy1 scope host >>> 172.16.107.0/24 via 172.16.104.100 dev dummy1 >>> 172.16.107.0/24 via 172.16.104.100 dev dummy1 >>> >>> + ip addr add 2001:db8:101::1/64 dev dummy1 >>> + ip addr add 2001:db8:101::2/64 dev dummy2 >>> + ip route add 2001:db8:102::/64 via 2001:db8:101::10 dev dummy1 table 100 >>> + ip route prepend 2001:db8:102::/64 via 2001:db8:101::10 dev dummy2 table 100 >>> + ip route add local 2001:db8:103::/64 via 2001:db8:101::10 dev dummy1 table 100 >>> + ip route prepend unicast 2001:db8:103::/64 via 2001:db8:101::10 dev dummy2 table 1 >> Unfortunately the original IPv6 multipath implementation did not follow >> the same semantics as IPv4. Each leg in a MP route is a separate entry >> and the append and prepend work differently for v6. :-( >> >> This difference is one of the many goals of the separate nexthop objects >> -- aligning ipv4 and ipv6 behavior which can only be done with a new >> API. There were many attempts to make the legacy route infrastructure >> more closely aligned between v4 and v6 and inevitably each was reverted >> because it broke some existing user. > > Yes, I understand the difficult and risk to aligned the v4/v6 behavior. > On the other hand, changing to new nexthop api also a large work for the > routing daemons. Here is a quote from NM developers replied to me. It is some level of work yes, but the netlink message format between old and new was left as aligned and similar as possible - to make it easier to move between old and new api. > > "If the issues (this and others) of the netlink API for route objects can be > fixed, then there seems less reason to change NetworkManager to nexthop > objects. If it cannot (won't) be fixed, then would be another argument for using > nexthop objects..." > > I will check if all the issues could be fixed with new nexthop api. > > Thanks > Hangbin
Le 25/07/2023 à 18:36, Stephen Hemminger a écrit : > On Tue, 25 Jul 2023 16:20:59 +0800 > Hangbin Liu <liuhangbin@gmail.com> wrote: > >> On Mon, Jul 24, 2023 at 08:48:20AM -0700, Stephen Hemminger wrote: >>> On Mon, 24 Jul 2023 16:56:37 +0800 >>> Hangbin Liu <liuhangbin@gmail.com> wrote: >>> >>>> The NetworkManager keeps a cache of the routes. Missing/Wrong events mean that >>>> the cache becomes inconsistent. The IPv4 will not send src route delete info >>>> if it's bond to other device. While IPv6 only modify the src route instead of >>>> delete it, and also no notify. So NetworkManager developers complained and >>>> hope to have a consistent and clear notification about route modify/delete. >>> >>> Read FRR they get it right. The routing daemons have to track kernel, >>> and the semantics have been worked out for years. >> >> Yes, normally the routing daemon need to track kernel. On the other hand, >> the kernel also need to make a clear feedback. The userspace developers may >> not know the kernel code very well. The unclear/inconsistent notification >> would make them confused. > > Right, that should be addressed by clearer documentation of the semantics > and the rational. > Frankly, it's quite complex, there are corner cases. When an interface is set down, the routes associated to this interface should be removed. This is the simple part. But for ecmp routes, there are several cases: - if all nh use this interface: the routes are deleted by the kernel; - if only some nh uses this interface : + if all other nh already point to a down interface: the route are deleted by the kernel; + if at least one nh points to an up interface: the nh are temporarily disabled. Managing a cache with this is not so obvious ;-) My two cents, Nicolas
On 7/28/23 7:01 AM, Nicolas Dichtel wrote: > Frankly, it's quite complex, there are corner cases. yes there are. > > When an interface is set down, the routes associated to this interface should be > removed. This is the simple part. > But for ecmp routes, there are several cases: > - if all nh use this interface: the routes are deleted by the kernel; > - if only some nh uses this interface : > + if all other nh already point to a down interface: the route are deleted by > the kernel; > + if at least one nh points to an up interface: the nh are temporarily disabled. > > Managing a cache with this is not so obvious
On Sun, 2023-07-23 at 10:38 +0300, Ido Schimmel wrote: > On Fri, Jul 21, 2023 at 01:46:13PM +0800, Hangbin Liu wrote: > > > I assume NetworkManager already knows how to delete routes given > RTM_DELLINK events. Can't it be taught to react to RTM_DELADDR events > as > well? Then this functionality will always work, regardless of the > kernel > version being used. > Hi, NetworkManager will already request a full dump of all routes, when an address gets deleted. https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/1dff0156278594ab171b59885396cf95aff2af7f/src/libnm-platform/nm-linux-platform.c#L7437 Thomas
On Fri, 2023-07-28 at 09:42 -0600, David Ahern wrote: > On 7/28/23 7:01 AM, Nicolas Dichtel wrote: > > > Managing a cache with this is not so obvious
On Sun, Jul 23, 2023 at 10:38:11AM +0300, Ido Schimmel wrote: > > > >> Will you get a notification in this case for 198.51.100.0/24? > > > > > > > > No. Do you think it is expected with this patch or not? > > > > > > The intent is that notifications are sent for link events but not route > > > events which are easily deduced from the link events. > > > > Sorry, I didn't get what you mean. The link events should be notified to user > > via rtmsg_ifinfo_event()? So I think here we ignore the route events caused by > > link down looks reasonable. > > The route in the scenario I mentioned wasn't deleted because of a link > event, but because the source address was deleted yet no notification > was emitted. IMO, this is wrong given the description of the patch. > Hi Ido, I reconsidered this issue this week. As we can't get the device status in fib_table_flush(). How about adding another flag to track the deleted src entries. e.g. RTNH_F_UNRESOLVED. Which is only used in ipmr currently. When the src route address is deleted, the route entry also could be considered as unresolved. With this idea, the patch could be like: diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index 51c13cf9c5ae..5c41d34ab447 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -420,7 +420,7 @@ struct rtnexthop { #define RTNH_F_ONLINK 4 /* Gateway is forced on link */ #define RTNH_F_OFFLOAD 8 /* Nexthop is offloaded */ #define RTNH_F_LINKDOWN 16 /* carrier-down on nexthop */ -#define RTNH_F_UNRESOLVED 32 /* The entry is unresolved (ipmr) */ +#define RTNH_F_UNRESOLVED 32 /* The entry is unresolved (ipmr/dead src) */ #define RTNH_F_TRAP 64 /* Nexthop is trapping packets */ #define RTNH_COMPARE_MASK (RTNH_F_DEAD | RTNH_F_LINKDOWN | \ diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 65ba18a91865..a7ef21a6d271 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -1883,7 +1883,7 @@ int fib_sync_down_addr(struct net_device *dev, __be32 local) fi->fib_tb_id != tb_id) continue; if (fi->fib_prefsrc == local) { - fi->fib_flags |= RTNH_F_DEAD; + fi->fib_flags |= (RTNH_F_DEAD | RTNH_F_UNRESOLVED); ret++; } } diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 74d403dbd2b4..88c593967063 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -2026,6 +2026,7 @@ void fib_table_flush_external(struct fib_table *tb) int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) { struct trie *t = (struct trie *)tb->tb_data; + struct nl_info info = { .nl_net = net }; struct key_vector *pn = t->kv; unsigned long cindex = 1; struct hlist_node *tmp; @@ -2088,6 +2089,11 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) fib_notify_alias_delete(net, n->key, &n->leaf, fa, NULL); + if (fi->fib_flags & RTNH_F_UNRESOLVED) { + fi->fib_flags &= ~RTNH_F_UNRESOLVED; + rtmsg_fib(RTM_DELROUTE, htonl(n->key), fa, + KEYLENGTH - fa->fa_slen, tb->tb_id, &info, 0); + } hlist_del_rcu(&fa->fa_list); fib_release_info(fa->fa_info); alias_free_mem_rcu(fa); Thanks Hangbin
On 8/2/23 3:10 AM, Thomas Haller wrote: > On Fri, 2023-07-28 at 09:42 -0600, David Ahern wrote: >> On 7/28/23 7:01 AM, Nicolas Dichtel wrote: >> >>> Managing a cache with this is not so obvious
On 2023-08-07 19:44 -0600, David Ahern wrote: > On 8/2/23 3:10 AM, Thomas Haller wrote: > > On Fri, 2023-07-28 at 09:42 -0600, David Ahern wrote: > >> On 7/28/23 7:01 AM, Nicolas Dichtel wrote: > >> > >>> Managing a cache with this is not so obvious
On Fri, Aug 04, 2023 at 04:09:08PM +0800, Hangbin Liu wrote: > I reconsidered this issue this week. As we can't get the device status in > fib_table_flush(). How about adding another flag to track the deleted src > entries. e.g. RTNH_F_UNRESOLVED. Which is only used in ipmr currently. > When the src route address is deleted, the route entry also could be > considered as unresolved. With this idea, the patch could be like: > > diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h > index 51c13cf9c5ae..5c41d34ab447 100644 > --- a/include/uapi/linux/rtnetlink.h > +++ b/include/uapi/linux/rtnetlink.h > @@ -420,7 +420,7 @@ struct rtnexthop { > #define RTNH_F_ONLINK 4 /* Gateway is forced on link */ > #define RTNH_F_OFFLOAD 8 /* Nexthop is offloaded */ > #define RTNH_F_LINKDOWN 16 /* carrier-down on nexthop */ > -#define RTNH_F_UNRESOLVED 32 /* The entry is unresolved (ipmr) */ > +#define RTNH_F_UNRESOLVED 32 /* The entry is unresolved (ipmr/dead src) */ > #define RTNH_F_TRAP 64 /* Nexthop is trapping packets */ > > #define RTNH_COMPARE_MASK (RTNH_F_DEAD | RTNH_F_LINKDOWN | \ I'm not sure we need to reinterpret the meaning of this uAPI flag. The FIB info structure currently looks like this: struct fib_info { struct hlist_node fib_hash; /* 0 16 */ struct hlist_node fib_lhash; /* 16 16 */ struct list_head nh_list; /* 32 16 */ struct net * fib_net; /* 48 8 */ refcount_t fib_treeref; /* 56 4 */ refcount_t fib_clntref; /* 60 4 */ /* --- cacheline 1 boundary (64 bytes) --- */ unsigned int fib_flags; /* 64 4 */ unsigned char fib_dead; /* 68 1 */ unsigned char fib_protocol; /* 69 1 */ unsigned char fib_scope; /* 70 1 */ unsigned char fib_type; /* 71 1 */ __be32 fib_prefsrc; /* 72 4 */ u32 fib_tb_id; /* 76 4 */ u32 fib_priority; /* 80 4 */ /* XXX 4 bytes hole, try to pack */ struct dst_metrics * fib_metrics; /* 88 8 */ int fib_nhs; /* 96 4 */ bool fib_nh_is_v6; /* 100 1 */ bool nh_updated; /* 101 1 */ /* XXX 2 bytes hole, try to pack */ struct nexthop * nh; /* 104 8 */ struct callback_head rcu __attribute__((__aligned__(8))); /* 112 16 */ /* --- cacheline 2 boundary (128 bytes) --- */ struct fib_nh fib_nh[]; /* 128 0 */ /* size: 128, cachelines: 2, members: 21 */ /* sum members: 122, holes: 2, sum holes: 6 */ /* forced alignments: 1 */ } __attribute__((__aligned__(8))); We can instead represent fib_nh_is_v6 and nh_updated using a single bit: diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index a378eff827c7..a91f8a28689a 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -152,8 +152,8 @@ struct fib_info { #define fib_rtt fib_metrics->metrics[RTAX_RTT-1] #define fib_advmss fib_metrics->metrics[RTAX_ADVMSS-1] int fib_nhs; - bool fib_nh_is_v6; - bool nh_updated; + u8 fib_nh_is_v6:1, + nh_updated:1; struct nexthop *nh; struct rcu_head rcu; struct fib_nh fib_nh[]; And then add another bit there to mark a FIB info that is deleted because of preferred source address deletion. I suggest testing with the FIB tests in tools/testing/selftests/net/. > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c > index 65ba18a91865..a7ef21a6d271 100644 > --- a/net/ipv4/fib_semantics.c > +++ b/net/ipv4/fib_semantics.c > @@ -1883,7 +1883,7 @@ int fib_sync_down_addr(struct net_device *dev, __be32 local) > fi->fib_tb_id != tb_id) > continue; > if (fi->fib_prefsrc == local) { > - fi->fib_flags |= RTNH_F_DEAD; > + fi->fib_flags |= (RTNH_F_DEAD | RTNH_F_UNRESOLVED); > ret++; > } > } > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 74d403dbd2b4..88c593967063 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -2026,6 +2026,7 @@ void fib_table_flush_external(struct fib_table *tb) > int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) > { > struct trie *t = (struct trie *)tb->tb_data; > + struct nl_info info = { .nl_net = net }; > struct key_vector *pn = t->kv; > unsigned long cindex = 1; > struct hlist_node *tmp; > @@ -2088,6 +2089,11 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) > > fib_notify_alias_delete(net, n->key, &n->leaf, fa, > NULL); > + if (fi->fib_flags & RTNH_F_UNRESOLVED) { > + fi->fib_flags &= ~RTNH_F_UNRESOLVED; > + rtmsg_fib(RTM_DELROUTE, htonl(n->key), fa, > + KEYLENGTH - fa->fa_slen, tb->tb_id, &info, 0); > + } > hlist_del_rcu(&fa->fa_list); > fib_release_info(fa->fa_info); > alias_free_mem_rcu(fa); > > Thanks > Hangbin
On Wed, Aug 09, 2023 at 10:06:07AM +0300, Ido Schimmel wrote: > We can instead represent fib_nh_is_v6 and nh_updated using a single bit: > > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h > index a378eff827c7..a91f8a28689a 100644 > --- a/include/net/ip_fib.h > +++ b/include/net/ip_fib.h > @@ -152,8 +152,8 @@ struct fib_info { > #define fib_rtt fib_metrics->metrics[RTAX_RTT-1] > #define fib_advmss fib_metrics->metrics[RTAX_ADVMSS-1] > int fib_nhs; > - bool fib_nh_is_v6; > - bool nh_updated; > + u8 fib_nh_is_v6:1, > + nh_updated:1; > struct nexthop *nh; > struct rcu_head rcu; > struct fib_nh fib_nh[]; > > And then add another bit there to mark a FIB info that is deleted > because of preferred source address deletion. > > I suggest testing with the FIB tests in tools/testing/selftests/net/. Thanks for the suggestion. The discuss in this patch is getting too long. Let me post the new patch and we can discuss there. Hangbin
Hi Ido, I'm back on this question again. Hope you are not too tired on this topic. Because in you last reply you only answered that we'd better using new nexthop API (which I'm totally agree) to resolve the consistent of IPv4 and IPv6. New feature should also go for the new api. But, it always takes a lot years for the end user using new API (e.g. some users are still using ifcfg). What I'm asking is how we should deal with the bugs for old API. See my reply below. On Thu, Jul 27, 2023 at 05:45:02PM +0300, Ido Schimmel wrote: > > Since the route are not merged, the nexthop weight is not shown, which > > make them look like the same for users. For IPv4, the scope is also > > not shown, which look like the same for users. > > The routes are the same, but separate. They do not form a multipath > route. Weight is meaningless for a non-multipath route. > > > But there are 2 issues here: > > 1. the *type* and *protocol* field are actally ignored > > 2. when do `ip monitor route`, the info dumpped in fib6_add_rt2node() > > use the config info from user space. When means `ip monitor` show the > > incorrect type and protocol > > > > So my questions are, should we show weight/scope for IPv4? Here is the first one. As the weight/scope are not shown, the two separate routes would looks exactly the same for end user, which makes user confused. So why not just show the weight/scope, or forbid user to add a non-multipath route with weight/scope? >> How to deal the type/proto info missing for IPv6? What we should do for this bug? The type/proto info are ignored when merge the IPv6 nexthop entries. Thanks Hangbin >> How to deal with the difference of merging policy for IPv4/IPv6? > > In my opinion, if you want consistent behavior between IPv4 and IPv6 for > multipath routes, then I suggest using the nexthop API. It was merged in > 5.3 (IIRC) and FRR started using it by default a few years ago. Other > than a few bugs that were fixed, I don't remember many complaints. Also, > any nexthop-related features will only be implemented in the nexthop > API, not in the legacy API. Resilient nexthop groups is one example.
On 8/28/23 1:53 AM, Hangbin Liu wrote: > On Thu, Jul 27, 2023 at 05:45:02PM +0300, Ido Schimmel wrote: >>> Since the route are not merged, the nexthop weight is not shown, which >>> make them look like the same for users. For IPv4, the scope is also >>> not shown, which look like the same for users. >> >> The routes are the same, but separate. They do not form a multipath >> route. Weight is meaningless for a non-multipath route. >> >>> But there are 2 issues here: >>> 1. the *type* and *protocol* field are actally ignored >>> 2. when do `ip monitor route`, the info dumpped in fib6_add_rt2node() >>> use the config info from user space. When means `ip monitor` show the >>> incorrect type and protocol >>> >>> So my questions are, should we show weight/scope for IPv4? > > Here is the first one. As the weight/scope are not shown, the two separate > routes would looks exactly the same for end user, which makes user confused. Asked and answered many times above: Weight has no meaning on single path routes; it is not even tracked if I recall correctly. > So why not just show the weight/scope, or forbid user to add a non-multipath > route with weight/scope? That is a change to a uAPI we can not do at this point. > >>> How to deal the type/proto info missing for IPv6? > > What we should do for this bug? The type/proto info are ignored when > merge the IPv6 nexthop entries. I need more information; this thread has gone on for a long time now.
On Mon, Aug 28, 2023 at 09:06:25AM -0600, David Ahern wrote: > >>> But there are 2 issues here: > >>> 1. the *type* and *protocol* field are actally ignored > >>> 2. when do `ip monitor route`, the info dumpped in fib6_add_rt2node() > >>> use the config info from user space. When means `ip monitor` show the > >>> incorrect type and protocol > >>> > >>> So my questions are, should we show weight/scope for IPv4? > > > > Here is the first one. As the weight/scope are not shown, the two separate > > routes would looks exactly the same for end user, which makes user confused. > > Asked and answered many times above: Weight has no meaning on single > path routes; it is not even tracked if I recall correctly. Yes, I'm sorry that I asked this question over and over again. Because I always got the answer that these are two different routes and weight are meaningless for none-multipath route. But IIRC, I never got a straight answer of what we should deal with this problem. > > > So why not just show the weight/scope, or forbid user to add a non-multipath > > route with weight/scope? > > That is a change to a uAPI we can not do at this point. Yes, that's the answers I want to receive. Either show it, forbid it, or not change it as it would change uAPI. > > > > >>> How to deal the type/proto info missing for IPv6? > > > > What we should do for this bug? The type/proto info are ignored when > > merge the IPv6 nexthop entries. > > I need more information; this thread has gone on for a long time now. Sure, here is the reproducer: + ip link add dummy1 up type dummy + ip link add dummy2 up type dummy + ip addr add 2001:db8:101::1/64 dev dummy1 + ip addr add 2001:db8:101::2/64 dev dummy2 + ip monitor route + sleep 1 + ip route add local 2001:db8:103::/64 via 2001:db8:101::10 dev dummy1 table 100 local 2001:db8:103::/64 via 2001:db8:101::10 dev dummy1 table 100 metric 1024 pref medium + ip route prepend unicast 2001:db8:103::/64 via 2001:db8:101::10 dev dummy2 table 100 2001:db8:103::/64 table 100 metric 1024 pref medium nexthop via 2001:db8:101::10 dev dummy2 weight 1 nexthop via 2001:db8:101::10 dev dummy1 weight 1 ^^ Here you can see the ip monitor print the route with unicast, even the "dev dummy1" route should be local + ip -6 route show table 100 local 2001:db8:103::/64 metric 1024 pref medium nexthop via 2001:db8:101::10 dev dummy1 weight 1 nexthop via 2001:db8:101::10 dev dummy2 weight 1 ^^ But the final route still keep using local. Which is different with what `ip monitor` print + ip route add 2001:db8:104::/64 via 2001:db8:101::10 dev dummy1 proto kernel table 200 2001:db8:104::/64 via 2001:db8:101::10 dev dummy1 table 200 proto kernel metric 1024 pref medium + ip route append 2001:db8:104::/64 via 2001:db8:101::10 dev dummy2 proto bgp table 200 2001:db8:104::/64 table 200 proto bgp metric 1024 pref medium nexthop via 2001:db8:101::10 dev dummy2 weight 1 nexthop via 2001:db8:101::10 dev dummy1 weight 1 + ip -6 route show table 200 2001:db8:104::/64 proto kernel metric 1024 pref medium nexthop via 2001:db8:101::10 dev dummy1 weight 1 nexthop via 2001:db8:101::10 dev dummy2 weight 1 ^^ Same here, ip monitor print protocol bgp, but the actual protocol is still kernel. We just merged them together and ignored the protocol field. + kill $! As I asked, The type/proto info are ignored and dropped when merge the IPv6 nexthop entries. How should we deal with this bug? Fix it or ignore it? Thanks Hangbin
On 8/28/23 7:07 PM, Hangbin Liu wrote: > As I asked, The type/proto info are ignored and dropped when merge the IPv6 > nexthop entries. How should we deal with this bug? Fix it or ignore it? ok, yes, that is a bug to me since information passed in is silently dropped. Please send a patch to fix it.
On Tue, 2023-08-08 at 14:59 -0400, Benjamin Poirier wrote: > On 2023-08-07 19:44 -0600, David Ahern wrote: > > On 8/2/23 3:10 AM, Thomas Haller wrote: > > > On Fri, 2023-07-28 at 09:42 -0600, David Ahern wrote: > > > > On 7/28/23 7:01 AM, Nicolas Dichtel wrote: > > > > > > > > > Managing a cache with this is not so obvious
Le 11/09/2023 à 11:50, Thomas Haller a écrit : [snip] > - the fact that it isn't fixed in more than a decade, shows IMO that > getting caching right for routes is very hard. Patches that improve the > behavior should not be rejected with "look at libnl3 or FRR". +1 I just hit another corner case: ip link set ntfp2 up ip address add 10.125.0.1/24 dev ntfp2 ip nexthop add id 1234 via 10.125.0.2 dev ntfp2 ip route add 10.200.0.0/24 nhid 1234 Check the config: $ ip route <snip> 10.200.0.0/24 nhid 1234 via 10.125.0.2 dev ntfp2 $ ip nexthop id 1234 via 10.125.0.2 dev ntfp2 scope link Set the carrier off on ntfp2: ip monitor label link route nexthop& ip link set ntfp2 carrier off $ ip link set ntfp2 carrier off $ [LINK]4: ntfp2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default link/ether de:ed:02:67:61:1f brd ff:ff:ff:ff:ff:ff => No nexthop event nor route event (net.ipv4.nexthop_compat_mode = 1) 'ip nexthop' and 'ip route' show that the nexthop and the route have been deleted. If the nexthop infra is not used (ip route add 10.200.0.0/24 via 10.125.0.2 dev ntfp2), the route entry is not deleted. I wondering if it is expected to not have a nexthop event when one is removed due to a carrier lost. At least, a route event should be generated when the compat_mode is enabled. Regards, Nicolas
On Wed, Sep 13, 2023 at 09:58:08AM +0200, Nicolas Dichtel wrote: > Le 11/09/2023 à 11:50, Thomas Haller a écrit : > [snip] > > - the fact that it isn't fixed in more than a decade, shows IMO that > > getting caching right for routes is very hard. Patches that improve the > > behavior should not be rejected with "look at libnl3 or FRR". > +1 > > I just hit another corner case: > > ip link set ntfp2 up > ip address add 10.125.0.1/24 dev ntfp2 > ip nexthop add id 1234 via 10.125.0.2 dev ntfp2 > ip route add 10.200.0.0/24 nhid 1234 > > Check the config: > $ ip route > <snip> > 10.200.0.0/24 nhid 1234 via 10.125.0.2 dev ntfp2 > $ ip nexthop > id 1234 via 10.125.0.2 dev ntfp2 scope link > > > Set the carrier off on ntfp2: > ip monitor label link route nexthop& > ip link set ntfp2 carrier off > > $ ip link set ntfp2 carrier off > $ [LINK]4: ntfp2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state > DOWN group default > link/ether de:ed:02:67:61:1f brd ff:ff:ff:ff:ff:ff > > => No nexthop event nor route event (net.ipv4.nexthop_compat_mode = 1) > > 'ip nexthop' and 'ip route' show that the nexthop and the route have been deleted. > > If the nexthop infra is not used (ip route add 10.200.0.0/24 via 10.125.0.2 dev > ntfp2), the route entry is not deleted. > > I wondering if it is expected to not have a nexthop event when one is removed > due to a carrier lost. > At least, a route event should be generated when the compat_mode is enabled. This thread goes to a long discussion. Ido has bringing up this issue[1]. In my patchv2[2] we skipped to send the notification as it is deliberate. BTW, I'm still looking into the questions you asked in my IPv6 patch[3]. Sorry for the late response. I was busy with some other stuff recently. [1] https://lore.kernel.org/netdev/ZLlE5of1Sw1pMPlM@shredder/ [2] https://lore.kernel.org/netdev/20230809140234.3879929-3-liuhangbin@gmail.com/ [3] https://lore.kernel.org/netdev/bf3bb290-25b7-e327-851a-d6a036daab03@6wind.com/ Thanks Hangbin
Le 13/09/2023 à 11:54, Hangbin Liu a écrit : > On Wed, Sep 13, 2023 at 09:58:08AM +0200, Nicolas Dichtel wrote: >> Le 11/09/2023 à 11:50, Thomas Haller a écrit : >> [snip] >>> - the fact that it isn't fixed in more than a decade, shows IMO that >>> getting caching right for routes is very hard. Patches that improve the >>> behavior should not be rejected with "look at libnl3 or FRR". >> +1 >> >> I just hit another corner case: >> >> ip link set ntfp2 up >> ip address add 10.125.0.1/24 dev ntfp2 >> ip nexthop add id 1234 via 10.125.0.2 dev ntfp2 >> ip route add 10.200.0.0/24 nhid 1234 >> >> Check the config: >> $ ip route >> <snip> >> 10.200.0.0/24 nhid 1234 via 10.125.0.2 dev ntfp2 >> $ ip nexthop >> id 1234 via 10.125.0.2 dev ntfp2 scope link >> >> >> Set the carrier off on ntfp2: >> ip monitor label link route nexthop& >> ip link set ntfp2 carrier off >> >> $ ip link set ntfp2 carrier off >> $ [LINK]4: ntfp2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state >> DOWN group default >> link/ether de:ed:02:67:61:1f brd ff:ff:ff:ff:ff:ff >> >> => No nexthop event nor route event (net.ipv4.nexthop_compat_mode = 1) >> >> 'ip nexthop' and 'ip route' show that the nexthop and the route have been deleted. >> >> If the nexthop infra is not used (ip route add 10.200.0.0/24 via 10.125.0.2 dev >> ntfp2), the route entry is not deleted. >> >> I wondering if it is expected to not have a nexthop event when one is removed >> due to a carrier lost. >> At least, a route event should be generated when the compat_mode is enabled. > > This thread goes to a long discussion. > > Ido has bringing up this issue[1]. In my patchv2[2] we skipped to send the > notification as it is deliberate. The compat_mode was introduced for daemons that doesn't support the nexthop framework. There must be a notification (RTM_DELROUTE) when a route is deleted due to a carrier down event. Right now, the backward compat is broken. > > BTW, I'm still looking into the questions you asked in my IPv6 patch[3]. Sorry > for the late response. I was busy with some other stuff recently. > > [1] https://lore.kernel.org/netdev/ZLlE5of1Sw1pMPlM@shredder/ > [2] https://lore.kernel.org/netdev/20230809140234.3879929-3-liuhangbin@gmail.com/ > [3] https://lore.kernel.org/netdev/bf3bb290-25b7-e327-851a-d6a036daab03@6wind.com/ Thank you for the pointers. Regards, Nicolas
On 9/13/23 1:58 AM, Nicolas Dichtel wrote: > Le 11/09/2023 à 11:50, Thomas Haller a écrit : > [snip] >> - the fact that it isn't fixed in more than a decade, shows IMO that >> getting caching right for routes is very hard. Patches that improve the >> behavior should not be rejected with "look at libnl3 or FRR". > +1 > > I just hit another corner case: > > ip link set ntfp2 up > ip address add 10.125.0.1/24 dev ntfp2 > ip nexthop add id 1234 via 10.125.0.2 dev ntfp2 > ip route add 10.200.0.0/24 nhid 1234 > > Check the config: > $ ip route > <snip> > 10.200.0.0/24 nhid 1234 via 10.125.0.2 dev ntfp2 > $ ip nexthop > id 1234 via 10.125.0.2 dev ntfp2 scope link > > > Set the carrier off on ntfp2: > ip monitor label link route nexthop& > ip link set ntfp2 carrier off > > $ ip link set ntfp2 carrier off > $ [LINK]4: ntfp2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state > DOWN group default > link/ether de:ed:02:67:61:1f brd ff:ff:ff:ff:ff:ff > > => No nexthop event nor route event (net.ipv4.nexthop_compat_mode = 1) carrier down is a link event and as you show here, link events are sent. > > 'ip nexthop' and 'ip route' show that the nexthop and the route have been deleted. nexthop objects are removed on the link event; any routes referencing those nexthops are removed. > > If the nexthop infra is not used (ip route add 10.200.0.0/24 via 10.125.0.2 dev > ntfp2), the route entry is not deleted. > > I wondering if it is expected to not have a nexthop event when one is removed > due to a carrier lost. > At least, a route event should be generated when the compat_mode is enabled. compat_mode is about expanding nhid into the full, legacy route attributes. See 4f80116d3df3b
On 9/13/23 8:11 AM, Nicolas Dichtel wrote: > The compat_mode was introduced for daemons that doesn't support the nexthop > framework. There must be a notification (RTM_DELROUTE) when a route is deleted > due to a carrier down event. Right now, the backward compat is broken. The compat_mode is for daemons that do not understand the nexthop id attribute, and need the legacy set of attributes for the route - i.e, expand the nexthop information when sending messages to userspace.
Le 13/09/2023 à 16:43, David Ahern a écrit : > On 9/13/23 8:11 AM, Nicolas Dichtel wrote: >> The compat_mode was introduced for daemons that doesn't support the nexthop >> framework. There must be a notification (RTM_DELROUTE) when a route is deleted >> due to a carrier down event. Right now, the backward compat is broken. > > The compat_mode is for daemons that do not understand the nexthop id > attribute, and need the legacy set of attributes for the route - i.e, Yes, it's my point. On my system, one daemon understands and configures nexthop id and another one doesn't understand nexthop id. This last daemon removes routes when an interface is put down but not when the carrier is lost. The kernel doc [1] says: Further, updates or deletes of a nexthop configuration generate route notifications for each fib entry using the nexthop. So, my understanding is that a RTM_DELROUTE msg should be sent when a nexthop is removed due to a carrier lost event. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/ip-sysctl.rst#n2116 Regards, Nicolas > expand the nexthop information when sending messages to userspace.
Le 13/09/2023 à 16:53, Nicolas Dichtel a écrit : > Le 13/09/2023 à 16:43, David Ahern a écrit : >> On 9/13/23 8:11 AM, Nicolas Dichtel wrote: >>> The compat_mode was introduced for daemons that doesn't support the nexthop >>> framework. There must be a notification (RTM_DELROUTE) when a route is deleted >>> due to a carrier down event. Right now, the backward compat is broken. >> >> The compat_mode is for daemons that do not understand the nexthop id >> attribute, and need the legacy set of attributes for the route - i.e, > Yes, it's my point. > On my system, one daemon understands and configures nexthop id and another one > doesn't understand nexthop id. This last daemon removes routes when an interface > is put down but not when the carrier is lost. > The kernel doc [1] says: > Further, updates or deletes of a nexthop configuration generate route > notifications for each fib entry using the nexthop. > So, my understanding is that a RTM_DELROUTE msg should be sent when a nexthop is > removed due to a carrier lost event. > > [1] > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/ip-sysctl.rst#n2116 I dug a bit more about these (missing) notifications. I will try to describe what should be done for cases where there is no notification: When an interface is set down: - the single (!multipath) routes associated with this interface should be removed; - for multipath routes: + if all nh use this interface: the routes are deleted; + if only some nh uses this interface : ~ if all other nh already point to a down interface: the routes are deleted; ~ if at least one nh points to an up interface: o the nh are *temporarily* disabled if it's a plain nexthop; o the nh is *definitely* removed if it's a nexthop object; When the interface is set up later, disabled nh are restored (ie only plain nexthop of multipath routes). When an interface loses its carrier: - for routes using plain nexthop: nothing happens; - for routes using nexthop objects: + for single routes: they are deleted; + for multipath routes, the nh is definitely removed if it's a nexthop object (ie the route is deleted if there is no other nexthop in the group); When an interface recovers its carrier, there is nothing to do. When the last ipv4 address of an interface is removed: - for routes using nexthop objects: nothing happens; - for routes using plain nexthop: the same rules as 'interface down' applies. When an ipv4 address is added again on the interface, disabled nh are restored (ie only plain nexthop of multipath routes). I bet I miss some cases. Conclusions: - legacy applications (that are not aware of nexthop objects) cannot maintain a routing cache (even with compat_mode enabled); - fixing only the legacy applications (aka compat_mode) seems too complex; - even if an application is aware of nexthop objects, the rules to maintain a cache are far from obvious. I don't understand why there is so much reluctance to not send a notification when a route is deleted. This would fix all cases. I understand that the goal was to save netlink traffic, but in this case, the daemons that are interested in maintaining a routing cache have to fully parse their cache to mark/remove routes. For big routing tables, this will cost a lot of cpu, so I wonder if it's really a gain for the system. On such systems, there is probably more than one daemon in this case, so even more cpu to spend for these operations. As Thomas said, this discussion has come up for more than a decade. And with the nexthop objects support, it's even more complex. There is obviously something to do. At least, I would have expected an RTM_DELNEXTHOP msg for each deleted nexthop. But this wouldn't solve the routing cache sync for legacy applications. Regards, Nicolas
On 9/14/23 9:43 AM, Nicolas Dichtel wrote: > Le 13/09/2023 à 16:53, Nicolas Dichtel a écrit : >> Le 13/09/2023 à 16:43, David Ahern a écrit : >>> On 9/13/23 8:11 AM, Nicolas Dichtel wrote: >>>> The compat_mode was introduced for daemons that doesn't support the nexthop >>>> framework. There must be a notification (RTM_DELROUTE) when a route is deleted >>>> due to a carrier down event. Right now, the backward compat is broken. >>> >>> The compat_mode is for daemons that do not understand the nexthop id >>> attribute, and need the legacy set of attributes for the route - i.e, >> Yes, it's my point. >> On my system, one daemon understands and configures nexthop id and another one >> doesn't understand nexthop id. This last daemon removes routes when an interface >> is put down but not when the carrier is lost. >> The kernel doc [1] says: >> Further, updates or deletes of a nexthop configuration generate route >> notifications for each fib entry using the nexthop. >> So, my understanding is that a RTM_DELROUTE msg should be sent when a nexthop is >> removed due to a carrier lost event. >> >> [1] >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/ip-sysctl.rst#n2116 > > I dug a bit more about these (missing) notifications. I will try to describe > what should be done for cases where there is no notification: > > When an interface is set down: > - the single (!multipath) routes associated with this interface should be > removed; > - for multipath routes: > + if all nh use this interface: the routes are deleted; > + if only some nh uses this interface : > ~ if all other nh already point to a down interface: the routes are deleted; > ~ if at least one nh points to an up interface: > o the nh are *temporarily* disabled if it's a plain nexthop; > o the nh is *definitely* removed if it's a nexthop object; > When the interface is set up later, disabled nh are restored (ie only plain > nexthop of multipath routes). > > When an interface loses its carrier: > - for routes using plain nexthop: nothing happens; > - for routes using nexthop objects: > + for single routes: they are deleted; > + for multipath routes, the nh is definitely removed if it's a nexthop > object (ie the route is deleted if there is no other nexthop in the group); > When an interface recovers its carrier, there is nothing to do. > > When the last ipv4 address of an interface is removed: > - for routes using nexthop objects: nothing happens; > - for routes using plain nexthop: the same rules as 'interface down' applies. > When an ipv4 address is added again on the interface, disabled nh are restored > (ie only plain nexthop of multipath routes). > > I bet I miss some cases. > > Conclusions: > - legacy applications (that are not aware of nexthop objects) cannot maintain a > routing cache (even with compat_mode enabled); > - fixing only the legacy applications (aka compat_mode) seems too > complex; > - even if an application is aware of nexthop objects, the rules to maintain a > cache are far from obvious. > > I don't understand why there is so much reluctance to not send a notification > when a route is deleted. This would fix all cases. > I understand that the goal was to save netlink traffic, but in this case, the > daemons that are interested in maintaining a routing cache have to fully parse > their cache to mark/remove routes. For big routing tables, this will cost a lot > of cpu, so I wonder if it's really a gain for the system. On such systems, there > is probably more than one daemon in this case, so even more cpu to spend for > these operations. > > As Thomas said, this discussion has come up for more than a decade. And with the > nexthop objects support, it's even more complex. There is obviously something to do. > > At least, I would have expected an RTM_DELNEXTHOP msg for each deleted nexthop. > But this wouldn't solve the routing cache sync for legacy applications. > The split nexthop API is about efficiency. Do not send route notifications when they are easily derived from other events -- e.g., link down or carrier down. The first commit for the nexthop code: commit ab84be7e54fc3d9b248285f1a14067558d858819 Author: David Ahern <dsahern@gmail.com> Date: Fri May 24 14:43:04 2019 -0700 net: Initial nexthop code Barebones start point for nexthops. Implementation for RTM commands, notifications, management of rbtree for holding nexthops by id, and kernel side data structures for nexthops and nexthop config. Nexthops are maintained in an rbtree sorted by id. Similar to routes, nexthops are configured per namespace using netns_nexthop struct added to struct net. Nexthop notifications are sent when a nexthop is added or deleted, but NOT if the delete is due to a device event or network namespace teardown (which also involves device events). Applications are expected to use the device down event to flush nexthops and any routes used by the nexthops. Intent is stated. The only compatibility with legacy API is around expanding the nhid to a full set of nexthop attributes to aid a transition to the new API. This new API also gave an opportunity to bring equivalency between IPv4 and IPv6. Let's not weaken that model at all.
Le 15/09/2023 à 05:07, David Ahern a écrit : > On 9/14/23 9:43 AM, Nicolas Dichtel wrote: >> Le 13/09/2023 à 16:53, Nicolas Dichtel a écrit : >>> Le 13/09/2023 à 16:43, David Ahern a écrit : >>>> On 9/13/23 8:11 AM, Nicolas Dichtel wrote: >>>>> The compat_mode was introduced for daemons that doesn't support the nexthop >>>>> framework. There must be a notification (RTM_DELROUTE) when a route is deleted >>>>> due to a carrier down event. Right now, the backward compat is broken. >>>> >>>> The compat_mode is for daemons that do not understand the nexthop id >>>> attribute, and need the legacy set of attributes for the route - i.e, >>> Yes, it's my point. >>> On my system, one daemon understands and configures nexthop id and another one >>> doesn't understand nexthop id. This last daemon removes routes when an interface >>> is put down but not when the carrier is lost. >>> The kernel doc [1] says: >>> Further, updates or deletes of a nexthop configuration generate route >>> notifications for each fib entry using the nexthop. >>> So, my understanding is that a RTM_DELROUTE msg should be sent when a nexthop is >>> removed due to a carrier lost event. >>> >>> [1] >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/ip-sysctl.rst#n2116 >> >> I dug a bit more about these (missing) notifications. I will try to describe >> what should be done for cases where there is no notification: >> >> When an interface is set down: >> - the single (!multipath) routes associated with this interface should be >> removed; >> - for multipath routes: >> + if all nh use this interface: the routes are deleted; >> + if only some nh uses this interface : >> ~ if all other nh already point to a down interface: the routes are deleted; >> ~ if at least one nh points to an up interface: >> o the nh are *temporarily* disabled if it's a plain nexthop; >> o the nh is *definitely* removed if it's a nexthop object; >> When the interface is set up later, disabled nh are restored (ie only plain >> nexthop of multipath routes). >> >> When an interface loses its carrier: >> - for routes using plain nexthop: nothing happens; >> - for routes using nexthop objects: >> + for single routes: they are deleted; >> + for multipath routes, the nh is definitely removed if it's a nexthop >> object (ie the route is deleted if there is no other nexthop in the group); >> When an interface recovers its carrier, there is nothing to do. >> >> When the last ipv4 address of an interface is removed: >> - for routes using nexthop objects: nothing happens; >> - for routes using plain nexthop: the same rules as 'interface down' applies. >> When an ipv4 address is added again on the interface, disabled nh are restored >> (ie only plain nexthop of multipath routes). >> >> I bet I miss some cases. >> >> Conclusions: >> - legacy applications (that are not aware of nexthop objects) cannot maintain a >> routing cache (even with compat_mode enabled); >> - fixing only the legacy applications (aka compat_mode) seems too >> complex; >> - even if an application is aware of nexthop objects, the rules to maintain a >> cache are far from obvious. >> >> I don't understand why there is so much reluctance to not send a notification >> when a route is deleted. This would fix all cases. >> I understand that the goal was to save netlink traffic, but in this case, the >> daemons that are interested in maintaining a routing cache have to fully parse >> their cache to mark/remove routes. For big routing tables, this will cost a lot >> of cpu, so I wonder if it's really a gain for the system. On such systems, there >> is probably more than one daemon in this case, so even more cpu to spend for >> these operations. >> >> As Thomas said, this discussion has come up for more than a decade. And with the >> nexthop objects support, it's even more complex. There is obviously something to do. >> >> At least, I would have expected an RTM_DELNEXTHOP msg for each deleted nexthop. >> But this wouldn't solve the routing cache sync for legacy applications. >> > > The split nexthop API is about efficiency. Do not send route Efficiency for the kernel, but complexity for userspace daemons. I'm not sure the system wins something when several daemons implement the same complex logic. > notifications when they are easily derived from other events -- e.g., > link down or carrier down. The first commit for the nexthop code: > > commit ab84be7e54fc3d9b248285f1a14067558d858819 > Author: David Ahern <dsahern@gmail.com> > Date: Fri May 24 14:43:04 2019 -0700 > > net: Initial nexthop code > > Barebones start point for nexthops. Implementation for RTM commands, > notifications, management of rbtree for holding nexthops by id, and > kernel side data structures for nexthops and nexthop config. > > Nexthops are maintained in an rbtree sorted by id. Similar to routes, > nexthops are configured per namespace using netns_nexthop struct added > to struct net. > > Nexthop notifications are sent when a nexthop is added or deleted, > but NOT if the delete is due to a device event or network namespace > teardown (which also involves device events). Applications are > expected to use the device down event to flush nexthops and any > routes used by the nexthops. > > Intent is stated. I understand, but are there numbers of the gain? In terms of cpu? memory? Some systems could have a lot of routes (~900k for an ipv4 full route), but they usually have a small number of nexthop. If you look at the big picture, is it really worth saving some netlink messages and putting the load/complexity in the userspace daemons? It was the initial intent, no problem. But if people complain, maybe it could be improved to fit everybody's use case. > > The only compatibility with legacy API is around expanding the nhid to a > full set of nexthop attributes to aid a transition to the new API. As I said in another reply, the kernel doc [1] seems also to explain this: Further, updates or deletes of a nexthop configuration generate route notifications for each fib entry using the nexthop. Without this, legacy daemons are broken. The compat_mode only hides some behavior changes if these updates are not notified. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/ip-sysctl.rst#n2116 Regards, Nicolas
On Wed, 13 Sep 2023 08:41:05 -0600 David Ahern <dsahern@kernel.org> wrote: > On 9/13/23 1:58 AM, Nicolas Dichtel wrote: > > Le 11/09/2023 à 11:50, Thomas Haller a écrit : > > [snip] > >> - the fact that it isn't fixed in more than a decade, shows IMO that > >> getting caching right for routes is very hard. Patches that improve the > >> behavior should not be rejected with "look at libnl3 or FRR". > > +1 > > > > I just hit another corner case: > > > > ip link set ntfp2 up > > ip address add 10.125.0.1/24 dev ntfp2 > > ip nexthop add id 1234 via 10.125.0.2 dev ntfp2 > > ip route add 10.200.0.0/24 nhid 1234 > > > > Check the config: > > $ ip route > > <snip> > > 10.200.0.0/24 nhid 1234 via 10.125.0.2 dev ntfp2 > > $ ip nexthop > > id 1234 via 10.125.0.2 dev ntfp2 scope link > > > > > > Set the carrier off on ntfp2: > > ip monitor label link route nexthop& > > ip link set ntfp2 carrier off > > > > $ ip link set ntfp2 carrier off > > $ [LINK]4: ntfp2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state > > DOWN group default > > link/ether de:ed:02:67:61:1f brd ff:ff:ff:ff:ff:ff > > > > => No nexthop event nor route event (net.ipv4.nexthop_compat_mode = 1) > > carrier down is a link event and as you show here, link events are sent. > > > > > 'ip nexthop' and 'ip route' show that the nexthop and the route have been deleted. > > nexthop objects are removed on the link event; any routes referencing > those nexthops are removed. The netlink notification path can and does not have any flow control. The problem with what is propoosed is that if there are 1M route entries and a link event happens, some of the RTM_DELROUTE events will be dropped when the netlink queue overruns. Therefore daemons can not depend on RTM_DELROUTE for tracking, and instead should look for the link event and do bulk cleanup then.
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 74d403dbd2b4..1a88013f6a5b 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -2026,6 +2026,7 @@ void fib_table_flush_external(struct fib_table *tb) int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) { struct trie *t = (struct trie *)tb->tb_data; + struct nl_info info = { .nl_net = net }; struct key_vector *pn = t->kv; unsigned long cindex = 1; struct hlist_node *tmp; @@ -2088,6 +2089,8 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) fib_notify_alias_delete(net, n->key, &n->leaf, fa, NULL); + rtmsg_fib(RTM_DELROUTE, htonl(n->key), fa, + KEYLENGTH - fa->fa_slen, tb->tb_id, &info, 0); hlist_del_rcu(&fa->fa_list); fib_release_info(fa->fa_info); alias_free_mem_rcu(fa);
After deleting an interface address in fib_del_ifaddr(), the function scans the fib_info list for stray entries and calls fib_flush. Then the stray entries will be deleted silently and no RTM_DELROUTE notification will be sent. This lack of notification can make routing daemons like NetworkManager, miss the routing changes. e.g. + ip link add dummy1 type dummy + ip link add dummy2 type dummy + ip link set dummy1 up + ip link set dummy2 up + ip addr add 192.168.5.5/24 dev dummy1 + ip route add 7.7.7.0/24 dev dummy2 src 192.168.5.5 + ip -4 route 7.7.7.0/24 dev dummy2 scope link src 192.168.5.5 192.168.5.0/24 dev dummy1 proto kernel scope link src 192.168.5.5 + ip monitor route + ip addr del 192.168.5.5/24 dev dummy1 Deleted 192.168.5.0/24 dev dummy1 proto kernel scope link src 192.168.5.5 Deleted broadcast 192.168.5.255 dev dummy1 table local proto kernel scope link src 192.168.5.5 Deleted local 192.168.5.5 dev dummy1 table local proto kernel scope host src 192.168.5.5 After update: + ip monitor route + ip addr del 192.168.5.5/24 dev dummy1 Deleted 192.168.5.0/24 dev dummy1 proto kernel scope link src 192.168.5.5 Deleted broadcast 192.168.5.255 dev dummy1 table local proto kernel scope link src 192.168.5.5 Deleted local 192.168.5.5 dev dummy1 table local proto kernel scope host src 192.168.5.5 Deleted 7.7.7.0/24 dev dummy2 scope link src 192.168.5.5 Suggested-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> --- net/ipv4/fib_trie.c | 3 +++ 1 file changed, 3 insertions(+)