diff mbox series

[net-next,05/10] net: ipv4: Emit notification when fib hardware flags are changed

Message ID 20210126132311.3061388-6-idosch@idosch.org (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series Add notifications when route hardware flags change | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 7 maintainers not CCed: linux-doc@vger.kernel.org yoshfuji@linux-ipv6.org edumazet@google.com weiwan@google.com roopa@cumulusnetworks.com kuniyu@amazon.co.jp corbet@lwn.net
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 7490 this patch: 7490
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: line length of 82 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 7892 this patch: 7892
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Ido Schimmel Jan. 26, 2021, 1:23 p.m. UTC
From: Amit Cohen <amcohen@nvidia.com>

After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel,
but not necessarily in hardware.

The asynchronous nature of route installation in hardware can lead to a
routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.

It is also possible for a route already installed in hardware to change
its action and therefore its flags. For example, a host route that is
trapping packets can be "promoted" to perform decapsulation following
the installation of an IPinIP/VXLAN tunnel.

Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed. The aim is to provide an indication to user-space
(e.g., routing daemons) about the state of the route in hardware.

Introduce a sysctl that controls this behavior.

Keep the default value at 0 (i.e., do not emit notifications) for several
reasons:
- Multiple RTM_NEWROUTE notification per-route might confuse existing
  routing daemons.
- Convergence reasons in routing daemons.
- The extra notifications will negatively impact the insertion rate.
- Not all users are interested in these notifications.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Acked-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 Documentation/networking/ip-sysctl.rst | 20 +++++++++++++++++++
 include/net/netns/ipv4.h               |  2 ++
 net/ipv4/af_inet.c                     |  2 ++
 net/ipv4/fib_trie.c                    | 27 ++++++++++++++++++++++++++
 net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
 5 files changed, 60 insertions(+)

Comments

David Ahern Jan. 27, 2021, 5:02 a.m. UTC | #1
On 1/26/21 6:23 AM, Ido Schimmel wrote:
> From: Amit Cohen <amcohen@nvidia.com>
> 
> After installing a route to the kernel, user space receives an
> acknowledgment, which means the route was installed in the kernel,
> but not necessarily in hardware.
> 
> The asynchronous nature of route installation in hardware can lead to a
> routing daemon advertising a route before it was actually installed in
> hardware. This can result in packet loss or mis-routed packets until the
> route is installed in hardware.
> 
> It is also possible for a route already installed in hardware to change
> its action and therefore its flags. For example, a host route that is
> trapping packets can be "promoted" to perform decapsulation following
> the installation of an IPinIP/VXLAN tunnel.
> 
> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
> are changed. The aim is to provide an indication to user-space
> (e.g., routing daemons) about the state of the route in hardware.
> 
> Introduce a sysctl that controls this behavior.
> 
> Keep the default value at 0 (i.e., do not emit notifications) for several
> reasons:
> - Multiple RTM_NEWROUTE notification per-route might confuse existing
>   routing daemons.

are you aware of any routing daemons that are affected? Seems like they
should be able to handle redundant notifications

> - Convergence reasons in routing daemons.
> - The extra notifications will negatively impact the insertion rate.

any numbers on the overhead?

> - Not all users are interested in these notifications.
> 
> Signed-off-by: Amit Cohen <amcohen@nvidia.com>
> Acked-by: Roopa Prabhu <roopa@nvidia.com>
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> ---
>  Documentation/networking/ip-sysctl.rst | 20 +++++++++++++++++++
>  include/net/netns/ipv4.h               |  2 ++
>  net/ipv4/af_inet.c                     |  2 ++
>  net/ipv4/fib_trie.c                    | 27 ++++++++++++++++++++++++++
>  net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
>  5 files changed, 60 insertions(+)
>
Amit Cohen Jan. 27, 2021, 11:46 a.m. UTC | #2
>-----Original Message-----
>From: David Ahern <dsahern@gmail.com>
>Sent: Wednesday, January 27, 2021 7:03
>To: Ido Schimmel <idosch@idosch.org>; netdev@vger.kernel.org
>Cc: davem@davemloft.net; kuba@kernel.org; Amit Cohen <amcohen@nvidia.com>; Roopa Prabhu <roopa@nvidia.com>; Donald
>Sharp <sharpd@nvidia.com>; Benjamin Poirier <bpoirier@nvidia.com>; mlxsw <mlxsw@nvidia.com>; Ido Schimmel
><idosch@nvidia.com>
>Subject: Re: [PATCH net-next 05/10] net: ipv4: Emit notification when fib hardware flags are changed
>
>On 1/26/21 6:23 AM, Ido Schimmel wrote:
>> From: Amit Cohen <amcohen@nvidia.com>
>>
>> After installing a route to the kernel, user space receives an
>> acknowledgment, which means the route was installed in the kernel, but
>> not necessarily in hardware.
>>
>> The asynchronous nature of route installation in hardware can lead to
>> a routing daemon advertising a route before it was actually installed
>> in hardware. This can result in packet loss or mis-routed packets
>> until the route is installed in hardware.
>>
>> It is also possible for a route already installed in hardware to
>> change its action and therefore its flags. For example, a host route
>> that is trapping packets can be "promoted" to perform decapsulation
>> following the installation of an IPinIP/VXLAN tunnel.
>>
>> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP
>> flags are changed. The aim is to provide an indication to user-space
>> (e.g., routing daemons) about the state of the route in hardware.
>>
>> Introduce a sysctl that controls this behavior.
>>
>> Keep the default value at 0 (i.e., do not emit notifications) for
>> several
>> reasons:
>> - Multiple RTM_NEWROUTE notification per-route might confuse existing
>>   routing daemons.
>
>are you aware of any routing daemons that are affected? Seems like they should be able to handle redundant notifications

Actually no, we didn't check all the existing daemons, just assume that not everyone will want to activate the notifications at all.
So there is no point in sending notifications for users which aren't interested in them.

>
>> - Convergence reasons in routing daemons.
>> - The extra notifications will negatively impact the insertion rate.
>
>any numbers on the overhead?

For addition of 256k routes in mlxsw, the overhead is 3.6% of the total time.

>
>> - Not all users are interested in these notifications.
>>
>> Signed-off-by: Amit Cohen <amcohen@nvidia.com>
>> Acked-by: Roopa Prabhu <roopa@nvidia.com>
>> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
>> ---
>>  Documentation/networking/ip-sysctl.rst | 20 +++++++++++++++++++
>>  include/net/netns/ipv4.h               |  2 ++
>>  net/ipv4/af_inet.c                     |  2 ++
>>  net/ipv4/fib_trie.c                    | 27 ++++++++++++++++++++++++++
>>  net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
>>  5 files changed, 60 insertions(+)
>>
David Ahern Jan. 28, 2021, 3:34 a.m. UTC | #3
On 1/26/21 6:23 AM, Ido Schimmel wrote:
> From: Amit Cohen <amcohen@nvidia.com>
> 
> After installing a route to the kernel, user space receives an
> acknowledgment, which means the route was installed in the kernel,
> but not necessarily in hardware.
> 
> The asynchronous nature of route installation in hardware can lead to a
> routing daemon advertising a route before it was actually installed in
> hardware. This can result in packet loss or mis-routed packets until the
> route is installed in hardware.
> 
> It is also possible for a route already installed in hardware to change
> its action and therefore its flags. For example, a host route that is
> trapping packets can be "promoted" to perform decapsulation following
> the installation of an IPinIP/VXLAN tunnel.
> 
> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
> are changed. The aim is to provide an indication to user-space
> (e.g., routing daemons) about the state of the route in hardware.
> 
> Introduce a sysctl that controls this behavior.
> 
> Keep the default value at 0 (i.e., do not emit notifications) for several
> reasons:
> - Multiple RTM_NEWROUTE notification per-route might confuse existing
>   routing daemons.
> - Convergence reasons in routing daemons.
> - The extra notifications will negatively impact the insertion rate.
> - Not all users are interested in these notifications.
> 
> Signed-off-by: Amit Cohen <amcohen@nvidia.com>
> Acked-by: Roopa Prabhu <roopa@nvidia.com>
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> ---
>  Documentation/networking/ip-sysctl.rst | 20 +++++++++++++++++++
>  include/net/netns/ipv4.h               |  2 ++
>  net/ipv4/af_inet.c                     |  2 ++
>  net/ipv4/fib_trie.c                    | 27 ++++++++++++++++++++++++++
>  net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
>  5 files changed, 60 insertions(+)
> 

Reviewed-by: David Ahern <dsahern@kernel.org>
Jakub Kicinski Jan. 29, 2021, 3:04 a.m. UTC | #4
On Tue, 26 Jan 2021 15:23:06 +0200 Ido Schimmel wrote:
> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
> are changed. The aim is to provide an indication to user-space
> (e.g., routing daemons) about the state of the route in hardware.

What does the daemon in the user space do with it?

The notification will only be generated for the _first_ ASIC which
offloaded the object. Which may be fine for you today but as an uAPI 
it feels slightly lacking.

If the user space just wants to make sure the devices are synced to
notifications from certain stage, wouldn't it be more idiomatic to
provide some "fence" operation?

WDYT? David?
David Ahern Jan. 29, 2021, 3:33 a.m. UTC | #5
On 1/28/21 8:04 PM, Jakub Kicinski wrote:
> On Tue, 26 Jan 2021 15:23:06 +0200 Ido Schimmel wrote:
>> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
>> are changed. The aim is to provide an indication to user-space
>> (e.g., routing daemons) about the state of the route in hardware.
> 
> What does the daemon in the user space do with it?

You don't want FRR for example to advertise a route to a peer until it
is really programmed in h/w. This notification gives routing daemons
that information.

> 
> The notification will only be generated for the _first_ ASIC which
> offloaded the object. Which may be fine for you today but as an uAPI 
> it feels slightly lacking.
> 
> If the user space just wants to make sure the devices are synced to
> notifications from certain stage, wouldn't it be more idiomatic to
> provide some "fence" operation?
> 
> WDYT? David?
> 

This feature was first discussed I think about 2 years ago - when I was
still with Cumulus, so I already knew the intent and end goal.

I think support for multiple ASICs / NICs doing this kind of offload
will have a whole lot of challenges. I don't think this particular user
notification is going to be a big problem - e.g., you could always delay
the emit until all have indicated the offload.
Jakub Kicinski Jan. 29, 2021, 4:15 a.m. UTC | #6
On Thu, 28 Jan 2021 20:33:22 -0700 David Ahern wrote:
> On 1/28/21 8:04 PM, Jakub Kicinski wrote:
> > On Tue, 26 Jan 2021 15:23:06 +0200 Ido Schimmel wrote:  
> >> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
> >> are changed. The aim is to provide an indication to user-space
> >> (e.g., routing daemons) about the state of the route in hardware.  
> > 
> > What does the daemon in the user space do with it?  
> 
> You don't want FRR for example to advertise a route to a peer until it
> is really programmed in h/w. This notification gives routing daemons
> that information.

I see. Hm.

> > The notification will only be generated for the _first_ ASIC which
> > offloaded the object. Which may be fine for you today but as an uAPI 
> > it feels slightly lacking.
> > 
> > If the user space just wants to make sure the devices are synced to
> > notifications from certain stage, wouldn't it be more idiomatic to
> > provide some "fence" operation?
> > 
> > WDYT? David?
> 
> This feature was first discussed I think about 2 years ago - when I was
> still with Cumulus, so I already knew the intent and end goal.
> 
> I think support for multiple ASICs / NICs doing this kind of offload
> will have a whole lot of challenges. I don't think this particular user
> notification is going to be a big problem - e.g., you could always delay
> the emit until all have indicated the offload.

My impression from working on this problem in TC is that the definition
of "all" becomes problematic especially if one takes into account
drivers getting reloaded. But I think routing offload has stronger
semantics than TC, so no objections.

We need a respin for the somewhat embarrassing loop in patch 1, tho.
Ido Schimmel Jan. 30, 2021, 3:11 p.m. UTC | #7
On Thu, Jan 28, 2021 at 08:33:22PM -0700, David Ahern wrote:
> On 1/28/21 8:04 PM, Jakub Kicinski wrote:
> > On Tue, 26 Jan 2021 15:23:06 +0200 Ido Schimmel wrote:
> >> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
> >> are changed. The aim is to provide an indication to user-space
> >> (e.g., routing daemons) about the state of the route in hardware.
> > 
> > What does the daemon in the user space do with it?
> 
> You don't want FRR for example to advertise a route to a peer until it
> is really programmed in h/w. This notification gives routing daemons
> that information.

Correct. It is in the cover letter:

"These flags are of interest to routing daemons since they would like to
delay advertisement of routes until they are installed in hardware."

Amit is working on follow-up to emit notifications when route offload
fails. This request also comes from the FRR team. Currently we have a
policy inside mlxsw to abort route offload and install a default route
that sends all the traffic to the CPU. It obviously kills the box and
anyway the policy is something user space should decide, not the kernel.

> 
> > 
> > The notification will only be generated for the _first_ ASIC which
> > offloaded the object. Which may be fine for you today but as an uAPI 
> > it feels slightly lacking.
> > 
> > If the user space just wants to make sure the devices are synced to
> > notifications from certain stage, wouldn't it be more idiomatic to
> > provide some "fence" operation?
> > 
> > WDYT? David?
> > 
> 
> This feature was first discussed I think about 2 years ago - when I was
> still with Cumulus, so I already knew the intent and end goal.
> 
> I think support for multiple ASICs / NICs doing this kind of offload
> will have a whole lot of challenges. I don't think this particular user
> notification is going to be a big problem - e.g., you could always delay
> the emit until all have indicated the offload.

I do not have experience with multi-ASIC systems, but my understanding
is that each ASIC has its own copy of the networking stack and the ASICs
are connected via front panel or backplane ports, like distinct
leaf/spine switches. In Linux, such a system can be supported by
registering a devlink instance for each ASIC and reloading each instance
to a separate namespace.

Thanks for reviewing, David
Ido Schimmel Jan. 30, 2021, 3:36 p.m. UTC | #8
On Thu, Jan 28, 2021 at 08:15:45PM -0800, Jakub Kicinski wrote:
> My impression from working on this problem in TC is that the definition
> of "all" becomes problematic especially if one takes into account
> drivers getting reloaded. But I think routing offload has stronger
> semantics than TC, so no objections.

During the teardown phase of the reload, all the routes using the
driver's netdevs (or their uppers) will be deleted by the kernel because
the netdevs will disappear. During the init phase of the reload, the
driver will re-register its FIB notifier and ask for a dump of all the
existing routes (usually only host routes). With this patchset, user
space will receive a notification that these routes are now in hardware.

# ip monitor route
broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1 
local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1 
local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1 
broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1 
broadcast 10.209.1.255 dev eth0 table local proto kernel scope link src 10.209.0.191 
local 10.209.0.191 dev eth0 table local proto kernel scope host src 10.209.0.191 
broadcast 10.209.0.0 dev eth0 table local proto kernel scope link src 10.209.0.191 
10.209.0.1 dev eth0 proto dhcp scope link src 10.209.0.191 metric 1024 
10.209.0.0/23 dev eth0 proto kernel scope link src 10.209.0.191 
default via 10.209.0.1 dev eth0 proto dhcp src 10.209.0.191 metric 1024 
<< init phase starts here >>
default via 10.209.0.1 dev eth0 proto dhcp src 10.209.0.191 metric 1024 rt_trap 
10.209.0.0/23 dev eth0 proto kernel scope link src 10.209.0.191 rt_trap 
10.209.0.1 dev eth0 proto dhcp scope link src 10.209.0.191 metric 1024 rt_trap 
broadcast 10.209.0.0 dev eth0 table local proto kernel scope link src 10.209.0.191 rt_trap 
local 10.209.0.191 dev eth0 table local proto kernel scope host src 10.209.0.191 rt_trap 
broadcast 10.209.1.255 dev eth0 table local proto kernel scope link src 10.209.0.191 rt_trap 
broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1 rt_trap 
local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1 rt_trap 
local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1 rt_trap 
broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1 rt_trap
diff mbox series

Patch

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index dd2b12a32b73..01927b36bbee 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -178,6 +178,26 @@  min_adv_mss - INTEGER
 	The advertised MSS depends on the first hop route MTU, but will
 	never be lower than this setting.
 
+fib_notify_on_flag_change - INTEGER
+        Whether to emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/
+        RTM_F_TRAP flags are changed.
+
+        After installing a route to the kernel, user space receives an
+        acknowledgment, which means the route was installed in the kernel,
+        but not necessarily in hardware.
+        It is also possible for a route already installed in hardware to change
+        its action and therefore its flags. For example, a host route that is
+        trapping packets can be "promoted" to perform decapsulation following
+        the installation of an IPinIP/VXLAN tunnel.
+        The notifications will indicate to user-space the state of the route.
+
+        Default: 0 (Do not emit notifications.)
+
+        Possible values:
+
+        - 0 - Do not emit notifications.
+        - 1 - Emit notifications.
+
 IP Fragmentation:
 
 ipfrag_high_thresh - LONG INTEGER
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 8e4fcac4df72..70a2a085dd1a 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -188,6 +188,8 @@  struct netns_ipv4 {
 	int sysctl_udp_wmem_min;
 	int sysctl_udp_rmem_min;
 
+	int sysctl_fib_notify_on_flag_change;
+
 #ifdef CONFIG_NET_L3_MASTER_DEV
 	int sysctl_udp_l3mdev_accept;
 #endif
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index b94fa8eb831b..ab42f6404fc6 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1871,6 +1871,8 @@  static __net_init int inet_init_net(struct net *net)
 	net->ipv4.sysctl_igmp_llm_reports = 1;
 	net->ipv4.sysctl_igmp_qrv = 2;
 
+	net->ipv4.sysctl_fib_notify_on_flag_change = 0;
+
 	return 0;
 }
 
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 28117c05dc35..60559b708158 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1038,6 +1038,8 @@  fib_find_matching_alias(struct net *net, const struct fib_rt_info *fri)
 void fib_alias_hw_flags_set(struct net *net, const struct fib_rt_info *fri)
 {
 	struct fib_alias *fa_match;
+	struct sk_buff *skb;
+	int err;
 
 	rcu_read_lock();
 
@@ -1045,9 +1047,34 @@  void fib_alias_hw_flags_set(struct net *net, const struct fib_rt_info *fri)
 	if (!fa_match)
 		goto out;
 
+	if (fa_match->offload == fri->offload && fa_match->trap == fri->trap)
+		goto out;
+
 	fa_match->offload = fri->offload;
 	fa_match->trap = fri->trap;
 
+	if (!net->ipv4.sysctl_fib_notify_on_flag_change)
+		goto out;
+
+	skb = nlmsg_new(fib_nlmsg_size(fa_match->fa_info), GFP_ATOMIC);
+	if (!skb) {
+		err = -ENOBUFS;
+		goto errout;
+	}
+
+	err = fib_dump_info(skb, 0, 0, RTM_NEWROUTE, fri, 0);
+	if (err < 0) {
+		/* -EMSGSIZE implies BUG in fib_nlmsg_size() */
+		WARN_ON(err == -EMSGSIZE);
+		kfree_skb(skb);
+		goto errout;
+	}
+
+	rtnl_notify(skb, net, 0, RTNLGRP_IPV4_ROUTE, NULL, GFP_ATOMIC);
+	goto out;
+
+errout:
+	rtnl_set_sk_err(net, RTNLGRP_IPV4_ROUTE, err);
 out:
 	rcu_read_unlock();
 }
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 3e5f4f2e705e..e5798b3b59d2 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -1354,6 +1354,15 @@  static struct ctl_table ipv4_net_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE
 	},
+	{
+		.procname	= "fib_notify_on_flag_change",
+		.data		= &init_net.ipv4.sysctl_fib_notify_on_flag_change,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 	{ }
 };