diff mbox series

[net-next,v2] veth: Support bonding events

Message ID 20220329114052.237572-1-wintera@linux.ibm.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net-next,v2] veth: Support bonding events | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 4 of 4 maintainers
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 71 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Alexandra Winter March 29, 2022, 11:40 a.m. UTC
Bonding drivers generate specific events during failover that trigger
switch updates.  When a veth device is attached to a bridge with a
bond interface, we want external switches to learn about the veth
devices as well.

Example:

	| veth_a2   |  veth_b2  |  veth_c2 |
	------o-----------o----------o------
	       \	  |	    /
		o	  o	   o
	      veth_a1  veth_b1  veth_c1
	      -------------------------
	      |        bridge         |
	      -------------------------
			bond0
			/  \
		     eth0  eth1

In case of failover from eth0 to eth1, the netdev_notifier needs to be
propagated, so e.g. veth_a2 can re-announce its MAC address to the
external hardware attached to eth1.

Without this patch we have seen cases where recovery after bond failover
took an unacceptable amount of time (depending on timeout settings in the
network).

Due to the symmetric nature of veth special care is required to avoid
endless notification loops. Therefore we only notify from a veth
bridgeport to a peer that is not a bridgeport.

References:
Same handling as for macvlan:
commit 4c9912556867 ("macvlan: Support bonding events")
and vlan:
commit 4aa5dee4d999 ("net: convert resend IGMP to notifier event")

Alternatives:
Propagate notifier events to all ports of a bridge. IIUC, this was
rejected in https://www.spinics.net/lists/netdev/msg717292.html
It also seems difficult to avoid re-bouncing the notifier.

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
---
 drivers/net/veth.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

Comments

Jakub Kicinski March 30, 2022, 12:54 a.m. UTC | #1
Dropping the BPF people from CC and adding Hangbin, bridge and
bond/team. Please exercise some judgment when sending patches.

On Tue, 29 Mar 2022 13:40:52 +0200 Alexandra Winter wrote:
> Bonding drivers generate specific events during failover that trigger
> switch updates.  When a veth device is attached to a bridge with a
> bond interface, we want external switches to learn about the veth
> devices as well.
> 
> Example:
> 
> 	| veth_a2   |  veth_b2  |  veth_c2 |
> 	------o-----------o----------o------
> 	       \	  |	    /
> 		o	  o	   o
> 	      veth_a1  veth_b1  veth_c1
> 	      -------------------------
> 	      |        bridge         |
> 	      -------------------------
> 			bond0
> 			/  \
> 		     eth0  eth1
> 
> In case of failover from eth0 to eth1, the netdev_notifier needs to be
> propagated, so e.g. veth_a2 can re-announce its MAC address to the
> external hardware attached to eth1.
> 
> Without this patch we have seen cases where recovery after bond failover
> took an unacceptable amount of time (depending on timeout settings in the
> network).
> 
> Due to the symmetric nature of veth special care is required to avoid
> endless notification loops. Therefore we only notify from a veth
> bridgeport to a peer that is not a bridgeport.
> 
> References:
> Same handling as for macvlan:
> commit 4c9912556867 ("macvlan: Support bonding events")
> and vlan:
> commit 4aa5dee4d999 ("net: convert resend IGMP to notifier event")
> 
> Alternatives:
> Propagate notifier events to all ports of a bridge. IIUC, this was
> rejected in https://www.spinics.net/lists/netdev/msg717292.html

My (likely flawed) reading of Nik's argument was that (1) he was
concerned about GARP storms; (2) he didn't want the GARP to be
broadcast to all ports, just the bond that originated the request.

I'm not sure I follow (1), as Hangbin said the event is rare, plus 
GARP only comes from interfaces that have an IP addr, which IIUC
most bridge ports will not have.

This patch in no way addresses (2). But then, again, if we put 
a macvlan on top of a bridge master it will shotgun its GARPS all 
the same. So it's not like veth would be special in that regard.

Nik, what am I missing?

> It also seems difficult to avoid re-bouncing the notifier.

syzbot will make short work of this patch, I think the potential
for infinite loops has to be addressed somehow. IIUC this is the 
first instance of forwarding those notifiers to a peer rather
than within a upper <> lower device hierarchy which is a DAG.

> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
> ---
>  drivers/net/veth.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 53 insertions(+)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index d29fb9759cc9..74b074453197 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -1579,6 +1579,57 @@ static void veth_setup(struct net_device *dev)
>  	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
>  }
>  
> +static bool netif_is_veth(const struct net_device *dev)
> +{
> +	return (dev->netdev_ops == &veth_netdev_ops);

brackets unnecessary 

> +}
> +
> +static void veth_notify_peer(unsigned long event, const struct net_device *dev)
> +{
> +	struct net_device *peer;
> +	struct veth_priv *priv;
> +
> +	priv = netdev_priv(dev);
> +	peer = rtnl_dereference(priv->peer);
> +	/* avoid re-bounce between 2 bridges */
> +	if (!netif_is_bridge_port(peer))
> +		call_netdevice_notifiers(event, peer);
> +}
> +
> +/* Called under rtnl_lock */
> +static int veth_device_event(struct notifier_block *unused,
> +			     unsigned long event, void *ptr)
> +{
> +	struct net_device *dev, *lower;
> +	struct list_head *iter;
> +
> +	dev = netdev_notifier_info_to_dev(ptr);
> +
> +	switch (event) {
> +	case NETDEV_NOTIFY_PEERS:
> +	case NETDEV_BONDING_FAILOVER:
> +	case NETDEV_RESEND_IGMP:
> +		/* propagate to peer of a bridge attached veth */
> +		if (netif_is_bridge_master(dev)) {

Having veth sift thru bridge ports seems strange.
In fact it could be beneficial to filter the event based on
port state (whether it's forwarding, vlan etc). But looking
at details of port state outside the bridge would be even stranger.

> +			iter = &dev->adj_list.lower;
> +			lower = netdev_next_lower_dev_rcu(dev, &iter);
> +			while (lower) {
> +				if (netif_is_veth(lower))
> +					veth_notify_peer(event, lower);
> +				lower = netdev_next_lower_dev_rcu(dev, &iter);

let's add netdev_for_each_lower_dev_rcu() rather than open-coding

> +			}
> +		}
> +		break;
> +	default:
> +		break;
> +	}
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block veth_notifier_block __read_mostly = {
> +		.notifier_call  = veth_device_event,

extra tab here

> +};
> +
>  /*
>   * netlink interface
>   */
> @@ -1824,12 +1875,14 @@ static struct rtnl_link_ops veth_link_ops = {
>  
>  static __init int veth_init(void)
>  {
> +	register_netdevice_notifier(&veth_notifier_block);

this can fail

>  	return rtnl_link_register(&veth_link_ops);
>  }
>  
>  static __exit void veth_exit(void)
>  {
>  	rtnl_link_unregister(&veth_link_ops);
> +	unregister_netdevice_notifier(&veth_notifier_block);
>  }
>  
>  module_init(veth_init);
Nikolay Aleksandrov March 30, 2022, 10:23 a.m. UTC | #2
On 30/03/2022 03:54, Jakub Kicinski wrote:
> Dropping the BPF people from CC and adding Hangbin, bridge and
> bond/team. Please exercise some judgment when sending patches.
> 
> On Tue, 29 Mar 2022 13:40:52 +0200 Alexandra Winter wrote:
>> Bonding drivers generate specific events during failover that trigger
>> switch updates.  When a veth device is attached to a bridge with a
>> bond interface, we want external switches to learn about the veth
>> devices as well.
>>
>> Example:
>>
>> 	| veth_a2   |  veth_b2  |  veth_c2 |
>> 	------o-----------o----------o------
>> 	       \	  |	    /
>> 		o	  o	   o
>> 	      veth_a1  veth_b1  veth_c1
>> 	      -------------------------
>> 	      |        bridge         |
>> 	      -------------------------
>> 			bond0
>> 			/  \
>> 		     eth0  eth1
>>
>> In case of failover from eth0 to eth1, the netdev_notifier needs to be
>> propagated, so e.g. veth_a2 can re-announce its MAC address to the
>> external hardware attached to eth1.
>>
>> Without this patch we have seen cases where recovery after bond failover
>> took an unacceptable amount of time (depending on timeout settings in the
>> network).
>>
>> Due to the symmetric nature of veth special care is required to avoid
>> endless notification loops. Therefore we only notify from a veth
>> bridgeport to a peer that is not a bridgeport.
>>
>> References:
>> Same handling as for macvlan:
>> commit 4c9912556867 ("macvlan: Support bonding events")
>> and vlan:
>> commit 4aa5dee4d999 ("net: convert resend IGMP to notifier event")
>>
>> Alternatives:
>> Propagate notifier events to all ports of a bridge. IIUC, this was
>> rejected in https://www.spinics.net/lists/netdev/msg717292.html
> 
> My (likely flawed) reading of Nik's argument was that (1) he was
> concerned about GARP storms; (2) he didn't want the GARP to be
> broadcast to all ports, just the bond that originated the request.
> 

Yes, that would be ideal. Trying to avoid unnecessary bcasts, that is
especially important for large setups with lots of devices.

> I'm not sure I follow (1), as Hangbin said the event is rare, plus 
> GARP only comes from interfaces that have an IP addr, which IIUC
> most bridge ports will not have.
> 

Indeed, such setups are not the most common ones.

> This patch in no way addresses (2). But then, again, if we put 
> a macvlan on top of a bridge master it will shotgun its GARPS all 
> the same. So it's not like veth would be special in that regard.
> 
> Nik, what am I missing?
> 

If we're talking about macvlan -> bridge -> bond then the bond flap's
notify peers shouldn't reach the macvlan. Generally broadcast traffic
is quite expensive for the bridge, I have patches that improve on the
technical side (consider ports only for the same bcast domain), but you also
wouldn't want unnecessary bcast packets being sent around. :)
There are setups with tens of bond devices and propagating that to all would be
very expensive, but most of all unnecessary. It would also hurt setups with
a lot of vlan devices on the bridge. There are setups with hundreds of vlans
and hundreds of macvlans on top, propagating it up would send it to all of
them and that wouldn't scale at all, these mostly have IP addresses too.

Perhaps we can enable propagation on a per-port or per-bridge basis, then we
can avoid these walks. That is, make it opt-in.

>> It also seems difficult to avoid re-bouncing the notifier.
> 
> syzbot will make short work of this patch, I think the potential
> for infinite loops has to be addressed somehow. IIUC this is the 
> first instance of forwarding those notifiers to a peer rather
> than within a upper <> lower device hierarchy which is a DAG.
> 
>> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
>> ---
>>  drivers/net/veth.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 53 insertions(+)
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index d29fb9759cc9..74b074453197 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -1579,6 +1579,57 @@ static void veth_setup(struct net_device *dev)
>>  	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
>>  }
>>  
>> +static bool netif_is_veth(const struct net_device *dev)
>> +{
>> +	return (dev->netdev_ops == &veth_netdev_ops);
> 
> brackets unnecessary 
> 
>> +}
>> +
>> +static void veth_notify_peer(unsigned long event, const struct net_device *dev)
>> +{
>> +	struct net_device *peer;
>> +	struct veth_priv *priv;
>> +
>> +	priv = netdev_priv(dev);
>> +	peer = rtnl_dereference(priv->peer);
>> +	/* avoid re-bounce between 2 bridges */
>> +	if (!netif_is_bridge_port(peer))
>> +		call_netdevice_notifiers(event, peer);
>> +}
>> +
>> +/* Called under rtnl_lock */
>> +static int veth_device_event(struct notifier_block *unused,
>> +			     unsigned long event, void *ptr)
>> +{
>> +	struct net_device *dev, *lower;
>> +	struct list_head *iter;
>> +
>> +	dev = netdev_notifier_info_to_dev(ptr);
>> +
>> +	switch (event) {
>> +	case NETDEV_NOTIFY_PEERS:
>> +	case NETDEV_BONDING_FAILOVER:
>> +	case NETDEV_RESEND_IGMP:
>> +		/* propagate to peer of a bridge attached veth */
>> +		if (netif_is_bridge_master(dev)) {
> 
> Having veth sift thru bridge ports seems strange.
> In fact it could be beneficial to filter the event based on
> port state (whether it's forwarding, vlan etc). But looking
> at details of port state outside the bridge would be even stranger.
> 
>> +			iter = &dev->adj_list.lower;
>> +			lower = netdev_next_lower_dev_rcu(dev, &iter);
>> +			while (lower) {
>> +				if (netif_is_veth(lower))
>> +					veth_notify_peer(event, lower);
>> +				lower = netdev_next_lower_dev_rcu(dev, &iter);
> 
> let's add netdev_for_each_lower_dev_rcu() rather than open-coding
> 
>> +			}
>> +		}
>> +		break;
>> +	default:
>> +		break;
>> +	}
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static struct notifier_block veth_notifier_block __read_mostly = {
>> +		.notifier_call  = veth_device_event,
> 
> extra tab here
> 
>> +};
>> +
>>  /*
>>   * netlink interface
>>   */
>> @@ -1824,12 +1875,14 @@ static struct rtnl_link_ops veth_link_ops = {
>>  
>>  static __init int veth_init(void)
>>  {
>> +	register_netdevice_notifier(&veth_notifier_block);
> 
> this can fail
> 
>>  	return rtnl_link_register(&veth_link_ops);
>>  }
>>  
>>  static __exit void veth_exit(void)
>>  {
>>  	rtnl_link_unregister(&veth_link_ops);
>> +	unregister_netdevice_notifier(&veth_notifier_block);
>>  }
>>  
>>  module_init(veth_init);
>
Alexandra Winter March 30, 2022, 11:14 a.m. UTC | #3
On 30.03.22 12:23, Nikolay Aleksandrov wrote:
> On 30/03/2022 03:54, Jakub Kicinski wrote:
>> Dropping the BPF people from CC and adding Hangbin, bridge and
>> bond/team. Please exercise some judgment when sending patches.
Thank you. 
I did  'scripts/get_maintainer.pl drivers/net/veth.c'
but was a bit surprised about the outcome.

>>
>> On Tue, 29 Mar 2022 13:40:52 +0200 Alexandra Winter wrote:
>>> Bonding drivers generate specific events during failover that trigger
>>> switch updates.  When a veth device is attached to a bridge with a
>>> bond interface, we want external switches to learn about the veth
>>> devices as well.
>>>
>>> Example:
>>>
>>> 	| veth_a2   |  veth_b2  |  veth_c2 |
>>> 	------o-----------o----------o------
>>> 	       \	  |	    /
>>> 		o	  o	   o
>>> 	      veth_a1  veth_b1  veth_c1
>>> 	      -------------------------
>>> 	      |        bridge         |
>>> 	      -------------------------
>>> 			bond0
>>> 			/  \
>>> 		     eth0  eth1
>>>
>>> In case of failover from eth0 to eth1, the netdev_notifier needs to be
>>> propagated, so e.g. veth_a2 can re-announce its MAC address to the
>>> external hardware attached to eth1.
>>>
>>> Without this patch we have seen cases where recovery after bond failover
>>> took an unacceptable amount of time (depending on timeout settings in the
>>> network).
>>>
>>> Due to the symmetric nature of veth special care is required to avoid
>>> endless notification loops. Therefore we only notify from a veth
>>> bridgeport to a peer that is not a bridgeport.
>>>
>>> References:
>>> Same handling as for macvlan:
>>> commit 4c9912556867 ("macvlan: Support bonding events")
>>> and vlan:
>>> commit 4aa5dee4d999 ("net: convert resend IGMP to notifier event")
>>>
>>> Alternatives:
>>> Propagate notifier events to all ports of a bridge. IIUC, this was
>>> rejected in https://www.spinics.net/lists/netdev/msg717292.html
>>
>> My (likely flawed) reading of Nik's argument was that (1) he was
>> concerned about GARP storms; (2) he didn't want the GARP to be
>> broadcast to all ports, just the bond that originated the request.
>>
> 
> Yes, that would be ideal. Trying to avoid unnecessary bcasts, that is
> especially important for large setups with lots of devices.

One way to target the bond that originated the request, would be if the
bridge itself would do GARPs/RARPS/..,  on this bond port for all MACs
that are in its FDB. What do you think about that?

> 
>> I'm not sure I follow (1), as Hangbin said the event is rare, plus 
>> GARP only comes from interfaces that have an IP addr, which IIUC
>> most bridge ports will not have.
>>
> 
> Indeed, such setups are not the most common ones.
> 
>> This patch in no way addresses (2). But then, again, if we put 
>> a macvlan on top of a bridge master it will shotgun its GARPS all 
>> the same. So it's not like veth would be special in that regard.
>>
>> Nik, what am I missing?
>>
> 
> If we're talking about macvlan -> bridge -> bond then the bond flap's
> notify peers shouldn't reach the macvlan. Generally broadcast traffic
> is quite expensive for the bridge, I have patches that improve on the
> technical side (consider ports only for the same bcast domain), but you also
> wouldn't want unnecessary bcast packets being sent around. :)
> There are setups with tens of bond devices and propagating that to all would be
> very expensive, but most of all unnecessary. It would also hurt setups with
> a lot of vlan devices on the bridge. There are setups with hundreds of vlans
> and hundreds of macvlans on top, propagating it up would send it to all of
> them and that wouldn't scale at all, these mostly have IP addresses too.
> 
> Perhaps we can enable propagation on a per-port or per-bridge basis, then we
> can avoid these walks. That is, make it opt-in.
> 
>>> It also seems difficult to avoid re-bouncing the notifier.
>>
>> syzbot will make short work of this patch, I think the potential
>> for infinite loops has to be addressed somehow. IIUC this is the 
>> first instance of forwarding those notifiers to a peer rather
>> than within a upper <> lower device hierarchy which is a DAG.

My concern was about the Hangbin's alternative proposal to notify all
bridge ports. I hope in my porposal I was able to avoid infinite loops.

>>
>>> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
>>> --- 
>>>  drivers/net/veth.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 53 insertions(+)
>>>
>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>> index d29fb9759cc9..74b074453197 100644
>>> --- a/drivers/net/veth.c
>>> +++ b/drivers/net/veth.c
>>> @@ -1579,6 +1579,57 @@ static void veth_setup(struct net_device *dev)
>>>  	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
>>>  }
>>>  
>>> +static bool netif_is_veth(const struct net_device *dev)
>>> +{
>>> +	return (dev->netdev_ops == &veth_netdev_ops);
>>
>> brackets unnecessary 
>>
>>> +}
>>> +
>>> +static void veth_notify_peer(unsigned long event, const struct net_device *dev)
>>> +{
>>> +	struct net_device *peer;
>>> +	struct veth_priv *priv;
>>> +
>>> +	priv = netdev_priv(dev);
>>> +	peer = rtnl_dereference(priv->peer);
>>> +	/* avoid re-bounce between 2 bridges */
>>> +	if (!netif_is_bridge_port(peer))
>>> +		call_netdevice_notifiers(event, peer);
>>> +}
>>> +
>>> +/* Called under rtnl_lock */
>>> +static int veth_device_event(struct notifier_block *unused,
>>> +			     unsigned long event, void *ptr)
>>> +{
>>> +	struct net_device *dev, *lower;
>>> +	struct list_head *iter;
>>> +
>>> +	dev = netdev_notifier_info_to_dev(ptr);
>>> +
>>> +	switch (event) {
>>> +	case NETDEV_NOTIFY_PEERS:
>>> +	case NETDEV_BONDING_FAILOVER:
>>> +	case NETDEV_RESEND_IGMP:
>>> +		/* propagate to peer of a bridge attached veth */
>>> +		if (netif_is_bridge_master(dev)) {
>>
>> Having veth sift thru bridge ports seems strange.
>> In fact it could be beneficial to filter the event based on
>> port state (whether it's forwarding, vlan etc). But looking
>> at details of port state outside the bridge would be even stranger.
>>
>>> +			iter = &dev->adj_list.lower;
>>> +			lower = netdev_next_lower_dev_rcu(dev, &iter);
>>> +			while (lower) {
>>> +				if (netif_is_veth(lower))
>>> +					veth_notify_peer(event, lower);
>>> +				lower = netdev_next_lower_dev_rcu(dev, &iter);
>>
>> let's add netdev_for_each_lower_dev_rcu() rather than open-coding
>>
>>> +			}
>>> +		}
>>> +		break;
>>> +	default:
>>> +		break;
>>> +	}
>>> +	return NOTIFY_DONE;
>>> +}
>>> +
>>> +static struct notifier_block veth_notifier_block __read_mostly = {
>>> +		.notifier_call  = veth_device_event,
>>
>> extra tab here
>>
>>> +};
>>> +
>>>  /*
>>>   * netlink interface
>>>   */
>>> @@ -1824,12 +1875,14 @@ static struct rtnl_link_ops veth_link_ops = {
>>>  
>>>  static __init int veth_init(void)
>>>  {
>>> +	register_netdevice_notifier(&veth_notifier_block);
>>
>> this can fail
>>
>>>  	return rtnl_link_register(&veth_link_ops);
>>>  }
>>>  
>>>  static __exit void veth_exit(void)
>>>  {
>>>  	rtnl_link_unregister(&veth_link_ops);
>>> +	unregister_netdevice_notifier(&veth_notifier_block);
>>>  }
>>>  
>>>  module_init(veth_init);
>>
>
Nikolay Aleksandrov March 30, 2022, 11:25 a.m. UTC | #4
On 30/03/2022 14:14, Alexandra Winter wrote:
> 
> 
> On 30.03.22 12:23, Nikolay Aleksandrov wrote:
>> On 30/03/2022 03:54, Jakub Kicinski wrote:
>>> Dropping the BPF people from CC and adding Hangbin, bridge and
>>> bond/team. Please exercise some judgment when sending patches.
> Thank you. 
> I did  'scripts/get_maintainer.pl drivers/net/veth.c'
> but was a bit surprised about the outcome.
> 
>>>
>>> On Tue, 29 Mar 2022 13:40:52 +0200 Alexandra Winter wrote:
>>>> Bonding drivers generate specific events during failover that trigger
>>>> switch updates.  When a veth device is attached to a bridge with a
>>>> bond interface, we want external switches to learn about the veth
>>>> devices as well.
>>>>
>>>> Example:
>>>>
>>>> 	| veth_a2   |  veth_b2  |  veth_c2 |
>>>> 	------o-----------o----------o------
>>>> 	       \	  |	    /
>>>> 		o	  o	   o
>>>> 	      veth_a1  veth_b1  veth_c1
>>>> 	      -------------------------
>>>> 	      |        bridge         |
>>>> 	      -------------------------
>>>> 			bond0
>>>> 			/  \
>>>> 		     eth0  eth1
>>>>
>>>> In case of failover from eth0 to eth1, the netdev_notifier needs to be
>>>> propagated, so e.g. veth_a2 can re-announce its MAC address to the
>>>> external hardware attached to eth1.
>>>>
>>>> Without this patch we have seen cases where recovery after bond failover
>>>> took an unacceptable amount of time (depending on timeout settings in the
>>>> network).
>>>>
>>>> Due to the symmetric nature of veth special care is required to avoid
>>>> endless notification loops. Therefore we only notify from a veth
>>>> bridgeport to a peer that is not a bridgeport.
>>>>
>>>> References:
>>>> Same handling as for macvlan:
>>>> commit 4c9912556867 ("macvlan: Support bonding events")
>>>> and vlan:
>>>> commit 4aa5dee4d999 ("net: convert resend IGMP to notifier event")
>>>>
>>>> Alternatives:
>>>> Propagate notifier events to all ports of a bridge. IIUC, this was
>>>> rejected in https://www.spinics.net/lists/netdev/msg717292.html
>>>
>>> My (likely flawed) reading of Nik's argument was that (1) he was
>>> concerned about GARP storms; (2) he didn't want the GARP to be
>>> broadcast to all ports, just the bond that originated the request.
>>>
>>
>> Yes, that would be ideal. Trying to avoid unnecessary bcasts, that is
>> especially important for large setups with lots of devices.
> 
> One way to target the bond that originated the request, would be if the
> bridge itself would do GARPs/RARPS/..,  on this bond port for all MACs
> that are in its FDB. What do you think about that?
> 

That's a hack and you can already do it easily in user-space, you don't need
anything special in the bridge. It is also very specific, and it should only
happen in certain situations (e.g. a/b flap) which the bridge doesn't really
know about, but user-space does because it can see the notifications and
can see the bond mode.

>>
>>> I'm not sure I follow (1), as Hangbin said the event is rare, plus 
>>> GARP only comes from interfaces that have an IP addr, which IIUC
>>> most bridge ports will not have.
>>>
>>
>> Indeed, such setups are not the most common ones.
>>
>>> This patch in no way addresses (2). But then, again, if we put 
>>> a macvlan on top of a bridge master it will shotgun its GARPS all 
>>> the same. So it's not like veth would be special in that regard.
>>>
>>> Nik, what am I missing?
>>>
>>
>> If we're talking about macvlan -> bridge -> bond then the bond flap's
>> notify peers shouldn't reach the macvlan. Generally broadcast traffic
>> is quite expensive for the bridge, I have patches that improve on the
>> technical side (consider ports only for the same bcast domain), but you also
>> wouldn't want unnecessary bcast packets being sent around. :)
>> There are setups with tens of bond devices and propagating that to all would be
>> very expensive, but most of all unnecessary. It would also hurt setups with
>> a lot of vlan devices on the bridge. There are setups with hundreds of vlans
>> and hundreds of macvlans on top, propagating it up would send it to all of
>> them and that wouldn't scale at all, these mostly have IP addresses too.
>>
>> Perhaps we can enable propagation on a per-port or per-bridge basis, then we
>> can avoid these walks. That is, make it opt-in.
>>
>>>> It also seems difficult to avoid re-bouncing the notifier.
>>>
>>> syzbot will make short work of this patch, I think the potential
>>> for infinite loops has to be addressed somehow. IIUC this is the 
>>> first instance of forwarding those notifiers to a peer rather
>>> than within a upper <> lower device hierarchy which is a DAG.
> 
> My concern was about the Hangbin's alternative proposal to notify all
> bridge ports. I hope in my porposal I was able to avoid infinite loops.
> >>>
>>>> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
>>>> --- 
>>>>  drivers/net/veth.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 53 insertions(+)
>>>>
>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>> index d29fb9759cc9..74b074453197 100644
>>>> --- a/drivers/net/veth.c
>>>> +++ b/drivers/net/veth.c
>>>> @@ -1579,6 +1579,57 @@ static void veth_setup(struct net_device *dev)
>>>>  	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
>>>>  }
>>>>  
>>>> +static bool netif_is_veth(const struct net_device *dev)
>>>> +{
>>>> +	return (dev->netdev_ops == &veth_netdev_ops);
>>>
>>> brackets unnecessary 
>>>
>>>> +}
>>>> +
>>>> +static void veth_notify_peer(unsigned long event, const struct net_device *dev)
>>>> +{
>>>> +	struct net_device *peer;
>>>> +	struct veth_priv *priv;
>>>> +
>>>> +	priv = netdev_priv(dev);
>>>> +	peer = rtnl_dereference(priv->peer);
>>>> +	/* avoid re-bounce between 2 bridges */
>>>> +	if (!netif_is_bridge_port(peer))
>>>> +		call_netdevice_notifiers(event, peer);
>>>> +}
>>>> +
>>>> +/* Called under rtnl_lock */
>>>> +static int veth_device_event(struct notifier_block *unused,
>>>> +			     unsigned long event, void *ptr)
>>>> +{
>>>> +	struct net_device *dev, *lower;
>>>> +	struct list_head *iter;
>>>> +
>>>> +	dev = netdev_notifier_info_to_dev(ptr);
>>>> +
>>>> +	switch (event) {
>>>> +	case NETDEV_NOTIFY_PEERS:
>>>> +	case NETDEV_BONDING_FAILOVER:
>>>> +	case NETDEV_RESEND_IGMP:
>>>> +		/* propagate to peer of a bridge attached veth */
>>>> +		if (netif_is_bridge_master(dev)) {
>>>
>>> Having veth sift thru bridge ports seems strange.
>>> In fact it could be beneficial to filter the event based on
>>> port state (whether it's forwarding, vlan etc). But looking
>>> at details of port state outside the bridge would be even stranger.
>>>
>>>> +			iter = &dev->adj_list.lower;
>>>> +			lower = netdev_next_lower_dev_rcu(dev, &iter);
>>>> +			while (lower) {
>>>> +				if (netif_is_veth(lower))
>>>> +					veth_notify_peer(event, lower);
>>>> +				lower = netdev_next_lower_dev_rcu(dev, &iter);
>>>
>>> let's add netdev_for_each_lower_dev_rcu() rather than open-coding
>>>
>>>> +			}
>>>> +		}
>>>> +		break;
>>>> +	default:
>>>> +		break;
>>>> +	}
>>>> +	return NOTIFY_DONE;
>>>> +}
>>>> +
>>>> +static struct notifier_block veth_notifier_block __read_mostly = {
>>>> +		.notifier_call  = veth_device_event,
>>>
>>> extra tab here
>>>
>>>> +};
>>>> +
>>>>  /*
>>>>   * netlink interface
>>>>   */
>>>> @@ -1824,12 +1875,14 @@ static struct rtnl_link_ops veth_link_ops = {
>>>>  
>>>>  static __init int veth_init(void)
>>>>  {
>>>> +	register_netdevice_notifier(&veth_notifier_block);
>>>
>>> this can fail
>>>
>>>>  	return rtnl_link_register(&veth_link_ops);
>>>>  }
>>>>  
>>>>  static __exit void veth_exit(void)
>>>>  {
>>>>  	rtnl_link_unregister(&veth_link_ops);
>>>> +	unregister_netdevice_notifier(&veth_notifier_block);
>>>>  }
>>>>  
>>>>  module_init(veth_init);
>>>
>>
Jakub Kicinski March 30, 2022, 3:51 p.m. UTC | #5
On Wed, 30 Mar 2022 13:14:12 +0200 Alexandra Winter wrote:
> >> This patch in no way addresses (2). But then, again, if we put 
> >> a macvlan on top of a bridge master it will shotgun its GARPS all 
> >> the same. So it's not like veth would be special in that regard.
> >>
> >> Nik, what am I missing?
> > 
> > If we're talking about macvlan -> bridge -> bond then the bond flap's
> > notify peers shouldn't reach the macvlan.

Hm, right. I'm missing a step in my understanding. As you say bridge
does not seem to be re-broadcasting the event to its master. So how
does Alexandra catch this kind of an event? :S

	case NETDEV_NOTIFY_PEERS:
		/* propagate to peer of a bridge attached veth */
		if (netif_is_bridge_master(dev)) {  

IIUC bond will notify with dev == bond netdev. Where is the event with
dev == br generated?

> > Generally broadcast traffic
> > is quite expensive for the bridge, I have patches that improve on the
> > technical side (consider ports only for the same bcast domain), but you also
> > wouldn't want unnecessary bcast packets being sent around. :)
> > There are setups with tens of bond devices and propagating that to all would be
> > very expensive, but most of all unnecessary. It would also hurt setups with
> > a lot of vlan devices on the bridge. There are setups with hundreds of vlans
> > and hundreds of macvlans on top, propagating it up would send it to all of
> > them and that wouldn't scale at all, these mostly have IP addresses too.

Ack.

> > Perhaps we can enable propagation on a per-port or per-bridge basis, then we
> > can avoid these walks. That is, make it opt-in.

Maybe opt-out? But assuming the event is only generated on
active/backup switch over - when would it be okay to ignore
the notification?

> >>> It also seems difficult to avoid re-bouncing the notifier.  
> >>
> >> syzbot will make short work of this patch, I think the potential
> >> for infinite loops has to be addressed somehow. IIUC this is the 
> >> first instance of forwarding those notifiers to a peer rather
> >> than within a upper <> lower device hierarchy which is a DAG.  
> 
> My concern was about the Hangbin's alternative proposal to notify all
> bridge ports. I hope in my porposal I was able to avoid infinite loops.

Possibly I'm confused as to where the notification for bridge master
gets sent..
Nikolay Aleksandrov March 30, 2022, 4:16 p.m. UTC | #6
On 30/03/2022 18:51, Jakub Kicinski wrote:
> On Wed, 30 Mar 2022 13:14:12 +0200 Alexandra Winter wrote:
>>>> This patch in no way addresses (2). But then, again, if we put 
>>>> a macvlan on top of a bridge master it will shotgun its GARPS all 
>>>> the same. So it's not like veth would be special in that regard.
>>>>
>>>> Nik, what am I missing?
>>>
>>> If we're talking about macvlan -> bridge -> bond then the bond flap's
>>> notify peers shouldn't reach the macvlan.
> 
> Hm, right. I'm missing a step in my understanding. As you say bridge
> does not seem to be re-broadcasting the event to its master. So how
> does Alexandra catch this kind of an event? :S
> 
> 	case NETDEV_NOTIFY_PEERS:
> 		/* propagate to peer of a bridge attached veth */
> 		if (netif_is_bridge_master(dev)) {  
> 
> IIUC bond will notify with dev == bond netdev. Where is the event with
> dev == br generated?
> 

Good question. :)

>>> Generally broadcast traffic
>>> is quite expensive for the bridge, I have patches that improve on the
>>> technical side (consider ports only for the same bcast domain), but you also
>>> wouldn't want unnecessary bcast packets being sent around. :)
>>> There are setups with tens of bond devices and propagating that to all would be
>>> very expensive, but most of all unnecessary. It would also hurt setups with
>>> a lot of vlan devices on the bridge. There are setups with hundreds of vlans
>>> and hundreds of macvlans on top, propagating it up would send it to all of
>>> them and that wouldn't scale at all, these mostly have IP addresses too.
> 
> Ack.
> 
>>> Perhaps we can enable propagation on a per-port or per-bridge basis, then we
>>> can avoid these walks. That is, make it opt-in.
> 
> Maybe opt-out? But assuming the event is only generated on
> active/backup switch over - when would it be okay to ignore
> the notification?
> 

Let me just clarify, so I'm sure I've not misunderstood you. Do you mean opt-out as in
make it default on? IMO that would be a problem, large scale setups would suddenly
start propagating it to upper devices which would cause a lot of unnecessary bcast.
I meant enable it only if needed, and only on specific ports (second part is not
necessary, could be global, I think it's ok either way). I don't think any setup
which has many upper vlans/macvlans would ever enable this.

>>>>> It also seems difficult to avoid re-bouncing the notifier.  
>>>>
>>>> syzbot will make short work of this patch, I think the potential
>>>> for infinite loops has to be addressed somehow. IIUC this is the 
>>>> first instance of forwarding those notifiers to a peer rather
>>>> than within a upper <> lower device hierarchy which is a DAG.  
>>
>> My concern was about the Hangbin's alternative proposal to notify all
>> bridge ports. I hope in my porposal I was able to avoid infinite loops.
> 
> Possibly I'm confused as to where the notification for bridge master
> gets sent..

IIUC it bypasses the bridge and sends a notify peers for the veth peer so it would
generate a grat arp (inetdev_event -> NETDEV_NOTIFY_PEERS).
Jakub Kicinski March 30, 2022, 5:12 p.m. UTC | #7
On Wed, 30 Mar 2022 19:16:42 +0300 Nikolay Aleksandrov wrote:
> > Maybe opt-out? But assuming the event is only generated on
> > active/backup switch over - when would it be okay to ignore
> > the notification?
> 
> Let me just clarify, so I'm sure I've not misunderstood you. Do you mean opt-out as in
> make it default on? IMO that would be a problem, large scale setups would suddenly
> start propagating it to upper devices which would cause a lot of unnecessary bcast.
> I meant enable it only if needed, and only on specific ports (second part is not
> necessary, could be global, I think it's ok either way). I don't think any setup
> which has many upper vlans/macvlans would ever enable this.

That may be. I don't have a good understanding of scenarios in which
GARP is required and where it's not :) Goes without saying but the
default should follow the more common scenario.

> >> My concern was about the Hangbin's alternative proposal to notify all
> >> bridge ports. I hope in my porposal I was able to avoid infinite loops.  
> > 
> > Possibly I'm confused as to where the notification for bridge master
> > gets sent..  
> 
> IIUC it bypasses the bridge and sends a notify peers for the veth peer so it would
> generate a grat arp (inetdev_event -> NETDEV_NOTIFY_PEERS).

Ack, I was basically repeating the question of where does 
the notification with dev == br get generated.

There is a protection in this patch to make sure the other 
end of the veth is not plugged into a bridge (i.e. is not
a bridge port) but there can be a macvlan on top of that
veth that is part of a bridge, so IIUC that check is either
insufficient or unnecessary.
Jay Vosburgh March 30, 2022, 7:15 p.m. UTC | #8
Jakub Kicinski <kuba@kernel.org> wrote:

>On Wed, 30 Mar 2022 19:16:42 +0300 Nikolay Aleksandrov wrote:
>> > Maybe opt-out? But assuming the event is only generated on
>> > active/backup switch over - when would it be okay to ignore
>> > the notification?
>> 
>> Let me just clarify, so I'm sure I've not misunderstood you. Do you mean opt-out as in
>> make it default on? IMO that would be a problem, large scale setups would suddenly
>> start propagating it to upper devices which would cause a lot of unnecessary bcast.
>> I meant enable it only if needed, and only on specific ports (second part is not
>> necessary, could be global, I think it's ok either way). I don't think any setup
>> which has many upper vlans/macvlans would ever enable this.
>
>That may be. I don't have a good understanding of scenarios in which
>GARP is required and where it's not :) Goes without saying but the
>default should follow the more common scenario.

	At least from the bonding failover persective, the GARP is
needed when there's a visible topology change (so peers learn the new
path), a change in MAC address, or both.  I don't think it's possible to
determine from bonding which topology changes are visible, so any
failover gets a GARP.  The original intent as best I recall was to cover
IP addresses configured on the bond itself or on VLANs above the bond.

	If I understand the original problem description correctly, the
bonding failover causes the connectivity issue because the network
segments beyond the bond interfaces don't share forwarding information
(i.e., they are completely independent).  The peer (end station or
switch) at the far end of those network segments (where they converge)
is unable to directly see that the "to bond eth0" port went down, and
has no way to know that anything is awry, and thus won't find the new
path until an ARP or forwarding entry for "veth_a2" (from the original
diagram) times out at the peer out in the network.

>> >> My concern was about the Hangbin's alternative proposal to notify all
>> >> bridge ports. I hope in my porposal I was able to avoid infinite loops.  
>> > 
>> > Possibly I'm confused as to where the notification for bridge master
>> > gets sent..  
>> 
>> IIUC it bypasses the bridge and sends a notify peers for the veth peer so it would
>> generate a grat arp (inetdev_event -> NETDEV_NOTIFY_PEERS).
>
>Ack, I was basically repeating the question of where does 
>the notification with dev == br get generated.
>
>There is a protection in this patch to make sure the other 
>end of the veth is not plugged into a bridge (i.e. is not
>a bridge port) but there can be a macvlan on top of that
>veth that is part of a bridge, so IIUC that check is either
>insufficient or unnecessary.

	I'm a bit concerned this is becoming a interface plumbing
topology change whack-a-mole.

	In the above, what if the veth is plugged into a bridge, and
there's a end station on that bridge?  If it's bridges all the way down,
where does the need for some kind of TCN mechanism stop?

	Or instead of a veth it's an physical network hop (perhaps a
tunnel; something through which notifiers do not propagate) to another
host with another bridge, then what?

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com
Alexandra Winter March 31, 2022, 9:59 a.m. UTC | #9
On Tue, 29 Mar 2022 13:40:52 +0200 Alexandra Winter wrote:
> Bonding drivers generate specific events during failover that trigger
> switch updates.  When a veth device is attached to a bridge with a
> bond interface, we want external switches to learn about the veth
> devices as well.
>
> Example:
>
> 	| veth_a2   |  veth_b2  |  veth_c2 |
> 	------o-----------o----------o------
> 	       \	  |	    /
> 		o	  o	   o
> 	      veth_a1  veth_b1  veth_c1
> 	      -------------------------
> 	      |        bridge         |
> 	      -------------------------
> 			bond0
> 			/  \
> 		     eth0  eth1
>
> In case of failover from eth0 to eth1, the netdev_notifier needs to be
> propagated, so e.g. veth_a2 can re-announce its MAC address to the
> external hardware attached to eth1.

On 30.03.22 21:15, Jay Vosburgh wrote:
> Jakub Kicinski <kuba@kernel.org> wrote:
> 
>> On Wed, 30 Mar 2022 19:16:42 +0300 Nikolay Aleksandrov wrote:
>>>> Maybe opt-out? But assuming the event is only generated on
>>>> active/backup switch over - when would it be okay to ignore
>>>> the notification?
>>>
>>> Let me just clarify, so I'm sure I've not misunderstood you. Do you mean opt-out as in
>>> make it default on? IMO that would be a problem, large scale setups would suddenly
>>> start propagating it to upper devices which would cause a lot of unnecessary bcast.
>>> I meant enable it only if needed, and only on specific ports (second part is not
>>> necessary, could be global, I think it's ok either way). I don't think any setup
>>> which has many upper vlans/macvlans would ever enable this.
>>
>> That may be. I don't have a good understanding of scenarios in which
>> GARP is required and where it's not :) Goes without saying but the
>> default should follow the more common scenario.
> 
> 	At least from the bonding failover persective, the GARP is
> needed when there's a visible topology change (so peers learn the new
> path), a change in MAC address, or both.  I don't think it's possible to
> determine from bonding which topology changes are visible, so any
> failover gets a GARP.  The original intent as best I recall was to cover
> IP addresses configured on the bond itself or on VLANs above the bond.
> 
> 	If I understand the original problem description correctly, the
> bonding failover causes the connectivity issue because the network
> segments beyond the bond interfaces don't share forwarding information
> (i.e., they are completely independent).  The peer (end station or
> switch) at the far end of those network segments (where they converge)
> is unable to directly see that the "to bond eth0" port went down, and
> has no way to know that anything is awry, and thus won't find the new
> path until an ARP or forwarding entry for "veth_a2" (from the original
> diagram) times out at the peer out in the network.
> 
>>>>> My concern was about the Hangbin's alternative proposal to notify all
>>>>> bridge ports. I hope in my porposal I was able to avoid infinite loops.  
>>>>
>>>> Possibly I'm confused as to where the notification for bridge master
>>>> gets sent..  
>>>
>>> IIUC it bypasses the bridge and sends a notify peers for the veth peer so it would
>>> generate a grat arp (inetdev_event -> NETDEV_NOTIFY_PEERS).
>>
>> Ack, I was basically repeating the question of where does 
>> the notification with dev == br get generated.
>>
>> There is a protection in this patch to make sure the other 
>> end of the veth is not plugged into a bridge (i.e. is not
>> a bridge port) but there can be a macvlan on top of that
>> veth that is part of a bridge, so IIUC that check is either
>> insufficient or unnecessary.
> 
> 	I'm a bit concerned this is becoming a interface plumbing
> topology change whack-a-mole.
> 
> 	In the above, what if the veth is plugged into a bridge, and
> there's a end station on that bridge?  If it's bridges all the way down,
> where does the need for some kind of TCN mechanism stop?
> 
> 	Or instead of a veth it's an physical network hop (perhaps a
> tunnel; something through which notifiers do not propagate) to another
> host with another bridge, then what?
> 
> 	-J
> 
> ---
> 	-Jay Vosburgh, jay.vosburgh@canonical.com

I see 3 technologies that are used for network virtualization in combination with bond for redundancy
(and I may miss some):
(1) MACVTAP/MACVLAN over bond:
MACVLAN propagates notifiers from bond to endpoints (same as VLAN)
(drivers/net/macvlan.c:
	case NETDEV_NOTIFY_PEERS:
	case NETDEV_BONDING_FAILOVER:
	case NETDEV_RESEND_IGMP:
		/* Propagate to all vlans */
		list_for_each_entry(vlan, &port->vlans, list)
			call_netdevice_notifiers(event, vlan->dev);
	})
(2) OpenVSwitch:
OVS seems to have its own bond implementation, but sends out reverse Arp on active-backup failover
(3) User defined bridge over bond:
propagates notifiers to the bridge device itself, but not to the devices attached to bridge ports.
(net/bridge/br.c:
	case NETDEV_RESEND_IGMP:
		/* Propagate to master device */
		call_netdevice_notifiers(event, br->dev);)

Active-backup may not be the best bonding mode, but it is a simple way to achieve redundancy and I've seen it being used.
I don't see a usecase for MACVLAN over bridge over bond (?)
The external HW network does not need to be updated about the instances that are conencted via tunnel,
so I don't see an issue there.

I had this idea how to solve the failover issue it for veth pairs attached to the user defined bridge.
Does this need to be configurable? How? Per veth pair?

Of course a more general solution how bridge over bond could handle notifications, would be great,
but I'm running out of ideas. So I thought I'd address veth first.
Your help and ideas are highly appreciated, thank you.
Alexandra
Nikolay Aleksandrov March 31, 2022, 10:33 a.m. UTC | #10
On 31/03/2022 12:59, Alexandra Winter wrote:
> On Tue, 29 Mar 2022 13:40:52 +0200 Alexandra Winter wrote:
>> Bonding drivers generate specific events during failover that trigger
>> switch updates.  When a veth device is attached to a bridge with a
>> bond interface, we want external switches to learn about the veth
>> devices as well.
>>
>> Example:
>>
>> 	| veth_a2   |  veth_b2  |  veth_c2 |
>> 	------o-----------o----------o------
>> 	       \	  |	    /
>> 		o	  o	   o
>> 	      veth_a1  veth_b1  veth_c1
>> 	      -------------------------
>> 	      |        bridge         |
>> 	      -------------------------
>> 			bond0
>> 			/  \
>> 		     eth0  eth1
>>
>> In case of failover from eth0 to eth1, the netdev_notifier needs to be
>> propagated, so e.g. veth_a2 can re-announce its MAC address to the
>> external hardware attached to eth1.
> 
> On 30.03.22 21:15, Jay Vosburgh wrote:
>> Jakub Kicinski <kuba@kernel.org> wrote:
>>
>>> On Wed, 30 Mar 2022 19:16:42 +0300 Nikolay Aleksandrov wrote:
>>>>> Maybe opt-out? But assuming the event is only generated on
>>>>> active/backup switch over - when would it be okay to ignore
>>>>> the notification?
>>>>
>>>> Let me just clarify, so I'm sure I've not misunderstood you. Do you mean opt-out as in
>>>> make it default on? IMO that would be a problem, large scale setups would suddenly
>>>> start propagating it to upper devices which would cause a lot of unnecessary bcast.
>>>> I meant enable it only if needed, and only on specific ports (second part is not
>>>> necessary, could be global, I think it's ok either way). I don't think any setup
>>>> which has many upper vlans/macvlans would ever enable this.
>>>
>>> That may be. I don't have a good understanding of scenarios in which
>>> GARP is required and where it's not :) Goes without saying but the
>>> default should follow the more common scenario.
>>
>> 	At least from the bonding failover persective, the GARP is
>> needed when there's a visible topology change (so peers learn the new
>> path), a change in MAC address, or both.  I don't think it's possible to
>> determine from bonding which topology changes are visible, so any
>> failover gets a GARP.  The original intent as best I recall was to cover
>> IP addresses configured on the bond itself or on VLANs above the bond.
>>
>> 	If I understand the original problem description correctly, the
>> bonding failover causes the connectivity issue because the network
>> segments beyond the bond interfaces don't share forwarding information
>> (i.e., they are completely independent).  The peer (end station or
>> switch) at the far end of those network segments (where they converge)
>> is unable to directly see that the "to bond eth0" port went down, and
>> has no way to know that anything is awry, and thus won't find the new
>> path until an ARP or forwarding entry for "veth_a2" (from the original
>> diagram) times out at the peer out in the network.
>>
>>>>>> My concern was about the Hangbin's alternative proposal to notify all
>>>>>> bridge ports. I hope in my porposal I was able to avoid infinite loops.  
>>>>>
>>>>> Possibly I'm confused as to where the notification for bridge master
>>>>> gets sent..  
>>>>
>>>> IIUC it bypasses the bridge and sends a notify peers for the veth peer so it would
>>>> generate a grat arp (inetdev_event -> NETDEV_NOTIFY_PEERS).
>>>
>>> Ack, I was basically repeating the question of where does 
>>> the notification with dev == br get generated.
>>>
>>> There is a protection in this patch to make sure the other 
>>> end of the veth is not plugged into a bridge (i.e. is not
>>> a bridge port) but there can be a macvlan on top of that
>>> veth that is part of a bridge, so IIUC that check is either
>>> insufficient or unnecessary.
>>
>> 	I'm a bit concerned this is becoming a interface plumbing
>> topology change whack-a-mole.
>>
>> 	In the above, what if the veth is plugged into a bridge, and
>> there's a end station on that bridge?  If it's bridges all the way down,
>> where does the need for some kind of TCN mechanism stop?
>>
>> 	Or instead of a veth it's an physical network hop (perhaps a
>> tunnel; something through which notifiers do not propagate) to another
>> host with another bridge, then what?
>>
>> 	-J
>>
>> ---
>> 	-Jay Vosburgh, jay.vosburgh@canonical.com
> 
> I see 3 technologies that are used for network virtualization in combination with bond for redundancy
> (and I may miss some):
> (1) MACVTAP/MACVLAN over bond:
> MACVLAN propagates notifiers from bond to endpoints (same as VLAN)
> (drivers/net/macvlan.c:
> 	case NETDEV_NOTIFY_PEERS:
> 	case NETDEV_BONDING_FAILOVER:
> 	case NETDEV_RESEND_IGMP:
> 		/* Propagate to all vlans */
> 		list_for_each_entry(vlan, &port->vlans, list)
> 			call_netdevice_notifiers(event, vlan->dev);
> 	})
> (2) OpenVSwitch:
> OVS seems to have its own bond implementation, but sends out reverse Arp on active-backup failover
> (3) User defined bridge over bond:
> propagates notifiers to the bridge device itself, but not to the devices attached to bridge ports.
> (net/bridge/br.c:
> 	case NETDEV_RESEND_IGMP:
> 		/* Propagate to master device */
> 		call_netdevice_notifiers(event, br->dev);)
> 
> Active-backup may not be the best bonding mode, but it is a simple way to achieve redundancy and I've seen it being used.
> I don't see a usecase for MACVLAN over bridge over bond (?)

If you're talking about this particular case (network virtualization) - sure. But macvlans over bridges
are heavily used in Cumulus Linux and large scale setups. For example VRRP is implemented using macvlan
devices. Any notification that propagates to the bridge and reaches these would cause a storm of broadcasts
being sent down which would not scale and is extremely undesirable in general.

> The external HW network does not need to be updated about the instances that are conencted via tunnel,
> so I don't see an issue there.
> 
> I had this idea how to solve the failover issue it for veth pairs attached to the user defined bridge.
> Does this need to be configurable? How? Per veth pair?

That is not what I meant (if you were referring to my comment), I meant if it gets implemented in the
bridge and it starts propagating the notify peers notifier - that _must_ be configurable.

> 
> Of course a more general solution how bridge over bond could handle notifications, would be great,
> but I'm running out of ideas. So I thought I'd address veth first.
> Your help and ideas are highly appreciated, thank you.

I'm curious why it must be done in the kernel altogether? This can obviously be solved in user-space
by sending grat arps towards flapped por for fdbs on other ports (e.g. veths) based on a netlink notification.
In fact based on your description propagating NETDEV_NOTIFY_PEERS to bridge ports wouldn't help
because in that case the remote peer veth will not generate a grat arp. The notification will
get propagated only to local veth (bridge port), or the bridge itself depending on implementation.

So from bridge perspective, if you decide to pursue a kernel solution, I think you'll need
a new bridge port option which acts on NOTIFY_PEERS and generates a grat arp for all fdbs
on the port where it is enabled to the port which generated the NOTIFY_PEERS. Note that is
also fragile as I'm sure some stacked device config would not work, so I want to re-iterate
how much easier it is to solve it in user-space which has better visibility and you can
change much faster to accommodate new use cases.

To illustrate: bond
                   \ 
                    bridge
                   /
               veth0
                 |
               veth1

When bond generates NOTIFY_PEERS, and you have this new option enabled on veth0 then
the bridge should generate grat arps for all fdbs on veth0 towards bond so the new
path would learn them. Note that is very dangerous as veth1 can generate thousands
of fdbs and you can potentially DDoS the whole network, so again I'd advise to do
this in user-space where you can better control it.

W.r.t to this patch, I think it will also work and will cause a single grat arp which
is ok. Just need to make sure loops are not possible, for example I think you can loop
your implementation by the following config (untested theory):
bond
    \
     bridge 
   \          \
veth2.10       veth0 - veth1
      \                \
       \                veth1.10 (vlan)
        \                \
         \                bridge2
          \              /
           veth2 - veth3


1. bond generates NOTIFY_PEERS
2. bridge propagates to veth1 (through veth0 port)
3. veth1 propagates to its vlan (veth1.10)
4. bridge2 sees veth1.10 NOTIFY_PEERS and propagates to veth2 (through veth3 port)
5. veth2 propagates to its vlan (veth2.10)
6. veth2.10 propagates it back to bridge
<loop>

I'm sure similar setup, and maybe even simpler, can be constructed with other devices
which can propagate or generate NOTIFY_PEERS.

Cheers,
 Nik
Alexandra Winter March 31, 2022, 12:07 p.m. UTC | #11
On 31.03.22 12:33, Nikolay Aleksandrov wrote:
> On 31/03/2022 12:59, Alexandra Winter wrote:
>> On Tue, 29 Mar 2022 13:40:52 +0200 Alexandra Winter wrote:
>>> Bonding drivers generate specific events during failover that trigger
>>> switch updates.  When a veth device is attached to a bridge with a
>>> bond interface, we want external switches to learn about the veth
>>> devices as well.
>>>
>>> Example:
>>>
>>> 	| veth_a2   |  veth_b2  |  veth_c2 |
>>> 	------o-----------o----------o------
>>> 	       \	  |	    /
>>> 		o	  o	   o
>>> 	      veth_a1  veth_b1  veth_c1
>>> 	      -------------------------
>>> 	      |        bridge         |
>>> 	      -------------------------
>>> 			bond0
>>> 			/  \
>>> 		     eth0  eth1
>>>
>>> In case of failover from eth0 to eth1, the netdev_notifier needs to be
>>> propagated, so e.g. veth_a2 can re-announce its MAC address to the
>>> external hardware attached to eth1.
>>
>> On 30.03.22 21:15, Jay Vosburgh wrote:
>>> Jakub Kicinski <kuba@kernel.org> wrote:
>>>
>>>> On Wed, 30 Mar 2022 19:16:42 +0300 Nikolay Aleksandrov wrote:
>>>>>> Maybe opt-out? But assuming the event is only generated on
>>>>>> active/backup switch over - when would it be okay to ignore
>>>>>> the notification?
>>>>>
>>>>> Let me just clarify, so I'm sure I've not misunderstood you. Do you mean opt-out as in
>>>>> make it default on? IMO that would be a problem, large scale setups would suddenly
>>>>> start propagating it to upper devices which would cause a lot of unnecessary bcast.
>>>>> I meant enable it only if needed, and only on specific ports (second part is not
>>>>> necessary, could be global, I think it's ok either way). I don't think any setup
>>>>> which has many upper vlans/macvlans would ever enable this.
>>>>
>>>> That may be. I don't have a good understanding of scenarios in which
>>>> GARP is required and where it's not :) Goes without saying but the
>>>> default should follow the more common scenario.
>>>
>>> 	At least from the bonding failover persective, the GARP is
>>> needed when there's a visible topology change (so peers learn the new
>>> path), a change in MAC address, or both.  I don't think it's possible to
>>> determine from bonding which topology changes are visible, so any
>>> failover gets a GARP.  The original intent as best I recall was to cover
>>> IP addresses configured on the bond itself or on VLANs above the bond.
>>>
>>> 	If I understand the original problem description correctly, the
>>> bonding failover causes the connectivity issue because the network
>>> segments beyond the bond interfaces don't share forwarding information
>>> (i.e., they are completely independent).  The peer (end station or
>>> switch) at the far end of those network segments (where they converge)
>>> is unable to directly see that the "to bond eth0" port went down, and
>>> has no way to know that anything is awry, and thus won't find the new
>>> path until an ARP or forwarding entry for "veth_a2" (from the original
>>> diagram) times out at the peer out in the network.
>>>
>>>>>>> My concern was about the Hangbin's alternative proposal to notify all
>>>>>>> bridge ports. I hope in my porposal I was able to avoid infinite loops.  
>>>>>>
>>>>>> Possibly I'm confused as to where the notification for bridge master
>>>>>> gets sent..  
>>>>>
>>>>> IIUC it bypasses the bridge and sends a notify peers for the veth peer so it would
>>>>> generate a grat arp (inetdev_event -> NETDEV_NOTIFY_PEERS).
>>>>
>>>> Ack, I was basically repeating the question of where does 
>>>> the notification with dev == br get generated.
>>>>
>>>> There is a protection in this patch to make sure the other 
>>>> end of the veth is not plugged into a bridge (i.e. is not
>>>> a bridge port) but there can be a macvlan on top of that
>>>> veth that is part of a bridge, so IIUC that check is either
>>>> insufficient or unnecessary.
>>>
>>> 	I'm a bit concerned this is becoming a interface plumbing
>>> topology change whack-a-mole.
>>>
>>> 	In the above, what if the veth is plugged into a bridge, and
>>> there's a end station on that bridge?  If it's bridges all the way down,
>>> where does the need for some kind of TCN mechanism stop?
>>>
>>> 	Or instead of a veth it's an physical network hop (perhaps a
>>> tunnel; something through which notifiers do not propagate) to another
>>> host with another bridge, then what?
>>>
>>> 	-J
>>>
>>> ---
>>> 	-Jay Vosburgh, jay.vosburgh@canonical.com
>>
>> I see 3 technologies that are used for network virtualization in combination with bond for redundancy
>> (and I may miss some):
>> (1) MACVTAP/MACVLAN over bond:
>> MACVLAN propagates notifiers from bond to endpoints (same as VLAN)
>> (drivers/net/macvlan.c:
>> 	case NETDEV_NOTIFY_PEERS:
>> 	case NETDEV_BONDING_FAILOVER:
>> 	case NETDEV_RESEND_IGMP:
>> 		/* Propagate to all vlans */
>> 		list_for_each_entry(vlan, &port->vlans, list)
>> 			call_netdevice_notifiers(event, vlan->dev);
>> 	})
>> (2) OpenVSwitch:
>> OVS seems to have its own bond implementation, but sends out reverse Arp on active-backup failover
>> (3) User defined bridge over bond:
>> propagates notifiers to the bridge device itself, but not to the devices attached to bridge ports.
>> (net/bridge/br.c:
>> 	case NETDEV_RESEND_IGMP:
>> 		/* Propagate to master device */
>> 		call_netdevice_notifiers(event, br->dev);)
>>
>> Active-backup may not be the best bonding mode, but it is a simple way to achieve redundancy and I've seen it being used.
>> I don't see a usecase for MACVLAN over bridge over bond (?)
> 
> If you're talking about this particular case (network virtualization) - sure. But macvlans over bridges
> are heavily used in Cumulus Linux and large scale setups. For example VRRP is implemented using macvlan
> devices. Any notification that propagates to the bridge and reaches these would cause a storm of broadcasts
> being sent down which would not scale and is extremely undesirable in general.
> 
>> The external HW network does not need to be updated about the instances that are conencted via tunnel,
>> so I don't see an issue there.
>>
>> I had this idea how to solve the failover issue it for veth pairs attached to the user defined bridge.
>> Does this need to be configurable? How? Per veth pair?
> 
> That is not what I meant (if you were referring to my comment), I meant if it gets implemented in the
> bridge and it starts propagating the notify peers notifier - that _must_ be configurable.
> 
>>
>> Of course a more general solution how bridge over bond could handle notifications, would be great,
>> but I'm running out of ideas. So I thought I'd address veth first.
>> Your help and ideas are highly appreciated, thank you.
> 
> I'm curious why it must be done in the kernel altogether? This can obviously be solved in user-space
> by sending grat arps towards flapped por for fdbs on other ports (e.g. veths) based on a netlink notification.
> In fact based on your description propagating NETDEV_NOTIFY_PEERS to bridge ports wouldn't help
> because in that case the remote peer veth will not generate a grat arp. The notification will
> get propagated only to local veth (bridge port), or the bridge itself depending on implementation.
> 
> So from bridge perspective, if you decide to pursue a kernel solution, I think you'll need
> a new bridge port option which acts on NOTIFY_PEERS and generates a grat arp for all fdbs
> on the port where it is enabled to the port which generated the NOTIFY_PEERS. Note that is
> also fragile as I'm sure some stacked device config would not work, so I want to re-iterate
> how much easier it is to solve it in user-space which has better visibility and you can
> change much faster to accommodate new use cases.
> 
> To illustrate: bond
>                    \ 
>                     bridge
>                    /
>                veth0
>                  |
>                veth1
> 
> When bond generates NOTIFY_PEERS, and you have this new option enabled on veth0 then
> the bridge should generate grat arps for all fdbs on veth0 towards bond so the new
> path would learn them. Note that is very dangerous as veth1 can generate thousands
> of fdbs and you can potentially DDoS the whole network, so again I'd advise to do
> this in user-space where you can better control it.
> 
> W.r.t to this patch, I think it will also work and will cause a single grat arp which
> is ok. Just need to make sure loops are not possible, for example I think you can loop
> your implementation by the following config (untested theory):
> bond
>     \
>      bridge 
>    \          \
> veth2.10       veth0 - veth1
>       \                \
>        \                veth1.10 (vlan)
>         \                \
>          \                bridge2
>           \              /
>            veth2 - veth3
> 
> 
> 1. bond generates NOTIFY_PEERS
> 2. bridge propagates to veth1 (through veth0 port)
> 3. veth1 propagates to its vlan (veth1.10)
> 4. bridge2 sees veth1.10 NOTIFY_PEERS and propagates to veth2 (through veth3 port)
> 5. veth2 propagates to its vlan (veth2.10)
> 6. veth2.10 propagates it back to bridge
> <loop>
> 
> I'm sure similar setup, and maybe even simpler, can be constructed with other devices
> which can propagate or generate NOTIFY_PEERS.
> 
> Cheers,
>  Nik
> 
Thank you very much Nik for your advice and your thourough explanations.

I think I could prevent the loop you describe above. But I agree that propagating
notifications through veth is more risky, because there is no upper-lower relationship.
Is there interest in a v3 of my patch, where I would also incorporate Jakub's comments?

Otherwise I would next explore a user-space solution like Nik proposed.

Thank you all very much
Alexandra
diff mbox series

Patch

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index d29fb9759cc9..74b074453197 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1579,6 +1579,57 @@  static void veth_setup(struct net_device *dev)
 	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
 }
 
+static bool netif_is_veth(const struct net_device *dev)
+{
+	return (dev->netdev_ops == &veth_netdev_ops);
+}
+
+static void veth_notify_peer(unsigned long event, const struct net_device *dev)
+{
+	struct net_device *peer;
+	struct veth_priv *priv;
+
+	priv = netdev_priv(dev);
+	peer = rtnl_dereference(priv->peer);
+	/* avoid re-bounce between 2 bridges */
+	if (!netif_is_bridge_port(peer))
+		call_netdevice_notifiers(event, peer);
+}
+
+/* Called under rtnl_lock */
+static int veth_device_event(struct notifier_block *unused,
+			     unsigned long event, void *ptr)
+{
+	struct net_device *dev, *lower;
+	struct list_head *iter;
+
+	dev = netdev_notifier_info_to_dev(ptr);
+
+	switch (event) {
+	case NETDEV_NOTIFY_PEERS:
+	case NETDEV_BONDING_FAILOVER:
+	case NETDEV_RESEND_IGMP:
+		/* propagate to peer of a bridge attached veth */
+		if (netif_is_bridge_master(dev)) {
+			iter = &dev->adj_list.lower;
+			lower = netdev_next_lower_dev_rcu(dev, &iter);
+			while (lower) {
+				if (netif_is_veth(lower))
+					veth_notify_peer(event, lower);
+				lower = netdev_next_lower_dev_rcu(dev, &iter);
+			}
+		}
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block veth_notifier_block __read_mostly = {
+		.notifier_call  = veth_device_event,
+};
+
 /*
  * netlink interface
  */
@@ -1824,12 +1875,14 @@  static struct rtnl_link_ops veth_link_ops = {
 
 static __init int veth_init(void)
 {
+	register_netdevice_notifier(&veth_notifier_block);
 	return rtnl_link_register(&veth_link_ops);
 }
 
 static __exit void veth_exit(void)
 {
 	rtnl_link_unregister(&veth_link_ops);
+	unregister_netdevice_notifier(&veth_notifier_block);
 }
 
 module_init(veth_init);