diff mbox series

[net-next,2/9] net: bridge: offload initial and final port flags through switchdev

Message ID 20210207232141.2142678-3-olteanv@gmail.com (mailing list archive)
State New, archived
Headers show
Series Cleanup in brport flags switchdev offload for DSA | expand

Commit Message

Vladimir Oltean Feb. 7, 2021, 11:21 p.m. UTC
From: Vladimir Oltean <vladimir.oltean@nxp.com>

It must first be admitted that switchdev device drivers have a life
beyond the bridge, and when they aren't offloading the bridge driver
they are operating with forwarding disabled between ports, emulating as
closely as possible N standalone network interfaces.

Now it must be said that for a switchdev port operating in standalone
mode, address learning doesn't make much sense since that is a bridge
function. In fact, address learning even breaks setups such as this one:

   +---------------------------------------------+
   |                                             |
   | +-------------------+                       |
   | |        br0        |    send      receive  |
   | +--------+-+--------+ +--------+ +--------+ |
   | |        | |        | |        | |        | |
   | |  swp0  | |  swp1  | |  swp2  | |  swp3  | |
   | |        | |        | |        | |        | |
   +-+--------+-+--------+-+--------+-+--------+-+
          |         ^           |          ^
          |         |           |          |
          |         +-----------+          |
          |                                |
          +--------------------------------+

because if the ASIC has a single FDB (can offload a single bridge)
then source address learning on swp3 can "steal" the source MAC address
of swp2 from br0's FDB, because learning frames coming from swp2 will be
done twice: first on the swp1 ingress port, second on the swp3 ingress
port. So the hardware FDB will become out of sync with the software
bridge, and when swp2 tries to send one more packet towards swp1, the
ASIC will attempt to short-circuit the forwarding path and send it
directly to swp3 (since that's the last port it learned that address on),
which it obviously can't, because swp3 operates in standalone mode.

So switchdev drivers operating in standalone mode should disable address
learning. As a matter of practicality, we can reduce code duplication in
drivers by having the bridge notify through switchdev of the initial and
final brport flags. Then, drivers can simply start up hardcoded for no
address learning (similar to how they already start up hardcoded for no
forwarding), then they only need to listen for
SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS and their job is basically done, no
need for special cases when the port joins or leaves the bridge etc.

When a port leaves the bridge (and therefore becomes standalone), we
issue a switchdev attribute that apart from disabling address learning,
enables flooding of all kinds. This is also done for pragmatic reasons,
because even though standalone switchdev ports might not need to have
flooding enabled in order to inject traffic with any MAC DA from the
control interface, it certainly doesn't hurt either, and it even makes
more sense than disabling flooding of unknown traffic towards that port.

Note that the implementation is a bit wacky because the switchdev API
for port attributes is very counterproductive. Instead of issuing a
single switchdev notification with a bitwise OR of all flags that we're
modifying, we need to issue 4 individual notifications, one for each bit.
This is because the SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS notifier
forces you to refuse the entire operation if there's at least one bit
which you can't offload, and that is currently BR_BCAST_FLOOD which
nobody does. So this change would do nothing for no one if we offloaded
all flags at once, but the idea is to offload as much as possible
instead of all or nothing.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/bridge/br_if.c      | 24 +++++++++++++++++++++++-
 net/bridge/br_netlink.c | 16 ++++------------
 net/bridge/br_private.h |  2 ++
 3 files changed, 29 insertions(+), 13 deletions(-)

Comments

Nikolay Aleksandrov Feb. 8, 2021, 11:37 a.m. UTC | #1
On 08/02/2021 01:21, Vladimir Oltean wrote:
> From: Vladimir Oltean <vladimir.oltean@nxp.com>
> 
> It must first be admitted that switchdev device drivers have a life
> beyond the bridge, and when they aren't offloading the bridge driver
> they are operating with forwarding disabled between ports, emulating as
> closely as possible N standalone network interfaces.
> 
> Now it must be said that for a switchdev port operating in standalone
> mode, address learning doesn't make much sense since that is a bridge
> function. In fact, address learning even breaks setups such as this one:
> 
>    +---------------------------------------------+
>    |                                             |
>    | +-------------------+                       |
>    | |        br0        |    send      receive  |
>    | +--------+-+--------+ +--------+ +--------+ |
>    | |        | |        | |        | |        | |
>    | |  swp0  | |  swp1  | |  swp2  | |  swp3  | |
>    | |        | |        | |        | |        | |
>    +-+--------+-+--------+-+--------+-+--------+-+
>           |         ^           |          ^
>           |         |           |          |
>           |         +-----------+          |
>           |                                |
>           +--------------------------------+
> 
> because if the ASIC has a single FDB (can offload a single bridge)
> then source address learning on swp3 can "steal" the source MAC address
> of swp2 from br0's FDB, because learning frames coming from swp2 will be
> done twice: first on the swp1 ingress port, second on the swp3 ingress
> port. So the hardware FDB will become out of sync with the software
> bridge, and when swp2 tries to send one more packet towards swp1, the
> ASIC will attempt to short-circuit the forwarding path and send it
> directly to swp3 (since that's the last port it learned that address on),
> which it obviously can't, because swp3 operates in standalone mode.
> 
> So switchdev drivers operating in standalone mode should disable address
> learning. As a matter of practicality, we can reduce code duplication in
> drivers by having the bridge notify through switchdev of the initial and
> final brport flags. Then, drivers can simply start up hardcoded for no
> address learning (similar to how they already start up hardcoded for no
> forwarding), then they only need to listen for
> SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS and their job is basically done, no
> need for special cases when the port joins or leaves the bridge etc.
> 
> When a port leaves the bridge (and therefore becomes standalone), we
> issue a switchdev attribute that apart from disabling address learning,
> enables flooding of all kinds. This is also done for pragmatic reasons,
> because even though standalone switchdev ports might not need to have
> flooding enabled in order to inject traffic with any MAC DA from the
> control interface, it certainly doesn't hurt either, and it even makes
> more sense than disabling flooding of unknown traffic towards that port.
> 
> Note that the implementation is a bit wacky because the switchdev API
> for port attributes is very counterproductive. Instead of issuing a
> single switchdev notification with a bitwise OR of all flags that we're
> modifying, we need to issue 4 individual notifications, one for each bit.
> This is because the SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS notifier
> forces you to refuse the entire operation if there's at least one bit
> which you can't offload, and that is currently BR_BCAST_FLOOD which
> nobody does. So this change would do nothing for no one if we offloaded
> all flags at once, but the idea is to offload as much as possible
> instead of all or nothing.
> 
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
>  net/bridge/br_if.c      | 24 +++++++++++++++++++++++-
>  net/bridge/br_netlink.c | 16 ++++------------
>  net/bridge/br_private.h |  2 ++
>  3 files changed, 29 insertions(+), 13 deletions(-)
> 

Hi Vladimir,
I think this patch potentially breaks some use cases. There are a few problems, I'll
start with the more serious one: before the ports would have a set of flags that were
always set when joining, now due to how nbp_flags_change() handles flag setting some might
not be set which would immediately change behaviour w.r.t software fwding. I'll use your
example of BR_BCAST_FLOOD: a lot of drivers will return an error for it and any broadcast
towards these ports will be dropped, we have mixed environments with software ports that
sometimes have traffic (e.g. decapped ARP requests) software forwarded which will stop working.
The other lesser issue is with the style below, I mean these three calls for each flag are
just ugly and look weird as you've also noted, since these APIs are internal can we do better?

Cheers,
 Nik

> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
> index f7d2f472ae24..8903333654f0 100644
> --- a/net/bridge/br_if.c
> +++ b/net/bridge/br_if.c
> @@ -89,6 +89,21 @@ void br_port_carrier_check(struct net_bridge_port *p, bool *notified)
>  	spin_unlock_bh(&br->lock);
>  }
>  
> +int nbp_flags_change(struct net_bridge_port *p, unsigned long flags,
> +		     unsigned long mask, struct netlink_ext_ack *extack)
> +{
> +	int err;
> +
> +	err = br_switchdev_set_port_flag(p, flags, mask, extack);
> +	if (err)
> +		return err;
> +
> +	p->flags &= ~mask;
> +	p->flags |= flags;
> +
> +	return 0;
> +}
> +
>  static void br_port_set_promisc(struct net_bridge_port *p)
>  {
>  	int err = 0;
> @@ -343,6 +358,10 @@ static void del_nbp(struct net_bridge_port *p)
>  		update_headroom(br, get_max_headroom(br));
>  	netdev_reset_rx_headroom(dev);
>  
> +	nbp_flags_change(p, 0, BR_LEARNING, NULL);
> +	nbp_flags_change(p, BR_FLOOD, BR_FLOOD, NULL);
> +	nbp_flags_change(p, BR_MCAST_FLOOD, BR_MCAST_FLOOD, NULL);
> +	nbp_flags_change(p, BR_BCAST_FLOOD, BR_BCAST_FLOOD, NULL);
>  	nbp_vlan_flush(p);
>  	br_fdb_delete_by_port(br, p, 0, 1);
>  	switchdev_deferred_process();
> @@ -428,7 +447,10 @@ static struct net_bridge_port *new_nbp(struct net_bridge *br,
>  	p->path_cost = port_cost(dev);
>  	p->priority = 0x8000 >> BR_PORT_BITS;
>  	p->port_no = index;
> -	p->flags = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
> +	nbp_flags_change(p, BR_LEARNING, BR_LEARNING, NULL);
> +	nbp_flags_change(p, BR_FLOOD, BR_FLOOD, NULL);
> +	nbp_flags_change(p, BR_MCAST_FLOOD, BR_MCAST_FLOOD, NULL);
> +	nbp_flags_change(p, BR_BCAST_FLOOD, BR_BCAST_FLOOD, NULL);
>  	br_init_port(p);
>  	br_set_state(p, BR_STATE_DISABLED);
>  	br_stp_port_timer_init(p);
> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> index 02aa95c08b77..ab54d1daa9b4 100644
> --- a/net/bridge/br_netlink.c
> +++ b/net/bridge/br_netlink.c
> @@ -852,28 +852,20 @@ static int br_set_port_state(struct net_bridge_port *p, u8 state)
>  	return 0;
>  }
>  
> -/* Set/clear or port flags based on attribute */
> +/* Set/clear or port flags based on netlink attribute */
>  static int br_set_port_flag(struct net_bridge_port *p, struct nlattr *tb[],
>  			    int attrtype, unsigned long mask,
>  			    struct netlink_ext_ack *extack)
>  {
> -	unsigned long flags;
> -	int err;
> +	unsigned long flags = 0;
>  
>  	if (!tb[attrtype])
>  		return 0;
>  
>  	if (nla_get_u8(tb[attrtype]))
> -		flags = p->flags | mask;
> -	else
> -		flags = p->flags & ~mask;
> -
> -	err = br_switchdev_set_port_flag(p, flags, mask, extack);
> -	if (err)
> -		return err;
> +		flags = mask;
>  
> -	p->flags = flags;
> -	return 0;
> +	return nbp_flags_change(p, flags, mask, extack);
>  }
>  
>  /* Process bridge protocol info on port */
> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
> index a1639d41188b..f064abd86bdf 100644
> --- a/net/bridge/br_private.h
> +++ b/net/bridge/br_private.h
> @@ -749,6 +749,8 @@ netdev_features_t br_features_recompute(struct net_bridge *br,
>  void br_port_flags_change(struct net_bridge_port *port, unsigned long mask);
>  void br_manage_promisc(struct net_bridge *br);
>  int nbp_backup_change(struct net_bridge_port *p, struct net_device *backup_dev);
> +int nbp_flags_change(struct net_bridge_port *p, unsigned long flags,
> +		     unsigned long mask, struct netlink_ext_ack *extack);
>  
>  /* br_input.c */
>  int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb);
>
Vladimir Oltean Feb. 8, 2021, 11:45 a.m. UTC | #2
On Mon, Feb 08, 2021 at 01:37:03PM +0200, Nikolay Aleksandrov wrote:
> Hi Vladimir,
> I think this patch potentially breaks some use cases. There are a few problems, I'll
> start with the more serious one: before the ports would have a set of flags that were
> always set when joining, now due to how nbp_flags_change() handles flag setting some might
> not be set which would immediately change behaviour w.r.t software fwding. I'll use your
> example of BR_BCAST_FLOOD: a lot of drivers will return an error for it and any broadcast
> towards these ports will be dropped, we have mixed environments with software ports that
> sometimes have traffic (e.g. decapped ARP requests) software forwarded which will stop working.

Yes, you're right. The only solution I can think of is to add a "bool ignore_errors"
to nbp_flags_change, set to true from new_nbp and del_nbp, and to false from the
netlink code.

> The other lesser issue is with the style below, I mean these three calls for each flag are
> just ugly and look weird as you've also noted, since these APIs are internal can we do better?

Doing better would mean allowing nbp_flags_change() to have a bit mask with
potentially more brport flags set, and to call br_switchdev_set_port_flag in
a for_each_set_bit() loop?
Nikolay Aleksandrov Feb. 8, 2021, 12:17 p.m. UTC | #3
On 08/02/2021 13:45, Vladimir Oltean wrote:
> On Mon, Feb 08, 2021 at 01:37:03PM +0200, Nikolay Aleksandrov wrote:
>> Hi Vladimir,
>> I think this patch potentially breaks some use cases. There are a few problems, I'll
>> start with the more serious one: before the ports would have a set of flags that were
>> always set when joining, now due to how nbp_flags_change() handles flag setting some might
>> not be set which would immediately change behaviour w.r.t software fwding. I'll use your
>> example of BR_BCAST_FLOOD: a lot of drivers will return an error for it and any broadcast
>> towards these ports will be dropped, we have mixed environments with software ports that
>> sometimes have traffic (e.g. decapped ARP requests) software forwarded which will stop working.
> 
> Yes, you're right. The only solution I can think of is to add a "bool ignore_errors"
> to nbp_flags_change, set to true from new_nbp and del_nbp, and to false from the
> netlink code.
> 

Indeed, I can't think of any better solution right now, but that would make it more or less
equal to the current situation where the flags are just set. You can read/restore them on add/del
of bridge port, but I guess that's what you'd like to avoid. :)
I don't mind adding the add/del_nbp() notifications, but both of them seem redundant with
the port add/del notifications which you can handle in the driver.

>> The other lesser issue is with the style below, I mean these three calls for each flag are
>> just ugly and look weird as you've also noted, since these APIs are internal can we do better?
> 
> Doing better would mean allowing nbp_flags_change() to have a bit mask with
> potentially more brport flags set, and to call br_switchdev_set_port_flag in
> a for_each_set_bit() loop?
> 

Sure, that sounds better for now. I think you've described the ideal case in your
commit message.
diff mbox series

Patch

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index f7d2f472ae24..8903333654f0 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -89,6 +89,21 @@  void br_port_carrier_check(struct net_bridge_port *p, bool *notified)
 	spin_unlock_bh(&br->lock);
 }
 
+int nbp_flags_change(struct net_bridge_port *p, unsigned long flags,
+		     unsigned long mask, struct netlink_ext_ack *extack)
+{
+	int err;
+
+	err = br_switchdev_set_port_flag(p, flags, mask, extack);
+	if (err)
+		return err;
+
+	p->flags &= ~mask;
+	p->flags |= flags;
+
+	return 0;
+}
+
 static void br_port_set_promisc(struct net_bridge_port *p)
 {
 	int err = 0;
@@ -343,6 +358,10 @@  static void del_nbp(struct net_bridge_port *p)
 		update_headroom(br, get_max_headroom(br));
 	netdev_reset_rx_headroom(dev);
 
+	nbp_flags_change(p, 0, BR_LEARNING, NULL);
+	nbp_flags_change(p, BR_FLOOD, BR_FLOOD, NULL);
+	nbp_flags_change(p, BR_MCAST_FLOOD, BR_MCAST_FLOOD, NULL);
+	nbp_flags_change(p, BR_BCAST_FLOOD, BR_BCAST_FLOOD, NULL);
 	nbp_vlan_flush(p);
 	br_fdb_delete_by_port(br, p, 0, 1);
 	switchdev_deferred_process();
@@ -428,7 +447,10 @@  static struct net_bridge_port *new_nbp(struct net_bridge *br,
 	p->path_cost = port_cost(dev);
 	p->priority = 0x8000 >> BR_PORT_BITS;
 	p->port_no = index;
-	p->flags = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
+	nbp_flags_change(p, BR_LEARNING, BR_LEARNING, NULL);
+	nbp_flags_change(p, BR_FLOOD, BR_FLOOD, NULL);
+	nbp_flags_change(p, BR_MCAST_FLOOD, BR_MCAST_FLOOD, NULL);
+	nbp_flags_change(p, BR_BCAST_FLOOD, BR_BCAST_FLOOD, NULL);
 	br_init_port(p);
 	br_set_state(p, BR_STATE_DISABLED);
 	br_stp_port_timer_init(p);
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 02aa95c08b77..ab54d1daa9b4 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -852,28 +852,20 @@  static int br_set_port_state(struct net_bridge_port *p, u8 state)
 	return 0;
 }
 
-/* Set/clear or port flags based on attribute */
+/* Set/clear or port flags based on netlink attribute */
 static int br_set_port_flag(struct net_bridge_port *p, struct nlattr *tb[],
 			    int attrtype, unsigned long mask,
 			    struct netlink_ext_ack *extack)
 {
-	unsigned long flags;
-	int err;
+	unsigned long flags = 0;
 
 	if (!tb[attrtype])
 		return 0;
 
 	if (nla_get_u8(tb[attrtype]))
-		flags = p->flags | mask;
-	else
-		flags = p->flags & ~mask;
-
-	err = br_switchdev_set_port_flag(p, flags, mask, extack);
-	if (err)
-		return err;
+		flags = mask;
 
-	p->flags = flags;
-	return 0;
+	return nbp_flags_change(p, flags, mask, extack);
 }
 
 /* Process bridge protocol info on port */
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index a1639d41188b..f064abd86bdf 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -749,6 +749,8 @@  netdev_features_t br_features_recompute(struct net_bridge *br,
 void br_port_flags_change(struct net_bridge_port *port, unsigned long mask);
 void br_manage_promisc(struct net_bridge *br);
 int nbp_backup_change(struct net_bridge_port *p, struct net_device *backup_dev);
+int nbp_flags_change(struct net_bridge_port *p, unsigned long flags,
+		     unsigned long mask, struct netlink_ext_ack *extack);
 
 /* br_input.c */
 int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb);