[RFC,v2,net-next,00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices

Message ID	20210703115705.1034112-1-vladimir.oltean@nxp.com (mailing list archive)
Headers	show Return-Path: <netdev-owner@kernel.org> From: Vladimir Oltean <vladimir.oltean@nxp.com> To: netdev@vger.kernel.org, Jakub Kicinski <kuba@kernel.org>, "David S. Miller" <davem@davemloft.net> Cc: Andrew Lunn <andrew@lunn.ch>, Florian Fainelli <f.fainelli@gmail.com>, Vivien Didelot <vivien.didelot@gmail.com>, Jiri Pirko <jiri@resnulli.us>, Ido Schimmel <idosch@idosch.org>, Tobias Waldekranz <tobias@waldekranz.com>, Roopa Prabhu <roopa@nvidia.com>, Nikolay Aleksandrov <nikolay@nvidia.com>, Stephen Hemminger <stephen@networkplumber.org>, bridge@lists.linux-foundation.org, Alexander Duyck <alexander.duyck@gmail.com> Subject: [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices Date: Sat, 3 Jul 2021 14:56:55 +0300 Message-Id: <20210703115705.1034112-1-vladimir.oltean@nxp.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Precedence: bulk
Series	Allow forwarding for the software bridge data path to be offloaded to capable devices \| expand [RFC,v2,net-next,00/10] Allow forwarding for the software bridge data path to be offloaded to capab… [RFC,v2,net-next,01/10] net: dfwd: constrain existing users to macvlan subordinates [RFC,v2,net-next,02/10] net: bridge: disambiguate offload_fwd_mark [RFC,v2,net-next,03/10] net: bridge: switchdev: recycle unused hwdoms [RFC,v2,net-next,04/10] net: bridge: switchdev: allow the data plane forwarding to be offloaded [RFC,v2,net-next,05/10] net: extract helpers for binding a subordinate device to TX queues [RFC,v2,net-next,06/10] net: allow ndo_select_queue to go beyond dev->num_real_tx_queues [RFC,v2,net-next,07/10] net: dsa: track the number of switches in a tree [RFC,v2,net-next,08/10] net: dsa: add support for bridge forwarding offload [RFC,v2,net-next,09/10] net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in the PVT [RFC,v2,net-next,10/10] net: dsa: tag_dsa: offload the bridge forwarding process

Vladimir Oltean July 3, 2021, 11:56 a.m. UTC

For this series I have taken Tobias' work from here:
https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
and made the following changes:
- I collected and integrated (hopefully all of) Nikolay's, Ido's and my
feedback on the bridge driver changes. Otherwise, the structure of the
bridge changes is pretty much the same as Tobias left it.
- I basically rewrote the DSA infrastructure for the data plane
forwarding offload, based on the commonalities with another switch
driver for which I implemented this feature (not submitted here)
- I adapted mv88e6xxx to use the new infrastructure, hopefully it still
works but I didn't test that

The data plane of the software bridge can be partially offloaded to
switchdev, in the sense that we can trust the accelerator to:
(a) look up its FDB (which is more or less in sync with the software
bridge FDB) for selecting the destination ports for a packet
(b) replicate the frame in hardware in case it's a multicast/broadcast,
instead of the software bridge having to clone it and send the
clones to each net device one at a time. This reduces the bandwidth
needed between the CPU and the accelerator, as well as the CPU time
spent.

The data path forwarding offload is managed per "hardware domain" - a
generalization of the "offload_fwd_mark" concept which is being
introduced in this series. Every packet is delivered only once to each
hardware domain.

In addition, Tobias said in the original cover letter:

====================
## Overview

vlan1 vlan2
\ /
.-----------.
| br0 |
'-----------'
/ / \ \
swp0 swp1 swp2 eth0
: : :
(hwdom 1)

Up to this point, switchdevs have been trusted with offloading
forwarding between bridge ports, e.g. forwarding a unicast from swp0
to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
series extends forward offloading to include some new classes of
traffic:

- Locally originating flows, i.e. packets that ingress on br0 that are
to be forwarded to one or several of the ports swp{0,1,2}. Notably
this also includes routed flows, e.g. a packet ingressing swp0 on
VLAN 1 which is then routed over to VLAN 2 by the CPU and then
forwarded to swp1 is "locally originating" from br0's point of view.

- Flows originating from "foreign" interfaces, i.e. an interface that
is not offloaded by a particular switchdev instance. This includes
ports belonging to other switchdev instances. A typical example
would be flows from eth0 towards swp{0,1,2}.

The bridge still looks up its FDB/MDB as usual and then notifies the
switchdev driver that a particular skb should be offloaded if it
matches one of the classes above. It does so by using the _accel
version of dev_queue_xmit, supplying its own netdev as the
"subordinate" device. The driver can react to the presence of the
subordinate in its .ndo_select_queue in what ever way it needs to make
sure to forward the skb in much the same way that it would for packets
ingressing on regular ports.

Hardware domains to which a particular skb has been forwarded are
recorded so that duplicates are avoided.

The main performance benefit is thus seen on multicast flows. Imagine
for example that:

- An IP camera is connected to swp0 (VLAN 1)

- The CPU is acting as a multicast router, routing the group from VLAN
1 to VLAN 2.

- There are subscribers for the group in question behind both swp1 and
swp2 (VLAN 2).

With this offloading in place, the bridge need only send a single skb
to the driver, which will send it to the hardware marked in such a way
that the switch will perform the multicast replication according to
the MDB configuration. Naturally, the number of saved skb_clones
increase linearly with the number of subscribed ports.

As an extra benefit, on mv88e6xxx, this also allows the switch to
perform source address learning on these flows, which avoids having to
sync dynamic FDB entries over slow configuration interfaces like MDIO
to avoid flows directed towards the CPU being flooded as unknown
unicast by the switch.

## RFC

- In general, what do you think about this idea?

- hwdom. What do you think about this terminology? Personally I feel
that we had too many things called offload_fwd_mark, and that as the
use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
might be useful to have a separate term for it.

- .dfwd_{add,del}_station. Am I stretching this abstraction too far,
and if so do you have any suggestion/preference on how to signal the
offloading from the bridge down to the switchdev driver?

- The way that flooding is implemented in br_forward.c (lazily cloning
skbs) means that you have to mark the forwarding as completed very
early (right after should_deliver in maybe_deliver) in order to
avoid duplicates. Is there some way to move this decision point to a
later stage that I am missing?

- BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
compatible with unicast-to-multicast being used on a port. Then
again, I think that this would also be broken for regular switchdev
bridge offloading as this flag is not offloaded to the switchdev
port, so there is no way for the driver to refuse it. Any ideas on
how to handle this?

## mv88e6xxx Specifics

Since we are now only receiving a single skb for both unicast and
multicast flows, we can tag the packets with the FORWARD command
instead of FROM_CPU. The swich(es) will then forward the packet in
accordance with its ATU, VTU, STU, and PVT configuration - just like
for packets ingressing on user ports.

Crucially, FROM_CPU is still used for:

- Ports in standalone mode.

- Flows that are trapped to the CPU and software-forwarded by a
bridge. Note that these flows match neither of the classes discussed
in the overview.

- Packets that are sent directly to a port netdev without going
through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
socket.

We thus have a pretty clean separation where the data plane uses
FORWARDs and the control plane uses TO_/FROM_CPU.

The barrier between different bridges is enforced by port based VLANs
on mv88e6xxx, which in essence is a mapping from a source device/port
pair to an allowed set of egress ports. In order to have a FORWARD
frame (which carries a _source_ device/port) correctly mapped by the
PVT, we must use a unique pair for each bridge.

Fortunately, there is typically lots of unused address space in most
switch trees. When was the last time you saw an mv88e6xxx product
using more than 4 chips? Even if you found one with 16 (!) devices,
you would still have room to allocate 16*16 virtual ports to software
bridges.

Therefore, the mv88e6xxx driver will allocate a virtual device/port
pair to each bridge that it offloads. All members of the same bridge
are then configured to allow packets from this virtual port in their
PVTs.
====================

Tobias Waldekranz (5):
net: dfwd: constrain existing users to macvlan subordinates
net: bridge: disambiguate offload_fwd_mark
net: bridge: switchdev: recycle unused hwdoms
net: bridge: switchdev: allow the data plane forwarding to be
offloaded
net: dsa: tag_dsa: offload the bridge forwarding process

Vladimir Oltean (5):
net: extract helpers for binding a subordinate device to TX queues
net: allow ndo_select_queue to go beyond dev->num_real_tx_queues
net: dsa: track the number of switches in a tree
net: dsa: add support for bridge forwarding offload
net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in
the PVT

Tobias Waldekranz July 3, 2021, 10:04 p.m. UTC | #1

On Sat, Jul 03, 2021 at 14:56, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> For this series I have taken Tobias' work from here:
> https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
> and made the following changes:
> - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
>   feedback on the bridge driver changes. Otherwise, the structure of the
>   bridge changes is pretty much the same as Tobias left it.
> - I basically rewrote the DSA infrastructure for the data plane
>   forwarding offload, based on the commonalities with another switch
>   driver for which I implemented this feature (not submitted here)
> - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
>   works but I didn't test that

Hi Vladimir,

Sorry that I have dropped the ball on this series. I have actually had a
v1 of this queued up for a while. Unfortunately I ran into mv88e6xxx
specific problems. (See below)

> The data plane of the software bridge can be partially offloaded to
> switchdev, in the sense that we can trust the accelerator to:
> (a) look up its FDB (which is more or less in sync with the software
>     bridge FDB) for selecting the destination ports for a packet
> (b) replicate the frame in hardware in case it's a multicast/broadcast,
>     instead of the software bridge having to clone it and send the
>     clones to each net device one at a time. This reduces the bandwidth
>     needed between the CPU and the accelerator, as well as the CPU time
>     spent.
>
> The data path forwarding offload is managed per "hardware domain" - a
> generalization of the "offload_fwd_mark" concept which is being
> introduced in this series. Every packet is delivered only once to each
> hardware domain.
>
> In addition, Tobias said in the original cover letter:
>
> ====================
> ## Overview
>
>    vlan1   vlan2
>        \   /
>    .-----------.
>    |    br0    |
>    '-----------'
>    /   /   \   \
> swp0 swp1 swp2 eth0
>   :   :   :
>   (hwdom 1)
>
> Up to this point, switchdevs have been trusted with offloading
> forwarding between bridge ports, e.g. forwarding a unicast from swp0
> to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
> series extends forward offloading to include some new classes of
> traffic:
>
> - Locally originating flows, i.e. packets that ingress on br0 that are
>   to be forwarded to one or several of the ports swp{0,1,2}. Notably
>   this also includes routed flows, e.g. a packet ingressing swp0 on
>   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
>   forwarded to swp1 is "locally originating" from br0's point of view.
>
> - Flows originating from "foreign" interfaces, i.e. an interface that
>   is not offloaded by a particular switchdev instance. This includes
>   ports belonging to other switchdev instances. A typical example
>   would be flows from eth0 towards swp{0,1,2}.
>
> The bridge still looks up its FDB/MDB as usual and then notifies the
> switchdev driver that a particular skb should be offloaded if it
> matches one of the classes above. It does so by using the _accel
> version of dev_queue_xmit, supplying its own netdev as the
> "subordinate" device. The driver can react to the presence of the
> subordinate in its .ndo_select_queue in what ever way it needs to make
> sure to forward the skb in much the same way that it would for packets
> ingressing on regular ports.
>
> Hardware domains to which a particular skb has been forwarded are
> recorded so that duplicates are avoided.
>
> The main performance benefit is thus seen on multicast flows. Imagine
> for example that:
>
> - An IP camera is connected to swp0 (VLAN 1)
>
> - The CPU is acting as a multicast router, routing the group from VLAN
>   1 to VLAN 2.
>
> - There are subscribers for the group in question behind both swp1 and
>   swp2 (VLAN 2).
>
> With this offloading in place, the bridge need only send a single skb
> to the driver, which will send it to the hardware marked in such a way
> that the switch will perform the multicast replication according to
> the MDB configuration. Naturally, the number of saved skb_clones
> increase linearly with the number of subscribed ports.
>
> As an extra benefit, on mv88e6xxx, this also allows the switch to
> perform source address learning on these flows, which avoids having to
> sync dynamic FDB entries over slow configuration interfaces like MDIO
> to avoid flows directed towards the CPU being flooded as unknown
> unicast by the switch.
>
>
> ## RFC
>
> - In general, what do you think about this idea?
>
> - hwdom. What do you think about this terminology? Personally I feel
>   that we had too many things called offload_fwd_mark, and that as the
>   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
>   might be useful to have a separate term for it.
>
> - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
>   and if so do you have any suggestion/preference on how to signal the
>   offloading from the bridge down to the switchdev driver?
>
> - The way that flooding is implemented in br_forward.c (lazily cloning
>   skbs) means that you have to mark the forwarding as completed very
>   early (right after should_deliver in maybe_deliver) in order to
>   avoid duplicates. Is there some way to move this decision point to a
>   later stage that I am missing?
>
> - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
>   compatible with unicast-to-multicast being used on a port. Then
>   again, I think that this would also be broken for regular switchdev
>   bridge offloading as this flag is not offloaded to the switchdev
>   port, so there is no way for the driver to refuse it. Any ideas on
>   how to handle this?
>
>
> ## mv88e6xxx Specifics
>
> Since we are now only receiving a single skb for both unicast and
> multicast flows, we can tag the packets with the FORWARD command
> instead of FROM_CPU. The swich(es) will then forward the packet in
> accordance with its ATU, VTU, STU, and PVT configuration - just like
> for packets ingressing on user ports.
>
> Crucially, FROM_CPU is still used for:
>
> - Ports in standalone mode.
>
> - Flows that are trapped to the CPU and software-forwarded by a
>   bridge. Note that these flows match neither of the classes discussed
>   in the overview.
>
> - Packets that are sent directly to a port netdev without going
>   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
>   socket.
>
> We thus have a pretty clean separation where the data plane uses
> FORWARDs and the control plane uses TO_/FROM_CPU.
>
> The barrier between different bridges is enforced by port based VLANs
> on mv88e6xxx, which in essence is a mapping from a source device/port
> pair to an allowed set of egress ports.

Unless I am missing something, it turns out that the PVT is not enough
to support multiple (non-VLAN filtering) bridges in multi-chip
setups. While the isolation barrier works, there is no way of correctly
managing automatic learning.

> In order to have a FORWARD
> frame (which carries a _source_ device/port) correctly mapped by the
> PVT, we must use a unique pair for each bridge.
>
> Fortunately, there is typically lots of unused address space in most
> switch trees. When was the last time you saw an mv88e6xxx product
> using more than 4 chips? Even if you found one with 16 (!) devices,
> you would still have room to allocate 16*16 virtual ports to software
> bridges.
>
> Therefore, the mv88e6xxx driver will allocate a virtual device/port
> pair to each bridge that it offloads. All members of the same bridge
> are then configured to allow packets from this virtual port in their
> PVTs.

So while this solution is cute, it does not work in this example:

 CPU
  | .-----.
.-0-1-. .-0-1-.
| sw0 | | sw1 |
'-2-3-' '-2-3-'

- [sw0p2, sw1p2] are attached to one bridge
- [sw0p3, sw1p3] are attached to another bridge
- Neither bridge uses VLAN filtering

Since no VLAN information available in the frames, the source addresses
of FORWARDs sent over the DSA link (sw0p1, sw1p0) cannot possibly be
separated into different FIDs. They will all be placed in the respective
port's default FID. Thus, the two bridges are not isolated with respect
to their FDBs.

My current plan is therefore to start by reworking how bridges are
isolated on mv88e6xxx. Roughly by allocating a reserved VID/FID pair for
each non-filtering bridge. Two of these can be easily managed since both
VID 0 and 4095 are illegal on the wire but allowed in the VTU - after
that it gets tricky. The best scheme I have come up with is to just grab
an unused VID when adding any subsequent non-filtering bridge; in the
event that that VID is requested by a filtering bridge or a VLAN upper,
you move the non-filtering bridge to another currently unused VID.

Does that sound reasonable?

> ====================
>
> Tobias Waldekranz (5):
>   net: dfwd: constrain existing users to macvlan subordinates
>   net: bridge: disambiguate offload_fwd_mark
>   net: bridge: switchdev: recycle unused hwdoms
>   net: bridge: switchdev: allow the data plane forwarding to be
>     offloaded
>   net: dsa: tag_dsa: offload the bridge forwarding process
>
> Vladimir Oltean (5):
>   net: extract helpers for binding a subordinate device to TX queues
>   net: allow ndo_select_queue to go beyond dev->num_real_tx_queues
>   net: dsa: track the number of switches in a tree
>   net: dsa: add support for bridge forwarding offload
>   net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in
>     the PVT
>
>  drivers/net/dsa/mv88e6xxx/chip.c              | 106 +++++++++++-
>  .../net/ethernet/intel/fm10k/fm10k_netdev.c   |   3 +
>  drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +
>  include/linux/if_bridge.h                     |   1 +
>  include/linux/netdevice.h                     |  13 +-
>  include/net/dsa.h                             |  37 ++++
>  net/bridge/br_forward.c                       |  18 +-
>  net/bridge/br_if.c                            |   4 +-
>  net/bridge/br_private.h                       |  49 +++++-
>  net/bridge/br_switchdev.c                     | 163 +++++++++++++++---
>  net/bridge/br_vlan.c                          |  10 +-
>  net/core/dev.c                                |  31 +++-
>  net/dsa/dsa2.c                                |   3 +
>  net/dsa/dsa_priv.h                            |  28 +++
>  net/dsa/port.c                                |  35 ++++
>  net/dsa/slave.c                               | 134 +++++++++++++-
>  net/dsa/switch.c                              |  58 +++++++
>  net/dsa/tag_dsa.c                             |  60 ++++++-
>  19 files changed, 700 insertions(+), 59 deletions(-)
>
> -- 
> 2.25.1

Vladimir Oltean July 4, 2021, 8:11 a.m. UTC | #2

Hi Tobias,

On Sun, Jul 04, 2021 at 12:04:26AM +0200, Tobias Waldekranz wrote:
> On Sat, Jul 03, 2021 at 14:56, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > For this series I have taken Tobias' work from here:
> > https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
> > and made the following changes:
> > - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
> >   feedback on the bridge driver changes. Otherwise, the structure of the
> >   bridge changes is pretty much the same as Tobias left it.
> > - I basically rewrote the DSA infrastructure for the data plane
> >   forwarding offload, based on the commonalities with another switch
> >   driver for which I implemented this feature (not submitted here)
> > - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
> >   works but I didn't test that
>
> Hi Vladimir,
>
> Sorry that I have dropped the ball on this series. I have actually had a
> v1 of this queued up for a while. Unfortunately I ran into mv88e6xxx
> specific problems. (See below)
>
> > The data plane of the software bridge can be partially offloaded to
> > switchdev, in the sense that we can trust the accelerator to:
> > (a) look up its FDB (which is more or less in sync with the software
> >     bridge FDB) for selecting the destination ports for a packet
> > (b) replicate the frame in hardware in case it's a multicast/broadcast,
> >     instead of the software bridge having to clone it and send the
> >     clones to each net device one at a time. This reduces the bandwidth
> >     needed between the CPU and the accelerator, as well as the CPU time
> >     spent.
> >
> > The data path forwarding offload is managed per "hardware domain" - a
> > generalization of the "offload_fwd_mark" concept which is being
> > introduced in this series. Every packet is delivered only once to each
> > hardware domain.
> >
> > In addition, Tobias said in the original cover letter:
> >
> > ====================
> > ## Overview
> >
> >    vlan1   vlan2
> >        \   /
> >    .-----------.
> >    |    br0    |
> >    '-----------'
> >    /   /   \   \
> > swp0 swp1 swp2 eth0
> >   :   :   :
> >   (hwdom 1)
> >
> > Up to this point, switchdevs have been trusted with offloading
> > forwarding between bridge ports, e.g. forwarding a unicast from swp0
> > to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
> > series extends forward offloading to include some new classes of
> > traffic:
> >
> > - Locally originating flows, i.e. packets that ingress on br0 that are
> >   to be forwarded to one or several of the ports swp{0,1,2}. Notably
> >   this also includes routed flows, e.g. a packet ingressing swp0 on
> >   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
> >   forwarded to swp1 is "locally originating" from br0's point of view.
> >
> > - Flows originating from "foreign" interfaces, i.e. an interface that
> >   is not offloaded by a particular switchdev instance. This includes
> >   ports belonging to other switchdev instances. A typical example
> >   would be flows from eth0 towards swp{0,1,2}.
> >
> > The bridge still looks up its FDB/MDB as usual and then notifies the
> > switchdev driver that a particular skb should be offloaded if it
> > matches one of the classes above. It does so by using the _accel
> > version of dev_queue_xmit, supplying its own netdev as the
> > "subordinate" device. The driver can react to the presence of the
> > subordinate in its .ndo_select_queue in what ever way it needs to make
> > sure to forward the skb in much the same way that it would for packets
> > ingressing on regular ports.
> >
> > Hardware domains to which a particular skb has been forwarded are
> > recorded so that duplicates are avoided.
> >
> > The main performance benefit is thus seen on multicast flows. Imagine
> > for example that:
> >
> > - An IP camera is connected to swp0 (VLAN 1)
> >
> > - The CPU is acting as a multicast router, routing the group from VLAN
> >   1 to VLAN 2.
> >
> > - There are subscribers for the group in question behind both swp1 and
> >   swp2 (VLAN 2).
> >
> > With this offloading in place, the bridge need only send a single skb
> > to the driver, which will send it to the hardware marked in such a way
> > that the switch will perform the multicast replication according to
> > the MDB configuration. Naturally, the number of saved skb_clones
> > increase linearly with the number of subscribed ports.
> >
> > As an extra benefit, on mv88e6xxx, this also allows the switch to
> > perform source address learning on these flows, which avoids having to
> > sync dynamic FDB entries over slow configuration interfaces like MDIO
> > to avoid flows directed towards the CPU being flooded as unknown
> > unicast by the switch.
> >
> >
> > ## RFC
> >
> > - In general, what do you think about this idea?
> >
> > - hwdom. What do you think about this terminology? Personally I feel
> >   that we had too many things called offload_fwd_mark, and that as the
> >   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
> >   might be useful to have a separate term for it.
> >
> > - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
> >   and if so do you have any suggestion/preference on how to signal the
> >   offloading from the bridge down to the switchdev driver?
> >
> > - The way that flooding is implemented in br_forward.c (lazily cloning
> >   skbs) means that you have to mark the forwarding as completed very
> >   early (right after should_deliver in maybe_deliver) in order to
> >   avoid duplicates. Is there some way to move this decision point to a
> >   later stage that I am missing?
> >
> > - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
> >   compatible with unicast-to-multicast being used on a port. Then
> >   again, I think that this would also be broken for regular switchdev
> >   bridge offloading as this flag is not offloaded to the switchdev
> >   port, so there is no way for the driver to refuse it. Any ideas on
> >   how to handle this?
> >
> >
> > ## mv88e6xxx Specifics
> >
> > Since we are now only receiving a single skb for both unicast and
> > multicast flows, we can tag the packets with the FORWARD command
> > instead of FROM_CPU. The swich(es) will then forward the packet in
> > accordance with its ATU, VTU, STU, and PVT configuration - just like
> > for packets ingressing on user ports.
> >
> > Crucially, FROM_CPU is still used for:
> >
> > - Ports in standalone mode.
> >
> > - Flows that are trapped to the CPU and software-forwarded by a
> >   bridge. Note that these flows match neither of the classes discussed
> >   in the overview.
> >
> > - Packets that are sent directly to a port netdev without going
> >   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
> >   socket.
> >
> > We thus have a pretty clean separation where the data plane uses
> > FORWARDs and the control plane uses TO_/FROM_CPU.
> >
> > The barrier between different bridges is enforced by port based VLANs
> > on mv88e6xxx, which in essence is a mapping from a source device/port
> > pair to an allowed set of egress ports.
>
> Unless I am missing something, it turns out that the PVT is not enough
> to support multiple (non-VLAN filtering) bridges in multi-chip
> setups. While the isolation barrier works, there is no way of correctly
> managing automatic learning.
>
> > In order to have a FORWARD
> > frame (which carries a _source_ device/port) correctly mapped by the
> > PVT, we must use a unique pair for each bridge.
> >
> > Fortunately, there is typically lots of unused address space in most
> > switch trees. When was the last time you saw an mv88e6xxx product
> > using more than 4 chips? Even if you found one with 16 (!) devices,
> > you would still have room to allocate 16*16 virtual ports to software
> > bridges.
> >
> > Therefore, the mv88e6xxx driver will allocate a virtual device/port
> > pair to each bridge that it offloads. All members of the same bridge
> > are then configured to allow packets from this virtual port in their
> > PVTs.
>
> So while this solution is cute, it does not work in this example:
>
>  CPU
>   | .-----.
> .-0-1-. .-0-1-.
> | sw0 | | sw1 |
> '-2-3-' '-2-3-'
>
> - [sw0p2, sw1p2] are attached to one bridge
> - [sw0p3, sw1p3] are attached to another bridge
> - Neither bridge uses VLAN filtering
>
> Since no VLAN information available in the frames, the source addresses
> of FORWARDs sent over the DSA link (sw0p1, sw1p0) cannot possibly be
> separated into different FIDs. They will all be placed in the respective
> port's default FID. Thus, the two bridges are not isolated with respect
> to their FDBs.
>
> My current plan is therefore to start by reworking how bridges are
> isolated on mv88e6xxx. Roughly by allocating a reserved VID/FID pair for
> each non-filtering bridge. Two of these can be easily managed since both
> VID 0 and 4095 are illegal on the wire but allowed in the VTU - after
> that it gets tricky. The best scheme I have come up with is to just grab
> an unused VID when adding any subsequent non-filtering bridge; in the
> event that that VID is requested by a filtering bridge or a VLAN upper,
> you move the non-filtering bridge to another currently unused VID.
>
> Does that sound reasonable?

I don't think this patch series makes the problem you are describing any
worse than it already is in mainline, does it?

I mean even with multiple VLAN-unaware bridges spanning the same single
switch chip today, it is still true that you can not have two stations
with the same MAC address, one in one bridge and another in the other
bridge, right?

Do you have an example when this causes issues that need to be addressed
immediately?

I thought the only case where this is a real problem is when you have
multiple CPU ports or multiple DSA links between 2 switches, because
then, if learning is enabled, that same MAC address will bounce between
the 2 ports. For that case, the consensus was that you just can't enable
address learning on those ports, and you let the software manage the FDB
in a way that is compatible with multiple CPU ports / DSA links (install
the MAC DA as a sort of multicast address and let the port forwarding
matrix choose only one of the 2 destinations based on source port).

Lack of FDB partitioning also used to be a problem when the standalone
ports were left to do address learning, but that changed too.

The hardware I am working with simply does not have any way to solve
this either - the FDB is simply not partitionable without VLAN
filtering (we have simple shared VLAN filtering, where the VID is
ignored and the FDB lookup is performed with VID 0, but not anything
more complex). So the simple solution I've been advising for people who
want their MAC addresses to be isolated is to create a single VLAN-aware
bridge and manage the VLAN broadcast domains themselves - that seems to
work and is simple to understand and flexible (note that I am going to
send a patch at some point to prevent the user from partitioning a
sja1105 switch tree into multiple VLAN-aware bridges).

Basically unless I'm misunderstanding something, I think what you're
proposing makes theoretical sense, but without a use case behind it it
might just be too much work with no real life benefit.

Qingfang Deng July 5, 2021, 4:20 a.m. UTC | #3

Hi Vladimir,

On Sat, Jul 03, 2021 at 02:56:55PM +0300, Vladimir Oltean wrote:
> For this series I have taken Tobias' work from here:
> https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
> and made the following changes:
> - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
>   feedback on the bridge driver changes. Otherwise, the structure of the
>   bridge changes is pretty much the same as Tobias left it.
> - I basically rewrote the DSA infrastructure for the data plane
>   forwarding offload, based on the commonalities with another switch
>   driver for which I implemented this feature (not submitted here)
> - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
>   works but I didn't test that
> 
> The data plane of the software bridge can be partially offloaded to
> switchdev, in the sense that we can trust the accelerator to:
> (a) look up its FDB (which is more or less in sync with the software
>     bridge FDB) for selecting the destination ports for a packet
> (b) replicate the frame in hardware in case it's a multicast/broadcast,
>     instead of the software bridge having to clone it and send the
>     clones to each net device one at a time. This reduces the bandwidth
>     needed between the CPU and the accelerator, as well as the CPU time
>     spent.

Many DSA taggers use port bit field in their TX tags, which allows
replication in hardware. (multiple bits set = send to multiple ports)
I wonder if the tagger API can be updated to support this.

> 
> The data path forwarding offload is managed per "hardware domain" - a
> generalization of the "offload_fwd_mark" concept which is being
> introduced in this series. Every packet is delivered only once to each
> hardware domain.
> 
> In addition, Tobias said in the original cover letter:
> 
> ====================
> ## Overview
> 
>    vlan1   vlan2
>        \   /
>    .-----------.
>    |    br0    |
>    '-----------'
>    /   /   \   \
> swp0 swp1 swp2 eth0
>   :   :   :
>   (hwdom 1)
> 
> Up to this point, switchdevs have been trusted with offloading
> forwarding between bridge ports, e.g. forwarding a unicast from swp0
> to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
> series extends forward offloading to include some new classes of
> traffic:
> 
> - Locally originating flows, i.e. packets that ingress on br0 that are
>   to be forwarded to one or several of the ports swp{0,1,2}. Notably
>   this also includes routed flows, e.g. a packet ingressing swp0 on
>   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
>   forwarded to swp1 is "locally originating" from br0's point of view.
> 
> - Flows originating from "foreign" interfaces, i.e. an interface that
>   is not offloaded by a particular switchdev instance. This includes
>   ports belonging to other switchdev instances. A typical example
>   would be flows from eth0 towards swp{0,1,2}.
> 
> The bridge still looks up its FDB/MDB as usual and then notifies the
> switchdev driver that a particular skb should be offloaded if it
> matches one of the classes above. It does so by using the _accel
> version of dev_queue_xmit, supplying its own netdev as the
> "subordinate" device. The driver can react to the presence of the
> subordinate in its .ndo_select_queue in what ever way it needs to make
> sure to forward the skb in much the same way that it would for packets
> ingressing on regular ports.
> 
> Hardware domains to which a particular skb has been forwarded are
> recorded so that duplicates are avoided.
> 
> The main performance benefit is thus seen on multicast flows. Imagine
> for example that:
> 
> - An IP camera is connected to swp0 (VLAN 1)
> 
> - The CPU is acting as a multicast router, routing the group from VLAN
>   1 to VLAN 2.
> 
> - There are subscribers for the group in question behind both swp1 and
>   swp2 (VLAN 2).
> 
> With this offloading in place, the bridge need only send a single skb
> to the driver, which will send it to the hardware marked in such a way
> that the switch will perform the multicast replication according to
> the MDB configuration. Naturally, the number of saved skb_clones
> increase linearly with the number of subscribed ports.
> 
> As an extra benefit, on mv88e6xxx, this also allows the switch to
> perform source address learning on these flows, which avoids having to
> sync dynamic FDB entries over slow configuration interfaces like MDIO
> to avoid flows directed towards the CPU being flooded as unknown
> unicast by the switch.
> 
> 
> ## RFC
> 
> - In general, what do you think about this idea?
> 
> - hwdom. What do you think about this terminology? Personally I feel
>   that we had too many things called offload_fwd_mark, and that as the
>   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
>   might be useful to have a separate term for it.
> 
> - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
>   and if so do you have any suggestion/preference on how to signal the
>   offloading from the bridge down to the switchdev driver?
> 
> - The way that flooding is implemented in br_forward.c (lazily cloning
>   skbs) means that you have to mark the forwarding as completed very
>   early (right after should_deliver in maybe_deliver) in order to
>   avoid duplicates. Is there some way to move this decision point to a
>   later stage that I am missing?
> 
> - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
>   compatible with unicast-to-multicast being used on a port. Then
>   again, I think that this would also be broken for regular switchdev
>   bridge offloading as this flag is not offloaded to the switchdev
>   port, so there is no way for the driver to refuse it. Any ideas on
>   how to handle this?
> 
> 
> ## mv88e6xxx Specifics
> 
> Since we are now only receiving a single skb for both unicast and
> multicast flows, we can tag the packets with the FORWARD command
> instead of FROM_CPU. The swich(es) will then forward the packet in
> accordance with its ATU, VTU, STU, and PVT configuration - just like
> for packets ingressing on user ports.
> 
> Crucially, FROM_CPU is still used for:
> 
> - Ports in standalone mode.
> 
> - Flows that are trapped to the CPU and software-forwarded by a
>   bridge. Note that these flows match neither of the classes discussed
>   in the overview.
> 
> - Packets that are sent directly to a port netdev without going
>   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
>   socket.
> 
> We thus have a pretty clean separation where the data plane uses
> FORWARDs and the control plane uses TO_/FROM_CPU.
> 
> The barrier between different bridges is enforced by port based VLANs
> on mv88e6xxx, which in essence is a mapping from a source device/port
> pair to an allowed set of egress ports. In order to have a FORWARD
> frame (which carries a _source_ device/port) correctly mapped by the
> PVT, we must use a unique pair for each bridge.
> 
> Fortunately, there is typically lots of unused address space in most
> switch trees. When was the last time you saw an mv88e6xxx product
> using more than 4 chips? Even if you found one with 16 (!) devices,
> you would still have room to allocate 16*16 virtual ports to software
> bridges.
> 
> Therefore, the mv88e6xxx driver will allocate a virtual device/port
> pair to each bridge that it offloads. All members of the same bridge
> are then configured to allow packets from this virtual port in their
> PVTs.
> ====================
> 
> Tobias Waldekranz (5):
>   net: dfwd: constrain existing users to macvlan subordinates
>   net: bridge: disambiguate offload_fwd_mark
>   net: bridge: switchdev: recycle unused hwdoms
>   net: bridge: switchdev: allow the data plane forwarding to be
>     offloaded
>   net: dsa: tag_dsa: offload the bridge forwarding process
> 
> Vladimir Oltean (5):
>   net: extract helpers for binding a subordinate device to TX queues
>   net: allow ndo_select_queue to go beyond dev->num_real_tx_queues
>   net: dsa: track the number of switches in a tree
>   net: dsa: add support for bridge forwarding offload
>   net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in
>     the PVT
> 
>  drivers/net/dsa/mv88e6xxx/chip.c              | 106 +++++++++++-
>  .../net/ethernet/intel/fm10k/fm10k_netdev.c   |   3 +
>  drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +
>  include/linux/if_bridge.h                     |   1 +
>  include/linux/netdevice.h                     |  13 +-
>  include/net/dsa.h                             |  37 ++++
>  net/bridge/br_forward.c                       |  18 +-
>  net/bridge/br_if.c                            |   4 +-
>  net/bridge/br_private.h                       |  49 +++++-
>  net/bridge/br_switchdev.c                     | 163 +++++++++++++++---
>  net/bridge/br_vlan.c                          |  10 +-
>  net/core/dev.c                                |  31 +++-
>  net/dsa/dsa2.c                                |   3 +
>  net/dsa/dsa_priv.h                            |  28 +++
>  net/dsa/port.c                                |  35 ++++
>  net/dsa/slave.c                               | 134 +++++++++++++-
>  net/dsa/switch.c                              |  58 +++++++
>  net/dsa/tag_dsa.c                             |  60 ++++++-
>  19 files changed, 700 insertions(+), 59 deletions(-)
> 
> -- 
> 2.25.1
>

Tobias Waldekranz July 5, 2021, 8:09 a.m. UTC | #4

On Sun, Jul 04, 2021 at 11:11, Vladimir Oltean <olteanv@gmail.com> wrote:
> Hi Tobias,
>
> On Sun, Jul 04, 2021 at 12:04:26AM +0200, Tobias Waldekranz wrote:
>> On Sat, Jul 03, 2021 at 14:56, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
>> > For this series I have taken Tobias' work from here:
>> > https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
>> > and made the following changes:
>> > - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
>> >   feedback on the bridge driver changes. Otherwise, the structure of the
>> >   bridge changes is pretty much the same as Tobias left it.
>> > - I basically rewrote the DSA infrastructure for the data plane
>> >   forwarding offload, based on the commonalities with another switch
>> >   driver for which I implemented this feature (not submitted here)
>> > - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
>> >   works but I didn't test that
>>
>> Hi Vladimir,
>>
>> Sorry that I have dropped the ball on this series. I have actually had a
>> v1 of this queued up for a while. Unfortunately I ran into mv88e6xxx
>> specific problems. (See below)
>>
>> > The data plane of the software bridge can be partially offloaded to
>> > switchdev, in the sense that we can trust the accelerator to:
>> > (a) look up its FDB (which is more or less in sync with the software
>> >     bridge FDB) for selecting the destination ports for a packet
>> > (b) replicate the frame in hardware in case it's a multicast/broadcast,
>> >     instead of the software bridge having to clone it and send the
>> >     clones to each net device one at a time. This reduces the bandwidth
>> >     needed between the CPU and the accelerator, as well as the CPU time
>> >     spent.
>> >
>> > The data path forwarding offload is managed per "hardware domain" - a
>> > generalization of the "offload_fwd_mark" concept which is being
>> > introduced in this series. Every packet is delivered only once to each
>> > hardware domain.
>> >
>> > In addition, Tobias said in the original cover letter:
>> >
>> > ====================
>> > ## Overview
>> >
>> >    vlan1   vlan2
>> >        \   /
>> >    .-----------.
>> >    |    br0    |
>> >    '-----------'
>> >    /   /   \   \
>> > swp0 swp1 swp2 eth0
>> >   :   :   :
>> >   (hwdom 1)
>> >
>> > Up to this point, switchdevs have been trusted with offloading
>> > forwarding between bridge ports, e.g. forwarding a unicast from swp0
>> > to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
>> > series extends forward offloading to include some new classes of
>> > traffic:
>> >
>> > - Locally originating flows, i.e. packets that ingress on br0 that are
>> >   to be forwarded to one or several of the ports swp{0,1,2}. Notably
>> >   this also includes routed flows, e.g. a packet ingressing swp0 on
>> >   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
>> >   forwarded to swp1 is "locally originating" from br0's point of view.
>> >
>> > - Flows originating from "foreign" interfaces, i.e. an interface that
>> >   is not offloaded by a particular switchdev instance. This includes
>> >   ports belonging to other switchdev instances. A typical example
>> >   would be flows from eth0 towards swp{0,1,2}.
>> >
>> > The bridge still looks up its FDB/MDB as usual and then notifies the
>> > switchdev driver that a particular skb should be offloaded if it
>> > matches one of the classes above. It does so by using the _accel
>> > version of dev_queue_xmit, supplying its own netdev as the
>> > "subordinate" device. The driver can react to the presence of the
>> > subordinate in its .ndo_select_queue in what ever way it needs to make
>> > sure to forward the skb in much the same way that it would for packets
>> > ingressing on regular ports.
>> >
>> > Hardware domains to which a particular skb has been forwarded are
>> > recorded so that duplicates are avoided.
>> >
>> > The main performance benefit is thus seen on multicast flows. Imagine
>> > for example that:
>> >
>> > - An IP camera is connected to swp0 (VLAN 1)
>> >
>> > - The CPU is acting as a multicast router, routing the group from VLAN
>> >   1 to VLAN 2.
>> >
>> > - There are subscribers for the group in question behind both swp1 and
>> >   swp2 (VLAN 2).
>> >
>> > With this offloading in place, the bridge need only send a single skb
>> > to the driver, which will send it to the hardware marked in such a way
>> > that the switch will perform the multicast replication according to
>> > the MDB configuration. Naturally, the number of saved skb_clones
>> > increase linearly with the number of subscribed ports.
>> >
>> > As an extra benefit, on mv88e6xxx, this also allows the switch to
>> > perform source address learning on these flows, which avoids having to
>> > sync dynamic FDB entries over slow configuration interfaces like MDIO
>> > to avoid flows directed towards the CPU being flooded as unknown
>> > unicast by the switch.
>> >
>> >
>> > ## RFC
>> >
>> > - In general, what do you think about this idea?
>> >
>> > - hwdom. What do you think about this terminology? Personally I feel
>> >   that we had too many things called offload_fwd_mark, and that as the
>> >   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
>> >   might be useful to have a separate term for it.
>> >
>> > - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
>> >   and if so do you have any suggestion/preference on how to signal the
>> >   offloading from the bridge down to the switchdev driver?
>> >
>> > - The way that flooding is implemented in br_forward.c (lazily cloning
>> >   skbs) means that you have to mark the forwarding as completed very
>> >   early (right after should_deliver in maybe_deliver) in order to
>> >   avoid duplicates. Is there some way to move this decision point to a
>> >   later stage that I am missing?
>> >
>> > - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
>> >   compatible with unicast-to-multicast being used on a port. Then
>> >   again, I think that this would also be broken for regular switchdev
>> >   bridge offloading as this flag is not offloaded to the switchdev
>> >   port, so there is no way for the driver to refuse it. Any ideas on
>> >   how to handle this?
>> >
>> >
>> > ## mv88e6xxx Specifics
>> >
>> > Since we are now only receiving a single skb for both unicast and
>> > multicast flows, we can tag the packets with the FORWARD command
>> > instead of FROM_CPU. The swich(es) will then forward the packet in
>> > accordance with its ATU, VTU, STU, and PVT configuration - just like
>> > for packets ingressing on user ports.
>> >
>> > Crucially, FROM_CPU is still used for:
>> >
>> > - Ports in standalone mode.
>> >
>> > - Flows that are trapped to the CPU and software-forwarded by a
>> >   bridge. Note that these flows match neither of the classes discussed
>> >   in the overview.
>> >
>> > - Packets that are sent directly to a port netdev without going
>> >   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
>> >   socket.
>> >
>> > We thus have a pretty clean separation where the data plane uses
>> > FORWARDs and the control plane uses TO_/FROM_CPU.
>> >
>> > The barrier between different bridges is enforced by port based VLANs
>> > on mv88e6xxx, which in essence is a mapping from a source device/port
>> > pair to an allowed set of egress ports.
>>
>> Unless I am missing something, it turns out that the PVT is not enough
>> to support multiple (non-VLAN filtering) bridges in multi-chip
>> setups. While the isolation barrier works, there is no way of correctly
>> managing automatic learning.
>>
>> > In order to have a FORWARD
>> > frame (which carries a _source_ device/port) correctly mapped by the
>> > PVT, we must use a unique pair for each bridge.
>> >
>> > Fortunately, there is typically lots of unused address space in most
>> > switch trees. When was the last time you saw an mv88e6xxx product
>> > using more than 4 chips? Even if you found one with 16 (!) devices,
>> > you would still have room to allocate 16*16 virtual ports to software
>> > bridges.
>> >
>> > Therefore, the mv88e6xxx driver will allocate a virtual device/port
>> > pair to each bridge that it offloads. All members of the same bridge
>> > are then configured to allow packets from this virtual port in their
>> > PVTs.
>>
>> So while this solution is cute, it does not work in this example:
>>
>>  CPU
>>   | .-----.
>> .-0-1-. .-0-1-.
>> | sw0 | | sw1 |
>> '-2-3-' '-2-3-'
>>
>> - [sw0p2, sw1p2] are attached to one bridge
>> - [sw0p3, sw1p3] are attached to another bridge
>> - Neither bridge uses VLAN filtering
>>
>> Since no VLAN information available in the frames, the source addresses
>> of FORWARDs sent over the DSA link (sw0p1, sw1p0) cannot possibly be
>> separated into different FIDs. They will all be placed in the respective
>> port's default FID. Thus, the two bridges are not isolated with respect
>> to their FDBs.
>>
>> My current plan is therefore to start by reworking how bridges are
>> isolated on mv88e6xxx. Roughly by allocating a reserved VID/FID pair for
>> each non-filtering bridge. Two of these can be easily managed since both
>> VID 0 and 4095 are illegal on the wire but allowed in the VTU - after
>> that it gets tricky. The best scheme I have come up with is to just grab
>> an unused VID when adding any subsequent non-filtering bridge; in the
>> event that that VID is requested by a filtering bridge or a VLAN upper,
>> you move the non-filtering bridge to another currently unused VID.
>>
>> Does that sound reasonable?
>
> I don't think this patch series makes the problem you are describing any
> worse than it already is in mainline, does it?

It does not make it worse, no. But assuming that mv88e6xxx will handle
multi-bridge using the VTU in the future (i.e. my suggestion above),
there is no need for inventing virtual DSA dev/port tuples - we can just
use the physical port info as the source and use the VID to signal the
source bridge. So I am hesitant to merge the mv88e6xxx-specific changes.

> I mean even with multiple VLAN-unaware bridges spanning the same single
> switch chip today, it is still true that you can not have two stations
> with the same MAC address, one in one bridge and another in the other
> bridge, right?

That is correct.

> Do you have an example when this causes issues that need to be addressed
> immediately?
>
> I thought the only case where this is a real problem is when you have
> multiple CPU ports or multiple DSA links between 2 switches, because
> then, if learning is enabled, that same MAC address will bounce between
> the 2 ports. For that case, the consensus was that you just can't enable
> address learning on those ports, and you let the software manage the FDB
> in a way that is compatible with multiple CPU ports / DSA links (install
> the MAC DA as a sort of multicast address and let the port forwarding
> matrix choose only one of the 2 destinations based on source port).
>
> Lack of FDB partitioning also used to be a problem when the standalone
> ports were left to do address learning, but that changed too.

Funny you should mention that. The presence of standalone ports is
actually what first shone a light on this issue for me. I was running a
kselftest-like setup like this:

   br0
   / \
swp1 swp3  swp2  swp4

Physically, [swp1, swp2] and [swp3, swp4] where looped externally:

    CPU
     |
.----0----.
|   sw0   |
'-1-2-3-4-'
  '-' '-'

I was testing automatic learning by sending out a broadcast from br0 and
verifying that br0's MAC was learned on port 0 in the ATU - alas, it was
not. The MAC was nowhere to be found.

Moving back to a topology without these loops, I could see that learning
worked as expected.

Not sure how familiar you are with mv88e6xxx at this point, but the way
learning is disabled is by clearing a port's port association vector
(PAV). This does not, however, "disable" learning really. It just
updates the ATU with an all-zero vector, which means "invalidate the
entry".

So, in the example above, when the broadcast is looped back to the
standalone ports, the port's default FID (0) will be used to invalidate
the MAC for br0. Since they all use FID 0, the standalone ports will
nuke the br0's FDB.

> The hardware I am working with simply does not have any way to solve
> this either - the FDB is simply not partitionable without VLAN
> filtering (we have simple shared VLAN filtering, where the VID is
> ignored and the FDB lookup is performed with VID 0, but not anything
> more complex). So the simple solution I've been advising for people who
> want their MAC addresses to be isolated is to create a single VLAN-aware
> bridge and manage the VLAN broadcast domains themselves - that seems to
> work and is simple to understand and flexible (note that I am going to
> send a patch at some point to prevent the user from partitioning a
> sja1105 switch tree into multiple VLAN-aware bridges).

Essentially what I am proposing is to always run mv88e6xxx VLAN-aware
internally. Then if you have bridges that disable VLAN filtering, you
change the ingress policy on the member ports to classify all incoming
traffic as untagged and assign them to the port's PVID. (Note: this is
different from "force PVID" in that you never pop any tags from the
frame)

Is your device capable of operating in that mode?

> Basically unless I'm misunderstanding something, I think what you're
> proposing makes theoretical sense, but without a use case behind it it
> might just be too much work with no real life benefit.

I have not done the tests to prove it, but I am pretty sure that if you
have two bridges where the same MACs are used (which does happen in
redundant topologies, and is why IVL is a thing) you will add MACs to
the ATU with a destination that is not allowed by the PVT. This will
lead to sporadic drops while the address is on the "wrong" port from the
view of one bridge, until that station sends some return traffic. At
that point of course, it will be on the wrong port according to the
other bridge.

I guess I would just rather spend some more time up front to make sure
that we support full isolation between bridges, than spend that same
time debugging issues in production environments later on.

Tobias Waldekranz July 5, 2021, 8:32 a.m. UTC | #5

On Mon, Jul 05, 2021 at 12:20, DENG Qingfang <dqfext@gmail.com> wrote:
> Hi Vladimir,
>
> On Sat, Jul 03, 2021 at 02:56:55PM +0300, Vladimir Oltean wrote:
>> For this series I have taken Tobias' work from here:
>> https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
>> and made the following changes:
>> - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
>>   feedback on the bridge driver changes. Otherwise, the structure of the
>>   bridge changes is pretty much the same as Tobias left it.
>> - I basically rewrote the DSA infrastructure for the data plane
>>   forwarding offload, based on the commonalities with another switch
>>   driver for which I implemented this feature (not submitted here)
>> - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
>>   works but I didn't test that
>> 
>> The data plane of the software bridge can be partially offloaded to
>> switchdev, in the sense that we can trust the accelerator to:
>> (a) look up its FDB (which is more or less in sync with the software
>>     bridge FDB) for selecting the destination ports for a packet
>> (b) replicate the frame in hardware in case it's a multicast/broadcast,
>>     instead of the software bridge having to clone it and send the
>>     clones to each net device one at a time. This reduces the bandwidth
>>     needed between the CPU and the accelerator, as well as the CPU time
>>     spent.
>
> Many DSA taggers use port bit field in their TX tags, which allows
> replication in hardware. (multiple bits set = send to multiple ports)
> I wonder if the tagger API can be updated to support this.

I think you could, but it would be tricky.

The bridge does not operate using vectors/bitfields, rather it is
procedural code that you have to loop through before knowing the set of
destination ports.

This series just sends the skb to the first port in the hardware domain
and trusts the HW to calculate the same port set as the code in
br_forward.c would have.

To do what you suggest, the bridge would have to translate each nbp into
a position in a bitfield (or call out to the underlying driver to do it)
as it is looping through ports, then send the aggregated mask along with
the skb. Knowing if a port is the first one you have come across for a
given domain is very easy (just maintain a bitfield), knowing if it is
the last one is harder. So you would likely end up having to queue up
the actual transmission until after the loop has been executed, which
hard to combine with the "lazy cloning" that you really want to get
decent performance.

>> 
>> The data path forwarding offload is managed per "hardware domain" - a
>> generalization of the "offload_fwd_mark" concept which is being
>> introduced in this series. Every packet is delivered only once to each
>> hardware domain.
>> 
>> In addition, Tobias said in the original cover letter:
>> 
>> ====================
>> ## Overview
>> 
>>    vlan1   vlan2
>>        \   /
>>    .-----------.
>>    |    br0    |
>>    '-----------'
>>    /   /   \   \
>> swp0 swp1 swp2 eth0
>>   :   :   :
>>   (hwdom 1)
>> 
>> Up to this point, switchdevs have been trusted with offloading
>> forwarding between bridge ports, e.g. forwarding a unicast from swp0
>> to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
>> series extends forward offloading to include some new classes of
>> traffic:
>> 
>> - Locally originating flows, i.e. packets that ingress on br0 that are
>>   to be forwarded to one or several of the ports swp{0,1,2}. Notably
>>   this also includes routed flows, e.g. a packet ingressing swp0 on
>>   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
>>   forwarded to swp1 is "locally originating" from br0's point of view.
>> 
>> - Flows originating from "foreign" interfaces, i.e. an interface that
>>   is not offloaded by a particular switchdev instance. This includes
>>   ports belonging to other switchdev instances. A typical example
>>   would be flows from eth0 towards swp{0,1,2}.
>> 
>> The bridge still looks up its FDB/MDB as usual and then notifies the
>> switchdev driver that a particular skb should be offloaded if it
>> matches one of the classes above. It does so by using the _accel
>> version of dev_queue_xmit, supplying its own netdev as the
>> "subordinate" device. The driver can react to the presence of the
>> subordinate in its .ndo_select_queue in what ever way it needs to make
>> sure to forward the skb in much the same way that it would for packets
>> ingressing on regular ports.
>> 
>> Hardware domains to which a particular skb has been forwarded are
>> recorded so that duplicates are avoided.
>> 
>> The main performance benefit is thus seen on multicast flows. Imagine
>> for example that:
>> 
>> - An IP camera is connected to swp0 (VLAN 1)
>> 
>> - The CPU is acting as a multicast router, routing the group from VLAN
>>   1 to VLAN 2.
>> 
>> - There are subscribers for the group in question behind both swp1 and
>>   swp2 (VLAN 2).
>> 
>> With this offloading in place, the bridge need only send a single skb
>> to the driver, which will send it to the hardware marked in such a way
>> that the switch will perform the multicast replication according to
>> the MDB configuration. Naturally, the number of saved skb_clones
>> increase linearly with the number of subscribed ports.
>> 
>> As an extra benefit, on mv88e6xxx, this also allows the switch to
>> perform source address learning on these flows, which avoids having to
>> sync dynamic FDB entries over slow configuration interfaces like MDIO
>> to avoid flows directed towards the CPU being flooded as unknown
>> unicast by the switch.
>> 
>> 
>> ## RFC
>> 
>> - In general, what do you think about this idea?
>> 
>> - hwdom. What do you think about this terminology? Personally I feel
>>   that we had too many things called offload_fwd_mark, and that as the
>>   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
>>   might be useful to have a separate term for it.
>> 
>> - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
>>   and if so do you have any suggestion/preference on how to signal the
>>   offloading from the bridge down to the switchdev driver?
>> 
>> - The way that flooding is implemented in br_forward.c (lazily cloning
>>   skbs) means that you have to mark the forwarding as completed very
>>   early (right after should_deliver in maybe_deliver) in order to
>>   avoid duplicates. Is there some way to move this decision point to a
>>   later stage that I am missing?
>> 
>> - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
>>   compatible with unicast-to-multicast being used on a port. Then
>>   again, I think that this would also be broken for regular switchdev
>>   bridge offloading as this flag is not offloaded to the switchdev
>>   port, so there is no way for the driver to refuse it. Any ideas on
>>   how to handle this?
>> 
>> 
>> ## mv88e6xxx Specifics
>> 
>> Since we are now only receiving a single skb for both unicast and
>> multicast flows, we can tag the packets with the FORWARD command
>> instead of FROM_CPU. The swich(es) will then forward the packet in
>> accordance with its ATU, VTU, STU, and PVT configuration - just like
>> for packets ingressing on user ports.
>> 
>> Crucially, FROM_CPU is still used for:
>> 
>> - Ports in standalone mode.
>> 
>> - Flows that are trapped to the CPU and software-forwarded by a
>>   bridge. Note that these flows match neither of the classes discussed
>>   in the overview.
>> 
>> - Packets that are sent directly to a port netdev without going
>>   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
>>   socket.
>> 
>> We thus have a pretty clean separation where the data plane uses
>> FORWARDs and the control plane uses TO_/FROM_CPU.
>> 
>> The barrier between different bridges is enforced by port based VLANs
>> on mv88e6xxx, which in essence is a mapping from a source device/port
>> pair to an allowed set of egress ports. In order to have a FORWARD
>> frame (which carries a _source_ device/port) correctly mapped by the
>> PVT, we must use a unique pair for each bridge.
>> 
>> Fortunately, there is typically lots of unused address space in most
>> switch trees. When was the last time you saw an mv88e6xxx product
>> using more than 4 chips? Even if you found one with 16 (!) devices,
>> you would still have room to allocate 16*16 virtual ports to software
>> bridges.
>> 
>> Therefore, the mv88e6xxx driver will allocate a virtual device/port
>> pair to each bridge that it offloads. All members of the same bridge
>> are then configured to allow packets from this virtual port in their
>> PVTs.
>> ====================
>> 
>> Tobias Waldekranz (5):
>>   net: dfwd: constrain existing users to macvlan subordinates
>>   net: bridge: disambiguate offload_fwd_mark
>>   net: bridge: switchdev: recycle unused hwdoms
>>   net: bridge: switchdev: allow the data plane forwarding to be
>>     offloaded
>>   net: dsa: tag_dsa: offload the bridge forwarding process
>> 
>> Vladimir Oltean (5):
>>   net: extract helpers for binding a subordinate device to TX queues
>>   net: allow ndo_select_queue to go beyond dev->num_real_tx_queues
>>   net: dsa: track the number of switches in a tree
>>   net: dsa: add support for bridge forwarding offload
>>   net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in
>>     the PVT
>> 
>>  drivers/net/dsa/mv88e6xxx/chip.c              | 106 +++++++++++-
>>  .../net/ethernet/intel/fm10k/fm10k_netdev.c   |   3 +
>>  drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +
>>  include/linux/if_bridge.h                     |   1 +
>>  include/linux/netdevice.h                     |  13 +-
>>  include/net/dsa.h                             |  37 ++++
>>  net/bridge/br_forward.c                       |  18 +-
>>  net/bridge/br_if.c                            |   4 +-
>>  net/bridge/br_private.h                       |  49 +++++-
>>  net/bridge/br_switchdev.c                     | 163 +++++++++++++++---
>>  net/bridge/br_vlan.c                          |  10 +-
>>  net/core/dev.c                                |  31 +++-
>>  net/dsa/dsa2.c                                |   3 +
>>  net/dsa/dsa_priv.h                            |  28 +++
>>  net/dsa/port.c                                |  35 ++++
>>  net/dsa/slave.c                               | 134 +++++++++++++-
>>  net/dsa/switch.c                              |  58 +++++++
>>  net/dsa/tag_dsa.c                             |  60 ++++++-
>>  19 files changed, 700 insertions(+), 59 deletions(-)
>> 
>> -- 
>> 2.25.1
>>

Vladimir Oltean July 5, 2021, 8:54 a.m. UTC | #6

On Mon, Jul 05, 2021 at 10:09:16AM +0200, Tobias Waldekranz wrote:
> On Sun, Jul 04, 2021 at 11:11, Vladimir Oltean <olteanv@gmail.com> wrote:
> > Hi Tobias,
> >
> > On Sun, Jul 04, 2021 at 12:04:26AM +0200, Tobias Waldekranz wrote:
> >> On Sat, Jul 03, 2021 at 14:56, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> >> > For this series I have taken Tobias' work from here:
> >> > https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
> >> > and made the following changes:
> >> > - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
> >> >   feedback on the bridge driver changes. Otherwise, the structure of the
> >> >   bridge changes is pretty much the same as Tobias left it.
> >> > - I basically rewrote the DSA infrastructure for the data plane
> >> >   forwarding offload, based on the commonalities with another switch
> >> >   driver for which I implemented this feature (not submitted here)
> >> > - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
> >> >   works but I didn't test that
> >>
> >> Hi Vladimir,
> >>
> >> Sorry that I have dropped the ball on this series. I have actually had a
> >> v1 of this queued up for a while. Unfortunately I ran into mv88e6xxx
> >> specific problems. (See below)
> >>
> >> > The data plane of the software bridge can be partially offloaded to
> >> > switchdev, in the sense that we can trust the accelerator to:
> >> > (a) look up its FDB (which is more or less in sync with the software
> >> >     bridge FDB) for selecting the destination ports for a packet
> >> > (b) replicate the frame in hardware in case it's a multicast/broadcast,
> >> >     instead of the software bridge having to clone it and send the
> >> >     clones to each net device one at a time. This reduces the bandwidth
> >> >     needed between the CPU and the accelerator, as well as the CPU time
> >> >     spent.
> >> >
> >> > The data path forwarding offload is managed per "hardware domain" - a
> >> > generalization of the "offload_fwd_mark" concept which is being
> >> > introduced in this series. Every packet is delivered only once to each
> >> > hardware domain.
> >> >
> >> > In addition, Tobias said in the original cover letter:
> >> >
> >> > ====================
> >> > ## Overview
> >> >
> >> >    vlan1   vlan2
> >> >        \   /
> >> >    .-----------.
> >> >    |    br0    |
> >> >    '-----------'
> >> >    /   /   \   \
> >> > swp0 swp1 swp2 eth0
> >> >   :   :   :
> >> >   (hwdom 1)
> >> >
> >> > Up to this point, switchdevs have been trusted with offloading
> >> > forwarding between bridge ports, e.g. forwarding a unicast from swp0
> >> > to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
> >> > series extends forward offloading to include some new classes of
> >> > traffic:
> >> >
> >> > - Locally originating flows, i.e. packets that ingress on br0 that are
> >> >   to be forwarded to one or several of the ports swp{0,1,2}. Notably
> >> >   this also includes routed flows, e.g. a packet ingressing swp0 on
> >> >   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
> >> >   forwarded to swp1 is "locally originating" from br0's point of view.
> >> >
> >> > - Flows originating from "foreign" interfaces, i.e. an interface that
> >> >   is not offloaded by a particular switchdev instance. This includes
> >> >   ports belonging to other switchdev instances. A typical example
> >> >   would be flows from eth0 towards swp{0,1,2}.
> >> >
> >> > The bridge still looks up its FDB/MDB as usual and then notifies the
> >> > switchdev driver that a particular skb should be offloaded if it
> >> > matches one of the classes above. It does so by using the _accel
> >> > version of dev_queue_xmit, supplying its own netdev as the
> >> > "subordinate" device. The driver can react to the presence of the
> >> > subordinate in its .ndo_select_queue in what ever way it needs to make
> >> > sure to forward the skb in much the same way that it would for packets
> >> > ingressing on regular ports.
> >> >
> >> > Hardware domains to which a particular skb has been forwarded are
> >> > recorded so that duplicates are avoided.
> >> >
> >> > The main performance benefit is thus seen on multicast flows. Imagine
> >> > for example that:
> >> >
> >> > - An IP camera is connected to swp0 (VLAN 1)
> >> >
> >> > - The CPU is acting as a multicast router, routing the group from VLAN
> >> >   1 to VLAN 2.
> >> >
> >> > - There are subscribers for the group in question behind both swp1 and
> >> >   swp2 (VLAN 2).
> >> >
> >> > With this offloading in place, the bridge need only send a single skb
> >> > to the driver, which will send it to the hardware marked in such a way
> >> > that the switch will perform the multicast replication according to
> >> > the MDB configuration. Naturally, the number of saved skb_clones
> >> > increase linearly with the number of subscribed ports.
> >> >
> >> > As an extra benefit, on mv88e6xxx, this also allows the switch to
> >> > perform source address learning on these flows, which avoids having to
> >> > sync dynamic FDB entries over slow configuration interfaces like MDIO
> >> > to avoid flows directed towards the CPU being flooded as unknown
> >> > unicast by the switch.
> >> >
> >> >
> >> > ## RFC
> >> >
> >> > - In general, what do you think about this idea?
> >> >
> >> > - hwdom. What do you think about this terminology? Personally I feel
> >> >   that we had too many things called offload_fwd_mark, and that as the
> >> >   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
> >> >   might be useful to have a separate term for it.
> >> >
> >> > - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
> >> >   and if so do you have any suggestion/preference on how to signal the
> >> >   offloading from the bridge down to the switchdev driver?
> >> >
> >> > - The way that flooding is implemented in br_forward.c (lazily cloning
> >> >   skbs) means that you have to mark the forwarding as completed very
> >> >   early (right after should_deliver in maybe_deliver) in order to
> >> >   avoid duplicates. Is there some way to move this decision point to a
> >> >   later stage that I am missing?
> >> >
> >> > - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
> >> >   compatible with unicast-to-multicast being used on a port. Then
> >> >   again, I think that this would also be broken for regular switchdev
> >> >   bridge offloading as this flag is not offloaded to the switchdev
> >> >   port, so there is no way for the driver to refuse it. Any ideas on
> >> >   how to handle this?
> >> >
> >> >
> >> > ## mv88e6xxx Specifics
> >> >
> >> > Since we are now only receiving a single skb for both unicast and
> >> > multicast flows, we can tag the packets with the FORWARD command
> >> > instead of FROM_CPU. The swich(es) will then forward the packet in
> >> > accordance with its ATU, VTU, STU, and PVT configuration - just like
> >> > for packets ingressing on user ports.
> >> >
> >> > Crucially, FROM_CPU is still used for:
> >> >
> >> > - Ports in standalone mode.
> >> >
> >> > - Flows that are trapped to the CPU and software-forwarded by a
> >> >   bridge. Note that these flows match neither of the classes discussed
> >> >   in the overview.
> >> >
> >> > - Packets that are sent directly to a port netdev without going
> >> >   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
> >> >   socket.
> >> >
> >> > We thus have a pretty clean separation where the data plane uses
> >> > FORWARDs and the control plane uses TO_/FROM_CPU.
> >> >
> >> > The barrier between different bridges is enforced by port based VLANs
> >> > on mv88e6xxx, which in essence is a mapping from a source device/port
> >> > pair to an allowed set of egress ports.
> >>
> >> Unless I am missing something, it turns out that the PVT is not enough
> >> to support multiple (non-VLAN filtering) bridges in multi-chip
> >> setups. While the isolation barrier works, there is no way of correctly
> >> managing automatic learning.
> >>
> >> > In order to have a FORWARD
> >> > frame (which carries a _source_ device/port) correctly mapped by the
> >> > PVT, we must use a unique pair for each bridge.
> >> >
> >> > Fortunately, there is typically lots of unused address space in most
> >> > switch trees. When was the last time you saw an mv88e6xxx product
> >> > using more than 4 chips? Even if you found one with 16 (!) devices,
> >> > you would still have room to allocate 16*16 virtual ports to software
> >> > bridges.
> >> >
> >> > Therefore, the mv88e6xxx driver will allocate a virtual device/port
> >> > pair to each bridge that it offloads. All members of the same bridge
> >> > are then configured to allow packets from this virtual port in their
> >> > PVTs.
> >>
> >> So while this solution is cute, it does not work in this example:
> >>
> >>  CPU
> >>   | .-----.
> >> .-0-1-. .-0-1-.
> >> | sw0 | | sw1 |
> >> '-2-3-' '-2-3-'
> >>
> >> - [sw0p2, sw1p2] are attached to one bridge
> >> - [sw0p3, sw1p3] are attached to another bridge
> >> - Neither bridge uses VLAN filtering
> >>
> >> Since no VLAN information available in the frames, the source addresses
> >> of FORWARDs sent over the DSA link (sw0p1, sw1p0) cannot possibly be
> >> separated into different FIDs. They will all be placed in the respective
> >> port's default FID. Thus, the two bridges are not isolated with respect
> >> to their FDBs.
> >>
> >> My current plan is therefore to start by reworking how bridges are
> >> isolated on mv88e6xxx. Roughly by allocating a reserved VID/FID pair for
> >> each non-filtering bridge. Two of these can be easily managed since both
> >> VID 0 and 4095 are illegal on the wire but allowed in the VTU - after
> >> that it gets tricky. The best scheme I have come up with is to just grab
> >> an unused VID when adding any subsequent non-filtering bridge; in the
> >> event that that VID is requested by a filtering bridge or a VLAN upper,
> >> you move the non-filtering bridge to another currently unused VID.
> >>
> >> Does that sound reasonable?
> >
> > I don't think this patch series makes the problem you are describing any
> > worse than it already is in mainline, does it?
> 
> It does not make it worse, no. But assuming that mv88e6xxx will handle
> multi-bridge using the VTU in the future (i.e. my suggestion above),
> there is no need for inventing virtual DSA dev/port tuples - we can just
> use the physical port info as the source and use the VID to signal the
> source bridge. So I am hesitant to merge the mv88e6xxx-specific changes.
> 
> > I mean even with multiple VLAN-unaware bridges spanning the same single
> > switch chip today, it is still true that you can not have two stations
> > with the same MAC address, one in one bridge and another in the other
> > bridge, right?
> 
> That is correct.
> 
> > Do you have an example when this causes issues that need to be addressed
> > immediately?
> >
> > I thought the only case where this is a real problem is when you have
> > multiple CPU ports or multiple DSA links between 2 switches, because
> > then, if learning is enabled, that same MAC address will bounce between
> > the 2 ports. For that case, the consensus was that you just can't enable
> > address learning on those ports, and you let the software manage the FDB
> > in a way that is compatible with multiple CPU ports / DSA links (install
> > the MAC DA as a sort of multicast address and let the port forwarding
> > matrix choose only one of the 2 destinations based on source port).
> >
> > Lack of FDB partitioning also used to be a problem when the standalone
> > ports were left to do address learning, but that changed too.
> 
> Funny you should mention that. The presence of standalone ports is
> actually what first shone a light on this issue for me. I was running a
> kselftest-like setup like this:
> 
>    br0
>    / \
> swp1 swp3  swp2  swp4
> 
> Physically, [swp1, swp2] and [swp3, swp4] where looped externally:
> 
>     CPU
>      |
> .----0----.
> |   sw0   |
> '-1-2-3-4-'
>   '-' '-'
> 
> I was testing automatic learning by sending out a broadcast from br0 and
> verifying that br0's MAC was learned on port 0 in the ATU - alas, it was
> not. The MAC was nowhere to be found.
> 
> Moving back to a topology without these loops, I could see that learning
> worked as expected.
> 
> Not sure how familiar you are with mv88e6xxx at this point, but the way
> learning is disabled is by clearing a port's port association vector
> (PAV). This does not, however, "disable" learning really. It just
> updates the ATU with an all-zero vector, which means "invalidate the
> entry".
> 
> So, in the example above, when the broadcast is looped back to the
> standalone ports, the port's default FID (0) will be used to invalidate
> the MAC for br0. Since they all use FID 0, the standalone ports will
> nuke the br0's FDB.

That's broken IMO. The same kind of setup works with sja1105 and felix/ocelot
(with ocelot I use tools/testing/selftests/drivers/net/ocelot/tc_flower_chains.sh
as a test case for this, with the "DUT ports" (bridged ports) part of
the same switch as the "Generator ports" (standalone ports)).
"Don't learn" means just "don't learn", the FDB entry remains where it
was, in this case on the CPU port.

I imagine that if the "no learning" bit is being set on an access port
as a security measure (you don't trust that whoever connects the cable
won't attempt to spoof MAC addresses and fill the FDB), then it won't be
effective at all on mv88e6xxx, since the guy will at least manage to
invalidate the FDB entries pointing to other ports in the switch. File a
ticket with Marvell maybe?

With your RX filtering patches now merged, and the br0 MAC address
installed as a static entry, this is not really a problem for the setup
you described, is it? Maybe we should just keep using assisted_learning_on_cpu_port
for mv88e6xxx.

> > The hardware I am working with simply does not have any way to solve
> > this either - the FDB is simply not partitionable without VLAN
> > filtering (we have simple shared VLAN filtering, where the VID is
> > ignored and the FDB lookup is performed with VID 0, but not anything
> > more complex). So the simple solution I've been advising for people who
> > want their MAC addresses to be isolated is to create a single VLAN-aware
> > bridge and manage the VLAN broadcast domains themselves - that seems to
> > work and is simple to understand and flexible (note that I am going to
> > send a patch at some point to prevent the user from partitioning a
> > sja1105 switch tree into multiple VLAN-aware bridges).
> 
> Essentially what I am proposing is to always run mv88e6xxx VLAN-aware
> internally. Then if you have bridges that disable VLAN filtering, you
> change the ingress policy on the member ports to classify all incoming
> traffic as untagged and assign them to the port's PVID. (Note: this is
> different from "force PVID" in that you never pop any tags from the
> frame)
> 
> Is your device capable of operating in that mode?

Yes, this is how the switches I maintain work in VLAN-unaware mode.
The sja1105 is always VLAN-aware, but I change the TPID by which it
recognizes VLAN tags to a bogus value (0xdadb) and all packets get
classified to the port pvid.
The ocelot/felix switch also has the concept of a "classified VLAN",
which can be derived from the port-based default with an option to look
at the VLAN header in the frame, or TCAM rules can also change it, etc.
In VLAN-unaware mode I configure the switch to not look at the VLAN
header in the frame when setting the classified VLAN.

Note that due to external reasons, in VLAN-unaware mode the sja1105 and
ocelot/felix switches use different classified VLANs:
- sja1105 uses VID 1024 as pvid for switch ID 0, port 0; 1025 for switch
  id 1 port 1 etc. See net/dsa/tag_8021q.c for details.
- ocelot uses VID 0 for all frames.

On sja1105, because every port has a unique pvid, I necessarily have to
configure it for shared address learning and ignore the VID during FDB
lookup (see commit 6d7c7d948a2e ("net: dsa: sja1105: Fix broken learning
with vlan_filtering disabled")). This also means that the VID is always
looked up as zero, which means I cannot do proper FDB partitioning.
Maybe if it had the option of a more complex shared VLAN learning, where
the VID can be 0 in the FDB lookup of some ports, 1 in others, etc etc,
then it can be made to work.  But it doesn't.

On ocelot/felix, I suppose that can be done: the classified VLAN in
VLAN-unaware mode can be derived from the "bridge ID", and 0 can be used
just for standalone ports. But I'm not really interested in adding
support for that, I don't see a use case where it will make a
significant difference.

> > Basically unless I'm misunderstanding something, I think what you're
> > proposing makes theoretical sense, but without a use case behind it it
> > might just be too much work with no real life benefit.
> 
> I have not done the tests to prove it, but I am pretty sure that if you
> have two bridges where the same MACs are used (which does happen in
> redundant topologies, and is why IVL is a thing) you will add MACs to
> the ATU with a destination that is not allowed by the PVT. This will
> lead to sporadic drops while the address is on the "wrong" port from the
> view of one bridge, until that station sends some return traffic. At
> that point of course, it will be on the wrong port according to the
> other bridge.

The redundant topologies I am familiar with (IEEE 802.1CB) will disable
address learning. HSR too, probably. There's no point in learning where
a packet came from if it is consistently going to come from multiple
sources.

Your VLAN-aware bridge should have independent VLAN learning enabled.
There is really no disadvantage to having a single VLAN-aware bridge as
opposed to multiple VLAN-unaware bridges. You can manage your forwarding
domains on a per-VLAN basis with even more flexibility.

Even if you add support for VLAN-unaware FDB partitioning via unique
classified VLANs per bridge domain, you will probably still have a
single ageing timer for your entire FDB. DSA chooses the
dsa_switch_fastest_ageing_time() for you, assuming that this will be the
case. It is unlikely to find hardware which is fully partitionable, you
will always find some corner cases.

> I guess I would just rather spend some more time up front to make sure
> that we support full isolation between bridges, than spend that same
> time debugging issues in production environments later on.

I have nothing against dropping the mv88e6xxx patches and replacing them
with sja1105 support (even if that will have the same "issues" that you
describe here).

On sja1105, there is actually no DSA tagging support for data plane
packets, only for control packets. This is a good and a bad thing,
because it's hard to work with given the current control-only
infrastructure, but good because everything is a VLAN, really.

Currently, a packet that is not PTP or STP is sent using a tag_8021q
"TX VLAN", whose broadcast domain contains only 2 ports: the CPU port
and the desired egress port. The TX VLAN is popped on egress, so it acts
as a de facto DSA tag.

With bridge data plane offload, all that needs to be done is tag_8021q
sets up a bridge forwarding offload TX VLAN corresponding to each bridge
number, and this is a multicast VLAN: it contains the CPU port as well
as all ports that have joined that bridge. It will not leak outside of
the software bridge's broadcast domain, and the packet will be looked up
in the FDB.

Actually what I described above is only true for VLAN-unaware bridging.
When we offload a VLAN-aware bridge, we let the packet slide into the
switch precisely with the VLAN ID that came from the bridge (basically
the packet looks the same as if it was sent through a socket on the DSA
master, we don't pretend that we construct a DSA tag at all because we
don't, but now we have all the advantages of it coming from the bridge
device, like you can now bridge the sja1105 with foreign interfaces).
This is why we need to restrict the user to a single VLAN-aware bridge,
otherwise port separation could not be maintained. When some ports of a
sja1105 device are part of a VLAN-aware bridge, the standalone ports are
still targeted with the precise TX VLAN, and the tag_8021q VLANs (the
1024-3071 range) cannot be installed in the bridge VLAN database or as
8021q uppers.

Vladimir Oltean July 5, 2021, 9:57 a.m. UTC | #7

On Mon, Jul 05, 2021 at 10:32:04AM +0200, Tobias Waldekranz wrote:
> > Many DSA taggers use port bit field in their TX tags, which allows
> > replication in hardware. (multiple bits set = send to multiple ports)
> > I wonder if the tagger API can be updated to support this.
>
> I think you could, but it would be tricky.
>
> The bridge does not operate using vectors/bitfields, rather it is
> procedural code that you have to loop through before knowing the set of
> destination ports.
>
> This series just sends the skb to the first port in the hardware domain
> and trusts the HW to calculate the same port set as the code in
> br_forward.c would have.
>
> To do what you suggest, the bridge would have to translate each nbp into
> a position in a bitfield (or call out to the underlying driver to do it)
> as it is looping through ports, then send the aggregated mask along with
> the skb. Knowing if a port is the first one you have come across for a
> given domain is very easy (just maintain a bitfield), knowing if it is
> the last one is harder. So you would likely end up having to queue up
> the actual transmission until after the loop has been executed, which
> hard to combine with the "lazy cloning" that you really want to get
> decent performance.

In addition to changing the bridge in order to get the entire bit mask,
one also has to somehow propagate that bit mask per skb down to the
driver which might be tricky in itself. There is currently no bridge
specific data structure passed between the bridge and the switchdev
driver, it is just the struct net_device *sb_dev. A hacky solution I
might imagine is for the bridge to kzalloc() a small data structure
like:

struct bridge_fwd_offload_accel_priv {
	struct net_device *sb_dev; /* Must be first! */
	unsigned long port_mask;
};

and call as follows:

int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
{
	struct bridge_fwd_offload_accel_priv *accel_priv = NULL;

	if (br_switchdev_accels_skb(skb)) {
		accel_priv = kzalloc(sizeof(*accel_priv), GFP_ATOMIC);
		if (!accel_priv)
			return -ENOMEM;

		accel_priv->sb_dev = BR_INPUT_SKB_CB(skb)->brdev;
		accel_priv->port_mask = port_mask;
	}

	dev_queue_xmit_accel(skb, accel_priv);
}

This way, the code in net/core/dev.c can be left unmodified. We give it
an accel_priv pointer but it can think it is only looking at a sb_dev
pointer, since that is the first element in the structure.

But then the switchdev driver must kfree(accel_priv) in the xmit function.

Not really nice, but for a cuter solution, I think we would need to extend struct sk_buff.

[RFC,v2,net-next,00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices

Message

Comments