mbox series

[RFC,v3,net-next,00/24] Allow forwarding for the software bridge data path to be offloaded to capable devices

Message ID 20210712152142.800651-1-vladimir.oltean@nxp.com (mailing list archive)
Headers show
Series Allow forwarding for the software bridge data path to be offloaded to capable devices | expand

Message

Vladimir Oltean July 12, 2021, 3:21 p.m. UTC
Message for v3:

In this submission I have introduced a "native switchdev" driver API to
signal whether the TX forwarding offload is supported or not. This comes
after a third person has said that the macvlan offload framework used
for v2 and v1 was simply too convoluted.

This large patch set is submitted for discussion purposes (it is
provided in its entirety so it can be applied & tested on net-next).
It is only minimally tested, and yet I will not copy all switchdev
driver maintainers until we agree on the viability of this approach.

The major changes compared to v2:
- The introduction of switchdev_bridge_port_offload() and
  switchdev_bridge_port_unoffload() as two major API changes from the
  perspective of a switchdev driver. All drivers were converted to call
  these.
- Augment switchdev_bridge_port_{,un}offload to also handle the
  switchdev object replays on port join/leave.
- Augment switchdev_bridge_port_offload to also signal whether the TX
  forwarding offload is supported.

Message for v2:

For this series I have taken Tobias' work from here:
https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
and made the following changes:
- I collected and integrated (hopefully all of) Nikolay's, Ido's and my
  feedback on the bridge driver changes. Otherwise, the structure of the
  bridge changes is pretty much the same as Tobias left it.
- I basically rewrote the DSA infrastructure for the data plane
  forwarding offload, based on the commonalities with another switch
  driver for which I implemented this feature (not submitted here)
- I adapted mv88e6xxx to use the new infrastructure, hopefully it still
  works but I didn't test that

The data plane of the software bridge can be partially offloaded to
switchdev, in the sense that we can trust the accelerator to:
(a) look up its FDB (which is more or less in sync with the software
    bridge FDB) for selecting the destination ports for a packet
(b) replicate the frame in hardware in case it's a multicast/broadcast,
    instead of the software bridge having to clone it and send the
    clones to each net device one at a time. This reduces the bandwidth
    needed between the CPU and the accelerator, as well as the CPU time
    spent.

The data path forwarding offload is managed per "hardware domain" - a
generalization of the "offload_fwd_mark" concept which is being
introduced in this series. Every packet is delivered only once to each
hardware domain.

In addition, Tobias said in the original cover letter:

====================
## Overview

   vlan1   vlan2
       \   /
   .-----------.
   |    br0    |
   '-----------'
   /   /   \   \
swp0 swp1 swp2 eth0
  :   :   :
  (hwdom 1)

Up to this point, switchdevs have been trusted with offloading
forwarding between bridge ports, e.g. forwarding a unicast from swp0
to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
series extends forward offloading to include some new classes of
traffic:

- Locally originating flows, i.e. packets that ingress on br0 that are
  to be forwarded to one or several of the ports swp{0,1,2}. Notably
  this also includes routed flows, e.g. a packet ingressing swp0 on
  VLAN 1 which is then routed over to VLAN 2 by the CPU and then
  forwarded to swp1 is "locally originating" from br0's point of view.

- Flows originating from "foreign" interfaces, i.e. an interface that
  is not offloaded by a particular switchdev instance. This includes
  ports belonging to other switchdev instances. A typical example
  would be flows from eth0 towards swp{0,1,2}.

The bridge still looks up its FDB/MDB as usual and then notifies the
switchdev driver that a particular skb should be offloaded if it
matches one of the classes above. It does so by using the _accel
version of dev_queue_xmit, supplying its own netdev as the
"subordinate" device. The driver can react to the presence of the
subordinate in its .ndo_select_queue in what ever way it needs to make
sure to forward the skb in much the same way that it would for packets
ingressing on regular ports.

Hardware domains to which a particular skb has been forwarded are
recorded so that duplicates are avoided.

The main performance benefit is thus seen on multicast flows. Imagine
for example that:

- An IP camera is connected to swp0 (VLAN 1)

- The CPU is acting as a multicast router, routing the group from VLAN
  1 to VLAN 2.

- There are subscribers for the group in question behind both swp1 and
  swp2 (VLAN 2).

With this offloading in place, the bridge need only send a single skb
to the driver, which will send it to the hardware marked in such a way
that the switch will perform the multicast replication according to
the MDB configuration. Naturally, the number of saved skb_clones
increase linearly with the number of subscribed ports.

As an extra benefit, on mv88e6xxx, this also allows the switch to
perform source address learning on these flows, which avoids having to
sync dynamic FDB entries over slow configuration interfaces like MDIO
to avoid flows directed towards the CPU being flooded as unknown
unicast by the switch.


## RFC

- In general, what do you think about this idea?

- hwdom. What do you think about this terminology? Personally I feel
  that we had too many things called offload_fwd_mark, and that as the
  use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
  might be useful to have a separate term for it.

- .dfwd_{add,del}_station. Am I stretching this abstraction too far,
  and if so do you have any suggestion/preference on how to signal the
  offloading from the bridge down to the switchdev driver?

- The way that flooding is implemented in br_forward.c (lazily cloning
  skbs) means that you have to mark the forwarding as completed very
  early (right after should_deliver in maybe_deliver) in order to
  avoid duplicates. Is there some way to move this decision point to a
  later stage that I am missing?

- BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
  compatible with unicast-to-multicast being used on a port. Then
  again, I think that this would also be broken for regular switchdev
  bridge offloading as this flag is not offloaded to the switchdev
  port, so there is no way for the driver to refuse it. Any ideas on
  how to handle this?


## mv88e6xxx Specifics

Since we are now only receiving a single skb for both unicast and
multicast flows, we can tag the packets with the FORWARD command
instead of FROM_CPU. The swich(es) will then forward the packet in
accordance with its ATU, VTU, STU, and PVT configuration - just like
for packets ingressing on user ports.

Crucially, FROM_CPU is still used for:

- Ports in standalone mode.

- Flows that are trapped to the CPU and software-forwarded by a
  bridge. Note that these flows match neither of the classes discussed
  in the overview.

- Packets that are sent directly to a port netdev without going
  through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
  socket.

We thus have a pretty clean separation where the data plane uses
FORWARDs and the control plane uses TO_/FROM_CPU.

The barrier between different bridges is enforced by port based VLANs
on mv88e6xxx, which in essence is a mapping from a source device/port
pair to an allowed set of egress ports. In order to have a FORWARD
frame (which carries a _source_ device/port) correctly mapped by the
PVT, we must use a unique pair for each bridge.

Fortunately, there is typically lots of unused address space in most
switch trees. When was the last time you saw an mv88e6xxx product
using more than 4 chips? Even if you found one with 16 (!) devices,
you would still have room to allocate 16*16 virtual ports to software
bridges.

Therefore, the mv88e6xxx driver will allocate a virtual device/port
pair to each bridge that it offloads. All members of the same bridge
are then configured to allow packets from this virtual port in their
PVTs.
====================

Tobias Waldekranz (4):
  net: bridge: disambiguate offload_fwd_mark
  net: bridge: switchdev: recycle unused hwdoms
  net: bridge: switchdev: allow the TX data plane forwarding to be
    offloaded
  net: dsa: tag_dsa: offload the bridge forwarding process

Vladimir Oltean (20):
  net: dpaa2-switch: use extack in dpaa2_switch_port_bridge_join
  net: dpaa2-switch: refactor prechangeupper sanity checks
  net: mlxsw: refactor prechangeupper sanity checks
  net: ocelot: fix switchdev objects synced for wrong netdev with LAG
    offload
  net: prestera: if the LAG that we're joining is under a bridge, join
    it
  net: prestera: refactor prechangeupper sanity checks
  net: bridge: switchdev: let drivers inform which bridge ports are
    offloaded
  net: prestera: guard against multiple switchdev obj replays on same
    bridge port
  net: mlxsw: guard against multiple switchdev obj replays on same
    bridge port
  net: bridge: drop context pointer from br_fdb_replay
  net: bridge: use the public notifier chain for br_fdb_replay
  net: bridge: unexport call_switchdev_blocking_notifiers
  net: bridge: propagate ctx to switchdev_port_obj_{add,del}
  net: bridge: propagate ctx to br_switchdev_port_vlan_{add,del}
  net: bridge: replay mdb entries on the public switchdev notifier chain
  net: bridge: replay vlan entries on the public switchdev notifier
  net: bridge: switchdev object replay helpers for everybody
  net: dsa: track the number of switches in a tree
  net: dsa: add support for bridge TX forwarding offload
  net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in
    the PVT

 drivers/net/dsa/mv88e6xxx/chip.c              |  78 ++++-
 .../ethernet/freescale/dpaa2/dpaa2-switch.c   |  61 +++-
 .../ethernet/marvell/prestera/prestera_main.c | 107 +++++--
 .../marvell/prestera/prestera_switchdev.c     |  36 ++-
 .../marvell/prestera/prestera_switchdev.h     |   7 +-
 .../net/ethernet/mellanox/mlxsw/spectrum.c    | 275 ++++++++++-------
 .../net/ethernet/mellanox/mlxsw/spectrum.h    |   4 +
 .../mellanox/mlxsw/spectrum_switchdev.c       |  21 +-
 .../microchip/sparx5/sparx5_switchdev.c       |  41 ++-
 drivers/net/ethernet/mscc/ocelot_net.c        | 116 +++++--
 drivers/net/ethernet/rocker/rocker.h          |   6 +-
 drivers/net/ethernet/rocker/rocker_main.c     |  30 +-
 drivers/net/ethernet/rocker/rocker_ofdpa.c    |  37 ++-
 drivers/net/ethernet/ti/am65-cpsw-nuss.c      |  28 +-
 drivers/net/ethernet/ti/cpsw_new.c            |  26 +-
 include/linux/if_bridge.h                     |  55 ++--
 include/net/dsa.h                             |  22 +-
 include/net/switchdev.h                       |  20 +-
 net/bridge/br_fdb.c                           |  46 +--
 net/bridge/br_forward.c                       |   9 +
 net/bridge/br_if.c                            |  11 +-
 net/bridge/br_mdb.c                           |  51 ++--
 net/bridge/br_mrp_switchdev.c                 |  20 +-
 net/bridge/br_private.h                       |  87 +++++-
 net/bridge/br_switchdev.c                     | 284 ++++++++++++++++--
 net/bridge/br_vlan.c                          |  62 ++--
 net/dsa/dsa2.c                                |   4 +
 net/dsa/dsa_priv.h                            |   8 +-
 net/dsa/port.c                                | 172 +++++++----
 net/dsa/slave.c                               |  17 +-
 net/dsa/tag_dsa.c                             |  52 +++-
 net/switchdev/switchdev.c                     |  58 ++--
 32 files changed, 1369 insertions(+), 482 deletions(-)

Comments

Marek Behun July 12, 2021, 3:40 p.m. UTC | #1
Vladimir, on what mv88e6xxx devices are you developing this stuff?
Do you use Turris MOX for this?

Marek
Vladimir Oltean July 12, 2021, 5:01 p.m. UTC | #2
Hi Marek,

On Mon, Jul 12, 2021 at 05:40:59PM +0200, Marek Behun wrote:
> Vladimir, on what mv88e6xxx devices are you developing this stuff?
> Do you use Turris MOX for this?

I didn't develop the Marvell stuff nor did I come up with the idea
there, Tobias did. I also am not particularly interested in supporting
this for Marvell beyond making sure that the patches look simple and
okay, and pave the way for other drivers to do the same thing.

I did my testing using a different DSA driver and extra patches which
I did not post here yet. I just reposted/adapted Tobias' work because
mv88e6xxx needs less work to support the TX data plane offload, and this
framework does need a first user to get accepted, so why delay it any
further if mv88e6xxx needs 2 patches whereas other drivers need 30.

I did use the MOX for some minimal testing however, at least as far as
I could - there is this COMPHY SERDES bug in the bootloader which makes
the board fail to boot, and now the device tree workaround you gave me
does not appear to bypass the issue any longer or I didn't reaply it
properly. The point is that it isn't as easy as I would have liked to
use the MOX for testing. I was able to validate the "net: dsa: track the
number of switches in a tree" patch on MOX, though, since DSA probes
earlier than xhci, and even though the boot hung, I did put some prints
and got the expected results.
Marek Behun July 12, 2021, 5:27 p.m. UTC | #3
Hi Vladimir,

On Mon, 12 Jul 2021 17:01:21 +0000
Vladimir Oltean <vladimir.oltean@nxp.com> wrote:

> Hi Marek,
> 
> On Mon, Jul 12, 2021 at 05:40:59PM +0200, Marek Behun wrote:
> > Vladimir, on what mv88e6xxx devices are you developing this stuff?
> > Do you use Turris MOX for this?  
> 
> I didn't develop the Marvell stuff nor did I come up with the idea
> there, Tobias did. I also am not particularly interested in supporting
> this for Marvell beyond making sure that the patches look simple and
> okay, and pave the way for other drivers to do the same thing.
> 
> I did my testing using a different DSA driver and extra patches which
> I did not post here yet. I just reposted/adapted Tobias' work because
> mv88e6xxx needs less work to support the TX data plane offload, and this
> framework does need a first user to get accepted, so why delay it any
> further if mv88e6xxx needs 2 patches whereas other drivers need 30.
> 
> I did use the MOX for some minimal testing however, at least as far as
> I could - there is this COMPHY SERDES bug in the bootloader which makes
> the board fail to boot, and now the device tree workaround you gave me
> does not appear to bypass the issue any longer or I didn't reaply it
> properly.

Sorry about that. Current upstream U-Boot solves this issue, but we did
not release official update yet because there are still some things that
need to be done. We have some RCs, though.

If you are interested, it is pretty easy to upgrade:
- MTD partition "secure-firmware" needs to be flashed with
    https://gitlab.nic.cz/turris/mox-boot-builder/uploads/8d5a17fae8f6e14ca11968ee5174c7ca/trusted-secure-firmware.bin
  (this file needs to be signed by CZ.NIC)
- MTD partition "a53-firmware" (or "u-boot" in older DTS) needs to be
  flashed with
    https://secure.nic.cz/files/mbehun/a53-firmware.bin
  (this file can be built by users themselves)

> The point is that it isn't as easy as I would have liked to
> use the MOX for testing. I was able to validate the "net: dsa: track the
> number of switches in a tree" patch on MOX, though, since DSA probes
> earlier than xhci, and even though the boot hung, I did put some prints
> and got the expected results.

I will try to do some tests with this patch series.

Marek
Vladimir Oltean July 22, 2021, 9:50 a.m. UTC | #4
On Mon, Jul 12, 2021 at 07:27:11PM +0200, Marek Behun wrote:
> Hi Vladimir,
> 
> On Mon, 12 Jul 2021 17:01:21 +0000
> Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> 
> > Hi Marek,
> > 
> > On Mon, Jul 12, 2021 at 05:40:59PM +0200, Marek Behun wrote:
> > > Vladimir, on what mv88e6xxx devices are you developing this stuff?
> > > Do you use Turris MOX for this?  
> > 
> > I didn't develop the Marvell stuff nor did I come up with the idea
> > there, Tobias did. I also am not particularly interested in supporting
> > this for Marvell beyond making sure that the patches look simple and
> > okay, and pave the way for other drivers to do the same thing.
> > 
> > I did my testing using a different DSA driver and extra patches which
> > I did not post here yet. I just reposted/adapted Tobias' work because
> > mv88e6xxx needs less work to support the TX data plane offload, and this
> > framework does need a first user to get accepted, so why delay it any
> > further if mv88e6xxx needs 2 patches whereas other drivers need 30.
> > 
> > I did use the MOX for some minimal testing however, at least as far as
> > I could - there is this COMPHY SERDES bug in the bootloader which makes
> > the board fail to boot, and now the device tree workaround you gave me
> > does not appear to bypass the issue any longer or I didn't reaply it
> > properly.
> 
> Sorry about that. Current upstream U-Boot solves this issue, but we did
> not release official update yet because there are still some things that
> need to be done. We have some RCs, though.
> 
> If you are interested, it is pretty easy to upgrade:
> - MTD partition "secure-firmware" needs to be flashed with
>     https://gitlab.nic.cz/turris/mox-boot-builder/uploads/8d5a17fae8f6e14ca11968ee5174c7ca/trusted-secure-firmware.bin
>   (this file needs to be signed by CZ.NIC)
> - MTD partition "a53-firmware" (or "u-boot" in older DTS) needs to be
>   flashed with
>     https://secure.nic.cz/files/mbehun/a53-firmware.bin
>   (this file can be built by users themselves)

Thanks. This worked in the sense that I could flash the trust zone
firmware and U-Boot, and net-next will boot without hanging, but now the
board is in a boot loop, due to what appears to be watchdog timer
expiration. This happens regardless of whether CONFIG_ARMADA_37XX_WATCHDOG
is set to y or n in the booted kernel.