Message ID | 20210504182259.5042-1-Joseph.Huang@garmin.com (mailing list archive) |
---|---|
Headers | show |
Series | bridge: Fix snooping in multi-bridge config with switchdev | expand |
On 04/05/2021 21:22, Joseph Huang wrote: > This series of patches contains the following fixes: > > 1. In a distributed system with multiple hardware-offloading bridges, > if a multicast source is attached to a Non-Querier bridge, the bridge > will not forward any multicast packets from that source to the Querier. > > +--------------------+ > | | > | Snooping | +------------+ > | Bridge 1 |----| Listener 1 | > | (Querier) | +------------+ > | | > +--------------------+ > | > | > +----+---------+-----+ > | | mrouter | | > +-----------+ | +---------+ | +------------+ > | MC Source |----| Snooping |----| Listener 2 | > +-----------| | Bridge 2 | +------------+ > | (Non-Querier) | > +--------------------+ > > In this scenario, Listener 1 will never receive multicast traffic > from MC Source since Snooping Bridge 2 does not forward multicast > packets to the mrouter port. Patches 0001, 0002, and 0003 address > this issue. > > 2. If mcast_flood is disabled on a bridge port, some of the snooping > functions stop working properly. > > a. Consider the following scenario: > > +--------------------+ > | | > | Snooping | +------------+ > | Bridge 1 |----| Listener 1 | > | (Querier) | +------------+ > | | > +--------------------+ > | > | > +--------------------+ > | | mrouter | | > +-----------+ | +---------+ | > | MC Source |----| Snooping | > +-----------| | Bridge 2 | > | (Non-Querier) | > +--------------------+ > > In this scenario, Listener 1 will never receive multicast traffic > from MC Source if mcast_flood is disabled on the mrouter port on > Snooping Bridge 2. Patch 0004 addresses this issue. > > b. For a Non-Querier bridge, if mcast_flood is disabled on a bridge > port, Queries received from other Querier will not be forwarded > out of that bridge port. Patch 0005 addresses this issue. > > 3. After a system boots up, the first couple Reports are not handled > properly: > > 1) the Report from the Host is being flooded (via br_flood) to all > bridge ports, and > 2) if the mrouter port's mcast_flood is disabled, the Reports received > from other hosts will not be forwarded to the Querier. > > Patch 0006 addresses this issue. > > These patches were developed and verified initially against 5.4 kernel > (due to hardware platform limitation) and forward-patched to 5.12. > Snooping code introduced between 5.4 and 5.12 are not extensively tested > (only IGMPv2/MLDv1 were tested). The hardware platform used were two > bridges utilizing a single Marvell 88E6352 Ethernet switch chip (i.e., > no cross-chip bridging involved). > > Joseph Huang (6): > bridge: Refactor br_mdb_notify > bridge: Offload mrouter port forwarding to switchdev > bridge: Avoid traffic disruption when Querier state changes > bridge: Force mcast_flooding for mrouter ports > bridge: Flood Queries even when mcast_flood is disabled > bridge: Always multicast_flood Reports > > net/bridge/br_device.c | 5 +- > net/bridge/br_forward.c | 3 +- > net/bridge/br_input.c | 5 +- > net/bridge/br_mdb.c | 70 +++++++++++++--------- > net/bridge/br_multicast.c | 121 ++++++++++++++++++++++++++++++++++---- > net/bridge/br_private.h | 11 +++- > 6 files changed, 169 insertions(+), 46 deletions(-) > > > base-commit: 5e321ded302da4d8c5d5dd953423d9b748ab3775 > Hi, This patch-set is inappropriate for -net, if at all. It's quite late over here and I'll review the rest later, but I can say from a quick peek that patch 02 is unacceptable for it increases the complexity with 1 order of magnitude of all add/del call paths and some of them can be invoked on user packets. A lot of this functionality should be "hidden" in the driver or done by a user-space daemon/helper. Most of the flooding behaviour changes must be hidden behind some new option otherwise they'll break user setups that rely on the current. I'll review the patches in detail over the following few days, net-next is closed anyway. Cheers, Nik
> Hi, > This patch-set is inappropriate for -net, if at all. It's quite late over here and I'll > review the rest later, but I can say from a quick peek that patch 02 is > unacceptable for it increases the complexity with 1 order of magnitude of all > add/del call paths and some of them can be invoked on user packets. A lot of > this functionality should be "hidden" in the driver or done by a user-space > daemon/helper. > Most of the flooding behaviour changes must be hidden behind some new > option otherwise they'll break user setups that rely on the current. I'll review > the patches in detail over the following few days, net-next is closed anyway. > > Cheers, > Nik Hi Nik, Thanks for your quick response! Once you have a chance to review the set, please let me know how I can improve them to make them acceptable. These are real problems and we do need to fix them. Thanks, Joseph
On Tue, May 04, 2021 at 20:37, "Huang, Joseph" <Joseph.Huang@garmin.com> wrote: >> Hi, >> This patch-set is inappropriate for -net, if at all. It's quite late over here and I'll >> review the rest later, but I can say from a quick peek that patch 02 is >> unacceptable for it increases the complexity with 1 order of magnitude of all >> add/del call paths and some of them can be invoked on user packets. A lot of >> this functionality should be "hidden" in the driver or done by a user-space >> daemon/helper. >> Most of the flooding behaviour changes must be hidden behind some new >> option otherwise they'll break user setups that rely on the current. I'll review >> the patches in detail over the following few days, net-next is closed anyway. >> >> Cheers, >> Nik > > Hi Nik, > > Thanks for your quick response! > Once you have a chance to review the set, please let me know how I can improve them to make them acceptable. These are real problems and we do need to fix them. If I may make a suggestion: I also work with mv88e6xxx systems, and we have the same issues with known multicast not being flooded to router ports. Knowing that chipset, I see what you are trying to do. But other chips may work differently. Imagine for example a switch where there is a separate vector of router ports that the hardware can OR in after looking up the group in the ATU. This implementation would render the performance gains possible on that device useless. As another example, you could imagine a device where an ATU operation exists that sets a bit in the vector of every group in a particular database; instead of having to update each entry individually. I think we (mv88e6xxx) will have to accept that we need to add the proper scaffolding to manage this on the driver side. That way the bridge can stay generic. The bridge could just provide some MDB iterator to save us from having to cache all the configured groups. So basically: - In mv88e6xxx, maintain a per-switch vector of router ports. - When a ports router state is toggled: 1. Update the vector. 2. Ask the bridge to iterate through all applicable groups and update the corresponding ATU entries. - When a new MDB entry is updated, make sure to also OR in the current vector of router ports in the DPV of the ATU entry. I would be happy to help out with testing of this!
> If I may make a suggestion: I also work with mv88e6xxx systems, and we > have the same issues with known multicast not being flooded to router > ports. Knowing that chipset, I see what you are trying to do. > > But other chips may work differently. Imagine for example a switch where > there is a separate vector of router ports that the hardware can OR in after > looking up the group in the ATU. This implementation would render the > performance gains possible on that device useless. As another example, you > could imagine a device where an ATU operation exists that sets a bit in the > vector of every group in a particular database; instead of having to update > each entry individually. > > I think we (mv88e6xxx) will have to accept that we need to add the proper > scaffolding to manage this on the driver side. That way the bridge can stay > generic. The bridge could just provide some MDB iterator to save us from > having to cache all the configured groups. > > So basically: > > - In mv88e6xxx, maintain a per-switch vector of router ports. > > - When a ports router state is toggled: > 1. Update the vector. > 2. Ask the bridge to iterate through all applicable groups and update > the corresponding ATU entries. > > - When a new MDB entry is updated, make sure to also OR in the current > vector of router ports in the DPV of the ATU entry. > > > I would be happy to help out with testing of this! Thanks for the suggestion/offer! What patch 0002 does is that: - When an mrouter port is added/deleted, it iterates over the list of mdb's to add/delete that port to/from the group in the hardware (I think this is what your bullet #2 does as well, except that one is done in the bridge, and the other is done in the driver) - When a group is added/deleted, it iterates over the list of mrouter ports to add/delete the switchdev programming I think what Nik is objecting to is that with this approach, there's now a for-loop in the call paths (thus it "increases the complexity with 1 order of magnitude), however I can't think of a way to avoid the looping (whether done inside the bridge or in the driver) but still achieve the same result (for Marvell at least). I suspect that other SOHO switches might have this problem as well (Broadcom comes to mind). Thanks, Joseph
On Tue, May 04, 2021 at 23:26, "Huang, Joseph" <Joseph.Huang@garmin.com> wrote: >> If I may make a suggestion: I also work with mv88e6xxx systems, and we >> have the same issues with known multicast not being flooded to router >> ports. Knowing that chipset, I see what you are trying to do. >> >> But other chips may work differently. Imagine for example a switch where >> there is a separate vector of router ports that the hardware can OR in after >> looking up the group in the ATU. This implementation would render the >> performance gains possible on that device useless. As another example, you >> could imagine a device where an ATU operation exists that sets a bit in the >> vector of every group in a particular database; instead of having to update >> each entry individually. >> >> I think we (mv88e6xxx) will have to accept that we need to add the proper >> scaffolding to manage this on the driver side. That way the bridge can stay >> generic. The bridge could just provide some MDB iterator to save us from >> having to cache all the configured groups. >> >> So basically: >> >> - In mv88e6xxx, maintain a per-switch vector of router ports. >> >> - When a ports router state is toggled: >> 1. Update the vector. >> 2. Ask the bridge to iterate through all applicable groups and update >> the corresponding ATU entries. >> >> - When a new MDB entry is updated, make sure to also OR in the current >> vector of router ports in the DPV of the ATU entry. >> >> >> I would be happy to help out with testing of this! > > Thanks for the suggestion/offer! > > What patch 0002 does is that: > > - When an mrouter port is added/deleted, it iterates over the list of mdb's > to add/delete that port to/from the group in the hardware (I think this is > what your bullet #2 does as well, except that one is done in the bridge, > and the other is done in the driver) > > - When a group is added/deleted, it iterates over the list of mrouter ports > to add/delete the switchdev programming > > I think what Nik is objecting to is that with this approach, there's now > a for-loop in the call paths (thus it "increases the complexity with 1 order > of magnitude), however I can't think of a way to avoid the looping (whether > done inside the bridge or in the driver) but still achieve the same result > (for Marvell at least). (I will stop trying to read Nikolay's mind and go forward with my own opinions now :)) The problem with solving this at the bridge layer is that you miss out on optimizations that are available at lower layers. As an example: br0 / | \ swp0 swp1 swp2 (R) (R) With two router ports, any new group added/removed to/from swp0 would incur 3 individual ATU operations: First adding swp0, then each router port individually in your loop. If you have the vector prepared in the driver, you can batch them together in one operation. This also atomically transitions the group from unknown to known without disrupting any streams towards a router. In the bridge-layer solution, flows will still be blocked in the (admittedly small) window between adding swp0 and swp{1,2}. > I suspect that other SOHO switches might have this problem as well (Broadcom > comes to mind). I suspect you are right. That is why I suggested implementing the iterator in the bridge that can be reused by all drivers. Something along the lines of br_fdb_replay. The rest should mostly be hardware specific anyway.
On 05/05/2021 02:26, Huang, Joseph wrote: >> If I may make a suggestion: I also work with mv88e6xxx systems, and we >> have the same issues with known multicast not being flooded to router >> ports. Knowing that chipset, I see what you are trying to do. >> >> But other chips may work differently. Imagine for example a switch where >> there is a separate vector of router ports that the hardware can OR in after >> looking up the group in the ATU. This implementation would render the >> performance gains possible on that device useless. As another example, you >> could imagine a device where an ATU operation exists that sets a bit in the >> vector of every group in a particular database; instead of having to update >> each entry individually. >> >> I think we (mv88e6xxx) will have to accept that we need to add the proper >> scaffolding to manage this on the driver side. That way the bridge can stay >> generic. The bridge could just provide some MDB iterator to save us from >> having to cache all the configured groups. >> >> So basically: >> >> - In mv88e6xxx, maintain a per-switch vector of router ports. >> >> - When a ports router state is toggled: >> 1. Update the vector. >> 2. Ask the bridge to iterate through all applicable groups and update >> the corresponding ATU entries. >> >> - When a new MDB entry is updated, make sure to also OR in the current >> vector of router ports in the DPV of the ATU entry. >> >> >> I would be happy to help out with testing of this! > > Thanks for the suggestion/offer! > > What patch 0002 does is that: > > - When an mrouter port is added/deleted, it iterates over the list of mdb's > to add/delete that port to/from the group in the hardware (I think this is > what your bullet #2 does as well, except that one is done in the bridge, > and the other is done in the driver) > > - When a group is added/deleted, it iterates over the list of mrouter ports > to add/delete the switchdev programming > > I think what Nik is objecting to is that with this approach, there's now > a for-loop in the call paths (thus it "increases the complexity with 1 order > of magnitude), however I can't think of a way to avoid the looping (whether > done inside the bridge or in the driver) but still achieve the same result > (for Marvell at least). > Note that I did not say to avoid it in the switchdev driver. :) I said it should be in the driver or in some user-space helper, but it mustn't affect non-switchdev software use cases so much. You can check how mlxsw[1] deals with mdbs and router ports. [1] drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c > I suspect that other SOHO switches might have this problem as well (Broadcom > comes to mind). > > Thanks, > Joseph >