mbox series

[net,0/6] bridge: Fix snooping in multi-bridge config with switchdev

Message ID 20210504182259.5042-1-Joseph.Huang@garmin.com (mailing list archive)
Headers show
Series bridge: Fix snooping in multi-bridge config with switchdev | expand

Message

Joseph Huang May 4, 2021, 6:22 p.m. UTC
This series of patches contains the following fixes:

1. In a distributed system with multiple hardware-offloading bridges,
   if a multicast source is attached to a Non-Querier bridge, the bridge
   will not forward any multicast packets from that source to the Querier.

                    +--------------------+
                    |                    |
                    |      Snooping      |    +------------+
                    |      Bridge 1      |----| Listener 1 |
                    |     (Querier)      |    +------------+
                    |                    |
                    +--------------------+
                              |
                              |
                    +----+---------+-----+
                    |    | mrouter |     |
   +-----------+    |    +---------+     |    +------------+
   | MC Source |----|      Snooping      |----| Listener 2 |
   +-----------|    |      Bridge 2      |    +------------+
                    |    (Non-Querier)   |
                    +--------------------+

   In this scenario, Listener 1 will never receive multicast traffic
   from MC Source since Snooping Bridge 2 does not forward multicast
   packets to the mrouter port. Patches 0001, 0002, and 0003 address
   this issue.

2. If mcast_flood is disabled on a bridge port, some of the snooping
   functions stop working properly.

   a. Consider the following scenario:

                       +--------------------+
                       |                    |
                       |      Snooping      |    +------------+
                       |      Bridge 1      |----| Listener 1 |
                       |     (Querier)      |    +------------+
                       |                    |
                       +--------------------+
                                 |
                                 |
                       +--------------------+
                       |    | mrouter |     |
      +-----------+    |    +---------+     |
      | MC Source |----|      Snooping      |
      +-----------|    |      Bridge 2      |
                       |    (Non-Querier)   |
                       +--------------------+

      In this scenario, Listener 1 will never receive multicast traffic
      from MC Source if mcast_flood is disabled on the mrouter port on
      Snooping Bridge 2. Patch 0004 addresses this issue.

   b. For a Non-Querier bridge, if mcast_flood is disabled on a bridge
      port, Queries received from other Querier will not be forwarded
      out of that bridge port. Patch 0005 addresses this issue.

3. After a system boots up, the first couple Reports are not handled
   properly:

   1) the Report from the Host is being flooded (via br_flood) to all
      bridge ports, and
   2) if the mrouter port's mcast_flood is disabled, the Reports received
      from other hosts will not be forwarded to the Querier.

   Patch 0006 addresses this issue.

These patches were developed and verified initially against 5.4 kernel
(due to hardware platform limitation) and forward-patched to 5.12.
Snooping code introduced between 5.4 and 5.12 are not extensively tested
(only IGMPv2/MLDv1 were tested). The hardware platform used were two
bridges utilizing a single Marvell 88E6352 Ethernet switch chip (i.e.,
no cross-chip bridging involved).

Joseph Huang (6):
  bridge: Refactor br_mdb_notify
  bridge: Offload mrouter port forwarding to switchdev
  bridge: Avoid traffic disruption when Querier state changes
  bridge: Force mcast_flooding for mrouter ports
  bridge: Flood Queries even when mcast_flood is disabled
  bridge: Always multicast_flood Reports

 net/bridge/br_device.c    |   5 +-
 net/bridge/br_forward.c   |   3 +-
 net/bridge/br_input.c     |   5 +-
 net/bridge/br_mdb.c       |  70 +++++++++++++---------
 net/bridge/br_multicast.c | 121 ++++++++++++++++++++++++++++++++++----
 net/bridge/br_private.h   |  11 +++-
 6 files changed, 169 insertions(+), 46 deletions(-)


base-commit: 5e321ded302da4d8c5d5dd953423d9b748ab3775

Comments

Nikolay Aleksandrov May 4, 2021, 8:05 p.m. UTC | #1
On 04/05/2021 21:22, Joseph Huang wrote:
> This series of patches contains the following fixes:
> 
> 1. In a distributed system with multiple hardware-offloading bridges,
>    if a multicast source is attached to a Non-Querier bridge, the bridge
>    will not forward any multicast packets from that source to the Querier.
> 
>                     +--------------------+
>                     |                    |
>                     |      Snooping      |    +------------+
>                     |      Bridge 1      |----| Listener 1 |
>                     |     (Querier)      |    +------------+
>                     |                    |
>                     +--------------------+
>                               |
>                               |
>                     +----+---------+-----+
>                     |    | mrouter |     |
>    +-----------+    |    +---------+     |    +------------+
>    | MC Source |----|      Snooping      |----| Listener 2 |
>    +-----------|    |      Bridge 2      |    +------------+
>                     |    (Non-Querier)   |
>                     +--------------------+
> 
>    In this scenario, Listener 1 will never receive multicast traffic
>    from MC Source since Snooping Bridge 2 does not forward multicast
>    packets to the mrouter port. Patches 0001, 0002, and 0003 address
>    this issue.
> 
> 2. If mcast_flood is disabled on a bridge port, some of the snooping
>    functions stop working properly.
> 
>    a. Consider the following scenario:
> 
>                        +--------------------+
>                        |                    |
>                        |      Snooping      |    +------------+
>                        |      Bridge 1      |----| Listener 1 |
>                        |     (Querier)      |    +------------+
>                        |                    |
>                        +--------------------+
>                                  |
>                                  |
>                        +--------------------+
>                        |    | mrouter |     |
>       +-----------+    |    +---------+     |
>       | MC Source |----|      Snooping      |
>       +-----------|    |      Bridge 2      |
>                        |    (Non-Querier)   |
>                        +--------------------+
> 
>       In this scenario, Listener 1 will never receive multicast traffic
>       from MC Source if mcast_flood is disabled on the mrouter port on
>       Snooping Bridge 2. Patch 0004 addresses this issue.
> 
>    b. For a Non-Querier bridge, if mcast_flood is disabled on a bridge
>       port, Queries received from other Querier will not be forwarded
>       out of that bridge port. Patch 0005 addresses this issue.
> 
> 3. After a system boots up, the first couple Reports are not handled
>    properly:
> 
>    1) the Report from the Host is being flooded (via br_flood) to all
>       bridge ports, and
>    2) if the mrouter port's mcast_flood is disabled, the Reports received
>       from other hosts will not be forwarded to the Querier.
> 
>    Patch 0006 addresses this issue.
> 
> These patches were developed and verified initially against 5.4 kernel
> (due to hardware platform limitation) and forward-patched to 5.12.
> Snooping code introduced between 5.4 and 5.12 are not extensively tested
> (only IGMPv2/MLDv1 were tested). The hardware platform used were two
> bridges utilizing a single Marvell 88E6352 Ethernet switch chip (i.e.,
> no cross-chip bridging involved).
> 
> Joseph Huang (6):
>   bridge: Refactor br_mdb_notify
>   bridge: Offload mrouter port forwarding to switchdev
>   bridge: Avoid traffic disruption when Querier state changes
>   bridge: Force mcast_flooding for mrouter ports
>   bridge: Flood Queries even when mcast_flood is disabled
>   bridge: Always multicast_flood Reports
> 
>  net/bridge/br_device.c    |   5 +-
>  net/bridge/br_forward.c   |   3 +-
>  net/bridge/br_input.c     |   5 +-
>  net/bridge/br_mdb.c       |  70 +++++++++++++---------
>  net/bridge/br_multicast.c | 121 ++++++++++++++++++++++++++++++++++----
>  net/bridge/br_private.h   |  11 +++-
>  6 files changed, 169 insertions(+), 46 deletions(-)
> 
> 
> base-commit: 5e321ded302da4d8c5d5dd953423d9b748ab3775
> 

Hi,
This patch-set is inappropriate for -net, if at all. It's quite late over here
and I'll review the rest later, but I can say from a quick peek that patch 02
is unacceptable for it increases the complexity with 1 order of magnitude of all
add/del call paths and some of them can be invoked on user packets. A lot of this
functionality should be "hidden" in the driver or done by a user-space daemon/helper.
Most of the flooding behaviour changes must be hidden behind some new option
otherwise they'll break user setups that rely on the current. I'll review the patches
in detail over the following few days, net-next is closed anyway.

Cheers,
 Nik
Joseph Huang May 4, 2021, 8:37 p.m. UTC | #2
> Hi,
> This patch-set is inappropriate for -net, if at all. It's quite late over here and I'll
> review the rest later, but I can say from a quick peek that patch 02 is
> unacceptable for it increases the complexity with 1 order of magnitude of all
> add/del call paths and some of them can be invoked on user packets. A lot of
> this functionality should be "hidden" in the driver or done by a user-space
> daemon/helper.
> Most of the flooding behaviour changes must be hidden behind some new
> option otherwise they'll break user setups that rely on the current. I'll review
> the patches in detail over the following few days, net-next is closed anyway.
> 
> Cheers,
>  Nik

Hi Nik,

Thanks for your quick response!
Once you have a chance to review the set, please let me know how I can improve them to make them acceptable. These are real problems and we do need to fix them.

Thanks,
Joseph
Tobias Waldekranz May 4, 2021, 10:29 p.m. UTC | #3
On Tue, May 04, 2021 at 20:37, "Huang, Joseph" <Joseph.Huang@garmin.com> wrote:
>> Hi,
>> This patch-set is inappropriate for -net, if at all. It's quite late over here and I'll
>> review the rest later, but I can say from a quick peek that patch 02 is
>> unacceptable for it increases the complexity with 1 order of magnitude of all
>> add/del call paths and some of them can be invoked on user packets. A lot of
>> this functionality should be "hidden" in the driver or done by a user-space
>> daemon/helper.
>> Most of the flooding behaviour changes must be hidden behind some new
>> option otherwise they'll break user setups that rely on the current. I'll review
>> the patches in detail over the following few days, net-next is closed anyway.
>> 
>> Cheers,
>>  Nik
>
> Hi Nik,
>
> Thanks for your quick response!
> Once you have a chance to review the set, please let me know how I can improve them to make them acceptable. These are real problems and we do need to fix them.

If I may make a suggestion: I also work with mv88e6xxx systems, and we
have the same issues with known multicast not being flooded to router
ports. Knowing that chipset, I see what you are trying to do.

But other chips may work differently. Imagine for example a switch where
there is a separate vector of router ports that the hardware can OR in
after looking up the group in the ATU. This implementation would render
the performance gains possible on that device useless. As another
example, you could imagine a device where an ATU operation exists that
sets a bit in the vector of every group in a particular database;
instead of having to update each entry individually.

I think we (mv88e6xxx) will have to accept that we need to add the
proper scaffolding to manage this on the driver side. That way the
bridge can stay generic. The bridge could just provide some MDB iterator
to save us from having to cache all the configured groups.

So basically:

- In mv88e6xxx, maintain a per-switch vector of router ports.

- When a ports router state is toggled:
  1. Update the vector.
  2. Ask the bridge to iterate through all applicable groups and update
     the corresponding ATU entries.

- When a new MDB entry is updated, make sure to also OR in the current
  vector of router ports in the DPV of the ATU entry.


I would be happy to help out with testing of this!
Joseph Huang May 4, 2021, 11:26 p.m. UTC | #4
> If I may make a suggestion: I also work with mv88e6xxx systems, and we
> have the same issues with known multicast not being flooded to router
> ports. Knowing that chipset, I see what you are trying to do.
> 
> But other chips may work differently. Imagine for example a switch where
> there is a separate vector of router ports that the hardware can OR in after
> looking up the group in the ATU. This implementation would render the
> performance gains possible on that device useless. As another example, you
> could imagine a device where an ATU operation exists that sets a bit in the
> vector of every group in a particular database; instead of having to update
> each entry individually.
> 
> I think we (mv88e6xxx) will have to accept that we need to add the proper
> scaffolding to manage this on the driver side. That way the bridge can stay
> generic. The bridge could just provide some MDB iterator to save us from
> having to cache all the configured groups.
> 
> So basically:
> 
> - In mv88e6xxx, maintain a per-switch vector of router ports.
> 
> - When a ports router state is toggled:
>   1. Update the vector.
>   2. Ask the bridge to iterate through all applicable groups and update
>      the corresponding ATU entries.
> 
> - When a new MDB entry is updated, make sure to also OR in the current
>   vector of router ports in the DPV of the ATU entry.
> 
> 
> I would be happy to help out with testing of this!

Thanks for the suggestion/offer!

What patch 0002 does is that:

- When an mrouter port is added/deleted, it iterates over the list of mdb's
  to add/delete that port to/from the group in the hardware (I think this is
  what your bullet #2 does as well, except that one is done in the bridge,
  and the other is done in the driver)

- When a group is added/deleted, it iterates over the list of mrouter ports
  to add/delete the switchdev programming

I think what Nik is objecting to is that with this approach, there's now
a for-loop in the call paths (thus it "increases the complexity with 1 order
of magnitude), however I can't think of a way to avoid the looping (whether
done inside the bridge or in the driver) but still achieve the same result
(for Marvell at least).

I suspect that other SOHO switches might have this problem as well (Broadcom
comes to mind).

Thanks,
Joseph
Tobias Waldekranz May 5, 2021, 6:52 a.m. UTC | #5
On Tue, May 04, 2021 at 23:26, "Huang, Joseph" <Joseph.Huang@garmin.com> wrote:
>> If I may make a suggestion: I also work with mv88e6xxx systems, and we
>> have the same issues with known multicast not being flooded to router
>> ports. Knowing that chipset, I see what you are trying to do.
>> 
>> But other chips may work differently. Imagine for example a switch where
>> there is a separate vector of router ports that the hardware can OR in after
>> looking up the group in the ATU. This implementation would render the
>> performance gains possible on that device useless. As another example, you
>> could imagine a device where an ATU operation exists that sets a bit in the
>> vector of every group in a particular database; instead of having to update
>> each entry individually.
>> 
>> I think we (mv88e6xxx) will have to accept that we need to add the proper
>> scaffolding to manage this on the driver side. That way the bridge can stay
>> generic. The bridge could just provide some MDB iterator to save us from
>> having to cache all the configured groups.
>> 
>> So basically:
>> 
>> - In mv88e6xxx, maintain a per-switch vector of router ports.
>> 
>> - When a ports router state is toggled:
>>   1. Update the vector.
>>   2. Ask the bridge to iterate through all applicable groups and update
>>      the corresponding ATU entries.
>> 
>> - When a new MDB entry is updated, make sure to also OR in the current
>>   vector of router ports in the DPV of the ATU entry.
>> 
>> 
>> I would be happy to help out with testing of this!
>
> Thanks for the suggestion/offer!
>
> What patch 0002 does is that:
>
> - When an mrouter port is added/deleted, it iterates over the list of mdb's
>   to add/delete that port to/from the group in the hardware (I think this is
>   what your bullet #2 does as well, except that one is done in the bridge,
>   and the other is done in the driver)
>
> - When a group is added/deleted, it iterates over the list of mrouter ports
>   to add/delete the switchdev programming
>
> I think what Nik is objecting to is that with this approach, there's now
> a for-loop in the call paths (thus it "increases the complexity with 1 order
> of magnitude), however I can't think of a way to avoid the looping (whether
> done inside the bridge or in the driver) but still achieve the same result
> (for Marvell at least).

(I will stop trying to read Nikolay's mind and go forward with my own
opinions now :))

The problem with solving this at the bridge layer is that you miss out
on optimizations that are available at lower layers. As an example:

      br0
    /  |  \
swp0 swp1 swp2
     (R)  (R)

With two router ports, any new group added/removed to/from swp0 would
incur 3 individual ATU operations: First adding swp0, then each router
port individually in your loop. If you have the vector prepared in the
driver, you can batch them together in one operation.

This also atomically transitions the group from unknown to known without
disrupting any streams towards a router. In the bridge-layer solution,
flows will still be blocked in the (admittedly small) window between
adding swp0 and swp{1,2}.

> I suspect that other SOHO switches might have this problem as well (Broadcom
> comes to mind).

I suspect you are right. That is why I suggested implementing the
iterator in the bridge that can be reused by all drivers. Something
along the lines of br_fdb_replay. The rest should mostly be hardware
specific anyway.
Nikolay Aleksandrov May 5, 2021, 6:59 a.m. UTC | #6
On 05/05/2021 02:26, Huang, Joseph wrote:
>> If I may make a suggestion: I also work with mv88e6xxx systems, and we
>> have the same issues with known multicast not being flooded to router
>> ports. Knowing that chipset, I see what you are trying to do.
>>
>> But other chips may work differently. Imagine for example a switch where
>> there is a separate vector of router ports that the hardware can OR in after
>> looking up the group in the ATU. This implementation would render the
>> performance gains possible on that device useless. As another example, you
>> could imagine a device where an ATU operation exists that sets a bit in the
>> vector of every group in a particular database; instead of having to update
>> each entry individually.
>>
>> I think we (mv88e6xxx) will have to accept that we need to add the proper
>> scaffolding to manage this on the driver side. That way the bridge can stay
>> generic. The bridge could just provide some MDB iterator to save us from
>> having to cache all the configured groups.
>>
>> So basically:
>>
>> - In mv88e6xxx, maintain a per-switch vector of router ports.
>>
>> - When a ports router state is toggled:
>>   1. Update the vector.
>>   2. Ask the bridge to iterate through all applicable groups and update
>>      the corresponding ATU entries.
>>
>> - When a new MDB entry is updated, make sure to also OR in the current
>>   vector of router ports in the DPV of the ATU entry.
>>
>>
>> I would be happy to help out with testing of this!
> 
> Thanks for the suggestion/offer!
> 
> What patch 0002 does is that:
> 
> - When an mrouter port is added/deleted, it iterates over the list of mdb's
>   to add/delete that port to/from the group in the hardware (I think this is
>   what your bullet #2 does as well, except that one is done in the bridge,
>   and the other is done in the driver)
> 
> - When a group is added/deleted, it iterates over the list of mrouter ports
>   to add/delete the switchdev programming
> 
> I think what Nik is objecting to is that with this approach, there's now
> a for-loop in the call paths (thus it "increases the complexity with 1 order
> of magnitude), however I can't think of a way to avoid the looping (whether
> done inside the bridge or in the driver) but still achieve the same result
> (for Marvell at least).
> 

Note that I did not say to avoid it in the switchdev driver. :)
I said it should be in the driver or in some user-space helper, but it mustn't
affect non-switchdev software use cases so much.

You can check how mlxsw[1] deals with mdbs and router ports.

[1] drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c

> I suspect that other SOHO switches might have this problem as well (Broadcom
> comes to mind).
> 
> Thanks,
> Joseph
>