[RFC,net-next,00/10] MC Flood disable and snooping

Message ID	20240402001137.2980589-1-Joseph.Huang@garmin.com (mailing list archive)
Headers	show Received: from mx0b-000eb902.pphosted.com (mx0b-000eb902.pphosted.com [205.220.177.212]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4FF9510E4; Tue, 2 Apr 2024 00:12:53 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of garmin.com designates 204.77.163.244 as permitted sender) receiver=protection.outlook.com; client-ip=204.77.163.244; helo=edgetransport.garmin.com; pr=C From: Joseph Huang <Joseph.Huang@garmin.com> To: <netdev@vger.kernel.org> CC: Joseph Huang <Joseph.Huang@garmin.com>, Andrew Lunn <andrew@lunn.ch>, Florian Fainelli <f.fainelli@gmail.com>, Vladimir Oltean <olteanv@gmail.com>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, "Roopa Prabhu" <roopa@nvidia.com>, Nikolay Aleksandrov <razor@blackwall.org>, =?utf-8?q?Linus_L=C3=BCssing?= <linus.luessing@c0d3.blue>, <linux-kernel@vger.kernel.org>, <bridge@lists.linux.dev> Subject: [PATCH RFC net-next 00/10] MC Flood disable and snooping Date: Mon, 1 Apr 2024 20:10:59 -0400 Message-ID: <20240402001137.2980589-1-Joseph.Huang@garmin.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	MC Flood disable and snooping \| expand [RFC,net-next,00/10] MC Flood disable and snooping [RFC,net-next,01/10] net: bridge: Flood Queries even when mc flood is disabled [RFC,net-next,02/10] net: bridge: Always multicast_flood Reports [RFC,net-next,03/10] net: bridge: Always flood local subnet mc packets [RFC,net-next,04/10] net: dsa: mv88e6xxx: Add all hosts mc addr to ATU [RFC,net-next,05/10] net: dsa: Add support for PORT_MROUTER attribute [RFC,net-next,06/10] net: dsa: mv88e6xxx: Track soft bridge objects [RFC,net-next,07/10] net: dsa: mv88e6xxx: Track bridge mdb objects [RFC,net-next,08/10] net: dsa: mv88e6xxx: Convert MAB to use bit flags [RFC,net-next,09/10] net: dsa: mv88e6xxx: Enable mc flood for mrouter port [RFC,net-next,10/10] net: dsa: mv88e6xxx: Offload mrouter port

Joseph Huang April 2, 2024, 12:10 a.m. UTC

There is a use case where one would like to enable multicast snooping
on a bridge but disable multicast flooding on all bridge ports so that
registered multicast traffic will only reach the intended recipients and
unregistered multicast traffic will be dropped. However, with existing
bridge ports' mcast_flood flag implementation, it doesn't work as desired.

This patchset aims to make multicast snooping work even when multicast
flooding is disabled on the bridge ports, without changing the semantic of
the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.

Also, in a network where more than one multicast snooping capable bridges
are interconnected without multicast routers being present, multicast
snooping fails if:

  1. The source is not directly attached to the Querier
  2. The listener is beyond the mrouter port of the bridge where the
     source is directly attached
  3. A hardware offloading switch is involved

When all of the conditions are met, the listener will not receive any
multicast packets from the source. Patches 5 to 10 attempt to address this
issue. Specifically, patches 5 to 8 set up the infrastructure, patch 9
handles unregistered multicast packets forwarding, and patch 10 handles
registered multicast packets forwarding to the mrouter port.

The patches were developed against 5.15, and forward-ported to 6.8.
Tests were done on a Pi 4B + Marvell 6393X Eval board with a single
switch chip with no VLAN.

V1 -> V2:
- Moved the bulk of the change from the bridge to the mv88e6xxx driver.
- Added more patches (specifically 3 and 4) to workaround some more
  issues with multicast flooding being disabled.

v1 here:
https://patchwork.kernel.org/project/netdevbpf/cover/20210504182259.5042-1-Joseph.Huang@garmin.com/


Joseph Huang (10):
  net: bridge: Flood Queries even when mc flood is disabled
  net: bridge: Always multicast_flood Reports
  net: bridge: Always flood local subnet mc packets
  net: dsa: mv88e6xxx: Add all hosts mc addr to ATU
  net: dsa: Add support for PORT_MROUTER attribute
  net: dsa: mv88e6xxx: Track soft bridge objects
  net: dsa: mv88e6xxx: Track bridge mdb objects
  net: dsa: mv88e6xxx: Convert MAB to use bit flags
  net: dsa: mv88e6xxx: Enable mc flood for mrouter port
  net: dsa: mv88e6xxx: Offload mrouter port

 drivers/net/dsa/mv88e6xxx/Kconfig       |  12 +
 drivers/net/dsa/mv88e6xxx/chip.c        | 439 +++++++++++++++++++++++-
 drivers/net/dsa/mv88e6xxx/chip.h        |  16 +-
 drivers/net/dsa/mv88e6xxx/global1_atu.c |   3 +-
 include/net/dsa.h                       |   2 +
 net/bridge/br_device.c                  |   5 +-
 net/bridge/br_forward.c                 |   3 +-
 net/bridge/br_input.c                   |   5 +-
 net/bridge/br_multicast.c               |  21 +-
 net/bridge/br_private.h                 |   6 +
 net/dsa/port.c                          |  11 +
 net/dsa/port.h                          |   2 +
 net/dsa/user.c                          |   6 +
 13 files changed, 502 insertions(+), 29 deletions(-)

Nikolay Aleksandrov April 2, 2024, 9:28 a.m. UTC | #1

On 4/2/24 03:10, Joseph Huang wrote:
> There is a use case where one would like to enable multicast snooping
> on a bridge but disable multicast flooding on all bridge ports so that
> registered multicast traffic will only reach the intended recipients and
> unregistered multicast traffic will be dropped. However, with existing
> bridge ports' mcast_flood flag implementation, it doesn't work as desired.
> 
> This patchset aims to make multicast snooping work even when multicast
> flooding is disabled on the bridge ports, without changing the semantic of
> the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.
> 
> Also, in a network where more than one multicast snooping capable bridges
> are interconnected without multicast routers being present, multicast
> snooping fails if:
> 
>    1. The source is not directly attached to the Querier
>    2. The listener is beyond the mrouter port of the bridge where the
>       source is directly attached
>    3. A hardware offloading switch is involved
> 
> When all of the conditions are met, the listener will not receive any
> multicast packets from the source. Patches 5 to 10 attempt to address this
> issue. Specifically, patches 5 to 8 set up the infrastructure, patch 9
> handles unregistered multicast packets forwarding, and patch 10 handles
> registered multicast packets forwarding to the mrouter port.
> 
> The patches were developed against 5.15, and forward-ported to 6.8.
> Tests were done on a Pi 4B + Marvell 6393X Eval board with a single
> switch chip with no VLAN.
> 
> V1 -> V2:
> - Moved the bulk of the change from the bridge to the mv88e6xxx driver.
> - Added more patches (specifically 3 and 4) to workaround some more
>    issues with multicast flooding being disabled.
> 
> v1 here:
> https://patchwork.kernel.org/project/netdevbpf/cover/20210504182259.5042-1-Joseph.Huang@garmin.com/
> 

For the bridge patches:
Nacked-by: Nikolay Aleksandrov <razor@blackwall.org>

You cannot break the multicast flood flag to add support for a custom
use-case. This is unacceptable. The current bridge behaviour is correct
your patch 02 doesn't fix anything, you should configure the bridge
properly to avoid all those problems, not break protocols.

Your special use case can easily be solved by a user-space helper or
eBPF and nftables. You can set the mcast flood flag and bypass the
bridge for these packets. I basically said the same in 2021, if this is
going to be in the bridge it should be hidden behind an option that is
default off. But in my opinion adding an option to solve such special
cases is undesirable, they can be easily solved with what's currently
available.

Andrew Lunn April 2, 2024, 12:43 p.m. UTC | #2

On Mon, Apr 01, 2024 at 08:10:59PM -0400, Joseph Huang wrote:
> There is a use case where one would like to enable multicast snooping
> on a bridge but disable multicast flooding on all bridge ports so that
> registered multicast traffic will only reach the intended recipients and
> unregistered multicast traffic will be dropped. However, with existing
> bridge ports' mcast_flood flag implementation, it doesn't work as desired.
> 
> This patchset aims to make multicast snooping work even when multicast
> flooding is disabled on the bridge ports, without changing the semantic of
> the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.
> 
> Also, in a network where more than one multicast snooping capable bridges
> are interconnected without multicast routers being present, multicast
> snooping fails if:
> 
>   1. The source is not directly attached to the Querier
>   2. The listener is beyond the mrouter port of the bridge where the
>      source is directly attached
>   3. A hardware offloading switch is involved

I've not studied the details here, but that last point makes me think
the offload driver is broken. There should not be any difference
between software bridging and hardware bridging. The whole idea is
that you take what Linux can do in software and accelerate it by
offloading to hardware. Doing acceleration should not change the
behaviour.

> The patches were developed against 5.15, and forward-ported to 6.8.
> Tests were done on a Pi 4B + Marvell 6393X Eval board with a single
> switch chip with no VLAN.

So what is the mv88e6xxx driver doing wrong?

	Andrew

Vladimir Oltean April 2, 2024, 5:43 p.m. UTC | #3

Hi Nikolai,

On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote:
> For the bridge patches:
> Nacked-by: Nikolay Aleksandrov <razor@blackwall.org>
> 
> You cannot break the multicast flood flag to add support for a custom
> use-case. This is unacceptable. The current bridge behaviour is correct
> your patch 02 doesn't fix anything, you should configure the bridge
> properly to avoid all those problems, not break protocols.
> 
> Your special use case can easily be solved by a user-space helper or
> eBPF and nftables. You can set the mcast flood flag and bypass the
> bridge for these packets. I basically said the same in 2021, if this is
> going to be in the bridge it should be hidden behind an option that is
> default off. But in my opinion adding an option to solve such special
> cases is undesirable, they can be easily solved with what's currently
> available.

I appreciate your time is limited, but could you please translate your
suggestion, and detail your proposed alternative a bit, for those of us
who are not very familiar with IP multicast snooping?

Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't
that break snooping? And then do what with the packets, forward them in
another software layer than the bridge?

I also don't quite understand the suggestion of turning on mcast flooding:
isn't Joseph saying that he wants it off for the unregistered multicast
data traffic?

Nikolay Aleksandrov April 2, 2024, 6:50 p.m. UTC | #4

On 4/2/24 20:43, Vladimir Oltean wrote:
> Hi Nikolai,
> 
> On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote:
>> For the bridge patches:
>> Nacked-by: Nikolay Aleksandrov <razor@blackwall.org>
>>
>> You cannot break the multicast flood flag to add support for a custom
>> use-case. This is unacceptable. The current bridge behaviour is correct
>> your patch 02 doesn't fix anything, you should configure the bridge
>> properly to avoid all those problems, not break protocols.
>>
>> Your special use case can easily be solved by a user-space helper or
>> eBPF and nftables. You can set the mcast flood flag and bypass the
>> bridge for these packets. I basically said the same in 2021, if this is
>> going to be in the bridge it should be hidden behind an option that is
>> default off. But in my opinion adding an option to solve such special
>> cases is undesirable, they can be easily solved with what's currently
>> available.
> 
> I appreciate your time is limited, but could you please translate your
> suggestion, and detail your proposed alternative a bit, for those of us
> who are not very familiar with IP multicast snooping?
> 

My suggestion is not related to snooping really, but to the goal of
patches 01-03. The bridge patches in this set are trying to forward
traffic that is not supposed to be forwarded with the proposed
configuration, so that can be done by a user-space helper that installs
rules to bypass the bridge specifically for those packets while
monitoring the bridge state to implement a policy and manage these rules
in order to keep snooping working.

> Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't
> that break snooping? And then do what with the packets, forward them in
> another software layer than the bridge?
> 

The ones that are not supposed to be forwarded in the proposed config
and are needed for this use case (control traffic and link-local). 
Obviously to have proper snooping you'd need to manage these bypass
rules and use them only while needed.

> I also don't quite understand the suggestion of turning on mcast flooding:
> isn't Joseph saying that he wants it off for the unregistered multicast
> data traffic?

Ah my bad, I meant to turn off flooding and bypass the bridge for those
packets and ports while necessary, under necessary can be any policy
that the user-space helper wants to implement.

In any case, if this is going to be yet another kernel solution then it
must be a new option that is default off, and doesn't break current 
mcast flood flag behaviour.

In general my opinion is that the whole snooping control must be in
user-space and only have the dataplane in the kernel, but that is beyond
the scope of this set.

Vladimir Oltean April 2, 2024, 8:46 p.m. UTC | #5

On Tue, Apr 02, 2024 at 09:50:51PM +0300, Nikolay Aleksandrov wrote:
> On 4/2/24 20:43, Vladimir Oltean wrote:
> > Hi Nikolai,
> > 
> > On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote:
> > > For the bridge patches:
> > > Nacked-by: Nikolay Aleksandrov <razor@blackwall.org>
> > > 
> > > You cannot break the multicast flood flag to add support for a custom
> > > use-case. This is unacceptable. The current bridge behaviour is correct
> > > your patch 02 doesn't fix anything, you should configure the bridge
> > > properly to avoid all those problems, not break protocols.
> > > 
> > > Your special use case can easily be solved by a user-space helper or
> > > eBPF and nftables. You can set the mcast flood flag and bypass the
> > > bridge for these packets. I basically said the same in 2021, if this is
> > > going to be in the bridge it should be hidden behind an option that is
> > > default off. But in my opinion adding an option to solve such special
> > > cases is undesirable, they can be easily solved with what's currently
> > > available.
> > 
> > I appreciate your time is limited, but could you please translate your
> > suggestion, and detail your proposed alternative a bit, for those of us
> > who are not very familiar with IP multicast snooping?
> > 
> 
> My suggestion is not related to snooping really, but to the goal of
> patches 01-03. The bridge patches in this set are trying to forward
> traffic that is not supposed to be forwarded with the proposed
> configuration,

Correct up to a point. Reinterpreting the given user space configuration
and trying to make it do something else seems like a mistake, but in
principle one could also look at alternative bridge configurations like
the one I described here:
https://lore.kernel.org/netdev/20240402180805.yhhwj2f52sdc4dl2@skbuf/

> so that can be done by a user-space helper that installs
> rules to bypass the bridge specifically for those packets while
> monitoring the bridge state to implement a policy and manage these rules
> in order to keep snooping working.
> 
> > Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't
> > that break snooping? And then do what with the packets, forward them in
> > another software layer than the bridge?
> > 
> 
> The ones that are not supposed to be forwarded in the proposed config
> and are needed for this use case (control traffic and link-local). Obviously
> to have proper snooping you'd need to manage these bypass
> rules and use them only while needed.

I think Joseph will end up in a situation where he needs IGMP control
messages both in the bridge data path and outside of it :)

Also, your proposal eliminates the possibility of cooperating with a
hardware accelerator which can forward the IGMP messages where they need
to go.

As far as I understand, I don't think Joseph has a very "special" use case.
Disabling flooding of unregistered multicast in the data plane sounds
reasonable. There seems to be a gap in the bridge API, in that this
operation also affects the control plane, which he is trying to fix with
this "force flooding", because of insufficiently fine grained control.

> > I also don't quite understand the suggestion of turning on mcast flooding:
> > isn't Joseph saying that he wants it off for the unregistered multicast
> > data traffic?
> 
> Ah my bad, I meant to turn off flooding and bypass the bridge for those
> packets and ports while necessary, under necessary can be any policy
> that the user-space helper wants to implement.
> 
> In any case, if this is going to be yet another kernel solution then it
> must be a new option that is default off, and doesn't break current mcast
> flood flag behaviour.

Yeah, maybe something like this, simple and with clear offload
semantics, as seen in existing hardware (not Marvell though):

mcast_flood == off:
- mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
- mcast_ipv4_data_flood: don't care
- mcast_ipv6_ctrl_flood: don't care
- mcast_ipv6_data_flood: don't care
- mcast_l2_flood: don't care
mcast_flood == on:
- Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
- Flood all other IPv4 multicast according to mcast_ipv4_data_flood
- Flood ff02::/16 according to mcast_ipv6_ctrl_flood
- Flood all other IPv6 multicast according to mcast_ipv6_data_flood
- Flood L2 according to mcast_l2_flood

Nikolay Aleksandrov April 2, 2024, 9:59 p.m. UTC | #6

On 4/2/24 23:46, Vladimir Oltean wrote:
> On Tue, Apr 02, 2024 at 09:50:51PM +0300, Nikolay Aleksandrov wrote:
>> On 4/2/24 20:43, Vladimir Oltean wrote:
>>> Hi Nikolai,
>>>
>>> On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote:
>>>> For the bridge patches:
>>>> Nacked-by: Nikolay Aleksandrov <razor@blackwall.org>
>>>>
>>>> You cannot break the multicast flood flag to add support for a custom
>>>> use-case. This is unacceptable. The current bridge behaviour is correct
>>>> your patch 02 doesn't fix anything, you should configure the bridge
>>>> properly to avoid all those problems, not break protocols.
>>>>
>>>> Your special use case can easily be solved by a user-space helper or
>>>> eBPF and nftables. You can set the mcast flood flag and bypass the
>>>> bridge for these packets. I basically said the same in 2021, if this is
>>>> going to be in the bridge it should be hidden behind an option that is
>>>> default off. But in my opinion adding an option to solve such special
>>>> cases is undesirable, they can be easily solved with what's currently
>>>> available.
>>>
>>> I appreciate your time is limited, but could you please translate your
>>> suggestion, and detail your proposed alternative a bit, for those of us
>>> who are not very familiar with IP multicast snooping?
>>>
>>
>> My suggestion is not related to snooping really, but to the goal of
>> patches 01-03. The bridge patches in this set are trying to forward
>> traffic that is not supposed to be forwarded with the proposed
>> configuration,
> 
> Correct up to a point. Reinterpreting the given user space configuration
> and trying to make it do something else seems like a mistake, but in
> principle one could also look at alternative bridge configurations like
> the one I described here:
> https://lore.kernel.org/netdev/20240402180805.yhhwj2f52sdc4dl2@skbuf/
> 
>> so that can be done by a user-space helper that installs
>> rules to bypass the bridge specifically for those packets while
>> monitoring the bridge state to implement a policy and manage these rules
>> in order to keep snooping working.
>>
>>> Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't
>>> that break snooping? And then do what with the packets, forward them in
>>> another software layer than the bridge?
>>>
>>
>> The ones that are not supposed to be forwarded in the proposed config
>> and are needed for this use case (control traffic and link-local). Obviously
>> to have proper snooping you'd need to manage these bypass
>> rules and use them only while needed.
> 
> I think Joseph will end up in a situation where he needs IGMP control
> messages both in the bridge data path and outside of it :)
> 

My solution does not exclude such scenario. With all unregistered mcast
disabled it will be handled the same as with this proposed solution.

> Also, your proposal eliminates the possibility of cooperating with a
> hardware accelerator which can forward the IGMP messages where they need
> to go.
> 
> As far as I understand, I don't think Joseph has a very "special" use case.
> Disabling flooding of unregistered multicast in the data plane sounds
> reasonable. There seems to be a gap in the bridge API, in that this

This we already have, but..

> operation also affects the control plane, which he is trying to fix with
> this "force flooding", because of insufficiently fine grained control.
> 

Right, this is the part that is more special, I'm not saying it's
unreasonable. The proposition to make it optional and break it down to
type of traffic sounds good to me.

>>> I also don't quite understand the suggestion of turning on mcast flooding:
>>> isn't Joseph saying that he wants it off for the unregistered multicast
>>> data traffic?
>>
>> Ah my bad, I meant to turn off flooding and bypass the bridge for those
>> packets and ports while necessary, under necessary can be any policy
>> that the user-space helper wants to implement.
>>
>> In any case, if this is going to be yet another kernel solution then it
>> must be a new option that is default off, and doesn't break current mcast
>> flood flag behaviour.
> 
> Yeah, maybe something like this, simple and with clear offload
> semantics, as seen in existing hardware (not Marvell though):
> 
> mcast_flood == off:
> - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
> - mcast_ipv4_data_flood: don't care
> - mcast_ipv6_ctrl_flood: don't care
> - mcast_ipv6_data_flood: don't care
> - mcast_l2_flood: don't care
> mcast_flood == on:
> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
> - Flood all other IPv4 multicast according to mcast_ipv4_data_flood
> - Flood ff02::/16 according to mcast_ipv6_ctrl_flood
> - Flood all other IPv6 multicast according to mcast_ipv6_data_flood
> - Flood L2 according to mcast_l2_flood

Yep, sounds good to me. I was thinking about something in these lines
as well if doing a kernel solution in order to make it simpler and more
generic. The ctrl flood bits need to be handled more carefully to make
sure they match only control traffic and not link-local data.
I think the old option can be converted to use this fine-grained
control, that is if anyone sets the old flood on/off then the flood
mask gets set properly so we can do just 1 & in the fast path and avoid
adding more tests. It will also make it symmetric - if it can override
the on case, then it will be able to override the off case. And to be
more explicit you can pass a mask variable to br_multicast_rcv() which
will get populated and then you can pass it down to br_flood(). That
will also avoid adding new bits to the skb's bridge CB.

Joseph Huang April 4, 2024, 9:35 p.m. UTC | #7

Hi Andrew,

On 4/2/2024 8:43 AM, Andrew Lunn wrote:
> On Mon, Apr 01, 2024 at 08:10:59PM -0400, Joseph Huang wrote:
>> There is a use case where one would like to enable multicast snooping
>> on a bridge but disable multicast flooding on all bridge ports so that
>> registered multicast traffic will only reach the intended recipients and
>> unregistered multicast traffic will be dropped. However, with existing
>> bridge ports' mcast_flood flag implementation, it doesn't work as desired.
>> 
>> This patchset aims to make multicast snooping work even when multicast
>> flooding is disabled on the bridge ports, without changing the semantic of
>> the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.
>> 
>> Also, in a network where more than one multicast snooping capable bridges
>> are interconnected without multicast routers being present, multicast
>> snooping fails if:
>> 
>>   1. The source is not directly attached to the Querier
>>   2. The listener is beyond the mrouter port of the bridge where the
>>      source is directly attached
>>   3. A hardware offloading switch is involved
> 
> I've not studied the details here, but that last point makes me think
> the offload driver is broken. There should not be any difference
> between software bridging and hardware bridging. The whole idea is
> that you take what Linux can do in software and accelerate it by
> offloading to hardware. Doing acceleration should not change the
> behaviour.
> 

In patch 10 I gave a little more detail about the fix, but basically 
this is what happened.

Assuming we have a soft bridge like the following:

             bp1 +------------+
   Querier <---- |   bridge   |
                 +------------+
                bp2 |      | bp3
                    |      |
                    v      v
             MC Source    MC Listener

Here bp1 is the mrouter port, bp2 is connected to the multicast source, 
and bp3 is connected to the multicast listener who wishes to receive 
multicast traffic for that group.

After some Query/Report exchange, the snooping code in the bridge is 
going to learn about the Listener from bp3, and is going to create an 
MDB group which includes bp3 as the sole member. When the bridge 
receives a multicast packet for that group from bp2, the bridge will do 
a look up to find the members of that group (in this case, bp3) and 
forward the packet to every single member in that group. At the same 
time, the bridge will also forward the packet to every mrouter port so 
that listeners beyond mrouter ports can receive that multicast packet as 
well.

Now consider the same scenario, but with a hardware-offloaded switch:

                 +------------+
                 |   bridge   |
                 +------------+
                       ^
                       |
                       | p6 (Host CPU port)
          p1/bp1 +------------+
   Querier <---- |     sw     |
                 +------------+
             p2/bp2 |      | p3/bp3
                    |      |
                    v      v
             MC Source    MC Listener

Same Query/Report exchange, same MDB group, except that this time around 
the MDB group will be offloaded to the switch as well. So in the 
switch's ATU we will now have an entry for the multicast group and with 
p3 being the only member of that ATU. When the multicast packet arrives 
at the switch from p2, the switch will do an ATU lookup, and forward the 
packet to p3 only. This means that the Host CPU (p6) will not get a copy 
of the packet, and so the soft bridge will not have the opportunity to 
forward that packet to the mrouter port. This is what patch 10 attempts 
to address.

One possible solution of course is to add p6 to every MDB group, however 
that's probably not very desirable. Besides, it will be more efficient 
if the packet is forwarded to the mrouter port by the switch in hardware 
(i.e., offload mrouter forwarding), vs. being forwarded in the bridge by 
software.

>> The patches were developed against 5.15, and forward-ported to 6.8.
>> Tests were done on a Pi 4B + Marvell 6393X Eval board with a single
>> switch chip with no VLAN.
> 
> So what is the mv88e6xxx driver doing wrong?
> 
> 	Andrew
> 

The mv88e6xxx driver did nothing differently than the other DSA drivers. 
I chose the mv88e6xxx driver to implement the fix simply because that's 
the only platform I have at hand to verify the solution.

Andrew Lunn April 4, 2024, 10:11 p.m. UTC | #8

On Thu, Apr 04, 2024 at 05:35:41PM -0400, Joseph Huang wrote:
> Hi Andrew,
> 
> On 4/2/2024 8:43 AM, Andrew Lunn wrote:
> > On Mon, Apr 01, 2024 at 08:10:59PM -0400, Joseph Huang wrote:
> > > There is a use case where one would like to enable multicast snooping
> > > on a bridge but disable multicast flooding on all bridge ports so that
> > > registered multicast traffic will only reach the intended recipients and
> > > unregistered multicast traffic will be dropped. However, with existing
> > > bridge ports' mcast_flood flag implementation, it doesn't work as desired.
> > > 
> > > This patchset aims to make multicast snooping work even when multicast
> > > flooding is disabled on the bridge ports, without changing the semantic of
> > > the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.
> > > 
> > > Also, in a network where more than one multicast snooping capable bridges
> > > are interconnected without multicast routers being present, multicast
> > > snooping fails if:
> > > 
> > >   1. The source is not directly attached to the Querier
> > >   2. The listener is beyond the mrouter port of the bridge where the
> > >      source is directly attached
> > >   3. A hardware offloading switch is involved
> > 
> > I've not studied the details here, but that last point makes me think
> > the offload driver is broken. There should not be any difference
> > between software bridging and hardware bridging. The whole idea is
> > that you take what Linux can do in software and accelerate it by
> > offloading to hardware. Doing acceleration should not change the
> > behaviour.
> > 
> 
> In patch 10 I gave a little more detail about the fix, but basically this is
> what happened.
> 
> Assuming we have a soft bridge like the following:
> 
>             bp1 +------------+
>   Querier <---- |   bridge   |
>                 +------------+
>                bp2 |      | bp3
>                    |      |
>                    v      v
>             MC Source    MC Listener
> 
> Here bp1 is the mrouter port, bp2 is connected to the multicast source, and
> bp3 is connected to the multicast listener who wishes to receive multicast
> traffic for that group.
> 
> After some Query/Report exchange, the snooping code in the bridge is going
> to learn about the Listener from bp3, and is going to create an MDB group
> which includes bp3 as the sole member. When the bridge receives a multicast
> packet for that group from bp2, the bridge will do a look up to find the
> members of that group (in this case, bp3) and forward the packet to every
> single member in that group. At the same time, the bridge will also forward
> the packet to every mrouter port so that listeners beyond mrouter ports can
> receive that multicast packet as well.
> 
> Now consider the same scenario, but with a hardware-offloaded switch:
> 
>                 +------------+
>                 |   bridge   |
>                 +------------+
>                       ^
>                       |
>                       | p6 (Host CPU port)
>          p1/bp1 +------------+
>   Querier <---- |     sw     |
>                 +------------+
>             p2/bp2 |      | p3/bp3
>                    |      |
>                    v      v
>             MC Source    MC Listener
> 
> Same Query/Report exchange, same MDB group, except that this time around the
> MDB group will be offloaded to the switch as well. So in the switch's ATU we
> will now have an entry for the multicast group and with p3 being the only
> member of that ATU. When the multicast packet arrives at the switch from p2,
> the switch will do an ATU lookup, and forward the packet to p3 only. This
> means that the Host CPU (p6) will not get a copy of the packet, and so the
> soft bridge will not have the opportunity to forward that packet to the
> mrouter port. This is what patch 10 attempts to address.
> 
> One possible solution of course is to add p6 to every MDB group, however
> that's probably not very desirable. Besides, it will be more efficient if
> the packet is forwarded to the mrouter port by the switch in hardware (i.e.,
> offload mrouter forwarding), vs. being forwarded in the bridge by software.

Thanks for the explanation. So i think the key part which you said
above is:

  At the same time, the bridge will also forward the packet to every
  mrouter port so that listeners beyond mrouter ports can receive that
  multicast packet as well.

How does the bridge know about the mrouter port? It seems like the
bridge needs to pass that information down to the switch. Is the
mrouter itself performing a join on the group when it has members of
the group on the other side of it?

	Andrew

Joseph Huang April 4, 2024, 10:16 p.m. UTC | #9

On 4/2/2024 5:59 PM, Nikolay Aleksandrov wrote:
> On 4/2/24 23:46, Vladimir Oltean wrote:
>> On Tue, Apr 02, 2024 at 09:50:51PM +0300, Nikolay Aleksandrov wrote:
>>> On 4/2/24 20:43, Vladimir Oltean wrote:
>>>> Hi Nikolai,
>>>>
>>>> On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote:
>>>>> For the bridge patches:
>>>>> Nacked-by: Nikolay Aleksandrov <razor@blackwall.org>
>>>>>
>>>>> You cannot break the multicast flood flag to add support for a custom
>>>>> use-case. This is unacceptable. The current bridge behaviour is correct
>>>>> your patch 02 doesn't fix anything, you should configure the bridge
>>>>> properly to avoid all those problems, not break protocols.
>>>>>
>>>>> Your special use case can easily be solved by a user-space helper or
>>>>> eBPF and nftables. You can set the mcast flood flag and bypass the
>>>>> bridge for these packets. I basically said the same in 2021, if this is
>>>>> going to be in the bridge it should be hidden behind an option that is
>>>>> default off. But in my opinion adding an option to solve such special
>>>>> cases is undesirable, they can be easily solved with what's currently
>>>>> available.
>>>>
>>>> I appreciate your time is limited, but could you please translate your
>>>> suggestion, and detail your proposed alternative a bit, for those of us
>>>> who are not very familiar with IP multicast snooping?
>>>>
>>>
>>> My suggestion is not related to snooping really, but to the goal of
>>> patches 01-03. The bridge patches in this set are trying to forward
>>> traffic that is not supposed to be forwarded with the proposed
>>> configuration,
>> 
>> Correct up to a point. Reinterpreting the given user space configuration
>> and trying to make it do something else seems like a mistake, but in
>> principle one could also look at alternative bridge configurations like
>> the one I described here:
>> https://lore.kernel.org/netdev/20240402180805.yhhwj2f52sdc4dl2@skbuf/
>> 
>>> so that can be done by a user-space helper that installs
>>> rules to bypass the bridge specifically for those packets while
>>> monitoring the bridge state to implement a policy and manage these rules
>>> in order to keep snooping working.
>>>
>>>> Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't
>>>> that break snooping? And then do what with the packets, forward them in
>>>> another software layer than the bridge?
>>>>
>>>
>>> The ones that are not supposed to be forwarded in the proposed config
>>> and are needed for this use case (control traffic and link-local). Obviously
>>> to have proper snooping you'd need to manage these bypass
>>> rules and use them only while needed.
>> 
>> I think Joseph will end up in a situation where he needs IGMP control
>> messages both in the bridge data path and outside of it :)
>> 
> 
> My solution does not exclude such scenario. With all unregistered mcast
> disabled it will be handled the same as with this proposed solution.
> 
>> Also, your proposal eliminates the possibility of cooperating with a
>> hardware accelerator which can forward the IGMP messages where they need
>> to go.
>> 
>> As far as I understand, I don't think Joseph has a very "special" use case.
>> Disabling flooding of unregistered multicast in the data plane sounds
>> reasonable. There seems to be a gap in the bridge API, in that this
> 
> This we already have, but..
> 
>> operation also affects the control plane, which he is trying to fix with
>> this "force flooding", because of insufficiently fine grained control.
>> 
> 
> Right, this is the part that is more special, I'm not saying it's
> unreasonable. The proposition to make it optional and break it down to
> type of traffic sounds good to me.
> 
>>>> I also don't quite understand the suggestion of turning on mcast flooding:
>>>> isn't Joseph saying that he wants it off for the unregistered multicast
>>>> data traffic?
>>>
>>> Ah my bad, I meant to turn off flooding and bypass the bridge for those
>>> packets and ports while necessary, under necessary can be any policy
>>> that the user-space helper wants to implement.
>>>
>>> In any case, if this is going to be yet another kernel solution then it
>>> must be a new option that is default off, and doesn't break current mcast
>>> flood flag behaviour.
>> 
>> Yeah, maybe something like this, simple and with clear offload
>> semantics, as seen in existing hardware (not Marvell though):
>> 
>> mcast_flood == off:
>> - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
>> - mcast_ipv4_data_flood: don't care
>> - mcast_ipv6_ctrl_flood: don't care
>> - mcast_ipv6_data_flood: don't care
>> - mcast_l2_flood: don't care
>> mcast_flood == on:
>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
>> - Flood all other IPv4 multicast according to mcast_ipv4_data_flood
>> - Flood ff02::/16 according to mcast_ipv6_ctrl_flood
>> - Flood all other IPv6 multicast according to mcast_ipv6_data_flood
>> - Flood L2 according to mcast_l2_flood

Did you mean

if mcast_flood == on (meaning mcast_flood is ENABLED)
- mcast_ipv4_ctrl_flood: don't care (since 224.0.0.x will be flooded anyway)
...

if mcast_flood == off (meaning mcast_flood is DISABLED)
- Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
...

? Otherwise the problem is still not solved when mcast_flood is disabled.

> 
> Yep, sounds good to me. I was thinking about something in these lines
> as well if doing a kernel solution in order to make it simpler and more
> generic. The ctrl flood bits need to be handled more carefully to make
> sure they match only control traffic and not link-local data.

Do we consider 224.0.0.251 (mDNS) to be control or data? What qualifies 
as control I guess that's my question.

> I think the old option can be converted to use this fine-grained
> control, that is if anyone sets the old flood on/off then the flood
> mask gets set properly so we can do just 1 & in the fast path and avoid
> adding more tests. It will also make it symmetric - if it can override
> the on case, then it will be able to override the off case. And to be
> more explicit you can pass a mask variable to br_multicast_rcv() which
> will get populated and then you can pass it down to br_flood(). That
> will also avoid adding new bits to the skb's bridge CB.
> 
>

Joseph Huang April 4, 2024, 10:40 p.m. UTC | #10

Hi Andrew,

On 4/4/2024 6:11 PM, Andrew Lunn wrote:
> On Thu, Apr 04, 2024 at 05:35:41PM -0400, Joseph Huang wrote:
>> Hi Andrew,
>>
>> On 4/2/2024 8:43 AM, Andrew Lunn wrote:
>>> On Mon, Apr 01, 2024 at 08:10:59PM -0400, Joseph Huang wrote:
>>>> There is a use case where one would like to enable multicast snooping
>>>> on a bridge but disable multicast flooding on all bridge ports so that
>>>> registered multicast traffic will only reach the intended recipients and
>>>> unregistered multicast traffic will be dropped. However, with existing
>>>> bridge ports' mcast_flood flag implementation, it doesn't work as desired.
>>>>
>>>> This patchset aims to make multicast snooping work even when multicast
>>>> flooding is disabled on the bridge ports, without changing the semantic of
>>>> the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.
>>>>
>>>> Also, in a network where more than one multicast snooping capable bridges
>>>> are interconnected without multicast routers being present, multicast
>>>> snooping fails if:
>>>>
>>>>    1. The source is not directly attached to the Querier
>>>>    2. The listener is beyond the mrouter port of the bridge where the
>>>>       source is directly attached
>>>>    3. A hardware offloading switch is involved
>>>
>>> I've not studied the details here, but that last point makes me think
>>> the offload driver is broken. There should not be any difference
>>> between software bridging and hardware bridging. The whole idea is
>>> that you take what Linux can do in software and accelerate it by
>>> offloading to hardware. Doing acceleration should not change the
>>> behaviour.
>>>
>>
>> In patch 10 I gave a little more detail about the fix, but basically this is
>> what happened.
>>
>> Assuming we have a soft bridge like the following:
>>
>>              bp1 +------------+
>>    Querier <---- |   bridge   |
>>                  +------------+
>>                 bp2 |      | bp3
>>                     |      |
>>                     v      v
>>              MC Source    MC Listener
>>
>> Here bp1 is the mrouter port, bp2 is connected to the multicast source, and
>> bp3 is connected to the multicast listener who wishes to receive multicast
>> traffic for that group.
>>
>> After some Query/Report exchange, the snooping code in the bridge is going
>> to learn about the Listener from bp3, and is going to create an MDB group
>> which includes bp3 as the sole member. When the bridge receives a multicast
>> packet for that group from bp2, the bridge will do a look up to find the
>> members of that group (in this case, bp3) and forward the packet to every
>> single member in that group. At the same time, the bridge will also forward
>> the packet to every mrouter port so that listeners beyond mrouter ports can
>> receive that multicast packet as well.
>>
>> Now consider the same scenario, but with a hardware-offloaded switch:
>>
>>                  +------------+
>>                  |   bridge   |
>>                  +------------+
>>                        ^
>>                        |
>>                        | p6 (Host CPU port)
>>           p1/bp1 +------------+
>>    Querier <---- |     sw     |
>>                  +------------+
>>              p2/bp2 |      | p3/bp3
>>                     |      |
>>                     v      v
>>              MC Source    MC Listener
>>
>> Same Query/Report exchange, same MDB group, except that this time around the
>> MDB group will be offloaded to the switch as well. So in the switch's ATU we
>> will now have an entry for the multicast group and with p3 being the only
>> member of that ATU. When the multicast packet arrives at the switch from p2,
>> the switch will do an ATU lookup, and forward the packet to p3 only. This
>> means that the Host CPU (p6) will not get a copy of the packet, and so the
>> soft bridge will not have the opportunity to forward that packet to the
>> mrouter port. This is what patch 10 attempts to address.
>>
>> One possible solution of course is to add p6 to every MDB group, however
>> that's probably not very desirable. Besides, it will be more efficient if
>> the packet is forwarded to the mrouter port by the switch in hardware (i.e.,
>> offload mrouter forwarding), vs. being forwarded in the bridge by software.
> 
> Thanks for the explanation. So i think the key part which you said
> above is:
> 
>    At the same time, the bridge will also forward the packet to every
>    mrouter port so that listeners beyond mrouter ports can receive that
>    multicast packet as well.
> 
> How does the bridge know about the mrouter port? It seems like the

The bridge learns about the existence of the Querier by the reception of 
Queries. The bridge will mark the port which it received Queries from as 
the mrouter port.

> bridge needs to pass that information down to the switch. Is the

The bridge does pass that information down to switchdev. Patch 5 adds 
DSA handling of that event as well. Patches 9 and 10 adds the support in 
the mv88e6xxx driver.

> mrouter itself performing a join on the group when it has members of
> the group on the other side of it?

You hit a key point here. The Querier does not send Report (not with 
IGMPv2 anyway), so the bridge will never add the mrouter port to any MDB 
group. That's why patch 10 is needed. Prestera driver does something 
similar (which is what patches 6,7, and 10 are modeled after).

> 
> 	Andrew

Vladimir Oltean April 5, 2024, 10:20 a.m. UTC | #11

On Thu, Apr 04, 2024 at 06:16:12PM -0400, Joseph Huang wrote:
> > > mcast_flood == off:
> > > - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
> > > - mcast_ipv4_data_flood: don't care
> > > - mcast_ipv6_ctrl_flood: don't care
> > > - mcast_ipv6_data_flood: don't care
> > > - mcast_l2_flood: don't care
> > > mcast_flood == on:
> > > - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
> > > - Flood all other IPv4 multicast according to mcast_ipv4_data_flood
> > > - Flood ff02::/16 according to mcast_ipv6_ctrl_flood
> > > - Flood all other IPv6 multicast according to mcast_ipv6_data_flood
> > > - Flood L2 according to mcast_l2_flood
> 
> Did you mean
> 
> if mcast_flood == on (meaning mcast_flood is ENABLED)
> - mcast_ipv4_ctrl_flood: don't care (since 224.0.0.x will be flooded anyway)
> ...
> 
> if mcast_flood == off (meaning mcast_flood is DISABLED)
> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
> ...
> 
> ? Otherwise the problem is still not solved when mcast_flood is disabled.

No, I mean exactly as I said. My goal was not to "solve the problem"
when mcast_flood is disabled, but to give you an option to configure the
bridge to achieve what you want, in a way which I think is more acceptable.

AFAIU, there is not really any "problem" - the bridge behaves exactly as
instructed given the limited language available to instruct it ("mcast_flood"
covers all multicast). So the other knobs have the role of fine-tuning
what gets flooded when mcast_flood is on. Like "yes, but..."

You can't "solve the problem" when it involves changing an established
behavior that somebody probably depended on to be just like that.

> > Yep, sounds good to me. I was thinking about something in these lines
> > as well if doing a kernel solution in order to make it simpler and more
> > generic. The ctrl flood bits need to be handled more carefully to make
> > sure they match only control traffic and not link-local data.
> 
> Do we consider 224.0.0.251 (mDNS) to be control or data? What qualifies as
> control I guess that's my question.

Well, as I said, I'm proposing that 224.0.0.x qualifies as control and
the rest of IPv4 multicast as data. Which means that, applied to your
case, "mcast_flood on mcast_ipv4_ctrl_flood on mcast_ipv4_data_flood off"
will "force flood" mDNS just like the IGMP traffic from your patches.
I'm not aware if this could be considered problematic (I don't think so).

The reason behind this proposal is that, AFAIU, endpoints may choose to
join IGMP groups in the 224.0.0.x range or not, but RFC4541 says that
switches shouldn't prune the destinations towards endpoints that don't
join this range anyway: https://www.rfc-editor.org/rfc/rfc4541#page-6

Whereas for IP multicast traffic towards an address outside 224.0.0.x,
pruning will happen as per the IGMP join tracking mechanism.

Nikolay Aleksandrov April 5, 2024, 11 a.m. UTC | #12

On 4/5/24 13:20, Vladimir Oltean wrote:
> On Thu, Apr 04, 2024 at 06:16:12PM -0400, Joseph Huang wrote:
>>>> mcast_flood == off:
>>>> - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
>>>> - mcast_ipv4_data_flood: don't care
>>>> - mcast_ipv6_ctrl_flood: don't care
>>>> - mcast_ipv6_data_flood: don't care
>>>> - mcast_l2_flood: don't care
>>>> mcast_flood == on:
>>>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
>>>> - Flood all other IPv4 multicast according to mcast_ipv4_data_flood
>>>> - Flood ff02::/16 according to mcast_ipv6_ctrl_flood
>>>> - Flood all other IPv6 multicast according to mcast_ipv6_data_flood
>>>> - Flood L2 according to mcast_l2_flood
>>
>> Did you mean
>>
>> if mcast_flood == on (meaning mcast_flood is ENABLED)
>> - mcast_ipv4_ctrl_flood: don't care (since 224.0.0.x will be flooded anyway)
>> ...
>>
>> if mcast_flood == off (meaning mcast_flood is DISABLED)
>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
>> ...
>>
>> ? Otherwise the problem is still not solved when mcast_flood is disabled.
> 
> No, I mean exactly as I said. My goal was not to "solve the problem"
> when mcast_flood is disabled, but to give you an option to configure the
> bridge to achieve what you want, in a way which I think is more acceptable.
> 
> AFAIU, there is not really any "problem" - the bridge behaves exactly as
> instructed given the limited language available to instruct it ("mcast_flood"
> covers all multicast). So the other knobs have the role of fine-tuning
> what gets flooded when mcast_flood is on. Like "yes, but..."
> 
> You can't "solve the problem" when it involves changing an established
> behavior that somebody probably depended on to be just like that.
> 
>>> Yep, sounds good to me. I was thinking about something in these lines
>>> as well if doing a kernel solution in order to make it simpler and more
>>> generic. The ctrl flood bits need to be handled more carefully to make
>>> sure they match only control traffic and not link-local data.
>>
>> Do we consider 224.0.0.251 (mDNS) to be control or data? What qualifies as
>> control I guess that's my question.
> 
> Well, as I said, I'm proposing that 224.0.0.x qualifies as control and
> the rest of IPv4 multicast as data. Which means that, applied to your
> case, "mcast_flood on mcast_ipv4_ctrl_flood on mcast_ipv4_data_flood off"
> will "force flood" mDNS just like the IGMP traffic from your patches.
> I'm not aware if this could be considered problematic (I don't think so).
> 
> The reason behind this proposal is that, AFAIU, endpoints may choose to
> join IGMP groups in the 224.0.0.x range or not, but RFC4541 says that
> switches shouldn't prune the destinations towards endpoints that don't
> join this range anyway: https://www.rfc-editor.org/rfc/rfc4541#page-6
> 
> Whereas for IP multicast traffic towards an address outside 224.0.0.x,
> pruning will happen as per the IGMP join tracking mechanism.

+1, non-IGMP traffic to 224.0.0.x must be flooded to all anyway
so this should allow for a better control over it, but perhaps
the naming should be link_local instead because control usually
means IGMP, maybe something like mcast_ip_link_local_flood

Andrew Lunn April 5, 2024, 1:09 p.m. UTC | #13

> > Thanks for the explanation. So i think the key part which you said
> > above is:
> > 
> >    At the same time, the bridge will also forward the packet to every
> >    mrouter port so that listeners beyond mrouter ports can receive that
> >    multicast packet as well.
> > 
> > How does the bridge know about the mrouter port? It seems like the
> 
> The bridge learns about the existence of the Querier by the reception of
> Queries. The bridge will mark the port which it received Queries from as the
> mrouter port.
> 
> > bridge needs to pass that information down to the switch. Is the
> 
> The bridge does pass that information down to switchdev. Patch 5 adds DSA
> handling of that event as well. Patches 9 and 10 adds the support in the
> mv88e6xxx driver.

I've not been looking at the details too much for this patchset. It
does however seem that some parts are controversial, and others just
seem like a bug fix. Does it make sense to split this into two
patchsets?

	Andrew

Joseph Huang April 5, 2024, 8:22 p.m. UTC | #14

On 4/5/2024 7:00 AM, Nikolay Aleksandrov wrote:
> On 4/5/24 13:20, Vladimir Oltean wrote:
>> On Thu, Apr 04, 2024 at 06:16:12PM -0400, Joseph Huang wrote:
>>>>> mcast_flood == off:
>>>>> - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
>>>>> - mcast_ipv4_data_flood: don't care
>>>>> - mcast_ipv6_ctrl_flood: don't care
>>>>> - mcast_ipv6_data_flood: don't care
>>>>> - mcast_l2_flood: don't care
>>>>> mcast_flood == on:
>>>>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
>>>>> - Flood all other IPv4 multicast according to mcast_ipv4_data_flood
>>>>> - Flood ff02::/16 according to mcast_ipv6_ctrl_flood
>>>>> - Flood all other IPv6 multicast according to mcast_ipv6_data_flood
>>>>> - Flood L2 according to mcast_l2_flood
>>>
>>> Did you mean
>>>
>>> if mcast_flood == on (meaning mcast_flood is ENABLED)
>>> - mcast_ipv4_ctrl_flood: don't care (since 224.0.0.x will be flooded 
>>> anyway)
>>> ...
>>>
>>> if mcast_flood == off (meaning mcast_flood is DISABLED)
>>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
>>> ...
>>>
>>> ? Otherwise the problem is still not solved when mcast_flood is 
>>> disabled.
>>
>> No, I mean exactly as I said. My goal was not to "solve the problem"
>> when mcast_flood is disabled, but to give you an option to configure the
>> bridge to achieve what you want, in a way which I think is more 
>> acceptable.
>>
>> AFAIU, there is not really any "problem" - the bridge behaves exactly as
>> instructed given the limited language available to instruct it 
>> ("mcast_flood"
>> covers all multicast). So the other knobs have the role of fine-tuning
>> what gets flooded when mcast_flood is on. Like "yes, but..."
>>
>> You can't "solve the problem" when it involves changing an established
>> behavior that somebody probably depended on to be just like that.
>>
>>>> Yep, sounds good to me. I was thinking about something in these lines
>>>> as well if doing a kernel solution in order to make it simpler and more
>>>> generic. The ctrl flood bits need to be handled more carefully to make
>>>> sure they match only control traffic and not link-local data.
>>>
>>> Do we consider 224.0.0.251 (mDNS) to be control or data? What 
>>> qualifies as
>>> control I guess that's my question.
>>
>> Well, as I said, I'm proposing that 224.0.0.x qualifies as control and
>> the rest of IPv4 multicast as data. Which means that, applied to your
>> case, "mcast_flood on mcast_ipv4_ctrl_flood on mcast_ipv4_data_flood off"
>> will "force flood" mDNS just like the IGMP traffic from your patches.
>> I'm not aware if this could be considered problematic (I don't think so).
>>
>> The reason behind this proposal is that, AFAIU, endpoints may choose to
>> join IGMP groups in the 224.0.0.x range or not, but RFC4541 says that
>> switches shouldn't prune the destinations towards endpoints that don't
>> join this range anyway: https://www.rfc-editor.org/rfc/rfc4541#page-6
>>
>> Whereas for IP multicast traffic towards an address outside 224.0.0.x,
>> pruning will happen as per the IGMP join tracking mechanism.
> 
> +1, non-IGMP traffic to 224.0.0.x must be flooded to all anyway
> so this should allow for a better control over it, but perhaps
> the naming should be link_local instead because control usually
> means IGMP, maybe something like mcast_ip_link_local_flood
> 

Like this?

bridge link set dev swp0 mcast_flood off
   - all flooding disabled

bridge link set dev swp0 mcast_flood on
   - all flooding enabled

bridge link set dev swp0 mcast_flood on mcast_ipv4_data_flood off 
mcast_ipv6_data_flood off
   - IPv4 data packets flooding disabled, IPv6 data packets flooding 
disabled, everything else floods (that is to say, only allow IPv4 local 
subnet and IPv6 link-local to flood)

?

The syntax seems to be counterintuitive.

Or like this?

bridge link set dev swp0 mcast_flood on mcast_ipv4_ctrl_flood on
   - only allow IPv4 local subnet to flood, everything else off

?

So basically the question is, what should the behavior be when something 
is omitted from the command line?

Vladimir Oltean April 5, 2024, 9:15 p.m. UTC | #15

On Fri, Apr 05, 2024 at 04:22:43PM -0400, Joseph Huang wrote:
> Like this?
> 
> bridge link set dev swp0 mcast_flood off
>   - all flooding disabled
> 
> bridge link set dev swp0 mcast_flood on
>   - all flooding enabled
> 
> bridge link set dev swp0 mcast_flood on mcast_ipv4_data_flood off
> mcast_ipv6_data_flood off
>   - IPv4 data packets flooding disabled, IPv6 data packets flooding
> disabled, everything else floods (that is to say, only allow IPv4 local
> subnet and IPv6 link-local to flood)
> 
> ?

Yeah.

> The syntax seems to be counterintuitive.
> 
> Or like this?
> 
> bridge link set dev swp0 mcast_flood on mcast_ipv4_ctrl_flood on
>   - only allow IPv4 local subnet to flood, everything else off
> 
> ?

Nope.

> So basically the question is, what should the behavior be when something is
> omitted from the command line?

The answer is always: "new options should default to behaving exactly
like before". It's not just about the command line arguments, but also
about the actual netlink attributes that iproute2 (and other tooling)
creates when communicating with the kernel. Old user space has no idea
about the existence of mcast_ipv4_ctrl_flood et. al. So, if netlink
attributes specifying their value are not sent by user space, their
value in the kernel must mimic the value of mcast_flood.

Joseph Huang April 29, 2024, 8:14 p.m. UTC | #16

On 4/5/2024 5:15 PM, Vladimir Oltean wrote:
> On Fri, Apr 05, 2024 at 04:22:43PM -0400, Joseph Huang wrote:
>> Like this?
>>
>> bridge link set dev swp0 mcast_flood off
>>    - all flooding disabled
>>
>> bridge link set dev swp0 mcast_flood on
>>    - all flooding enabled
>>
>> bridge link set dev swp0 mcast_flood on mcast_ipv4_data_flood off
>> mcast_ipv6_data_flood off
>>    - IPv4 data packets flooding disabled, IPv6 data packets flooding
>> disabled, everything else floods (that is to say, only allow IPv4 local
>> subnet and IPv6 link-local to flood)
>>
>> ?
> 
> Yeah.
> 
>> The syntax seems to be counterintuitive.
>>
>> Or like this?
>>
>> bridge link set dev swp0 mcast_flood on mcast_ipv4_ctrl_flood on
>>    - only allow IPv4 local subnet to flood, everything else off
>>
>> ?
> 
> Nope.
> 
>> So basically the question is, what should the behavior be when something is
>> omitted from the command line?
> 
> The answer is always: "new options should default to behaving exactly
> like before". It's not just about the command line arguments, but also
> about the actual netlink attributes that iproute2 (and other tooling)
> creates when communicating with the kernel. Old user space has no idea
> about the existence of mcast_ipv4_ctrl_flood et. al. So, if netlink
> attributes specifying their value are not sent by user space, their
> value in the kernel must mimic the value of mcast_flood.

How about the following syntax? I think it satisfies all the "not 
breaking existing behavior" requirements (new option defaults to off, 
and missing user space netlink attributes does not change the existing 
behavior):

mcast_flood off
   all off
mcast_flood off mcast_flood_rfc4541 off
   all off
mcast_flood off mcast_flood_rfc4541 on
   224.0.0.X and ff02::1 on, the rest off
mcast_flood on
   all on
mcast_flood on mcast_flood_rfc4541 off
   all on (mcast_flood on overrides mcast_flood_rfc4541)
mcast_flood on mcast_flood_rfc4541 on
   all on
mcast_flood_rfc4541 off
   invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] 
is specified first)
mcast_flood_rfc4541 on
   invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] 
is specified first)

Think of mcast_flood_rfc4541 like a pet door if you will.

Vladimir Oltean April 30, 2024, 1:21 a.m. UTC | #17

On Mon, Apr 29, 2024 at 04:14:03PM -0400, Joseph Huang wrote:
> How about the following syntax? I think it satisfies all the "not breaking
> existing behavior" requirements (new option defaults to off, and missing
> user space netlink attributes does not change the existing behavior):
> 
> mcast_flood off
>   all off
> mcast_flood off mcast_flood_rfc4541 off
>   all off
> mcast_flood off mcast_flood_rfc4541 on
>   224.0.0.X and ff02::1 on, the rest off
> mcast_flood on
>   all on
> mcast_flood on mcast_flood_rfc4541 off
>   all on (mcast_flood on overrides mcast_flood_rfc4541)
> mcast_flood on mcast_flood_rfc4541 on
>   all on
> mcast_flood_rfc4541 off
>   invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
> specified first)
> mcast_flood_rfc4541 on
>   invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
> specified first)

A bridge port defaults to having BR_MCAST_FLOOD set - see new_nbp().
Netlink attributes are only there to _change_ the state of properties in
the kernel. They don't need to be specified by user space if there's
nothing to be changed. "Only valid if another netlink attribute comes
first" makes no sense. You can alter 2 bridge port flags as part of the
same netlink message, or as part of different netlink messages (sent
over sockets of other processes).

> 
> Think of mcast_flood_rfc4541 like a pet door if you will.

Ultimately, as far as I see it, both the OR-based and the AND-based UAPI
addition could be made to work in a way that's perhaps similarly backwards
compatible. It needs to be worked out with the bridge maintainers. Given
that I'm not doing great with my spare time, I will take a back seat on
that.

However, what I don't quite understand in your proposal is how many IPv4
addresses lie beyond the "224.0.0.X" notation? 256? Explain why there is
such a large discrepancy in the number of IPv4 addresses you are in
control of (256), vs only 1 IPv6 address with the new rfc4541 flag?

Joseph Huang April 30, 2024, 5:01 p.m. UTC | #18

On 4/29/2024 9:21 PM, Vladimir Oltean wrote:
> On Mon, Apr 29, 2024 at 04:14:03PM -0400, Joseph Huang wrote:
>> How about the following syntax? I think it satisfies all the "not breaking
>> existing behavior" requirements (new option defaults to off, and missing
>> user space netlink attributes does not change the existing behavior):
>>
>> mcast_flood off
>>    all off
>> mcast_flood off mcast_flood_rfc4541 off
>>    all off
>> mcast_flood off mcast_flood_rfc4541 on
>>    224.0.0.X and ff02::1 on, the rest off
>> mcast_flood on
>>    all on
>> mcast_flood on mcast_flood_rfc4541 off
>>    all on (mcast_flood on overrides mcast_flood_rfc4541)
>> mcast_flood on mcast_flood_rfc4541 on
>>    all on
>> mcast_flood_rfc4541 off
>>    invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
>> specified first)
>> mcast_flood_rfc4541 on
>>    invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
>> specified first)
> 
> A bridge port defaults to having BR_MCAST_FLOOD set - see new_nbp().
> Netlink attributes are only there to _change_ the state of properties in
> the kernel. They don't need to be specified by user space if there's
> nothing to be changed. "Only valid if another netlink attribute comes
> first" makes no sense. You can alter 2 bridge port flags as part of the
> same netlink message, or as part of different netlink messages (sent
> over sockets of other processes).
> 
>>
>> Think of mcast_flood_rfc4541 like a pet door if you will.
> 
> Ultimately, as far as I see it, both the OR-based and the AND-based UAPI
> addition could be made to work in a way that's perhaps similarly backwards
> compatible. It needs to be worked out with the bridge maintainers. Given
> that I'm not doing great with my spare time, I will take a back seat on
> that.

Nik, do you have any objection to the following proposal?

mcast_flood ->          default/    off         on
(existing flag)         missing     (specified/ (specified/
                         (on)        nlmsg)      nlmsg)

mcast_flood_rfc4541
(proposed new flag)
      |
      v
default/                flood all   no flood    flood all
missing
(off)

off                     flood all   no flood    flood all
(specified/nlmsg)

on                      flood all   flood 4541  flood all
(specified/nlmsg)                   ^^^^^^^^^^
                                     only behavior change


Basically the attributes are OR'ed together to form the final flooding 
decision.


> 
> However, what I don't quite understand in your proposal is how many IPv4
> addresses lie beyond the "224.0.0.X" notation? 256? Explain why there is
> such a large discrepancy in the number of IPv4 addresses you are in
> control of (256), vs only 1 IPv6 address with the new rfc4541 flag?

That's straight out of RFC-4541 ("coincidentally", these are also the IP 
addresses for which the bridge will not create mdb's):

2.1.2.

    2) Packets with a destination IP (DIP) address in the 224.0.0.X range
       which are not IGMP must be forwarded on all ports.

       This recommendation is based on the fact that many host systems do
       not send Join IP multicast addresses in this range before sending
       or listening to IP multicast packets.  Furthermore, since the
       224.0.0.X address range is defined as link-local (not to be
       routed), it seems unnecessary to keep the state for each address
       in this range.  Additionally, some routers operate in the
       224.0.0.X address range without issuing IGMP Joins, and these
       applications would break if the switch were to prune them due to
       not having seen a Join Group message from the router.

and

3.

In IPv6, the data forwarding rules are more straight forward because
    MLD is mandated for addresses with scope 2 (link-scope) or greater.
    The only exception is the address FF02::1 which is the all hosts
    link-scope address for which MLD messages are never sent.  Packets
    with the all hosts link-scope address should be forwarded on all
    ports.

Nikolay Aleksandrov May 2, 2024, 12:12 p.m. UTC | #19

On 30/04/2024 20:01, Joseph Huang wrote:
> On 4/29/2024 9:21 PM, Vladimir Oltean wrote:
>> On Mon, Apr 29, 2024 at 04:14:03PM -0400, Joseph Huang wrote:
>>> How about the following syntax? I think it satisfies all the "not breaking
>>> existing behavior" requirements (new option defaults to off, and missing
>>> user space netlink attributes does not change the existing behavior):
>>>
>>> mcast_flood off
>>>    all off
>>> mcast_flood off mcast_flood_rfc4541 off
>>>    all off
>>> mcast_flood off mcast_flood_rfc4541 on
>>>    224.0.0.X and ff02::1 on, the rest off
>>> mcast_flood on
>>>    all on
>>> mcast_flood on mcast_flood_rfc4541 off
>>>    all on (mcast_flood on overrides mcast_flood_rfc4541)
>>> mcast_flood on mcast_flood_rfc4541 on
>>>    all on
>>> mcast_flood_rfc4541 off
>>>    invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
>>> specified first)
>>> mcast_flood_rfc4541 on
>>>    invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
>>> specified first)
>>
>> A bridge port defaults to having BR_MCAST_FLOOD set - see new_nbp().
>> Netlink attributes are only there to _change_ the state of properties in
>> the kernel. They don't need to be specified by user space if there's
>> nothing to be changed. "Only valid if another netlink attribute comes
>> first" makes no sense. You can alter 2 bridge port flags as part of the
>> same netlink message, or as part of different netlink messages (sent
>> over sockets of other processes).
>>
>>>
>>> Think of mcast_flood_rfc4541 like a pet door if you will.
>>
>> Ultimately, as far as I see it, both the OR-based and the AND-based UAPI
>> addition could be made to work in a way that's perhaps similarly backwards
>> compatible. It needs to be worked out with the bridge maintainers. Given
>> that I'm not doing great with my spare time, I will take a back seat on
>> that.
> 
> Nik, do you have any objection to the following proposal?
> 
> mcast_flood ->          default/    off         on
> (existing flag)         missing     (specified/ (specified/
>                         (on)        nlmsg)      nlmsg)
> 
> mcast_flood_rfc4541
> (proposed new flag)
>      |
>      v
> default/                flood all   no flood    flood all
> missing
> (off)
> 
> off                     flood all   no flood    flood all
> (specified/nlmsg)
> 
> on                      flood all   flood 4541  flood all
> (specified/nlmsg)                   ^^^^^^^^^^
>                                     only behavior change
> 
> 
> Basically the attributes are OR'ed together to form the final flooding decision.
> 
> 

Looks good to me. Please make use of the boolopt uapi to avoid adding new
nl attributes.

Thanks,
 Nik

[RFC,net-next,00/10] MC Flood disable and snooping

Message

Comments