mbox series

[RFC,net-next,00/13] vxlan: Add MDB support

Message ID 20230204170801.3897900-1-idosch@nvidia.com (mailing list archive)
Headers show
Series vxlan: Add MDB support | expand

Message

Ido Schimmel Feb. 4, 2023, 5:07 p.m. UTC
tl;dr
=====

This patchset implements MDB support in the VXLAN driver, allowing it to
selectively forward IP multicast traffic to VTEPs with interested
receivers instead of flooding it to all the VTEPs as BUM. The motivating
use case is intra and inter subnet multicast forwarding using EVPN
[1][2], which means that MDB entries are only installed by the user
space control plane and no snooping is implemented, thereby avoiding a
lot of unnecessary complexity in the kernel.

Background
==========

Both the bridge and VXLAN drivers have an FDB that allows them to
forward Ethernet frames based on their destination MAC addresses and
VLAN/VNI. These FDBs are managed using the same PF_BRIDGE/RTM_*NEIGH
netlink messages and bridge(8) utility.

However, only the bridge driver has an MDB that allows it to selectively
forward IP multicast packets to bridge ports with interested receivers
behind them, based on (S, G) and (*, G) MDB entries. When these packets
reach the VXLAN driver they are flooded using the "all-zeros" FDB entry
(00:00:00:00:00:00). The entry either includes the list of all the VTEPs
in the tenant domain (when ingress replication is used) or the multicast
address of the BUM tunnel (when P2MP tunnels are used), to which all the
VTEPs join.

Networks that make heavy use of multicast in the overlay can benefit
from a solution that allows them to selectively forward IP multicast
traffic only to VTEPs with interested receivers. Such a solution is
described in the next section.

Motivation
==========

RFC 7432 [3] defines a "MAC/IP Advertisement route" (type 2) [4] that
allows VTEPs in the EVPN network to advertise and learn reachability
information for unicast MAC addresses. Traffic destined to a unicast MAC
address can therefore be selectively forwarded to a single VTEP behind
which the MAC is located.

The same is not true for IP multicast traffic. Such traffic is simply
flooded as BUM to all VTEPs in the broadcast domain (BD) / subnet,
regardless if a VTEP has interested receivers for the multicast stream
or not. This is especially problematic for overlay networks that make
heavy use of multicast.

The issue is addressed by RFC 9251 [1] that defines a "Selective
Multicast Ethernet Tag Route" (type 6) [5] which allows VTEPs in the
EVPN network to advertise multicast streams that they are interested in.
This is done by having each VTEP suppress IGMP/MLD packets from being
transmitted to the NVE network and instead communicate the information
over BGP to other VTEPs.

The draft in [2] further extends RFC 9251 with procedures to allow
efficient forwarding of IP multicast traffic not only in a given subnet,
but also between different subnets in a tenant domain.

The required changes in the bridge driver to support the above were
already merged in merge commit 8150f0cfb24f ("Merge branch
'bridge-mcast-extensions-for-evpn'"). However, full support entails MDB
support in the VXLAN driver so that it will be able to selectively
forward IP multicast traffic only to VTEPs with interested receivers.
The implementation of this MDB is described in the next section.

Implementation
==============

The user interface is extended to allow user space to specify the
destination VTEP(s) and related parameters. Example usage:

 # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent dst 198.51.100.1
 # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent dst 192.0.2.1

 $ bridge -d -s mdb show
 dev vxlan0 port vxlan0 grp 239.1.1.1 permanent filter_mode exclude proto static dst 192.0.2.1    0.00
 dev vxlan0 port vxlan0 grp 239.1.1.1 permanent filter_mode exclude proto static dst 198.51.100.1    0.00

Since the MDB is fully managed by user space and since snooping is not
implemented, only permanent entries can be installed and temporary
entries are rejected by the kernel.

The netlink interface is extended with a few new attributes in the
RTM_NEWMDB / RTM_DELMDB request messages:

[ struct nlmsghdr ]
[ struct br_port_msg ]
[ MDBA_SET_ENTRY ]
	struct br_mdb_entry
[ MDBA_SET_ENTRY_ATTRS ]
	[ MDBE_ATTR_SOURCE ]
		struct in_addr / struct in6_addr
	[ MDBE_ATTR_SRC_LIST ]
		[ MDBE_SRC_LIST_ENTRY ]
			[ MDBE_SRCATTR_ADDRESS ]
				struct in_addr / struct in6_addr
		[ ...]
	[ MDBE_ATTR_GROUP_MODE ]
		u8
	[ MDBE_ATTR_RTPORT ]
		u8
	[ MDBE_ATTR_DST ]	// new
		struct in_addr / struct in6_addr
	[ MDBE_ATTR_DST_PORT ]	// new
		u16
	[ MDBE_ATTR_VNI ]	// new
		u32
	[ MDBE_ATTR_IFINDEX ]	// new
		s32
	[ MDBE_ATTR_SRC_VNI ]	// new
		u32

RTM_NEWMDB / RTM_DELMDB responses and notifications are extended with
corresponding attributes.

One MDB entry that can be installed in the VXLAN MDB, but not in the
bridge MDB is the catchall entry (0.0.0.0 / ::). It is used to transmit
unregistered multicast traffic that is not link-local and is especially
useful when inter-subnet multicast forwarding is required. See patch #12
for a detailed explanation and motivation. It is similar to the
"all-zeros" FDB entry that can be installed in the VXLAN FDB, but not
the bridge FDB.

"added_by_star_ex" entries?
---------------------------

The bridge driver automatically installs (S, G) MDB port group entries
marked as "added_by_star_ex" whenever it detects that an (S, G) entry
can prevent traffic from being forwarded via a port associated with an
EXCLUDE (*, G) entry. The bridge will add the port to the port group of
the (S, G) entry, thereby creating a new port group entry. The
complexity associated with these entries is not trivial, but it needs to
reside in the bridge driver because it automatically installs MDB
entries in response to snooped IGMP / MLD packets.

The same in not true for the VXLAN MDB which is entirely managed by user
space who is fully capable of forming the correct replication lists on
its own. In addition, the complexity associated with the
"added_by_star_ex" entries in the VXLAN driver is higher compared to the
bridge: Whenever a remote VTEP is added to the catchall entry, it needs
to be added to all the existing MDB entries, as such a remote requested
all the multicast traffic to be forwarded to it. Similarly, whenever an
(*, G) or (S, G) entry is added, all the remotes associated with the
catchall entry need to be added to it.

Given the above, this RFC does not implement support for such entries.
One argument against this decision can be that in the future someone
might want to populate the VXLAN MDB in response to decapsulated IGMP /
MLD packets and not according to EVPN routes. Regardless of my doubts
regarding this possibility, it is unclear to me why the snooping
functionality cannot be implemented in user space by opening an
AF_PACKET socket on the VXLAN device and sniffing IGMP / MLD packets.

I believe that the decision to place snooping functionality in the
bridge driver was made without appreciation for the complexity that
IGMPv3 support would bring and that a more informed decision should be
made for the VXLAN driver.

Testing
=======

No regressions in existing VXLAN / MDB selftests. Will add dedicated
selftests in v1.

Patchset overview
=================

Patches #1-#3 are small preparations in the bridge driver. I plan to
submit them separately together with an MDB dump test case.

Patches #4-#6 are additional preparations centered around the extraction
of the MDB netlink handlers from the bridge driver to the common
rtnetlink code. This allows reusing the existing MDB netlink messages
for the configuration of the VXLAN MDB.

Patches #7-#9 include more small preparations in the common rtnetlink
code and the VXLAN driver.

Patch #10 implements the MDB control path in the VXLAN driver, which
will allow user space to create, delete, replace and dump MDB entries.

Patches #11-#12 implement the MDB data path in the VXLAN driver,
allowing it to selectively forward IP multicast traffic according to the
matched MDB entry.

Patch #13 finally enables MDB support in the VXLAN driver.

iproute2 patches can be found here [6].

Note that in order to fully support the specifications in [1] and [2],
additional functionality is required from the data path. However, it can
be achieved using existing kernel interfaces which is why it is not
described here.

[1] https://datatracker.ietf.org/doc/html/rfc9251
[2] https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-irb-mcast
[3] https://datatracker.ietf.org/doc/html/rfc7432
[4] https://datatracker.ietf.org/doc/html/rfc7432#section-7.2
[5] https://datatracker.ietf.org/doc/html/rfc9251#section-9.1
[6] https://github.com/idosch/iproute2/commits/submit/mdb_vxlan_rfc_v1

Ido Schimmel (13):
  bridge: mcast: Use correct define in MDB dump
  bridge: mcast: Remove pointless sequence generation counter assignment
  bridge: mcast: Move validation to a policy
  net: Add MDB net device operations
  bridge: mcast: Implement MDB net device operations
  rtnetlink: bridge: mcast: Move MDB handlers out of bridge driver
  rtnetlink: bridge: mcast: Relax group address validation in common
    code
  vxlan: Move address helpers to private headers
  vxlan: Expose vxlan_xmit_one()
  vxlan: mdb: Add MDB control path support
  vxlan: mdb: Add an internal flag to indicate MDB usage
  vxlan: Add MDB data path support
  vxlan: Enable MDB support

 drivers/net/vxlan/Makefile        |    2 +-
 drivers/net/vxlan/vxlan_core.c    |   78 +-
 drivers/net/vxlan/vxlan_mdb.c     | 1484 +++++++++++++++++++++++++++++
 drivers/net/vxlan/vxlan_private.h |   84 ++
 include/linux/netdevice.h         |   21 +
 include/net/vxlan.h               |    6 +
 include/uapi/linux/if_bridge.h    |   10 +
 net/bridge/br_device.c            |    3 +
 net/bridge/br_mdb.c               |  214 +----
 net/bridge/br_netlink.c           |    3 -
 net/bridge/br_private.h           |   22 +-
 net/core/rtnetlink.c              |  215 +++++
 12 files changed, 1907 insertions(+), 235 deletions(-)
 create mode 100644 drivers/net/vxlan/vxlan_mdb.c

Comments

Nikolay Aleksandrov Feb. 6, 2023, 11:24 p.m. UTC | #1
On 2/4/23 19:07, Ido Schimmel wrote:
> tl;dr
> =====
> 
> This patchset implements MDB support in the VXLAN driver, allowing it to
> selectively forward IP multicast traffic to VTEPs with interested
> receivers instead of flooding it to all the VTEPs as BUM. The motivating
> use case is intra and inter subnet multicast forwarding using EVPN
> [1][2], which means that MDB entries are only installed by the user
> space control plane and no snooping is implemented, thereby avoiding a
> lot of unnecessary complexity in the kernel.
> 
> Background
> ==========
> 
> Both the bridge and VXLAN drivers have an FDB that allows them to
> forward Ethernet frames based on their destination MAC addresses and
> VLAN/VNI. These FDBs are managed using the same PF_BRIDGE/RTM_*NEIGH
> netlink messages and bridge(8) utility.
> 
> However, only the bridge driver has an MDB that allows it to selectively
> forward IP multicast packets to bridge ports with interested receivers
> behind them, based on (S, G) and (*, G) MDB entries. When these packets
> reach the VXLAN driver they are flooded using the "all-zeros" FDB entry
> (00:00:00:00:00:00). The entry either includes the list of all the VTEPs
> in the tenant domain (when ingress replication is used) or the multicast
> address of the BUM tunnel (when P2MP tunnels are used), to which all the
> VTEPs join.
> 
> Networks that make heavy use of multicast in the overlay can benefit
> from a solution that allows them to selectively forward IP multicast
> traffic only to VTEPs with interested receivers. Such a solution is
> described in the next section.
> 
> Motivation
> ==========
> 
> RFC 7432 [3] defines a "MAC/IP Advertisement route" (type 2) [4] that
> allows VTEPs in the EVPN network to advertise and learn reachability
> information for unicast MAC addresses. Traffic destined to a unicast MAC
> address can therefore be selectively forwarded to a single VTEP behind
> which the MAC is located.
> 
> The same is not true for IP multicast traffic. Such traffic is simply
> flooded as BUM to all VTEPs in the broadcast domain (BD) / subnet,
> regardless if a VTEP has interested receivers for the multicast stream
> or not. This is especially problematic for overlay networks that make
> heavy use of multicast.
> 
> The issue is addressed by RFC 9251 [1] that defines a "Selective
> Multicast Ethernet Tag Route" (type 6) [5] which allows VTEPs in the
> EVPN network to advertise multicast streams that they are interested in.
> This is done by having each VTEP suppress IGMP/MLD packets from being
> transmitted to the NVE network and instead communicate the information
> over BGP to other VTEPs.
> 
> The draft in [2] further extends RFC 9251 with procedures to allow
> efficient forwarding of IP multicast traffic not only in a given subnet,
> but also between different subnets in a tenant domain.
> 
> The required changes in the bridge driver to support the above were
> already merged in merge commit 8150f0cfb24f ("Merge branch
> 'bridge-mcast-extensions-for-evpn'"). However, full support entails MDB
> support in the VXLAN driver so that it will be able to selectively
> forward IP multicast traffic only to VTEPs with interested receivers.
> The implementation of this MDB is described in the next section.
> 
> Implementation
> ==============
> 
> The user interface is extended to allow user space to specify the
> destination VTEP(s) and related parameters. Example usage:
> 
>   # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent dst 198.51.100.1
>   # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent dst 192.0.2.1
> 
>   $ bridge -d -s mdb show
>   dev vxlan0 port vxlan0 grp 239.1.1.1 permanent filter_mode exclude proto static dst 192.0.2.1    0.00
>   dev vxlan0 port vxlan0 grp 239.1.1.1 permanent filter_mode exclude proto static dst 198.51.100.1    0.00
> 
> Since the MDB is fully managed by user space and since snooping is not
> implemented, only permanent entries can be installed and temporary
> entries are rejected by the kernel.
> 
> The netlink interface is extended with a few new attributes in the
> RTM_NEWMDB / RTM_DELMDB request messages:
> 
> [ struct nlmsghdr ]
> [ struct br_port_msg ]
> [ MDBA_SET_ENTRY ]
> 	struct br_mdb_entry
> [ MDBA_SET_ENTRY_ATTRS ]
> 	[ MDBE_ATTR_SOURCE ]
> 		struct in_addr / struct in6_addr
> 	[ MDBE_ATTR_SRC_LIST ]
> 		[ MDBE_SRC_LIST_ENTRY ]
> 			[ MDBE_SRCATTR_ADDRESS ]
> 				struct in_addr / struct in6_addr
> 		[ ...]
> 	[ MDBE_ATTR_GROUP_MODE ]
> 		u8
> 	[ MDBE_ATTR_RTPORT ]
> 		u8
> 	[ MDBE_ATTR_DST ]	// new
> 		struct in_addr / struct in6_addr
> 	[ MDBE_ATTR_DST_PORT ]	// new
> 		u16
> 	[ MDBE_ATTR_VNI ]	// new
> 		u32
> 	[ MDBE_ATTR_IFINDEX ]	// new
> 		s32
> 	[ MDBE_ATTR_SRC_VNI ]	// new
> 		u32
> 
> RTM_NEWMDB / RTM_DELMDB responses and notifications are extended with
> corresponding attributes.
> 
> One MDB entry that can be installed in the VXLAN MDB, but not in the
> bridge MDB is the catchall entry (0.0.0.0 / ::). It is used to transmit
> unregistered multicast traffic that is not link-local and is especially
> useful when inter-subnet multicast forwarding is required. See patch #12
> for a detailed explanation and motivation. It is similar to the
> "all-zeros" FDB entry that can be installed in the VXLAN FDB, but not
> the bridge FDB.
> 
> "added_by_star_ex" entries?
> ---------------------------
> 
> The bridge driver automatically installs (S, G) MDB port group entries
> marked as "added_by_star_ex" whenever it detects that an (S, G) entry
> can prevent traffic from being forwarded via a port associated with an
> EXCLUDE (*, G) entry. The bridge will add the port to the port group of
> the (S, G) entry, thereby creating a new port group entry. The
> complexity associated with these entries is not trivial, but it needs to
> reside in the bridge driver because it automatically installs MDB
> entries in response to snooped IGMP / MLD packets.
> 
> The same in not true for the VXLAN MDB which is entirely managed by user
> space who is fully capable of forming the correct replication lists on
> its own. In addition, the complexity associated with the
> "added_by_star_ex" entries in the VXLAN driver is higher compared to the
> bridge: Whenever a remote VTEP is added to the catchall entry, it needs
> to be added to all the existing MDB entries, as such a remote requested
> all the multicast traffic to be forwarded to it. Similarly, whenever an
> (*, G) or (S, G) entry is added, all the remotes associated with the
> catchall entry need to be added to it.
> 
> Given the above, this RFC does not implement support for such entries.
> One argument against this decision can be that in the future someone
> might want to populate the VXLAN MDB in response to decapsulated IGMP /
> MLD packets and not according to EVPN routes. Regardless of my doubts
> regarding this possibility, it is unclear to me why the snooping
> functionality cannot be implemented in user space by opening an
> AF_PACKET socket on the VXLAN device and sniffing IGMP / MLD packets.
> 
> I believe that the decision to place snooping functionality in the
> bridge driver was made without appreciation for the complexity that
> IGMPv3 support would bring and that a more informed decision should be
> made for the VXLAN driver.
> 

Hmm, while I agree that having the control plane in user-space is nice,
I do like having a relatively straight-forward and well maintained
protocol implementation in the kernel too, similar to its IGMPv3 client
support which doesn't need third party packages or external software
libraries to work. That being said, I do have (an unfinished) patch-set
that adds a bridge daemon to FRR, I think we can always add a knob to
switch to some more advanced user-space daemon which can snoop.

Anyway to the point - this patch-set looks ok to me, from bridge PoV
it's mostly code shuffling, and the new vxlan code is fairly straight-
forward.

Cheers,
  Nik
Ido Schimmel Feb. 7, 2023, 9:25 a.m. UTC | #2
On Tue, Feb 07, 2023 at 12:24:25AM +0100, Nikolay Aleksandrov wrote:
> Hmm, while I agree that having the control plane in user-space is nice,
> I do like having a relatively straight-forward and well maintained
> protocol implementation in the kernel too, similar to its IGMPv3 client
> support which doesn't need third party packages or external software
> libraries to work. That being said, I do have (an unfinished) patch-set
> that adds a bridge daemon to FRR, I think we can always add a knob to
> switch to some more advanced user-space daemon which can snoop.
> 
> Anyway to the point - this patch-set looks ok to me, from bridge PoV
> it's mostly code shuffling, and the new vxlan code is fairly straight-
> forward.

Thanks for taking a look. I was hoping you would comment on this
section... :)

After sending the RFC I realized that what I wrote about the user space
implementation is not accurate. An AF_PACKET socket opened on the VXLAN
device will only give you the decapsulated IGMP / MLD packets. You
wouldn't know from which remote VTEP they arrived. However, my point
still stands: As long as the kernel is not performing snooping we can
defer the forming of the replication lists to user space and avoid the
complexity of the "added_by_star_ex" entries (among many other things).
If in the future we need to implement snooping in the kernel, then we
will expose a new knob (e.g., "mcast_snooping", default off), which will
also enable the "added_by_star_ex" entries.

I tried looking what other implementations are doing and my impression
is that by "VXLAN IGMP snooping" they all refer to the snooping done in
the bridge driver. That is, instead of treating the VXLAN port as a
router port, the bridge will only forward specific groups to the VXLAN
port, but this multicast traffic will be forwarded to all the VTEPs.
This is already supported by the kernel.

Regarding what you wrote about a new knob in the bridge driver, you mean
that this knob will enable MDB lookup regardless of "mcast_snooping"?
Currently this knob enables both snooping and MDB lookup. Note that I
didn't add a new knob to the VXLAN device because I figured that if user
space doesn't want MDB lookup, then it will not configure MDB entries.

Thanks!
Nikolay Aleksandrov Feb. 7, 2023, 9:02 p.m. UTC | #3
On 2/7/23 11:25, Ido Schimmel wrote:
> On Tue, Feb 07, 2023 at 12:24:25AM +0100, Nikolay Aleksandrov wrote:
>> Hmm, while I agree that having the control plane in user-space is nice,
>> I do like having a relatively straight-forward and well maintained
>> protocol implementation in the kernel too, similar to its IGMPv3 client
>> support which doesn't need third party packages or external software
>> libraries to work. That being said, I do have (an unfinished) patch-set
>> that adds a bridge daemon to FRR, I think we can always add a knob to
>> switch to some more advanced user-space daemon which can snoop.
>>
>> Anyway to the point - this patch-set looks ok to me, from bridge PoV
>> it's mostly code shuffling, and the new vxlan code is fairly straight-
>> forward.
> 
> Thanks for taking a look. I was hoping you would comment on this
> section... :)
>

:)

> After sending the RFC I realized that what I wrote about the user space
> implementation is not accurate. An AF_PACKET socket opened on the VXLAN
> device will only give you the decapsulated IGMP / MLD packets. You
> wouldn't know from which remote VTEP they arrived. However, my point
> still stands: As long as the kernel is not performing snooping we can
> defer the forming of the replication lists to user space and avoid the
> complexity of the "added_by_star_ex" entries (among many other things).
> If in the future we need to implement snooping in the kernel, then we
> will expose a new knob (e.g., "mcast_snooping", default off), which will
> also enable the "added_by_star_ex" entries.
> 

Yep, I agree that it would be best for this case and we don't need the 
extra complexity in the kernel. I was referring more to the standard
IGMPv3 implementation (both client and bridge).

> I tried looking what other implementations are doing and my impression
> is that by "VXLAN IGMP snooping" they all refer to the snooping done in
> the bridge driver. That is, instead of treating the VXLAN port as a
> router port, the bridge will only forward specific groups to the VXLAN
> port, but this multicast traffic will be forwarded to all the VTEPs.
> This is already supported by the kernel.
> 
> Regarding what you wrote about a new knob in the bridge driver, you mean
> that this knob will enable MDB lookup regardless of "mcast_snooping"?

Yep, we can implement the snooping logic in user-space and use the
bridge only as a dataplane (that's what my bridge daemon in frr was
going to do for IGMPv3 and also explicit host tracking).

> Currently this knob enables both snooping and MDB lookup. Note that I
> didn't add a new knob to the VXLAN device because I figured that if user
> space doesn't want MDB lookup, then it will not configure MDB entries.
>

Yeah, of course. The set makes sense as it is since vxlan's logic would
be in user-space.

> Thanks!