Message ID | 20240131123544.462597-3-tobias@waldekranz.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: bridge: switchdev: Skip MDB replays of pending events | expand |
On Wed, Jan 31, 2024 at 01:35:44PM +0100, Tobias Waldekranz wrote: > Generating the list of events MDB to replay races against the IGMP/MLD > snooping logic, which may concurrently enqueue events to the switchdev > deferred queue, leading to duplicate events being sent to drivers. > > Avoid this by grabbing the write-side lock of the MDB, and make sure > that a deferred version of a replay event is not already enqueued to > the switchdev deferred queue before adding it to the replay list. > > An easy way to reproduce this issue, on an mv88e6xxx system, was to > create a snooping bridge, and immediately add a port to it: > > root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \ > > ip link set dev x3 up master br0 > root@infix-06-0b-00:~$ ip link del dev br0 > root@infix-06-0b-00:~$ mvls atu > ADDRESS FID STATE Q F 0 1 2 3 4 5 6 7 8 9 a > DEV:0 Marvell 88E6393X > 33:33:00:00:00:6a 1 static - - 0 . . . . . . . . . . > 33:33:ff:87:e4:3f 1 static - - 0 . . . . . . . . . . > ff:ff:ff:ff:ff:ff 1 static - - 0 1 2 3 4 5 6 7 8 9 a > root@infix-06-0b-00:~$ > > The two IPv6 groups remain in the hardware database because the > port (x3) is notified of the host's membership twice: once in the > original event and once in a replay. Since DSA tracks host addresses > using reference counters, and only a single delete notification is > sent, the count remains at 1 when the bridge is destroyed. > It's not really my business as to how the network maintainers handle this, but if you intend this to go to 'net', you should provide a Fixes: tag. And to make a compelling case for a submission to 'net', you should start off by explaining what the user-visible impact of the bug is. > Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> > --- > net/bridge/br_switchdev.c | 44 ++++++++++++++++++++++++--------------- > 1 file changed, 27 insertions(+), 17 deletions(-) > > diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c > index ee84e783e1df..a3481190d5e6 100644 > --- a/net/bridge/br_switchdev.c > +++ b/net/bridge/br_switchdev.c > @@ -595,6 +595,8 @@ br_switchdev_mdb_replay_one(struct notifier_block *nb, struct net_device *dev, > } > > static int br_switchdev_mdb_queue_one(struct list_head *mdb_list, > + struct net_device *dev, > + unsigned long action, > enum switchdev_obj_id id, > const struct net_bridge_mdb_entry *mp, > struct net_device *orig_dev) > @@ -608,8 +610,17 @@ static int br_switchdev_mdb_queue_one(struct list_head *mdb_list, > mdb->obj.id = id; > mdb->obj.orig_dev = orig_dev; > br_switchdev_mdb_populate(mdb, mp); > - list_add_tail(&mdb->obj.list, mdb_list); > > + if (switchdev_port_obj_is_deferred(dev, action, &mdb->obj)) { > + /* This event is already in the deferred queue of > + * events, so this replay must be elided, lest the > + * driver receives duplicate events for it. > + */ > + kfree(mdb); Would it make sense to make "mdb" a local on-stack variable, and kmemdup() it only if it needs to be queued? > + return 0; > + } > + > + list_add_tail(&mdb->obj.list, mdb_list); > return 0; > } > > @@ -677,22 +688,26 @@ br_switchdev_mdb_replay(struct net_device *br_dev, struct net_device *dev, > if (!br_opt_get(br, BROPT_MULTICAST_ENABLED)) > return 0; > > - /* We cannot walk over br->mdb_list protected just by the rtnl_mutex, > - * because the write-side protection is br->multicast_lock. But we > - * need to emulate the [ blocking ] calling context of a regular > - * switchdev event, so since both br->multicast_lock and RCU read side > - * critical sections are atomic, we have no choice but to pick the RCU > - * read side lock, queue up all our events, leave the critical section > - * and notify switchdev from blocking context. > + if (adding) > + action = SWITCHDEV_PORT_OBJ_ADD; > + else > + action = SWITCHDEV_PORT_OBJ_DEL; > + > + /* br_switchdev_mdb_queue_one will take care to not queue a () after function names > + * replay of an event that is already pending in the switchdev > + * deferred queue. In order to safely determine that, there > + * must be no new deferred MDB notifications enqueued for the > + * duration of the MDB scan. Therefore, grab the write-side > + * lock to avoid racing with any concurrent IGMP/MLD snooping. > */ > - rcu_read_lock(); > + spin_lock_bh(&br->multicast_lock); > > hlist_for_each_entry_rcu(mp, &br->mdb_list, mdb_node) { hlist_for_each_entry() > struct net_bridge_port_group __rcu * const *pp; > const struct net_bridge_port_group *p; > > if (mp->host_joined) { > - err = br_switchdev_mdb_queue_one(&mdb_list, > + err = br_switchdev_mdb_queue_one(&mdb_list, dev, action, > SWITCHDEV_OBJ_ID_HOST_MDB, > mp, br_dev); > if (err) { > @@ -706,7 +721,7 @@ br_switchdev_mdb_replay(struct net_device *br_dev, struct net_device *dev, > if (p->key.port->dev != dev) > continue; > > - err = br_switchdev_mdb_queue_one(&mdb_list, > + err = br_switchdev_mdb_queue_one(&mdb_list, dev, action, > SWITCHDEV_OBJ_ID_PORT_MDB, > mp, dev); > if (err) { > @@ -716,12 +731,7 @@ br_switchdev_mdb_replay(struct net_device *br_dev, struct net_device *dev, > } > } > > - rcu_read_unlock(); > - > - if (adding) > - action = SWITCHDEV_PORT_OBJ_ADD; > - else > - action = SWITCHDEV_PORT_OBJ_DEL; > + spin_unlock_bh(&br->multicast_lock); > > list_for_each_entry(obj, &mdb_list, list) { > err = br_switchdev_mdb_replay_one(nb, dev, > -- > 2.34.1 >
On Wed, Jan 31, 2024 at 01:35:44PM +0100, Tobias Waldekranz wrote: > list_for_each_entry(obj, &mdb_list, list) { > err = br_switchdev_mdb_replay_one(nb, dev, > -- > 2.34.1 > I think there's one more race to deal with. If the switchdev driver has signaled SWITCHDEV_BRPORT_UNOFFLOADED, it may be that there are still deferred port object deletions. If the switchdev port is under a LAG which is under the bridge AND is leaving the LAG, those deferred deletions might run too late, aka after it will no longer process the deletions, since it has left the bridge constellation. To fix that, we need another switchdev_deferred_process() call, after the br_switchdev_mdb_replay_one() calls, while still under rtnl_lock(). The existing switchdev_deferred_process() call from del_nbp() will not help, since the net_bridge_port (the LAG) does _not_ disappear.
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c index ee84e783e1df..a3481190d5e6 100644 --- a/net/bridge/br_switchdev.c +++ b/net/bridge/br_switchdev.c @@ -595,6 +595,8 @@ br_switchdev_mdb_replay_one(struct notifier_block *nb, struct net_device *dev, } static int br_switchdev_mdb_queue_one(struct list_head *mdb_list, + struct net_device *dev, + unsigned long action, enum switchdev_obj_id id, const struct net_bridge_mdb_entry *mp, struct net_device *orig_dev) @@ -608,8 +610,17 @@ static int br_switchdev_mdb_queue_one(struct list_head *mdb_list, mdb->obj.id = id; mdb->obj.orig_dev = orig_dev; br_switchdev_mdb_populate(mdb, mp); - list_add_tail(&mdb->obj.list, mdb_list); + if (switchdev_port_obj_is_deferred(dev, action, &mdb->obj)) { + /* This event is already in the deferred queue of + * events, so this replay must be elided, lest the + * driver receives duplicate events for it. + */ + kfree(mdb); + return 0; + } + + list_add_tail(&mdb->obj.list, mdb_list); return 0; } @@ -677,22 +688,26 @@ br_switchdev_mdb_replay(struct net_device *br_dev, struct net_device *dev, if (!br_opt_get(br, BROPT_MULTICAST_ENABLED)) return 0; - /* We cannot walk over br->mdb_list protected just by the rtnl_mutex, - * because the write-side protection is br->multicast_lock. But we - * need to emulate the [ blocking ] calling context of a regular - * switchdev event, so since both br->multicast_lock and RCU read side - * critical sections are atomic, we have no choice but to pick the RCU - * read side lock, queue up all our events, leave the critical section - * and notify switchdev from blocking context. + if (adding) + action = SWITCHDEV_PORT_OBJ_ADD; + else + action = SWITCHDEV_PORT_OBJ_DEL; + + /* br_switchdev_mdb_queue_one will take care to not queue a + * replay of an event that is already pending in the switchdev + * deferred queue. In order to safely determine that, there + * must be no new deferred MDB notifications enqueued for the + * duration of the MDB scan. Therefore, grab the write-side + * lock to avoid racing with any concurrent IGMP/MLD snooping. */ - rcu_read_lock(); + spin_lock_bh(&br->multicast_lock); hlist_for_each_entry_rcu(mp, &br->mdb_list, mdb_node) { struct net_bridge_port_group __rcu * const *pp; const struct net_bridge_port_group *p; if (mp->host_joined) { - err = br_switchdev_mdb_queue_one(&mdb_list, + err = br_switchdev_mdb_queue_one(&mdb_list, dev, action, SWITCHDEV_OBJ_ID_HOST_MDB, mp, br_dev); if (err) { @@ -706,7 +721,7 @@ br_switchdev_mdb_replay(struct net_device *br_dev, struct net_device *dev, if (p->key.port->dev != dev) continue; - err = br_switchdev_mdb_queue_one(&mdb_list, + err = br_switchdev_mdb_queue_one(&mdb_list, dev, action, SWITCHDEV_OBJ_ID_PORT_MDB, mp, dev); if (err) { @@ -716,12 +731,7 @@ br_switchdev_mdb_replay(struct net_device *br_dev, struct net_device *dev, } } - rcu_read_unlock(); - - if (adding) - action = SWITCHDEV_PORT_OBJ_ADD; - else - action = SWITCHDEV_PORT_OBJ_DEL; + spin_unlock_bh(&br->multicast_lock); list_for_each_entry(obj, &mdb_list, list) { err = br_switchdev_mdb_replay_one(nb, dev,