Message ID | 20240201161045.1956074-1-tobias@waldekranz.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v3,net] net: bridge: switchdev: Skip MDB replays of pending events | expand |
Thu, Feb 01, 2024 at 05:10:45PM CET, tobias@waldekranz.com wrote: >Before this change, generation of the list of events MDB to replay >would race against the IGMP/MLD snooping logic, which could concurrently >enqueue events to the switchdev deferred queue, leading to duplicate >events being sent to drivers. As a consequence of this, drivers which >reference count memberships (at least DSA), would be left with orphan >groups in their hardware database when the bridge was destroyed. > >Avoid this by grabbing the write-side lock of the MDB while generating >the replay list, making sure that no deferred version of a replay >event is already enqueued to the switchdev deferred queue, before >adding it to the replay list. > >An easy way to reproduce this issue, on an mv88e6xxx system, was to >create a snooping bridge, and immediately add a port to it: > > root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \ > > ip link set dev x3 up master br0 > root@infix-06-0b-00:~$ ip link del dev br0 > root@infix-06-0b-00:~$ mvls atu > ADDRESS FID STATE Q F 0 1 2 3 4 5 6 7 8 9 a > DEV:0 Marvell 88E6393X > 33:33:00:00:00:6a 1 static - - 0 . . . . . . . . . . > 33:33:ff:87:e4:3f 1 static - - 0 . . . . . . . . . . > ff:ff:ff:ff:ff:ff 1 static - - 0 1 2 3 4 5 6 7 8 9 a > root@infix-06-0b-00:~$ > >The two IPv6 groups remain in the hardware database because the >port (x3) is notified of the host's membership twice: once via the >original event and once via a replay. Since only a single delete >notification is sent, the count remains at 1 when the bridge is >destroyed. > >Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries") >Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Could you please maintain 24 hours period between sending another patch version? https://www.kernel.org/doc/html/v6.7/process/maintainer-netdev.html#tl-dr
On tor, feb 01, 2024 at 17:24, Jiri Pirko <jiri@resnulli.us> wrote: > Thu, Feb 01, 2024 at 05:10:45PM CET, tobias@waldekranz.com wrote: >>Before this change, generation of the list of events MDB to replay >>would race against the IGMP/MLD snooping logic, which could concurrently >>enqueue events to the switchdev deferred queue, leading to duplicate >>events being sent to drivers. As a consequence of this, drivers which >>reference count memberships (at least DSA), would be left with orphan >>groups in their hardware database when the bridge was destroyed. >> >>Avoid this by grabbing the write-side lock of the MDB while generating >>the replay list, making sure that no deferred version of a replay >>event is already enqueued to the switchdev deferred queue, before >>adding it to the replay list. >> >>An easy way to reproduce this issue, on an mv88e6xxx system, was to >>create a snooping bridge, and immediately add a port to it: >> >> root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \ >> > ip link set dev x3 up master br0 >> root@infix-06-0b-00:~$ ip link del dev br0 >> root@infix-06-0b-00:~$ mvls atu >> ADDRESS FID STATE Q F 0 1 2 3 4 5 6 7 8 9 a >> DEV:0 Marvell 88E6393X >> 33:33:00:00:00:6a 1 static - - 0 . . . . . . . . . . >> 33:33:ff:87:e4:3f 1 static - - 0 . . . . . . . . . . >> ff:ff:ff:ff:ff:ff 1 static - - 0 1 2 3 4 5 6 7 8 9 a >> root@infix-06-0b-00:~$ >> >>The two IPv6 groups remain in the hardware database because the >>port (x3) is notified of the host's membership twice: once via the >>original event and once via a replay. Since only a single delete >>notification is sent, the count remains at 1 when the bridge is >>destroyed. >> >>Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries") >>Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> > > Could you please maintain 24 hours period between sending another patch > version? > > https://www.kernel.org/doc/html/v6.7/process/maintainer-netdev.html#tl-dr Sorry, I will avoid that going forward.
On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote: > Before this change, generation of the list of events MDB to replay s/events MDB/MDB events/ > would race against the IGMP/MLD snooping logic, which could concurrently logic. This could (...) > enqueue events to the switchdev deferred queue, leading to duplicate > events being sent to drivers. As a consequence of this, drivers which > reference count memberships (at least DSA), would be left with orphan > groups in their hardware database when the bridge was destroyed. Still missing the user impact description, aka "when would this be noticed by, and actively bother an user?". Something that would justify handling this via net.git rather than net-next.git. > > Avoid this by grabbing the write-side lock of the MDB while generating > the replay list, making sure that no deferred version of a replay > event is already enqueued to the switchdev deferred queue, before > adding it to the replay list. The description of the solution is actually not very satisfactory to me. I would have liked to see a more thorough analysis. The race has 2 components, one comes from the fact that during replay, we iterate using RCU, which does not halt concurrent updates, and the other comes from the fact that the MDB addition procedure is non-atomic. Elements are first added to the br->mdb_list, but are notified to switchdev in a deferred context. Grabbing the bridge multicast spinlock only solves the first problem: it stops new enqueues of deferred events. We also need special handling of the pending deferred events. The idea there is that we cannot cancel them, since that would lead to complications for other potential listeners on the switchdev chain. And we cannot flush them either, since that wouldn't actually help: flushing needs sleepable context, which is incompatible with holding br->multicast_lock, and without br->multicast_lock held, we haven't actually solved anything, since new deferred events can still be enqueued at any time. So the only simple solution is to let the pending deferred events execute (eventually), but during event replay on joining port, exclude replaying those multicast elements which are in the bridge's multicast list, but were not yet added through switchdev. Eventually they will be. (side note: the handling code for excluding replays on pending event deletion seems to not actually help, because) Event replays on a switchdev port leaving the bridge are also problematic, but in a different way. The deletion procedure is also non-atomic, they are first removed from br->mdb_list then the switchdev notification is deferred. So, the replay procedure cannot enter a condition where it replays the deletion twice. But, there is no switchdev_deferred_process() call when the switchdev port unoffloads an intermediary LAG bridge port, and this can lead to the situation where neither the replay, nor the deferred switchdev object, trickle down to a call in the switchdev driver. So for event deletion, we need to force a synchronous call to switchdev_deferred_process(). See how the analysis in the commit message changes the patch? > > An easy way to reproduce this issue, on an mv88e6xxx system, was to s/was/is/ > create a snooping bridge, and immediately add a port to it: > > root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \ > > ip link set dev x3 up master br0 > root@infix-06-0b-00:~$ ip link del dev br0 > root@infix-06-0b-00:~$ mvls atu > ADDRESS FID STATE Q F 0 1 2 3 4 5 6 7 8 9 a > DEV:0 Marvell 88E6393X > 33:33:00:00:00:6a 1 static - - 0 . . . . . . . . . . > 33:33:ff:87:e4:3f 1 static - - 0 . . . . . . . . . . > ff:ff:ff:ff:ff:ff 1 static - - 0 1 2 3 4 5 6 7 8 9 a > root@infix-06-0b-00:~$ > > The two IPv6 groups remain in the hardware database because the > port (x3) is notified of the host's membership twice: once via the > original event and once via a replay. Since only a single delete > notification is sent, the count remains at 1 when the bridge is > destroyed. These 2 paragraphs, plus the mvls output, should be placed before the "Avoid this" paragraph. > > Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries") > Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> > ---
On mån, feb 05, 2024 at 13:41, Vladimir Oltean <olteanv@gmail.com> wrote: > On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote: >> Before this change, generation of the list of events MDB to replay > > s/events MDB/MDB events/ > >> would race against the IGMP/MLD snooping logic, which could concurrently > > logic. This could (...) > >> enqueue events to the switchdev deferred queue, leading to duplicate >> events being sent to drivers. As a consequence of this, drivers which >> reference count memberships (at least DSA), would be left with orphan >> groups in their hardware database when the bridge was destroyed. > > Still missing the user impact description, aka "when would this be > noticed by, and actively bother an user?". Something that would justify > handling this via net.git rather than net-next.git. Other than moving up the example to this point, what do you feel is missing? Or are you saying that you don't think this belongs on net.git? >> >> Avoid this by grabbing the write-side lock of the MDB while generating >> the replay list, making sure that no deferred version of a replay >> event is already enqueued to the switchdev deferred queue, before >> adding it to the replay list. > > The description of the solution is actually not very satisfactory to me. > I would have liked to see a more thorough analysis. Sorry to disappoint. > The race has 2 components, one comes from the fact that during replay, > we iterate using RCU, which does not halt concurrent updates, and the > other comes from the fact that the MDB addition procedure is non-atomic. > Elements are first added to the br->mdb_list, but are notified to > switchdev in a deferred context. > > Grabbing the bridge multicast spinlock only solves the first problem: it > stops new enqueues of deferred events. We also need special handling of > the pending deferred events. The idea there is that we cannot cancel > them, since that would lead to complications for other potential > listeners on the switchdev chain. And we cannot flush them either, since > that wouldn't actually help: flushing needs sleepable context, which is > incompatible with holding br->multicast_lock, and without > br->multicast_lock held, we haven't actually solved anything, since new > deferred events can still be enqueued at any time. > > So the only simple solution is to let the pending deferred events > execute (eventually), but during event replay on joining port, exclude > replaying those multicast elements which are in the bridge's multicast > list, but were not yet added through switchdev. Eventually they will be. In my opinion, your three paragraphs above say pretty much the same thing as my single paragraph. But sure, I will try to provide more details in v4. > (side note: the handling code for excluding replays on pending event > deletion seems to not actually help, because) > > Event replays on a switchdev port leaving the bridge are also > problematic, but in a different way. The deletion procedure is also > non-atomic, they are first removed from br->mdb_list then the switchdev > notification is deferred. So, the replay procedure cannot enter a > condition where it replays the deletion twice. But, there is no > switchdev_deferred_process() call when the switchdev port unoffloads an > intermediary LAG bridge port, and this can lead to the situation where > neither the replay, nor the deferred switchdev object, trickle down to a > call in the switchdev driver. So for event deletion, we need to force a > synchronous call to switchdev_deferred_process(). Good catch! What does this mean for v4? Is this still one logical change that ensures that MDB events are always delivered exactly once, or a series where one patch fixes the duplicates and one ensures that no events are missed? > See how the analysis in the commit message changes the patch? Suddenly the novice was enlightened. >> >> An easy way to reproduce this issue, on an mv88e6xxx system, was to > > s/was/is/ > >> create a snooping bridge, and immediately add a port to it: >> >> root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \ >> > ip link set dev x3 up master br0 >> root@infix-06-0b-00:~$ ip link del dev br0 >> root@infix-06-0b-00:~$ mvls atu >> ADDRESS FID STATE Q F 0 1 2 3 4 5 6 7 8 9 a >> DEV:0 Marvell 88E6393X >> 33:33:00:00:00:6a 1 static - - 0 . . . . . . . . . . >> 33:33:ff:87:e4:3f 1 static - - 0 . . . . . . . . . . >> ff:ff:ff:ff:ff:ff 1 static - - 0 1 2 3 4 5 6 7 8 9 a >> root@infix-06-0b-00:~$ >> >> The two IPv6 groups remain in the hardware database because the >> port (x3) is notified of the host's membership twice: once via the >> original event and once via a replay. Since only a single delete >> notification is sent, the count remains at 1 when the bridge is >> destroyed. > > These 2 paragraphs, plus the mvls output, should be placed before the > "Avoid this" paragraph. > >> >> Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries") >> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> >> ---
On Tue, Feb 06, 2024 at 03:54:25PM +0100, Tobias Waldekranz wrote: > On mån, feb 05, 2024 at 13:41, Vladimir Oltean <olteanv@gmail.com> wrote: > > On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote: > >> Before this change, generation of the list of events MDB to replay > > > > s/events MDB/MDB events/ > > > >> would race against the IGMP/MLD snooping logic, which could concurrently > > > > logic. This could (...) > > > >> enqueue events to the switchdev deferred queue, leading to duplicate > >> events being sent to drivers. As a consequence of this, drivers which > >> reference count memberships (at least DSA), would be left with orphan > >> groups in their hardware database when the bridge was destroyed. > > > > Still missing the user impact description, aka "when would this be > > noticed by, and actively bother an user?". Something that would justify > > handling this via net.git rather than net-next.git. > > Other than moving up the example to this point, what do you feel is > missing? Or are you saying that you don't think this belongs on net.git? I just don't have enough data to be able to judge (it's missing from the commit message). I'm looking at Documentation/process/stable-kernel-rules.rst as a reference. It lists a series of guidelines from which I gather, generally speaking, that there should exist a user impact on functionality. The "mvls atu" command is a debug program, it just dumps hardware tables. It is enough to point out that multicast entries leak, but not what will be broken as a result of that. > >> Avoid this by grabbing the write-side lock of the MDB while generating > >> the replay list, making sure that no deferred version of a replay > >> event is already enqueued to the switchdev deferred queue, before > >> adding it to the replay list. > > > > The description of the solution is actually not very satisfactory to me. > > I would have liked to see a more thorough analysis. > > Sorry to disappoint. > > > The race has 2 components, one comes from the fact that during replay, > > we iterate using RCU, which does not halt concurrent updates, and the > > other comes from the fact that the MDB addition procedure is non-atomic. > > Elements are first added to the br->mdb_list, but are notified to > > switchdev in a deferred context. > > > > Grabbing the bridge multicast spinlock only solves the first problem: it > > stops new enqueues of deferred events. We also need special handling of > > the pending deferred events. The idea there is that we cannot cancel > > them, since that would lead to complications for other potential > > listeners on the switchdev chain. And we cannot flush them either, since > > that wouldn't actually help: flushing needs sleepable context, which is > > incompatible with holding br->multicast_lock, and without > > br->multicast_lock held, we haven't actually solved anything, since new > > deferred events can still be enqueued at any time. > > > > So the only simple solution is to let the pending deferred events > > execute (eventually), but during event replay on joining port, exclude > > replaying those multicast elements which are in the bridge's multicast > > list, but were not yet added through switchdev. Eventually they will be. > > In my opinion, your three paragraphs above say pretty much the same > thing as my single paragraph. But sure, I will try to provide more > details in v4. It's not about word count, I'm notoriously bad at summarizing. If you could still say in a single paragraph what I did in 3, it would be great. The difference is that the above 3 paragraphs only explore the race on MDB addition events. It is of a different nature that the one on deletion. Your analysis clumps them together, and the code follows suit. There is code to supposedly handle the race between deferred deletion events and replay on port removal, but I don't think it does. > > (side note: the handling code for excluding replays on pending event > > deletion seems to not actually help, because) > > > > Event replays on a switchdev port leaving the bridge are also > > problematic, but in a different way. The deletion procedure is also > > non-atomic, they are first removed from br->mdb_list then the switchdev > > notification is deferred. So, the replay procedure cannot enter a > > condition where it replays the deletion twice. But, there is no > > switchdev_deferred_process() call when the switchdev port unoffloads an > > intermediary LAG bridge port, and this can lead to the situation where > > neither the replay, nor the deferred switchdev object, trickle down to a > > call in the switchdev driver. So for event deletion, we need to force a > > synchronous call to switchdev_deferred_process(). > > Good catch! > > What does this mean for v4? Is this still one logical change that > ensures that MDB events are always delivered exactly once, or a series > where one patch fixes the duplicates and one ensures that no events > are missed? I'm not saying I know how you can satisfy everyone's review requests. Once it is clear that deferred additions and deferred removals need separate treatment, it would probably be preferable to treat them in separate patches. > > See how the analysis in the commit message changes the patch? > > Suddenly the novice was enlightened. This, to me, reads like unnecessary sarcasm. I just mean that I don't think you spent enough time to explore the problem space in the commit message. It is a very important element of a patch, because it is very rare for someone to solve a problem which is not even formulated. I actually tried to be helpful by pointing out where things might not work out. I'm sorry if it did not appear that way.
On tis, feb 06, 2024 at 21:31, Vladimir Oltean <olteanv@gmail.com> wrote: > On Tue, Feb 06, 2024 at 03:54:25PM +0100, Tobias Waldekranz wrote: >> On mån, feb 05, 2024 at 13:41, Vladimir Oltean <olteanv@gmail.com> wrote: >> > On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote: >> >> Before this change, generation of the list of events MDB to replay >> > >> > s/events MDB/MDB events/ >> > >> >> would race against the IGMP/MLD snooping logic, which could concurrently >> > >> > logic. This could (...) >> > >> >> enqueue events to the switchdev deferred queue, leading to duplicate >> >> events being sent to drivers. As a consequence of this, drivers which >> >> reference count memberships (at least DSA), would be left with orphan >> >> groups in their hardware database when the bridge was destroyed. >> > >> > Still missing the user impact description, aka "when would this be >> > noticed by, and actively bother an user?". Something that would justify >> > handling this via net.git rather than net-next.git. >> >> Other than moving up the example to this point, what do you feel is >> missing? Or are you saying that you don't think this belongs on net.git? > > I just don't have enough data to be able to judge (it's missing from the > commit message). I'm looking at Documentation/process/stable-kernel-rules.rst > as a reference. It lists a series of guidelines from which I gather, > generally speaking, that there should exist a user impact on > functionality. The "mvls atu" command is a debug program, it just dumps > hardware tables. It is enough to point out that multicast entries leak, > but not what will be broken as a result of that. Fair enough. I originally discovered the issue while developing a kselftest for another improvement I want to make to switchdev multicast offloading (related to router ports). I thought my new code was causing these orphan entries, but then realized that the issue existed on a plain -net and -net-next kernel. I can imagine scenarios in which a user _could_ be impacted by this, but they are all quite esoteric. Maybe that is an indication that I should just target net-next instead. >> >> Avoid this by grabbing the write-side lock of the MDB while generating >> >> the replay list, making sure that no deferred version of a replay >> >> event is already enqueued to the switchdev deferred queue, before >> >> adding it to the replay list. >> > >> > The description of the solution is actually not very satisfactory to me. >> > I would have liked to see a more thorough analysis. >> >> Sorry to disappoint. >> >> > The race has 2 components, one comes from the fact that during replay, >> > we iterate using RCU, which does not halt concurrent updates, and the >> > other comes from the fact that the MDB addition procedure is non-atomic. >> > Elements are first added to the br->mdb_list, but are notified to >> > switchdev in a deferred context. >> > >> > Grabbing the bridge multicast spinlock only solves the first problem: it >> > stops new enqueues of deferred events. We also need special handling of >> > the pending deferred events. The idea there is that we cannot cancel >> > them, since that would lead to complications for other potential >> > listeners on the switchdev chain. And we cannot flush them either, since >> > that wouldn't actually help: flushing needs sleepable context, which is >> > incompatible with holding br->multicast_lock, and without >> > br->multicast_lock held, we haven't actually solved anything, since new >> > deferred events can still be enqueued at any time. >> > >> > So the only simple solution is to let the pending deferred events >> > execute (eventually), but during event replay on joining port, exclude >> > replaying those multicast elements which are in the bridge's multicast >> > list, but were not yet added through switchdev. Eventually they will be. >> >> In my opinion, your three paragraphs above say pretty much the same >> thing as my single paragraph. But sure, I will try to provide more >> details in v4. > > It's not about word count, I'm notoriously bad at summarizing. If you > could still say in a single paragraph what I did in 3, it would be great. > > The difference is that the above 3 paragraphs only explore the race on > MDB addition events. It is of a different nature that the one on deletion. > Your analysis clumps them together, and the code follows suit. There is > code to supposedly handle the race between deferred deletion events and > replay on port removal, but I don't think it does. That's the thing though: I did not lump them together, I simply did not think about the deletion issue at all. An issue which you yourself state should probably be fixed in a separate patch, presumably becuase you think of them as two very closely related, but ultimately separate, issues. Which is why I think it was needlessly harsh to say that my analysis of the duplicate-events-issue was too simplistic, because it did not address a related issue that I had not considered. >> > (side note: the handling code for excluding replays on pending event >> > deletion seems to not actually help, because) >> > >> > Event replays on a switchdev port leaving the bridge are also >> > problematic, but in a different way. The deletion procedure is also >> > non-atomic, they are first removed from br->mdb_list then the switchdev >> > notification is deferred. So, the replay procedure cannot enter a >> > condition where it replays the deletion twice. But, there is no >> > switchdev_deferred_process() call when the switchdev port unoffloads an >> > intermediary LAG bridge port, and this can lead to the situation where >> > neither the replay, nor the deferred switchdev object, trickle down to a >> > call in the switchdev driver. So for event deletion, we need to force a >> > synchronous call to switchdev_deferred_process(). >> >> Good catch! >> >> What does this mean for v4? Is this still one logical change that >> ensures that MDB events are always delivered exactly once, or a series >> where one patch fixes the duplicates and one ensures that no events >> are missed? > > I'm not saying I know how you can satisfy everyone's review requests. Of course, I was merely trying to avoid an extra round of patches. > Once it is clear that deferred additions and deferred removals need > separate treatment, it would probably be preferable to treat them in > separate patches. I should have been more clear in my response. When I said "Good catch!", that was because I had first reproduced the issue, and then verified that adding a call to switchdev_deferred_process() at the end of nbp_switchdev_unsync_objs() solves the issue. >> > See how the analysis in the commit message changes the patch? >> >> Suddenly the novice was enlightened. > > This, to me, reads like unnecessary sarcasm. Your question, to me, read like unnecessary condescension. > I just mean that I don't think you spent enough time to explore the > problem space in the commit message. It is a very important element of > a patch, because it is very rare for someone to solve a problem which > is not even formulated. I actually tried to be helpful by pointing out > where things might not work out. I'm sorry if it did not appear that way. I am extremely greatful for your help, truly! Which is why I complemented you for pointing out the second issue to me. Something about the phrasing of that question really rubbed me the wrong way though. I should have just let it go, I'm sorry.
On Tue, Feb 06, 2024 at 10:58:18PM +0100, Tobias Waldekranz wrote: > On tis, feb 06, 2024 at 21:31, Vladimir Oltean <olteanv@gmail.com> wrote: > > On Tue, Feb 06, 2024 at 03:54:25PM +0100, Tobias Waldekranz wrote: > >> On mån, feb 05, 2024 at 13:41, Vladimir Oltean <olteanv@gmail.com> wrote: > >> > On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote: > >> >> Before this change, generation of the list of events MDB to replay > >> > > >> > s/events MDB/MDB events/ > >> > > >> >> would race against the IGMP/MLD snooping logic, which could concurrently > >> > > >> > logic. This could (...) > >> > > >> >> enqueue events to the switchdev deferred queue, leading to duplicate > >> >> events being sent to drivers. As a consequence of this, drivers which > >> >> reference count memberships (at least DSA), would be left with orphan > >> >> groups in their hardware database when the bridge was destroyed. > >> > > >> > Still missing the user impact description, aka "when would this be > >> > noticed by, and actively bother an user?". Something that would justify > >> > handling this via net.git rather than net-next.git. > >> > >> Other than moving up the example to this point, what do you feel is > >> missing? Or are you saying that you don't think this belongs on net.git? > > > > I just don't have enough data to be able to judge (it's missing from the > > commit message). I'm looking at Documentation/process/stable-kernel-rules.rst > > as a reference. It lists a series of guidelines from which I gather, > > generally speaking, that there should exist a user impact on > > functionality. The "mvls atu" command is a debug program, it just dumps > > hardware tables. It is enough to point out that multicast entries leak, > > but not what will be broken as a result of that. > > Fair enough. > > I originally discovered the issue while developing a kselftest for > another improvement I want to make to switchdev multicast offloading > (related to router ports). I thought my new code was causing these > orphan entries, but then realized that the issue existed on a plain -net > and -net-next kernel. > > I can imagine scenarios in which a user _could_ be impacted by this, but > they are all quite esoteric. Maybe that is an indication that I should > just target net-next instead. Again, I'm not discouraging you to send this patch to net.git. I've been saying (since v1, apparently: https://lore.kernel.org/netdev/20240131135157.ddrtt4swvz5y3nbz@skbuf/) that if there is a reason to do it, just say it in the commit message, so that we're all on the same page. Be verbose when calculating the cost/benefit ratio calculation (preferably also in the commit message). I would analyze the complexity of the change as "medium", since it does change the locking scheme to fix an underdesigned mechanism. And the fact that mlxsw also uses replays through a slightly different code path means something you (probably) can't test, and potentially risky. > >> > The race has 2 components, one comes from the fact that during replay, > >> > we iterate using RCU, which does not halt concurrent updates, and the > >> > other comes from the fact that the MDB addition procedure is non-atomic. > >> > Elements are first added to the br->mdb_list, but are notified to > >> > switchdev in a deferred context. > >> > > >> > Grabbing the bridge multicast spinlock only solves the first problem: it > >> > stops new enqueues of deferred events. We also need special handling of > >> > the pending deferred events. The idea there is that we cannot cancel > >> > them, since that would lead to complications for other potential > >> > listeners on the switchdev chain. And we cannot flush them either, since > >> > that wouldn't actually help: flushing needs sleepable context, which is > >> > incompatible with holding br->multicast_lock, and without > >> > br->multicast_lock held, we haven't actually solved anything, since new > >> > deferred events can still be enqueued at any time. > >> > > >> > So the only simple solution is to let the pending deferred events > >> > execute (eventually), but during event replay on joining port, exclude > >> > replaying those multicast elements which are in the bridge's multicast > >> > list, but were not yet added through switchdev. Eventually they will be. > >> > >> In my opinion, your three paragraphs above say pretty much the same > >> thing as my single paragraph. But sure, I will try to provide more > >> details in v4. > > > > It's not about word count, I'm notoriously bad at summarizing. If you > > could still say in a single paragraph what I did in 3, it would be great. > > > > The difference is that the above 3 paragraphs only explore the race on > > MDB addition events. It is of a different nature that the one on deletion. > > Your analysis clumps them together, and the code follows suit. There is > > code to supposedly handle the race between deferred deletion events and > > replay on port removal, but I don't think it does. > > That's the thing though: I did not lump them together, I simply did not > think about the deletion issue at all. An issue which you yourself state > should probably be fixed in a separate patch, presumably becuase you > think of them as two very closely related, but ultimately separate, > issues. Which is why I think it was needlessly harsh to say that my > analysis of the duplicate-events-issue was too simplistic, because it > did not address a related issue that I had not considered. So in the comments for your v1, one of the things I told you was that "there's one more race to deal with" on removal. https://lore.kernel.org/netdev/20240131150642.mxcssv7l5qfiejkl@skbuf/ At the time, the idea had not crystalized in my mind either that it's not "one more" race on removal, but that on removal it's _the only_ race. You then submitted v2 and v3, not taking this comment into consideration, not even to explore it. Exploring it thoroughly would have revealed that your approach - to apply the algorithm of excluding objects from replay if events on them are pending in switchdev - is completely unnecessary when "adding=false". In light of my comment and the lack of a detailed analysis on your side on removal, it _appears_, by looking at this change, that the patch handles a race on removal, but the code actually runs through the algorithm that handles a race that doesn't exist, and doesn't handle the race that _does_ exist. My part of the fault is doing spotty review, giving ideas piecemeal, and getting frustrated you aren't picking all of them up. It's still going to be a really busy few weeks/months for me ahead, and there's nothing I can do about that. I'm faced with the alternatives of either doing a shit job reviewing and leaving comments/ideas in the little time I have, or not reviewing at all. I'll think about my options some more.
diff --git a/include/net/switchdev.h b/include/net/switchdev.h index a43062d4c734..8346b0d29542 100644 --- a/include/net/switchdev.h +++ b/include/net/switchdev.h @@ -308,6 +308,9 @@ void switchdev_deferred_process(void); int switchdev_port_attr_set(struct net_device *dev, const struct switchdev_attr *attr, struct netlink_ext_ack *extack); +bool switchdev_port_obj_act_is_deferred(struct net_device *dev, + enum switchdev_notifier_type nt, + const struct switchdev_obj *obj); int switchdev_port_obj_add(struct net_device *dev, const struct switchdev_obj *obj, struct netlink_ext_ack *extack); diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c index ee84e783e1df..6d3fb4292071 100644 --- a/net/bridge/br_switchdev.c +++ b/net/bridge/br_switchdev.c @@ -595,21 +595,35 @@ br_switchdev_mdb_replay_one(struct notifier_block *nb, struct net_device *dev, } static int br_switchdev_mdb_queue_one(struct list_head *mdb_list, + struct net_device *dev, + unsigned long action, enum switchdev_obj_id id, const struct net_bridge_mdb_entry *mp, struct net_device *orig_dev) { - struct switchdev_obj_port_mdb *mdb; + struct switchdev_obj_port_mdb mdb = { + .obj = { + .id = id, + .orig_dev = orig_dev, + }, + }; + struct switchdev_obj_port_mdb *pmdb; - mdb = kzalloc(sizeof(*mdb), GFP_ATOMIC); - if (!mdb) - return -ENOMEM; + br_switchdev_mdb_populate(&mdb, mp); - mdb->obj.id = id; - mdb->obj.orig_dev = orig_dev; - br_switchdev_mdb_populate(mdb, mp); - list_add_tail(&mdb->obj.list, mdb_list); + if (switchdev_port_obj_act_is_deferred(dev, action, &mdb.obj)) { + /* This event is already in the deferred queue of + * events, so this replay must be elided, lest the + * driver receives duplicate events for it. + */ + return 0; + } + + pmdb = kmemdup(&mdb, sizeof(mdb), GFP_ATOMIC); + if (!pmdb) + return -ENOMEM; + list_add_tail(&pmdb->obj.list, mdb_list); return 0; } @@ -677,51 +691,50 @@ br_switchdev_mdb_replay(struct net_device *br_dev, struct net_device *dev, if (!br_opt_get(br, BROPT_MULTICAST_ENABLED)) return 0; - /* We cannot walk over br->mdb_list protected just by the rtnl_mutex, - * because the write-side protection is br->multicast_lock. But we - * need to emulate the [ blocking ] calling context of a regular - * switchdev event, so since both br->multicast_lock and RCU read side - * critical sections are atomic, we have no choice but to pick the RCU - * read side lock, queue up all our events, leave the critical section - * and notify switchdev from blocking context. + if (adding) + action = SWITCHDEV_PORT_OBJ_ADD; + else + action = SWITCHDEV_PORT_OBJ_DEL; + + /* br_switchdev_mdb_queue_one() will take care to not queue a + * replay of an event that is already pending in the switchdev + * deferred queue. In order to safely determine that, there + * must be no new deferred MDB notifications enqueued for the + * duration of the MDB scan. Therefore, grab the write-side + * lock to avoid racing with any concurrent IGMP/MLD snooping. */ - rcu_read_lock(); + spin_lock_bh(&br->multicast_lock); - hlist_for_each_entry_rcu(mp, &br->mdb_list, mdb_node) { + hlist_for_each_entry(mp, &br->mdb_list, mdb_node) { struct net_bridge_port_group __rcu * const *pp; const struct net_bridge_port_group *p; if (mp->host_joined) { - err = br_switchdev_mdb_queue_one(&mdb_list, + err = br_switchdev_mdb_queue_one(&mdb_list, dev, action, SWITCHDEV_OBJ_ID_HOST_MDB, mp, br_dev); if (err) { - rcu_read_unlock(); + spin_unlock_bh(&br->multicast_lock); goto out_free_mdb; } } - for (pp = &mp->ports; (p = rcu_dereference(*pp)) != NULL; + for (pp = &mp->ports; (p = mlock_dereference(*pp, br)) != NULL; pp = &p->next) { if (p->key.port->dev != dev) continue; - err = br_switchdev_mdb_queue_one(&mdb_list, + err = br_switchdev_mdb_queue_one(&mdb_list, dev, action, SWITCHDEV_OBJ_ID_PORT_MDB, mp, dev); if (err) { - rcu_read_unlock(); + spin_unlock_bh(&br->multicast_lock); goto out_free_mdb; } } } - rcu_read_unlock(); - - if (adding) - action = SWITCHDEV_PORT_OBJ_ADD; - else - action = SWITCHDEV_PORT_OBJ_DEL; + spin_unlock_bh(&br->multicast_lock); list_for_each_entry(obj, &mdb_list, list) { err = br_switchdev_mdb_replay_one(nb, dev, diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c index 5b045284849e..7d11f31820df 100644 --- a/net/switchdev/switchdev.c +++ b/net/switchdev/switchdev.c @@ -19,6 +19,35 @@ #include <linux/rtnetlink.h> #include <net/switchdev.h> +static bool switchdev_obj_eq(const struct switchdev_obj *a, + const struct switchdev_obj *b) +{ + const struct switchdev_obj_port_vlan *va, *vb; + const struct switchdev_obj_port_mdb *ma, *mb; + + if (a->id != b->id || a->orig_dev != b->orig_dev) + return false; + + switch (a->id) { + case SWITCHDEV_OBJ_ID_PORT_VLAN: + va = SWITCHDEV_OBJ_PORT_VLAN(a); + vb = SWITCHDEV_OBJ_PORT_VLAN(b); + return va->flags == vb->flags && + va->vid == vb->vid && + va->changed == vb->changed; + case SWITCHDEV_OBJ_ID_PORT_MDB: + case SWITCHDEV_OBJ_ID_HOST_MDB: + ma = SWITCHDEV_OBJ_PORT_MDB(a); + mb = SWITCHDEV_OBJ_PORT_MDB(b); + return ma->vid == mb->vid && + ether_addr_equal(ma->addr, mb->addr); + default: + break; + } + + BUG(); +} + static LIST_HEAD(deferred); static DEFINE_SPINLOCK(deferred_lock); @@ -307,6 +336,50 @@ int switchdev_port_obj_del(struct net_device *dev, } EXPORT_SYMBOL_GPL(switchdev_port_obj_del); +/** + * switchdev_port_obj_act_is_deferred - Is object action pending? + * + * @dev: port device + * @nt: type of action; add or delete + * @obj: object to test + * + * Returns true if a deferred item is exists, which is equivalent + * to the action @nt of an object @obj. + * + * rtnl_lock must be held. + */ +bool switchdev_port_obj_act_is_deferred(struct net_device *dev, + enum switchdev_notifier_type nt, + const struct switchdev_obj *obj) +{ + struct switchdev_deferred_item *dfitem; + bool found = false; + + ASSERT_RTNL(); + + spin_lock_bh(&deferred_lock); + + list_for_each_entry(dfitem, &deferred, list) { + if (dfitem->dev != dev) + continue; + + if ((dfitem->func == switchdev_port_obj_add_deferred && + nt == SWITCHDEV_PORT_OBJ_ADD) || + (dfitem->func == switchdev_port_obj_del_deferred && + nt == SWITCHDEV_PORT_OBJ_DEL)) { + if (switchdev_obj_eq((const void *)dfitem->data, obj)) { + found = true; + break; + } + } + } + + spin_unlock_bh(&deferred_lock); + + return found; +} +EXPORT_SYMBOL_GPL(switchdev_port_obj_act_is_deferred); + static ATOMIC_NOTIFIER_HEAD(switchdev_notif_chain); static BLOCKING_NOTIFIER_HEAD(switchdev_blocking_notif_chain);