diff mbox series

[v3,net] net: bridge: switchdev: Skip MDB replays of pending events

Message ID 20240201161045.1956074-1-tobias@waldekranz.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [v3,net] net: bridge: switchdev: Skip MDB replays of pending events | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success SINGLE THREAD; Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1110 this patch: 1110
netdev/build_tools success Errors and warnings before: 2 this patch: 0
netdev/cc_maintainers success CCed 0 of 0 maintainers
netdev/build_clang success Errors and warnings before: 1087 this patch: 1087
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1144 this patch: 1144
netdev/checkpatch warning WARNING: Do not crash the kernel unless it is absolutely unavoidable--use WARN_ON_ONCE() plus recovery code (if feasible) instead of BUG() or variants
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-02-04--21-00 (tests: 721)

Commit Message

Tobias Waldekranz Feb. 1, 2024, 4:10 p.m. UTC
Before this change, generation of the list of events MDB to replay
would race against the IGMP/MLD snooping logic, which could concurrently
enqueue events to the switchdev deferred queue, leading to duplicate
events being sent to drivers. As a consequence of this, drivers which
reference count memberships (at least DSA), would be left with orphan
groups in their hardware database when the bridge was destroyed.

Avoid this by grabbing the write-side lock of the MDB while generating
the replay list, making sure that no deferred version of a replay
event is already enqueued to the switchdev deferred queue, before
adding it to the replay list.

An easy way to reproduce this issue, on an mv88e6xxx system, was to
create a snooping bridge, and immediately add a port to it:

    root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \
    > ip link set dev x3 up master br0
    root@infix-06-0b-00:~$ ip link del dev br0
    root@infix-06-0b-00:~$ mvls atu
    ADDRESS             FID  STATE      Q  F  0  1  2  3  4  5  6  7  8  9  a
    DEV:0 Marvell 88E6393X
    33:33:00:00:00:6a     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
    33:33:ff:87:e4:3f     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
    ff:ff:ff:ff:ff:ff     1  static     -  -  0  1  2  3  4  5  6  7  8  9  a
    root@infix-06-0b-00:~$

The two IPv6 groups remain in the hardware database because the
port (x3) is notified of the host's membership twice: once via the
original event and once via a replay. Since only a single delete
notification is sent, the count remains at 1 when the bridge is
destroyed.

Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries")
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---

Notes:
    v1 -> v2:
    
    - Squash the previously separate addition of
      switchdev_port_obj_act_is_deferred into this patch.
    - Use ether_addr_equal to compare MAC addresses.
    - Document switchdev_port_obj_act_is_deferred (renamed from
      switchdev_port_obj_is_deferred in v1, to indicate that we also match
      on the action).
    - Delay allocations of MDB objects until we know they're needed.
    - Use non-RCU version of the hash list iterator, now that the MDB is
      not scanned while holding the RCU read lock.
    - Add Fixes tag to commit message
    
    v2 -> v3:
    
    - Fix unlocking in error paths
    - Access RCU protected port list via mlock_dereference, since MDB is
      guaranteed to remain constant for the duration of the scan.

 include/net/switchdev.h   |  3 ++
 net/bridge/br_switchdev.c | 69 +++++++++++++++++++++---------------
 net/switchdev/switchdev.c | 73 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 117 insertions(+), 28 deletions(-)

Comments

Jiri Pirko Feb. 1, 2024, 4:24 p.m. UTC | #1
Thu, Feb 01, 2024 at 05:10:45PM CET, tobias@waldekranz.com wrote:
>Before this change, generation of the list of events MDB to replay
>would race against the IGMP/MLD snooping logic, which could concurrently
>enqueue events to the switchdev deferred queue, leading to duplicate
>events being sent to drivers. As a consequence of this, drivers which
>reference count memberships (at least DSA), would be left with orphan
>groups in their hardware database when the bridge was destroyed.
>
>Avoid this by grabbing the write-side lock of the MDB while generating
>the replay list, making sure that no deferred version of a replay
>event is already enqueued to the switchdev deferred queue, before
>adding it to the replay list.
>
>An easy way to reproduce this issue, on an mv88e6xxx system, was to
>create a snooping bridge, and immediately add a port to it:
>
>    root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \
>    > ip link set dev x3 up master br0
>    root@infix-06-0b-00:~$ ip link del dev br0
>    root@infix-06-0b-00:~$ mvls atu
>    ADDRESS             FID  STATE      Q  F  0  1  2  3  4  5  6  7  8  9  a
>    DEV:0 Marvell 88E6393X
>    33:33:00:00:00:6a     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
>    33:33:ff:87:e4:3f     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
>    ff:ff:ff:ff:ff:ff     1  static     -  -  0  1  2  3  4  5  6  7  8  9  a
>    root@infix-06-0b-00:~$
>
>The two IPv6 groups remain in the hardware database because the
>port (x3) is notified of the host's membership twice: once via the
>original event and once via a replay. Since only a single delete
>notification is sent, the count remains at 1 when the bridge is
>destroyed.
>
>Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries")
>Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>

Could you please maintain 24 hours period between sending another patch
version?

https://www.kernel.org/doc/html/v6.7/process/maintainer-netdev.html#tl-dr
Tobias Waldekranz Feb. 2, 2024, 7:09 a.m. UTC | #2
On tor, feb 01, 2024 at 17:24, Jiri Pirko <jiri@resnulli.us> wrote:
> Thu, Feb 01, 2024 at 05:10:45PM CET, tobias@waldekranz.com wrote:
>>Before this change, generation of the list of events MDB to replay
>>would race against the IGMP/MLD snooping logic, which could concurrently
>>enqueue events to the switchdev deferred queue, leading to duplicate
>>events being sent to drivers. As a consequence of this, drivers which
>>reference count memberships (at least DSA), would be left with orphan
>>groups in their hardware database when the bridge was destroyed.
>>
>>Avoid this by grabbing the write-side lock of the MDB while generating
>>the replay list, making sure that no deferred version of a replay
>>event is already enqueued to the switchdev deferred queue, before
>>adding it to the replay list.
>>
>>An easy way to reproduce this issue, on an mv88e6xxx system, was to
>>create a snooping bridge, and immediately add a port to it:
>>
>>    root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \
>>    > ip link set dev x3 up master br0
>>    root@infix-06-0b-00:~$ ip link del dev br0
>>    root@infix-06-0b-00:~$ mvls atu
>>    ADDRESS             FID  STATE      Q  F  0  1  2  3  4  5  6  7  8  9  a
>>    DEV:0 Marvell 88E6393X
>>    33:33:00:00:00:6a     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
>>    33:33:ff:87:e4:3f     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
>>    ff:ff:ff:ff:ff:ff     1  static     -  -  0  1  2  3  4  5  6  7  8  9  a
>>    root@infix-06-0b-00:~$
>>
>>The two IPv6 groups remain in the hardware database because the
>>port (x3) is notified of the host's membership twice: once via the
>>original event and once via a replay. Since only a single delete
>>notification is sent, the count remains at 1 when the bridge is
>>destroyed.
>>
>>Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries")
>>Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
>
> Could you please maintain 24 hours period between sending another patch
> version?
>
> https://www.kernel.org/doc/html/v6.7/process/maintainer-netdev.html#tl-dr

Sorry, I will avoid that going forward.
Vladimir Oltean Feb. 5, 2024, 11:41 a.m. UTC | #3
On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote:
> Before this change, generation of the list of events MDB to replay

s/events MDB/MDB events/

> would race against the IGMP/MLD snooping logic, which could concurrently

logic. This could (...)

> enqueue events to the switchdev deferred queue, leading to duplicate
> events being sent to drivers. As a consequence of this, drivers which
> reference count memberships (at least DSA), would be left with orphan
> groups in their hardware database when the bridge was destroyed.

Still missing the user impact description, aka "when would this be
noticed by, and actively bother an user?". Something that would justify
handling this via net.git rather than net-next.git.

> 
> Avoid this by grabbing the write-side lock of the MDB while generating
> the replay list, making sure that no deferred version of a replay
> event is already enqueued to the switchdev deferred queue, before
> adding it to the replay list.

The description of the solution is actually not very satisfactory to me.
I would have liked to see a more thorough analysis.

The race has 2 components, one comes from the fact that during replay,
we iterate using RCU, which does not halt concurrent updates, and the
other comes from the fact that the MDB addition procedure is non-atomic.
Elements are first added to the br->mdb_list, but are notified to
switchdev in a deferred context.

Grabbing the bridge multicast spinlock only solves the first problem: it
stops new enqueues of deferred events. We also need special handling of
the pending deferred events. The idea there is that we cannot cancel
them, since that would lead to complications for other potential
listeners on the switchdev chain. And we cannot flush them either, since
that wouldn't actually help: flushing needs sleepable context, which is
incompatible with holding br->multicast_lock, and without
br->multicast_lock held, we haven't actually solved anything, since new
deferred events can still be enqueued at any time.

So the only simple solution is to let the pending deferred events
execute (eventually), but during event replay on joining port, exclude
replaying those multicast elements which are in the bridge's multicast
list, but were not yet added through switchdev. Eventually they will be.

(side note: the handling code for excluding replays on pending event
deletion seems to not actually help, because)

Event replays on a switchdev port leaving the bridge are also
problematic, but in a different way. The deletion procedure is also
non-atomic, they are first removed from br->mdb_list then the switchdev
notification is deferred. So, the replay procedure cannot enter a
condition where it replays the deletion twice. But, there is no
switchdev_deferred_process() call when the switchdev port unoffloads an
intermediary LAG bridge port, and this can lead to the situation where
neither the replay, nor the deferred switchdev object, trickle down to a
call in the switchdev driver. So for event deletion, we need to force a
synchronous call to switchdev_deferred_process().

See how the analysis in the commit message changes the patch?

> 
> An easy way to reproduce this issue, on an mv88e6xxx system, was to

s/was/is/

> create a snooping bridge, and immediately add a port to it:
> 
>     root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \
>     > ip link set dev x3 up master br0
>     root@infix-06-0b-00:~$ ip link del dev br0
>     root@infix-06-0b-00:~$ mvls atu
>     ADDRESS             FID  STATE      Q  F  0  1  2  3  4  5  6  7  8  9  a
>     DEV:0 Marvell 88E6393X
>     33:33:00:00:00:6a     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
>     33:33:ff:87:e4:3f     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
>     ff:ff:ff:ff:ff:ff     1  static     -  -  0  1  2  3  4  5  6  7  8  9  a
>     root@infix-06-0b-00:~$
> 
> The two IPv6 groups remain in the hardware database because the
> port (x3) is notified of the host's membership twice: once via the
> original event and once via a replay. Since only a single delete
> notification is sent, the count remains at 1 when the bridge is
> destroyed.

These 2 paragraphs, plus the mvls output, should be placed before the
"Avoid this" paragraph.

> 
> Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries")
> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> ---
Tobias Waldekranz Feb. 6, 2024, 2:54 p.m. UTC | #4
On mån, feb 05, 2024 at 13:41, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote:
>> Before this change, generation of the list of events MDB to replay
>
> s/events MDB/MDB events/
>
>> would race against the IGMP/MLD snooping logic, which could concurrently
>
> logic. This could (...)
>
>> enqueue events to the switchdev deferred queue, leading to duplicate
>> events being sent to drivers. As a consequence of this, drivers which
>> reference count memberships (at least DSA), would be left with orphan
>> groups in their hardware database when the bridge was destroyed.
>
> Still missing the user impact description, aka "when would this be
> noticed by, and actively bother an user?". Something that would justify
> handling this via net.git rather than net-next.git.

Other than moving up the example to this point, what do you feel is
missing? Or are you saying that you don't think this belongs on net.git?

>> 
>> Avoid this by grabbing the write-side lock of the MDB while generating
>> the replay list, making sure that no deferred version of a replay
>> event is already enqueued to the switchdev deferred queue, before
>> adding it to the replay list.
>
> The description of the solution is actually not very satisfactory to me.
> I would have liked to see a more thorough analysis.

Sorry to disappoint.

> The race has 2 components, one comes from the fact that during replay,
> we iterate using RCU, which does not halt concurrent updates, and the
> other comes from the fact that the MDB addition procedure is non-atomic.
> Elements are first added to the br->mdb_list, but are notified to
> switchdev in a deferred context.
>
> Grabbing the bridge multicast spinlock only solves the first problem: it
> stops new enqueues of deferred events. We also need special handling of
> the pending deferred events. The idea there is that we cannot cancel
> them, since that would lead to complications for other potential
> listeners on the switchdev chain. And we cannot flush them either, since
> that wouldn't actually help: flushing needs sleepable context, which is
> incompatible with holding br->multicast_lock, and without
> br->multicast_lock held, we haven't actually solved anything, since new
> deferred events can still be enqueued at any time.
>
> So the only simple solution is to let the pending deferred events
> execute (eventually), but during event replay on joining port, exclude
> replaying those multicast elements which are in the bridge's multicast
> list, but were not yet added through switchdev. Eventually they will be.

In my opinion, your three paragraphs above say pretty much the same
thing as my single paragraph. But sure, I will try to provide more
details in v4.

> (side note: the handling code for excluding replays on pending event
> deletion seems to not actually help, because)
>
> Event replays on a switchdev port leaving the bridge are also
> problematic, but in a different way. The deletion procedure is also
> non-atomic, they are first removed from br->mdb_list then the switchdev
> notification is deferred. So, the replay procedure cannot enter a
> condition where it replays the deletion twice. But, there is no
> switchdev_deferred_process() call when the switchdev port unoffloads an
> intermediary LAG bridge port, and this can lead to the situation where
> neither the replay, nor the deferred switchdev object, trickle down to a
> call in the switchdev driver. So for event deletion, we need to force a
> synchronous call to switchdev_deferred_process().

Good catch!

What does this mean for v4? Is this still one logical change that
ensures that MDB events are always delivered exactly once, or a series
where one patch fixes the duplicates and one ensures that no events are
missed?

> See how the analysis in the commit message changes the patch?

Suddenly the novice was enlightened.

>> 
>> An easy way to reproduce this issue, on an mv88e6xxx system, was to
>
> s/was/is/
>
>> create a snooping bridge, and immediately add a port to it:
>> 
>>     root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \
>>     > ip link set dev x3 up master br0
>>     root@infix-06-0b-00:~$ ip link del dev br0
>>     root@infix-06-0b-00:~$ mvls atu
>>     ADDRESS             FID  STATE      Q  F  0  1  2  3  4  5  6  7  8  9  a
>>     DEV:0 Marvell 88E6393X
>>     33:33:00:00:00:6a     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
>>     33:33:ff:87:e4:3f     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
>>     ff:ff:ff:ff:ff:ff     1  static     -  -  0  1  2  3  4  5  6  7  8  9  a
>>     root@infix-06-0b-00:~$
>> 
>> The two IPv6 groups remain in the hardware database because the
>> port (x3) is notified of the host's membership twice: once via the
>> original event and once via a replay. Since only a single delete
>> notification is sent, the count remains at 1 when the bridge is
>> destroyed.
>
> These 2 paragraphs, plus the mvls output, should be placed before the
> "Avoid this" paragraph.
>
>> 
>> Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries")
>> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
>> ---
Vladimir Oltean Feb. 6, 2024, 7:31 p.m. UTC | #5
On Tue, Feb 06, 2024 at 03:54:25PM +0100, Tobias Waldekranz wrote:
> On mån, feb 05, 2024 at 13:41, Vladimir Oltean <olteanv@gmail.com> wrote:
> > On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote:
> >> Before this change, generation of the list of events MDB to replay
> >
> > s/events MDB/MDB events/
> >
> >> would race against the IGMP/MLD snooping logic, which could concurrently
> >
> > logic. This could (...)
> >
> >> enqueue events to the switchdev deferred queue, leading to duplicate
> >> events being sent to drivers. As a consequence of this, drivers which
> >> reference count memberships (at least DSA), would be left with orphan
> >> groups in their hardware database when the bridge was destroyed.
> >
> > Still missing the user impact description, aka "when would this be
> > noticed by, and actively bother an user?". Something that would justify
> > handling this via net.git rather than net-next.git.
> 
> Other than moving up the example to this point, what do you feel is
> missing? Or are you saying that you don't think this belongs on net.git?

I just don't have enough data to be able to judge (it's missing from the
commit message). I'm looking at Documentation/process/stable-kernel-rules.rst
as a reference. It lists a series of guidelines from which I gather,
generally speaking, that there should exist a user impact on
functionality. The "mvls atu" command is a debug program, it just dumps
hardware tables. It is enough to point out that multicast entries leak,
but not what will be broken as a result of that.

> >> Avoid this by grabbing the write-side lock of the MDB while generating
> >> the replay list, making sure that no deferred version of a replay
> >> event is already enqueued to the switchdev deferred queue, before
> >> adding it to the replay list.
> >
> > The description of the solution is actually not very satisfactory to me.
> > I would have liked to see a more thorough analysis.
> 
> Sorry to disappoint.
> 
> > The race has 2 components, one comes from the fact that during replay,
> > we iterate using RCU, which does not halt concurrent updates, and the
> > other comes from the fact that the MDB addition procedure is non-atomic.
> > Elements are first added to the br->mdb_list, but are notified to
> > switchdev in a deferred context.
> >
> > Grabbing the bridge multicast spinlock only solves the first problem: it
> > stops new enqueues of deferred events. We also need special handling of
> > the pending deferred events. The idea there is that we cannot cancel
> > them, since that would lead to complications for other potential
> > listeners on the switchdev chain. And we cannot flush them either, since
> > that wouldn't actually help: flushing needs sleepable context, which is
> > incompatible with holding br->multicast_lock, and without
> > br->multicast_lock held, we haven't actually solved anything, since new
> > deferred events can still be enqueued at any time.
> >
> > So the only simple solution is to let the pending deferred events
> > execute (eventually), but during event replay on joining port, exclude
> > replaying those multicast elements which are in the bridge's multicast
> > list, but were not yet added through switchdev. Eventually they will be.
> 
> In my opinion, your three paragraphs above say pretty much the same
> thing as my single paragraph. But sure, I will try to provide more
> details in v4.

It's not about word count, I'm notoriously bad at summarizing. If you
could still say in a single paragraph what I did in 3, it would be great.

The difference is that the above 3 paragraphs only explore the race on
MDB addition events. It is of a different nature that the one on deletion.
Your analysis clumps them together, and the code follows suit. There is
code to supposedly handle the race between deferred deletion events and
replay on port removal, but I don't think it does.

> > (side note: the handling code for excluding replays on pending event
> > deletion seems to not actually help, because)
> >
> > Event replays on a switchdev port leaving the bridge are also
> > problematic, but in a different way. The deletion procedure is also
> > non-atomic, they are first removed from br->mdb_list then the switchdev
> > notification is deferred. So, the replay procedure cannot enter a
> > condition where it replays the deletion twice. But, there is no
> > switchdev_deferred_process() call when the switchdev port unoffloads an
> > intermediary LAG bridge port, and this can lead to the situation where
> > neither the replay, nor the deferred switchdev object, trickle down to a
> > call in the switchdev driver. So for event deletion, we need to force a
> > synchronous call to switchdev_deferred_process().
> 
> Good catch!
> 
> What does this mean for v4? Is this still one logical change that
> ensures that MDB events are always delivered exactly once, or a series
> where one patch fixes the duplicates and one ensures that no events
> are missed?

I'm not saying I know how you can satisfy everyone's review requests.
Once it is clear that deferred additions and deferred removals need
separate treatment, it would probably be preferable to treat them in
separate patches.

> > See how the analysis in the commit message changes the patch?
> 
> Suddenly the novice was enlightened.

This, to me, reads like unnecessary sarcasm.

I just mean that I don't think you spent enough time to explore the
problem space in the commit message. It is a very important element of
a patch, because it is very rare for someone to solve a problem which
is not even formulated. I actually tried to be helpful by pointing out
where things might not work out. I'm sorry if it did not appear that way.
Tobias Waldekranz Feb. 6, 2024, 9:58 p.m. UTC | #6
On tis, feb 06, 2024 at 21:31, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Tue, Feb 06, 2024 at 03:54:25PM +0100, Tobias Waldekranz wrote:
>> On mån, feb 05, 2024 at 13:41, Vladimir Oltean <olteanv@gmail.com> wrote:
>> > On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote:
>> >> Before this change, generation of the list of events MDB to replay
>> >
>> > s/events MDB/MDB events/
>> >
>> >> would race against the IGMP/MLD snooping logic, which could concurrently
>> >
>> > logic. This could (...)
>> >
>> >> enqueue events to the switchdev deferred queue, leading to duplicate
>> >> events being sent to drivers. As a consequence of this, drivers which
>> >> reference count memberships (at least DSA), would be left with orphan
>> >> groups in their hardware database when the bridge was destroyed.
>> >
>> > Still missing the user impact description, aka "when would this be
>> > noticed by, and actively bother an user?". Something that would justify
>> > handling this via net.git rather than net-next.git.
>> 
>> Other than moving up the example to this point, what do you feel is
>> missing? Or are you saying that you don't think this belongs on net.git?
>
> I just don't have enough data to be able to judge (it's missing from the
> commit message). I'm looking at Documentation/process/stable-kernel-rules.rst
> as a reference. It lists a series of guidelines from which I gather,
> generally speaking, that there should exist a user impact on
> functionality. The "mvls atu" command is a debug program, it just dumps
> hardware tables. It is enough to point out that multicast entries leak,
> but not what will be broken as a result of that.

Fair enough.

I originally discovered the issue while developing a kselftest for
another improvement I want to make to switchdev multicast offloading
(related to router ports). I thought my new code was causing these
orphan entries, but then realized that the issue existed on a plain -net
and -net-next kernel.

I can imagine scenarios in which a user _could_ be impacted by this, but
they are all quite esoteric. Maybe that is an indication that I should
just target net-next instead.

>> >> Avoid this by grabbing the write-side lock of the MDB while generating
>> >> the replay list, making sure that no deferred version of a replay
>> >> event is already enqueued to the switchdev deferred queue, before
>> >> adding it to the replay list.
>> >
>> > The description of the solution is actually not very satisfactory to me.
>> > I would have liked to see a more thorough analysis.
>> 
>> Sorry to disappoint.
>> 
>> > The race has 2 components, one comes from the fact that during replay,
>> > we iterate using RCU, which does not halt concurrent updates, and the
>> > other comes from the fact that the MDB addition procedure is non-atomic.
>> > Elements are first added to the br->mdb_list, but are notified to
>> > switchdev in a deferred context.
>> >
>> > Grabbing the bridge multicast spinlock only solves the first problem: it
>> > stops new enqueues of deferred events. We also need special handling of
>> > the pending deferred events. The idea there is that we cannot cancel
>> > them, since that would lead to complications for other potential
>> > listeners on the switchdev chain. And we cannot flush them either, since
>> > that wouldn't actually help: flushing needs sleepable context, which is
>> > incompatible with holding br->multicast_lock, and without
>> > br->multicast_lock held, we haven't actually solved anything, since new
>> > deferred events can still be enqueued at any time.
>> >
>> > So the only simple solution is to let the pending deferred events
>> > execute (eventually), but during event replay on joining port, exclude
>> > replaying those multicast elements which are in the bridge's multicast
>> > list, but were not yet added through switchdev. Eventually they will be.
>> 
>> In my opinion, your three paragraphs above say pretty much the same
>> thing as my single paragraph. But sure, I will try to provide more
>> details in v4.
>
> It's not about word count, I'm notoriously bad at summarizing. If you
> could still say in a single paragraph what I did in 3, it would be great.
>
> The difference is that the above 3 paragraphs only explore the race on
> MDB addition events. It is of a different nature that the one on deletion.
> Your analysis clumps them together, and the code follows suit. There is
> code to supposedly handle the race between deferred deletion events and
> replay on port removal, but I don't think it does.

That's the thing though: I did not lump them together, I simply did not
think about the deletion issue at all. An issue which you yourself state
should probably be fixed in a separate patch, presumably becuase you
think of them as two very closely related, but ultimately separate,
issues. Which is why I think it was needlessly harsh to say that my
analysis of the duplicate-events-issue was too simplistic, because it
did not address a related issue that I had not considered.

>> > (side note: the handling code for excluding replays on pending event
>> > deletion seems to not actually help, because)
>> >
>> > Event replays on a switchdev port leaving the bridge are also
>> > problematic, but in a different way. The deletion procedure is also
>> > non-atomic, they are first removed from br->mdb_list then the switchdev
>> > notification is deferred. So, the replay procedure cannot enter a
>> > condition where it replays the deletion twice. But, there is no
>> > switchdev_deferred_process() call when the switchdev port unoffloads an
>> > intermediary LAG bridge port, and this can lead to the situation where
>> > neither the replay, nor the deferred switchdev object, trickle down to a
>> > call in the switchdev driver. So for event deletion, we need to force a
>> > synchronous call to switchdev_deferred_process().
>> 
>> Good catch!
>> 
>> What does this mean for v4? Is this still one logical change that
>> ensures that MDB events are always delivered exactly once, or a series
>> where one patch fixes the duplicates and one ensures that no events
>> are missed?
>
> I'm not saying I know how you can satisfy everyone's review requests.

Of course, I was merely trying to avoid an extra round of patches.

> Once it is clear that deferred additions and deferred removals need
> separate treatment, it would probably be preferable to treat them in
> separate patches.

I should have been more clear in my response. When I said "Good catch!",
that was because I had first reproduced the issue, and then verified
that adding a call to switchdev_deferred_process() at the end of
nbp_switchdev_unsync_objs() solves the issue.

>> > See how the analysis in the commit message changes the patch?
>> 
>> Suddenly the novice was enlightened.
>
> This, to me, reads like unnecessary sarcasm.

Your question, to me, read like unnecessary condescension.

> I just mean that I don't think you spent enough time to explore the
> problem space in the commit message. It is a very important element of
> a patch, because it is very rare for someone to solve a problem which
> is not even formulated. I actually tried to be helpful by pointing out
> where things might not work out. I'm sorry if it did not appear that way.

I am extremely greatful for your help, truly! Which is why I
complemented you for pointing out the second issue to me. Something
about the phrasing of that question really rubbed me the wrong way
though. I should have just let it go, I'm sorry.
Vladimir Oltean Feb. 7, 2024, 4:36 p.m. UTC | #7
On Tue, Feb 06, 2024 at 10:58:18PM +0100, Tobias Waldekranz wrote:
> On tis, feb 06, 2024 at 21:31, Vladimir Oltean <olteanv@gmail.com> wrote:
> > On Tue, Feb 06, 2024 at 03:54:25PM +0100, Tobias Waldekranz wrote:
> >> On mån, feb 05, 2024 at 13:41, Vladimir Oltean <olteanv@gmail.com> wrote:
> >> > On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote:
> >> >> Before this change, generation of the list of events MDB to replay
> >> >
> >> > s/events MDB/MDB events/
> >> >
> >> >> would race against the IGMP/MLD snooping logic, which could concurrently
> >> >
> >> > logic. This could (...)
> >> >
> >> >> enqueue events to the switchdev deferred queue, leading to duplicate
> >> >> events being sent to drivers. As a consequence of this, drivers which
> >> >> reference count memberships (at least DSA), would be left with orphan
> >> >> groups in their hardware database when the bridge was destroyed.
> >> >
> >> > Still missing the user impact description, aka "when would this be
> >> > noticed by, and actively bother an user?". Something that would justify
> >> > handling this via net.git rather than net-next.git.
> >> 
> >> Other than moving up the example to this point, what do you feel is
> >> missing? Or are you saying that you don't think this belongs on net.git?
> >
> > I just don't have enough data to be able to judge (it's missing from the
> > commit message). I'm looking at Documentation/process/stable-kernel-rules.rst
> > as a reference. It lists a series of guidelines from which I gather,
> > generally speaking, that there should exist a user impact on
> > functionality. The "mvls atu" command is a debug program, it just dumps
> > hardware tables. It is enough to point out that multicast entries leak,
> > but not what will be broken as a result of that.
> 
> Fair enough.
> 
> I originally discovered the issue while developing a kselftest for
> another improvement I want to make to switchdev multicast offloading
> (related to router ports). I thought my new code was causing these
> orphan entries, but then realized that the issue existed on a plain -net
> and -net-next kernel.
> 
> I can imagine scenarios in which a user _could_ be impacted by this, but
> they are all quite esoteric. Maybe that is an indication that I should
> just target net-next instead.

Again, I'm not discouraging you to send this patch to net.git. I've been
saying (since v1, apparently: https://lore.kernel.org/netdev/20240131135157.ddrtt4swvz5y3nbz@skbuf/)
that if there is a reason to do it, just say it in the commit message,
so that we're all on the same page.

Be verbose when calculating the cost/benefit ratio calculation
(preferably also in the commit message). I would analyze the complexity
of the change as "medium", since it does change the locking scheme to
fix an underdesigned mechanism. And the fact that mlxsw also uses
replays through a slightly different code path means something you
(probably) can't test, and potentially risky.

> >> > The race has 2 components, one comes from the fact that during replay,
> >> > we iterate using RCU, which does not halt concurrent updates, and the
> >> > other comes from the fact that the MDB addition procedure is non-atomic.
> >> > Elements are first added to the br->mdb_list, but are notified to
> >> > switchdev in a deferred context.
> >> >
> >> > Grabbing the bridge multicast spinlock only solves the first problem: it
> >> > stops new enqueues of deferred events. We also need special handling of
> >> > the pending deferred events. The idea there is that we cannot cancel
> >> > them, since that would lead to complications for other potential
> >> > listeners on the switchdev chain. And we cannot flush them either, since
> >> > that wouldn't actually help: flushing needs sleepable context, which is
> >> > incompatible with holding br->multicast_lock, and without
> >> > br->multicast_lock held, we haven't actually solved anything, since new
> >> > deferred events can still be enqueued at any time.
> >> >
> >> > So the only simple solution is to let the pending deferred events
> >> > execute (eventually), but during event replay on joining port, exclude
> >> > replaying those multicast elements which are in the bridge's multicast
> >> > list, but were not yet added through switchdev. Eventually they will be.
> >> 
> >> In my opinion, your three paragraphs above say pretty much the same
> >> thing as my single paragraph. But sure, I will try to provide more
> >> details in v4.
> >
> > It's not about word count, I'm notoriously bad at summarizing. If you
> > could still say in a single paragraph what I did in 3, it would be great.
> >
> > The difference is that the above 3 paragraphs only explore the race on
> > MDB addition events. It is of a different nature that the one on deletion.
> > Your analysis clumps them together, and the code follows suit. There is
> > code to supposedly handle the race between deferred deletion events and
> > replay on port removal, but I don't think it does.
> 
> That's the thing though: I did not lump them together, I simply did not
> think about the deletion issue at all. An issue which you yourself state
> should probably be fixed in a separate patch, presumably becuase you
> think of them as two very closely related, but ultimately separate,
> issues. Which is why I think it was needlessly harsh to say that my
> analysis of the duplicate-events-issue was too simplistic, because it
> did not address a related issue that I had not considered.

So in the comments for your v1, one of the things I told you was that
"there's one more race to deal with" on removal.
https://lore.kernel.org/netdev/20240131150642.mxcssv7l5qfiejkl@skbuf/

At the time, the idea had not crystalized in my mind either that it's
not "one more" race on removal, but that on removal it's _the only_
race.

You then submitted v2 and v3, not taking this comment into consideration,
not even to explore it. Exploring it thoroughly would have revealed that
your approach - to apply the algorithm of excluding objects from replay
if events on them are pending in switchdev - is completely unnecessary
when "adding=false".

In light of my comment and the lack of a detailed analysis on your side
on removal, it _appears_, by looking at this change, that the patch
handles a race on removal, but the code actually runs through the
algorithm that handles a race that doesn't exist, and doesn't handle the
race that _does_ exist.

My part of the fault is doing spotty review, giving ideas piecemeal, and
getting frustrated you aren't picking all of them up. It's still going
to be a really busy few weeks/months for me ahead, and there's nothing I
can do about that. I'm faced with the alternatives of either doing a
shit job reviewing and leaving comments/ideas in the little time I have,
or not reviewing at all. I'll think about my options some more.
diff mbox series

Patch

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index a43062d4c734..8346b0d29542 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -308,6 +308,9 @@  void switchdev_deferred_process(void);
 int switchdev_port_attr_set(struct net_device *dev,
 			    const struct switchdev_attr *attr,
 			    struct netlink_ext_ack *extack);
+bool switchdev_port_obj_act_is_deferred(struct net_device *dev,
+					enum switchdev_notifier_type nt,
+					const struct switchdev_obj *obj);
 int switchdev_port_obj_add(struct net_device *dev,
 			   const struct switchdev_obj *obj,
 			   struct netlink_ext_ack *extack);
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index ee84e783e1df..6d3fb4292071 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -595,21 +595,35 @@  br_switchdev_mdb_replay_one(struct notifier_block *nb, struct net_device *dev,
 }
 
 static int br_switchdev_mdb_queue_one(struct list_head *mdb_list,
+				      struct net_device *dev,
+				      unsigned long action,
 				      enum switchdev_obj_id id,
 				      const struct net_bridge_mdb_entry *mp,
 				      struct net_device *orig_dev)
 {
-	struct switchdev_obj_port_mdb *mdb;
+	struct switchdev_obj_port_mdb mdb = {
+		.obj = {
+			.id = id,
+			.orig_dev = orig_dev,
+		},
+	};
+	struct switchdev_obj_port_mdb *pmdb;
 
-	mdb = kzalloc(sizeof(*mdb), GFP_ATOMIC);
-	if (!mdb)
-		return -ENOMEM;
+	br_switchdev_mdb_populate(&mdb, mp);
 
-	mdb->obj.id = id;
-	mdb->obj.orig_dev = orig_dev;
-	br_switchdev_mdb_populate(mdb, mp);
-	list_add_tail(&mdb->obj.list, mdb_list);
+	if (switchdev_port_obj_act_is_deferred(dev, action, &mdb.obj)) {
+		/* This event is already in the deferred queue of
+		 * events, so this replay must be elided, lest the
+		 * driver receives duplicate events for it.
+		 */
+		return 0;
+	}
+
+	pmdb = kmemdup(&mdb, sizeof(mdb), GFP_ATOMIC);
+	if (!pmdb)
+		return -ENOMEM;
 
+	list_add_tail(&pmdb->obj.list, mdb_list);
 	return 0;
 }
 
@@ -677,51 +691,50 @@  br_switchdev_mdb_replay(struct net_device *br_dev, struct net_device *dev,
 	if (!br_opt_get(br, BROPT_MULTICAST_ENABLED))
 		return 0;
 
-	/* We cannot walk over br->mdb_list protected just by the rtnl_mutex,
-	 * because the write-side protection is br->multicast_lock. But we
-	 * need to emulate the [ blocking ] calling context of a regular
-	 * switchdev event, so since both br->multicast_lock and RCU read side
-	 * critical sections are atomic, we have no choice but to pick the RCU
-	 * read side lock, queue up all our events, leave the critical section
-	 * and notify switchdev from blocking context.
+	if (adding)
+		action = SWITCHDEV_PORT_OBJ_ADD;
+	else
+		action = SWITCHDEV_PORT_OBJ_DEL;
+
+	/* br_switchdev_mdb_queue_one() will take care to not queue a
+	 * replay of an event that is already pending in the switchdev
+	 * deferred queue. In order to safely determine that, there
+	 * must be no new deferred MDB notifications enqueued for the
+	 * duration of the MDB scan. Therefore, grab the write-side
+	 * lock to avoid racing with any concurrent IGMP/MLD snooping.
 	 */
-	rcu_read_lock();
+	spin_lock_bh(&br->multicast_lock);
 
-	hlist_for_each_entry_rcu(mp, &br->mdb_list, mdb_node) {
+	hlist_for_each_entry(mp, &br->mdb_list, mdb_node) {
 		struct net_bridge_port_group __rcu * const *pp;
 		const struct net_bridge_port_group *p;
 
 		if (mp->host_joined) {
-			err = br_switchdev_mdb_queue_one(&mdb_list,
+			err = br_switchdev_mdb_queue_one(&mdb_list, dev, action,
 							 SWITCHDEV_OBJ_ID_HOST_MDB,
 							 mp, br_dev);
 			if (err) {
-				rcu_read_unlock();
+				spin_unlock_bh(&br->multicast_lock);
 				goto out_free_mdb;
 			}
 		}
 
-		for (pp = &mp->ports; (p = rcu_dereference(*pp)) != NULL;
+		for (pp = &mp->ports; (p = mlock_dereference(*pp, br)) != NULL;
 		     pp = &p->next) {
 			if (p->key.port->dev != dev)
 				continue;
 
-			err = br_switchdev_mdb_queue_one(&mdb_list,
+			err = br_switchdev_mdb_queue_one(&mdb_list, dev, action,
 							 SWITCHDEV_OBJ_ID_PORT_MDB,
 							 mp, dev);
 			if (err) {
-				rcu_read_unlock();
+				spin_unlock_bh(&br->multicast_lock);
 				goto out_free_mdb;
 			}
 		}
 	}
 
-	rcu_read_unlock();
-
-	if (adding)
-		action = SWITCHDEV_PORT_OBJ_ADD;
-	else
-		action = SWITCHDEV_PORT_OBJ_DEL;
+	spin_unlock_bh(&br->multicast_lock);
 
 	list_for_each_entry(obj, &mdb_list, list) {
 		err = br_switchdev_mdb_replay_one(nb, dev,
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 5b045284849e..7d11f31820df 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -19,6 +19,35 @@ 
 #include <linux/rtnetlink.h>
 #include <net/switchdev.h>
 
+static bool switchdev_obj_eq(const struct switchdev_obj *a,
+			     const struct switchdev_obj *b)
+{
+	const struct switchdev_obj_port_vlan *va, *vb;
+	const struct switchdev_obj_port_mdb *ma, *mb;
+
+	if (a->id != b->id || a->orig_dev != b->orig_dev)
+		return false;
+
+	switch (a->id) {
+	case SWITCHDEV_OBJ_ID_PORT_VLAN:
+		va = SWITCHDEV_OBJ_PORT_VLAN(a);
+		vb = SWITCHDEV_OBJ_PORT_VLAN(b);
+		return va->flags == vb->flags &&
+			va->vid == vb->vid &&
+			va->changed == vb->changed;
+	case SWITCHDEV_OBJ_ID_PORT_MDB:
+	case SWITCHDEV_OBJ_ID_HOST_MDB:
+		ma = SWITCHDEV_OBJ_PORT_MDB(a);
+		mb = SWITCHDEV_OBJ_PORT_MDB(b);
+		return ma->vid == mb->vid &&
+			ether_addr_equal(ma->addr, mb->addr);
+	default:
+		break;
+	}
+
+	BUG();
+}
+
 static LIST_HEAD(deferred);
 static DEFINE_SPINLOCK(deferred_lock);
 
@@ -307,6 +336,50 @@  int switchdev_port_obj_del(struct net_device *dev,
 }
 EXPORT_SYMBOL_GPL(switchdev_port_obj_del);
 
+/**
+ *	switchdev_port_obj_act_is_deferred - Is object action pending?
+ *
+ *	@dev: port device
+ *	@nt: type of action; add or delete
+ *	@obj: object to test
+ *
+ *	Returns true if a deferred item is exists, which is equivalent
+ *	to the action @nt of an object @obj.
+ *
+ *	rtnl_lock must be held.
+ */
+bool switchdev_port_obj_act_is_deferred(struct net_device *dev,
+					enum switchdev_notifier_type nt,
+					const struct switchdev_obj *obj)
+{
+	struct switchdev_deferred_item *dfitem;
+	bool found = false;
+
+	ASSERT_RTNL();
+
+	spin_lock_bh(&deferred_lock);
+
+	list_for_each_entry(dfitem, &deferred, list) {
+		if (dfitem->dev != dev)
+			continue;
+
+		if ((dfitem->func == switchdev_port_obj_add_deferred &&
+		     nt == SWITCHDEV_PORT_OBJ_ADD) ||
+		    (dfitem->func == switchdev_port_obj_del_deferred &&
+		     nt == SWITCHDEV_PORT_OBJ_DEL)) {
+			if (switchdev_obj_eq((const void *)dfitem->data, obj)) {
+				found = true;
+				break;
+			}
+		}
+	}
+
+	spin_unlock_bh(&deferred_lock);
+
+	return found;
+}
+EXPORT_SYMBOL_GPL(switchdev_port_obj_act_is_deferred);
+
 static ATOMIC_NOTIFIER_HEAD(switchdev_notif_chain);
 static BLOCKING_NOTIFIER_HEAD(switchdev_blocking_notif_chain);