diff mbox series

[RFC,net-next,07/10] net: dsa: mv88e6xxx: Track bridge mdb objects

Message ID 20240402001137.2980589-8-Joseph.Huang@garmin.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series MC Flood disable and snooping | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 943 this patch: 943
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 954 this patch: 954
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 954 this patch: 954
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 161 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Joseph Huang April 2, 2024, 12:11 a.m. UTC
Keep track of bridge mdb objects in the driver.

Similar to the previous patch, since the driver doesn't get explicit
notifications about mdb group creation or destruction, just create
the mdb group when the first port joins the group via
mv88e6xxx_port_mdb_add(), and destroys the group when the last port left
the group via mv88e6xxx_port_mdb_del().

Use the group's L2 address together with the VLAN ID as the key to the list.
Port membership is again stored in a bitmask.

Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 117 +++++++++++++++++++++++++++++++
 1 file changed, 117 insertions(+)

Comments

Vladimir Oltean April 2, 2024, 12:23 p.m. UTC | #1
On Mon, Apr 01, 2024 at 08:11:06PM -0400, Joseph Huang wrote:
> Keep track of bridge mdb objects in the driver.
> 
> Similar to the previous patch, since the driver doesn't get explicit
> notifications about mdb group creation or destruction, just create
> the mdb group when the first port joins the group via
> mv88e6xxx_port_mdb_add(), and destroys the group when the last port left
> the group via mv88e6xxx_port_mdb_del().
> 
> Use the group's L2 address together with the VLAN ID as the key to the list.
> Port membership is again stored in a bitmask.
> 
> Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
> ---

Can you comment on the feasibility/infeasibility of Tobias' proposal of:
"The bridge could just provide some MDB iterator to save us from having
to cache all the configured groups."?
https://lore.kernel.org/netdev/87sg31n04a.fsf@waldekranz.com/

What is done here will have to be scaled to many drivers - potentially
all existing DSA ones, as far as I'm aware.
Joseph Huang April 4, 2024, 8:43 p.m. UTC | #2
Hi Vladimir,

On 4/2/2024 8:23 AM, Vladimir Oltean wrote:
> Can you comment on the feasibility/infeasibility of Tobias' proposal of:
> "The bridge could just provide some MDB iterator to save us from having
> to cache all the configured groups."?
> https://lore.kernel.org/netdev/87sg31n04a.fsf@waldekranz.com/
> 
> What is done here will have to be scaled to many drivers - potentially
> all existing DSA ones, as far as I'm aware.
> 

I thought about implementing an MDB iterator as suggested by Tobias, but 
I'm a bit concerned about the coherence of these MDB objects. In theory, 
when the device driver is trying to act on an event, the source of the 
trigger may have changed its state in the bridge already. If, upon 
receiving an event in the device driver, we iterate over what the bridge 
has at that instant, the differences between the worlds as seen by the 
bridge and the device driver might lead to some unexpected results. 
However, if we cache the MDB objects in the device driver, at least the 
order in which the events took place will be coherent and at any give 
time the state of the MDB objects in the device driver can be guaranteed 
to be sane. This is also the approach the prestera device driver took.

Thanks,
Joseph
Vladimir Oltean April 5, 2024, 11:07 a.m. UTC | #3
On Thu, Apr 04, 2024 at 04:43:38PM -0400, Joseph Huang wrote:
> Hi Vladimir,
> 
> On 4/2/2024 8:23 AM, Vladimir Oltean wrote:
> > Can you comment on the feasibility/infeasibility of Tobias' proposal of:
> > "The bridge could just provide some MDB iterator to save us from having
> > to cache all the configured groups."?
> > https://lore.kernel.org/netdev/87sg31n04a.fsf@waldekranz.com/
> > 
> > What is done here will have to be scaled to many drivers - potentially
> > all existing DSA ones, as far as I'm aware.
> > 
> 
> I thought about implementing an MDB iterator as suggested by Tobias, but I'm
> a bit concerned about the coherence of these MDB objects. In theory, when
> the device driver is trying to act on an event, the source of the trigger
> may have changed its state in the bridge already.

Yes, this is the result of SWITCHDEV_F_DEFER, used by both
SWITCHDEV_ATTR_ID_PORT_MROUTER and SWITCHDEV_OBJ_ID_PORT_MDB.

> If, upon receiving an event in the device driver, we iterate over what
> the bridge has at that instant, the differences between the worlds as
> seen by the bridge and the device driver might lead to some unexpected
> results.

Translated: iterating over bridge MDB objects needs to be serialized
with new switchdev events by acquiring rtnl_lock(). Then, once switchdev
events are temporarily blocked, the pending ones need to be flushed
using switchdev_deferred_process(), so resync the bridge state with the
driver state. Once the resync is done, the iteration is safe until
rtnl_unlock().

Applied to our case, the MDB iterator is needed in mv88e6xxx_port_mrouter().
This is already called with rtnl_lock() acquired. The resync procedure
will indirectly call mv88e6xxx_port_mdb_add()/mv88e6xxx_port_mdb_del()
through switchdev_deferred_process(), and then the walk is consistent
for the remainder of the mv88e6xxx_port_mrouter() function.

A helper which does this is what would be required - an iterator
function which calls an int (*cb)(struct net_device *brport, const struct switchdev_obj_port_mdb *mdb)
for each MDB entry. The DSA core could then offer some post-processing
services over this API, to recover the struct dsa_port associated with
the bridge port (in the LAG case they aren't the same) and the address
database associated with the bridge.

Do you think there would be unexpected results even if we did this?
br_switchdev_mdb_replay() needs to handle a similarly complicated
situation of synchronizing with deferred MDB events.

> However, if we cache the MDB objects in the device driver, at least
> the order in which the events took place will be coherent and at any
> give time the state of the MDB objects in the device driver can be
> guaranteed to be sane. This is also the approach the prestera device
> driver took.

Not contesting this, but I wouldn't like to see MDBs cached in each
device driver just for this. Switchdev is not very high on the list of
APIs which are easy to use, and making MDB caching a requirement
(for the common case that MDB entry destinations need software fixups
with the mrouter ports) isn't exactly going to make that any better.
Others' opinion may differ, but mine is that core offload APIs need to
consider what hardware is available in the real world, make the common
case easy, and the advanced cases possible. Rather than make every case
"advanced" :)
Joseph Huang April 5, 2024, 6:58 p.m. UTC | #4
Hi Vladimir,

On 4/5/2024 7:07 AM, Vladimir Oltean wrote:
> On Thu, Apr 04, 2024 at 04:43:38PM -0400, Joseph Huang wrote:
>> Hi Vladimir,
>>
>> On 4/2/2024 8:23 AM, Vladimir Oltean wrote:
>>> Can you comment on the feasibility/infeasibility of Tobias' proposal of:
>>> "The bridge could just provide some MDB iterator to save us from having
>>> to cache all the configured groups."?
>>> https://lore.kernel.org/netdev/87sg31n04a.fsf@waldekranz.com/
>>>
>>> What is done here will have to be scaled to many drivers - potentially
>>> all existing DSA ones, as far as I'm aware.
>>>
>>
>> I thought about implementing an MDB iterator as suggested by Tobias, but I'm
>> a bit concerned about the coherence of these MDB objects. In theory, when
>> the device driver is trying to act on an event, the source of the trigger
>> may have changed its state in the bridge already.
> 
> Yes, this is the result of SWITCHDEV_F_DEFER, used by both
> SWITCHDEV_ATTR_ID_PORT_MROUTER and SWITCHDEV_OBJ_ID_PORT_MDB.
> 
>> If, upon receiving an event in the device driver, we iterate over what
>> the bridge has at that instant, the differences between the worlds as
>> seen by the bridge and the device driver might lead to some unexpected
>> results.
> 
> Translated: iterating over bridge MDB objects needs to be serialized
> with new switchdev events by acquiring rtnl_lock(). Then, once switchdev
> events are temporarily blocked, the pending ones need to be flushed
> using switchdev_deferred_process(), so resync the bridge state with the
> driver state. Once the resync is done, the iteration is safe until
> rtnl_unlock().
> 
> Applied to our case, the MDB iterator is needed in mv88e6xxx_port_mrouter().
> This is already called with rtnl_lock() acquired. The resync procedure
> will indirectly call mv88e6xxx_port_mdb_add()/mv88e6xxx_port_mdb_del()
> through switchdev_deferred_process(), and then the walk is consistent
> for the remainder of the mv88e6xxx_port_mrouter() function.
> 
> A helper which does this is what would be required - an iterator
> function which calls an int (*cb)(struct net_device *brport, const struct switchdev_obj_port_mdb *mdb)
> for each MDB entry. The DSA core could then offer some post-processing
> services over this API, to recover the struct dsa_port associated with
> the bridge port (in the LAG case they aren't the same) and the address
> database associated with the bridge.
> 
> Do you think there would be unexpected results even if we did this?
> br_switchdev_mdb_replay() needs to handle a similarly complicated
> situation of synchronizing with deferred MDB events.
>  >> However, if we cache the MDB objects in the device driver, at least
>> the order in which the events took place will be coherent and at any
>> give time the state of the MDB objects in the device driver can be
>> guaranteed to be sane. This is also the approach the prestera device
>> driver took.
> 
> Not contesting this, but I wouldn't like to see MDBs cached in each
> device driver just for this. Switchdev is not very high on the list of
> APIs which are easy to use, and making MDB caching a requirement
> (for the common case that MDB entry destinations need software fixups
> with the mrouter ports) isn't exactly going to make that any better.
> Others' opinion may differ, but mine is that core offload APIs need to
> consider what hardware is available in the real world, make the common
> case easy, and the advanced cases possible. Rather than make every case
> "advanced" :)

Just throwing some random ideas out there. Do you think it would make 
more sense if this whole solution (rtnl_lock, iterator cb,...etc.) is 
moved up to DSA so that other DSA drivers could benefit from it? I 
thought about implement it (not the iterator, the current form) in DSA 
at first, but I'm not sure how other drivers would behave, so I did it 
with mv instead.

I guess the question is, is the current limitation (mrouter not properly 
offloaded) an issue specific to mv or is it a limitation of hardware 
offloading in general? I tend to think it's the latter.

But then again, if we move it to DSA, we would lose the benefit of the 
optimization of consolidating multiple register writes into just one (as 
done in patch 10 currently), unless we add a new switch op which takes a 
portvec instead of a port when modifying mdb's.
Joseph Huang April 29, 2024, 10:07 p.m. UTC | #5
On 4/5/2024 2:58 PM, Joseph Huang wrote:
> Hi Vladimir,
> 
> On 4/5/2024 7:07 AM, Vladimir Oltean wrote:
>> On Thu, Apr 04, 2024 at 04:43:38PM -0400, Joseph Huang wrote:
>>> Hi Vladimir,
>>>
>>> On 4/2/2024 8:23 AM, Vladimir Oltean wrote:
>>>> Can you comment on the feasibility/infeasibility of Tobias' proposal 
>>>> of:
>>>> "The bridge could just provide some MDB iterator to save us from having
>>>> to cache all the configured groups."?
>>>> https://lore.kernel.org/netdev/87sg31n04a.fsf@waldekranz.com/
>>>>
>>>> What is done here will have to be scaled to many drivers - potentially
>>>> all existing DSA ones, as far as I'm aware.
>>>>
>>>
>>> I thought about implementing an MDB iterator as suggested by Tobias, 
>>> but I'm
>>> a bit concerned about the coherence of these MDB objects. In theory, 
>>> when
>>> the device driver is trying to act on an event, the source of the 
>>> trigger
>>> may have changed its state in the bridge already.
>>
>> Yes, this is the result of SWITCHDEV_F_DEFER, used by both
>> SWITCHDEV_ATTR_ID_PORT_MROUTER and SWITCHDEV_OBJ_ID_PORT_MDB.
>>
>>> If, upon receiving an event in the device driver, we iterate over what
>>> the bridge has at that instant, the differences between the worlds as
>>> seen by the bridge and the device driver might lead to some unexpected
>>> results.
>>
>> Translated: iterating over bridge MDB objects needs to be serialized
>> with new switchdev events by acquiring rtnl_lock(). Then, once switchdev
>> events are temporarily blocked, the pending ones need to be flushed
>> using switchdev_deferred_process(), so resync the bridge state with the
>> driver state. Once the resync is done, the iteration is safe until
>> rtnl_unlock().
>>
>> Applied to our case, the MDB iterator is needed in 
>> mv88e6xxx_port_mrouter().
>> This is already called with rtnl_lock() acquired. The resync procedure
>> will indirectly call mv88e6xxx_port_mdb_add()/mv88e6xxx_port_mdb_del()
>> through switchdev_deferred_process(), and then the walk is consistent
>> for the remainder of the mv88e6xxx_port_mrouter() function.
>>
>> A helper which does this is what would be required - an iterator
>> function which calls an int (*cb)(struct net_device *brport, const 
>> struct switchdev_obj_port_mdb *mdb)
>> for each MDB entry. The DSA core could then offer some post-processing
>> services over this API, to recover the struct dsa_port associated with
>> the bridge port (in the LAG case they aren't the same) and the address
>> database associated with the bridge.

Something like this (some layers omitted for brevity)?

                                       +br_iterator
                                       |  for each mdb
                                       |    _br_switchdev_mdb_notify
rtnl_lock                             |      without F_DEFER flag
  |                                    |      |
  +switchdev_port_attr_set_deferred    |      +switchdev_port_obj_notify
    |                                  |        |
    +dsa_port_mrouter                  |        +dsa_user_port_obj_a/d
      |                                |          |
      +mv88e6xxx_port_mrouter----------+          +mv88e6xxx_port_obj_a/d
                                         |
  +--------------------------------------+
  |
rtnl_unlock

Note that on the system I tested, each register read/write takes about 
100us to complete. For 100's of mdb groups, this would mean that we will 
be holding rtnl lock for 10's of ms. I don't know if it's considered too 
long.


>>
>> Do you think there would be unexpected results even if we did this?
>> br_switchdev_mdb_replay() needs to handle a similarly complicated
>> situation of synchronizing with deferred MDB events.
>>  >> However, if we cache the MDB objects in the device driver, at least
>>> the order in which the events took place will be coherent and at any
>>> give time the state of the MDB objects in the device driver can be
>>> guaranteed to be sane. This is also the approach the prestera device
>>> driver took.
>>
>> Not contesting this, but I wouldn't like to see MDBs cached in each
>> device driver just for this. Switchdev is not very high on the list of
>> APIs which are easy to use, and making MDB caching a requirement
>> (for the common case that MDB entry destinations need software fixups
>> with the mrouter ports) isn't exactly going to make that any better.
>> Others' opinion may differ, but mine is that core offload APIs need to
>> consider what hardware is available in the real world, make the common
>> case easy, and the advanced cases possible. Rather than make every case
>> "advanced" :)
> 
> Just throwing some random ideas out there. Do you think it would make 
> more sense if this whole solution (rtnl_lock, iterator cb,...etc.) is 
> moved up to DSA so that other DSA drivers could benefit from it? I 
> thought about implement it (not the iterator, the current form) in DSA 
> at first, but I'm not sure how other drivers would behave, so I did it 
> with mv instead.
> 
> I guess the question is, is the current limitation (mrouter not properly 
> offloaded) an issue specific to mv or is it a limitation of hardware 
> offloading in general? I tend to think it's the latter.
> 
> But then again, if we move it to DSA, we would lose the benefit of the 
> optimization of consolidating multiple register writes into just one (as 
> done in patch 10 currently), unless we add a new switch op which takes a 
> portvec instead of a port when modifying mdb's.
Vladimir Oltean April 30, 2024, 12:59 a.m. UTC | #6
On Mon, Apr 29, 2024 at 06:07:25PM -0400, Joseph Huang wrote:
> Something like this (some layers omitted for brevity)?
> 
>                                       +br_iterator
>                                       |  for each mdb
>                                       |    _br_switchdev_mdb_notify
> rtnl_lock                             |      without F_DEFER flag
>  |                                    |      |
>  +switchdev_port_attr_set_deferred    |      +switchdev_port_obj_notify
>    |                                  |        |
>    +dsa_port_mrouter                  |        +dsa_user_port_obj_a/d
>      |                                |          |
>      +mv88e6xxx_port_mrouter----------+          +mv88e6xxx_port_obj_a/d
>                                         |
>  +--------------------------------------+
>  |
> rtnl_unlock

At a _very_ superficial glance, I don't think you are properly
accounting for the fact that even with rtnl_lock() held, there are still
SWITCHDEV_OBJ_ID_PORT_MDB events which may be pending on the switchdev
chain. Without a switchdev_deferred_process() flush call, you won't be
getting rid of them, so when you rtnl_unlock(), they will still run.

Even worse, holding rtnl_lock() will not stop the bridge multicast layer
from modifying its br->mdb_list; only br->multicast_lock will.

So you may be better off also acquiring br->multicast_lock, and
notifying the MDB entries to the switchdev chain _with_the F_DEFER flag.

> Note that on the system I tested, each register read/write takes about 100us
> to complete. For 100's of mdb groups, this would mean that we will be
> holding rtnl lock for 10's of ms. I don't know if it's considered too long.

Not sure how this is going to be any better if the iteration over MDB
entries is done 100% in the driver, though - since its hook,
dsa_port_mrouter(), runs entirely under rtnl_lock().

Anyway, with the SWITCHDEV_F_DEFER flag, maybe the mdb object
notifications can be made to run by switchdev only a few at a time, to
give the network stack time to do other things as well.
Joseph Huang April 30, 2024, 4:27 p.m. UTC | #7
On 4/29/2024 8:59 PM, Vladimir Oltean wrote:
> On Mon, Apr 29, 2024 at 06:07:25PM -0400, Joseph Huang wrote:
>> Something like this (some layers omitted for brevity)?
>>
>>                                        +br_iterator
>>                                        |  for each mdb
>>                                        |    _br_switchdev_mdb_notify
>> rtnl_lock                             |      without F_DEFER flag
>>   |                                    |      |
>>   +switchdev_port_attr_set_deferred    |      +switchdev_port_obj_notify
>>     |                                  |        |
>>     +dsa_port_mrouter                  |        +dsa_user_port_obj_a/d
>>       |                                |          |
>>       +mv88e6xxx_port_mrouter----------+          +mv88e6xxx_port_obj_a/d
>>                                          |
>>   +--------------------------------------+
>>   |
>> rtnl_unlock
> 
> At a _very_ superficial glance, I don't think you are properly
> accounting for the fact that even with rtnl_lock() held, there are still
> SWITCHDEV_OBJ_ID_PORT_MDB events which may be pending on the switchdev
> chain. Without a switchdev_deferred_process() flush call, you won't be
> getting rid of them, so when you rtnl_unlock(), they will still run.
> 
> Even worse, holding rtnl_lock() will not stop the bridge multicast layer
> from modifying its br->mdb_list; only br->multicast_lock will.
> 
> So you may be better off also acquiring br->multicast_lock, and
> notifying the MDB entries to the switchdev chain _with_the F_DEFER flag.

Like this?

                                       +br_iterator(dsa_cb)
                                       |  lock br->multicask_lock
                                       |  for each mdb
                                       |    br_switchdev_mdb_notify
rtnl_lock                             |      |
  |                                    |      +switchdev_port_obj_._defer
  +switchdev_port_attr_set_deferred    |  unlock br->multicast_lock
    |                                  |
    +dsa_port_mrouter                  |
      |                                |
      +mv88e6xxx_port_mrouter----------+
                                         |
  +--------------------------------------+
  |
rtnl_unlock

(potential task change)

rtnl_lock
  |
  +switchdev_deferred_process
  | flush all queued dfitems in queuing order
  |
rtnl_unlock

I'm not that familiar with the bridge code, but is there any concern 
with potential deadlock here (bewteen rtnl_lock and br->multicast_lock)?

> 
>> Note that on the system I tested, each register read/write takes about 100us
>> to complete. For 100's of mdb groups, this would mean that we will be
>> holding rtnl lock for 10's of ms. I don't know if it's considered too long.
> 
> Not sure how this is going to be any better if the iteration over MDB
> entries is done 100% in the driver, though - since its hook,
> dsa_port_mrouter(), runs entirely under rtnl_lock(). >
> Anyway, with the SWITCHDEV_F_DEFER flag, maybe the mdb object
> notifications can be made to run by switchdev only a few at a time, to
> give the network stack time to do other things as well.
Joseph Huang May 2, 2024, 8:37 p.m. UTC | #8
On 4/30/2024 12:27 PM, Joseph Huang wrote:
> On 4/29/2024 8:59 PM, Vladimir Oltean wrote:
>> On Mon, Apr 29, 2024 at 06:07:25PM -0400, Joseph Huang wrote:
>>> Something like this (some layers omitted for brevity)?
>>>
>>>                                        +br_iterator
>>>                                        |  for each mdb
>>>                                        |    _br_switchdev_mdb_notify
>>>  rtnl_lock                             |      without F_DEFER flag
>>>   |                                    |      |
>>>   +switchdev_port_attr_set_deferred    |      +switchdev_port_obj_notify
>>>     |                                  |        |
>>>     +dsa_port_mrouter                  |        +dsa_user_port_obj_a/d
>>>       |                                |          |
>>>       +mv88e6xxx_port_mrouter----------+          
>>> +mv88e6xxx_port_obj_a/d
>>>                                          |
>>>   +--------------------------------------+
>>>   |
>>> rtnl_unlock
>>
>> At a _very_ superficial glance, I don't think you are properly
>> accounting for the fact that even with rtnl_lock() held, there are still
>> SWITCHDEV_OBJ_ID_PORT_MDB events which may be pending on the switchdev
>> chain. Without a switchdev_deferred_process() flush call, you won't be
>> getting rid of them, so when you rtnl_unlock(), they will still run.
>>
>> Even worse, holding rtnl_lock() will not stop the bridge multicast layer
>> from modifying its br->mdb_list; only br->multicast_lock will.
>>
>> So you may be better off also acquiring br->multicast_lock, and
>> notifying the MDB entries to the switchdev chain _with_the F_DEFER flag.
> 
> Like this?
> 
>                                        +br_iterator(dsa_cb)
>                                        |  lock br->multicask_lock
>                                        |  for each mdb
>                                        |    br_switchdev_mdb_notify
>  rtnl_lock                             |      |
>   |                                    |      +switchdev_port_obj_._defer
>   +switchdev_port_attr_set_deferred    |  unlock br->multicast_lock
>     |                                  |
>     +dsa_port_mrouter                  |
>       |                                |
>       +mv88e6xxx_port_mrouter----------+
>                                          |
>   +--------------------------------------+
>   |
> rtnl_unlock
> 
> (potential task change)
> 
> rtnl_lock
>   |
>   +switchdev_deferred_process
>   | flush all queued dfitems in queuing order
>   |
> rtnl_unlock
> 
> I'm not that familiar with the bridge code, but is there any concern 
> with potential deadlock here (between rtnl_lock and br->multicast_lock)?

Hi Nik,

Do you know if it's safe to acquire rtnl_lock and br->multicast_lock in 
the following sequence? Is there any potential possibility for a deadlock?

rtnl_lock
spin_lock(br->multicast_lock)
spin_unlock(br->multicast_lock)
rtnl_unlock

If the operation is safe, the next question is should it be spin_lock or 
spin_lock_bh?

Thanks,
Joseph
diff mbox series

Patch

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index f66ddde484dc..32a613c965b1 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -47,6 +47,14 @@  struct mv88e6xxx_bridge {
 	struct list_head head;
 	struct net_device *br_dev;
 	u16 ports;
+	struct list_head br_mdb_list;
+};
+
+struct mv88e6xxx_br_mdb {
+	struct list_head head;
+	unsigned char addr[ETH_ALEN];
+	u16 vid;
+	u16 portvec;
 };
 
 static void assert_reg_lock(struct mv88e6xxx_chip *chip)
@@ -2974,6 +2982,7 @@  mv88e6xxx_bridge_create(struct mv88e6xxx_chip *chip, struct net_device *br_dev)
 		return ERR_PTR(-ENOMEM);
 
 	mv_bridge->br_dev = br_dev;
+	INIT_LIST_HEAD(&mv_bridge->br_mdb_list);
 	list_add(&mv_bridge->head, &chip->bridge_list);
 
 	return mv_bridge;
@@ -2984,6 +2993,7 @@  static void mv88e6xxx_bridge_destroy(struct mv88e6xxx_bridge *mv_bridge)
 	list_del(&mv_bridge->head);
 
 	WARN_ON(mv_bridge->ports);
+	WARN_ON(!list_empty(&mv_bridge->br_mdb_list));
 	kfree(mv_bridge);
 }
 
@@ -6583,16 +6593,101 @@  static int mv88e6xxx_change_tag_protocol(struct dsa_switch *ds,
 	return err;
 }
 
+static struct mv88e6xxx_br_mdb *
+mv88e6xxx_br_mdb_create(struct mv88e6xxx_bridge *mv_bridge,
+			const struct switchdev_obj_port_mdb *mdb)
+{
+	struct mv88e6xxx_br_mdb *mv_br_mdb;
+
+	mv_br_mdb = kzalloc(sizeof(*mv_br_mdb), GFP_KERNEL);
+	if (!mv_br_mdb)
+		return ERR_PTR(-ENOMEM);
+
+	ether_addr_copy(mv_br_mdb->addr, mdb->addr);
+	mv_br_mdb->vid = mdb->vid;
+	list_add(&mv_br_mdb->head, &mv_bridge->br_mdb_list);
+
+	return mv_br_mdb;
+}
+
+static void mv88e6xxx_br_mdb_destroy(struct mv88e6xxx_br_mdb *mv_br_mdb)
+{
+	list_del(&mv_br_mdb->head);
+
+	WARN_ON(mv_br_mdb->portvec);
+	kfree(mv_br_mdb);
+}
+
+static struct mv88e6xxx_br_mdb *
+mv88e6xxx_br_mdb_find(struct mv88e6xxx_bridge *mv_bridge,
+		      const struct switchdev_obj_port_mdb *mdb)
+{
+	struct mv88e6xxx_br_mdb *mv_br_mdb;
+
+	list_for_each_entry(mv_br_mdb, &mv_bridge->br_mdb_list, head)
+		if (ether_addr_equal(mv_br_mdb->addr, mdb->addr) &&
+		    mv_br_mdb->vid == mdb->vid)
+			return mv_br_mdb;
+
+	return NULL;
+}
+
+static struct mv88e6xxx_br_mdb *
+mv88e6xxx_br_mdb_get(struct mv88e6xxx_bridge *mv_bridge,
+		     const struct switchdev_obj_port_mdb *mdb)
+{
+	struct mv88e6xxx_br_mdb *mv_br_mdb;
+
+	mv_br_mdb = mv88e6xxx_br_mdb_find(mv_bridge, mdb);
+	if (!mv_br_mdb)
+		mv_br_mdb = mv88e6xxx_br_mdb_create(mv_bridge, mdb);
+
+	return mv_br_mdb;
+}
+
+static void mv88e6xxx_br_mdb_put(struct mv88e6xxx_br_mdb *mv_br_mdb)
+{
+	if (!mv_br_mdb->portvec)
+		mv88e6xxx_br_mdb_destroy(mv_br_mdb);
+}
+
 static int mv88e6xxx_port_mdb_add(struct dsa_switch *ds, int port,
 				  const struct switchdev_obj_port_mdb *mdb,
 				  struct dsa_db db)
 {
 	struct mv88e6xxx_chip *chip = ds->priv;
+	struct mv88e6xxx_bridge *mv_bridge;
+	struct mv88e6xxx_br_mdb *mv_br_mdb;
+	struct net_device *orig_dev;
+	struct net_device *br_dev;
 	int err;
 
+	orig_dev = mdb->obj.orig_dev;
+	br_dev = netdev_master_upper_dev_get(orig_dev);
+	if (!br_dev)
+		br_dev = orig_dev;
+
+	mv_bridge = mv88e6xxx_bridge_by_dev(chip, br_dev);
+	if (!mv_bridge)
+		return -EINVAL;
+
+	mv_br_mdb = mv88e6xxx_br_mdb_get(mv_bridge, mdb);
+	if (IS_ERR(mv_br_mdb))
+		return PTR_ERR(mv_br_mdb);
+
+	if (mv_br_mdb->portvec & BIT(port))
+		return -EEXIST;
+
 	mv88e6xxx_reg_lock(chip);
 	err = mv88e6xxx_port_db_load_purge(chip, port, mdb->addr, mdb->vid,
 					   MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC);
+
+	if (err)
+		goto out;
+
+	mv_br_mdb->portvec |= BIT(port);
+
+out:
 	mv88e6xxx_reg_unlock(chip);
 
 	return err;
@@ -6603,10 +6698,32 @@  static int mv88e6xxx_port_mdb_del(struct dsa_switch *ds, int port,
 				  struct dsa_db db)
 {
 	struct mv88e6xxx_chip *chip = ds->priv;
+	struct mv88e6xxx_bridge *mv_bridge;
+	struct mv88e6xxx_br_mdb *mv_br_mdb;
+	struct net_device *orig_dev;
+	struct net_device *br_dev;
 	int err;
 
+	orig_dev = mdb->obj.orig_dev;
+	br_dev = netdev_master_upper_dev_get(orig_dev);
+	if (!br_dev)
+		br_dev = orig_dev;
+
+	mv_bridge = mv88e6xxx_bridge_by_dev(chip, br_dev);
+	if (!mv_bridge)
+		return -EINVAL;
+
+	mv_br_mdb = mv88e6xxx_br_mdb_find(mv_bridge, mdb);
+	if (!mv_br_mdb)
+		return -ENOENT;
+
+	if (!(mv_br_mdb->portvec & BIT(port)))
+		return -ENOENT;
+
 	mv88e6xxx_reg_lock(chip);
 	err = mv88e6xxx_port_db_load_purge(chip, port, mdb->addr, mdb->vid, 0);
+	mv_br_mdb->portvec &= ~BIT(port);
+	mv88e6xxx_br_mdb_put(mv_br_mdb);
 	mv88e6xxx_reg_unlock(chip);
 
 	return err;