diff mbox series

[v3,net-next] net: dsa: reference count the host mdb addresses

Message ID 20201212203901.351331-1-vladimir.oltean@nxp.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [v3,net-next] net: dsa: reference count the host mdb addresses | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net-next
netdev/subject_prefix success Link
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 20 this patch: 20
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 20 this patch: 20
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Vladimir Oltean Dec. 12, 2020, 8:39 p.m. UTC
Currently any DSA switch that is strict when implementing the mdb
operations prints these benign errors after the addresses expire, with
at least 2 ports bridged:

[  286.013814] mscc_felix 0000:00:00.5 swp3: failed (err=-2) to del object (id=3)

The reason has to do with this piece of code:

	netdev_for_each_lower_dev(dev, lower_dev, iter)
		br_mdb_switchdev_host_port(dev, lower_dev, mp, type);

called from:

br_multicast_group_expired
-> br_multicast_host_leave
   -> br_mdb_notify
      -> br_mdb_switchdev_host

Basically, that code is correct. It tells each switchdev port that the
host can leave that multicast group. But in the case of DSA, all user
ports are connected to the host through the same pipe. So, because DSA
translates a host MDB to a normal MDB on the CPU port, this means that
when all user ports leave a multicast group, DSA tries to remove it N
times from the CPU port.

We should be reference-counting these addresses. Otherwise, the first
port on which the MDB expires will cause an entry removal from the CPU
port, which will break the host MDB for the remaining ports.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
Changes in v3:
- Allocating memory for host mdb in prepare phase, but setting refcount
  to 1 in commit phase. This complicates the implementation of the state
  machine a little bit.

Changes in v2:
- Re-targeted against net-next, since this is not breaking any use case
  that I know of.
- Re-did the refcounting logic. The problem is that the MDB addition is
  two-phase, but the deletion is one-phase. So refcounting on addition
  needs to be done only on one phase - the commit one. Before, we had a
  problem there, and for host MDB additions where an entry already existed
  on the CPU port, we would call the prepare phase but never commit.
  This would break drivers that allocate memory on prepare, and then
  expect the commit phase to actually apply. So we're not doing this any
  longer. Both prepare and commit phases are now stubbed out for additions
  of host MDB entries that are already present on the CPU port.
- Renamed dsa_host_mdb_find into dsa_host_addr_find, and we're now
  passing it the host_mdb list rather than struct dsa_switch *ds. This
  is a generic function and we might be able to reuse it in the future
  for host FDB entries (such as slave net_device MAC addresses).
- Left the allocation as GFP_KERNEL, since that is fine - the switchdev
  notifier runs as deferred, therefore in process context.

 include/net/dsa.h |   9 ++++
 net/dsa/dsa2.c    |   2 +
 net/dsa/slave.c   | 128 ++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 130 insertions(+), 9 deletions(-)

Comments

Andrew Lunn Dec. 12, 2020, 10:06 p.m. UTC | #1
> +	/* Complication created by the fact that addition has two phases, but
> +	 * deletion only has one phase, and we need reference counting.
> +	 * The strategy is to do the memory allocation in the prepare phase,
> +	 * but initialize the refcount in the commit phase.
> +	 *
> +	 * Have mdb	| mdb has refcount > 0	| Commit phase	| Resolution
> +	 * -------------+-----------------------+---------------+---------------
> +	 * no		| -			| no		| Alloc & proceed

This does not look right.

The point of the prepare phase is to allow all the different layers
involved to allocate whatever they need and to validate they can do
the requested action. Any layer can say, No, stop, i cannot do
this. The commit phase will then not happen. But that also means the
prepare phase should not do any state changes. So you should not be
proceeding here, just allocating.

And you need some way to cleanup the allocated memory when the commit
never happens because some other layer has said No!

     Andrew
Vladimir Oltean Dec. 12, 2020, 10:18 p.m. UTC | #2
On Sat, Dec 12, 2020 at 11:06:41PM +0100, Andrew Lunn wrote:
> > +	/* Complication created by the fact that addition has two phases, but
> > +	 * deletion only has one phase, and we need reference counting.
> > +	 * The strategy is to do the memory allocation in the prepare phase,
> > +	 * but initialize the refcount in the commit phase.
> > +	 *
> > +	 * Have mdb	| mdb has refcount > 0	| Commit phase	| Resolution
> > +	 * -------------+-----------------------+---------------+---------------
> > +	 * no		| -			| no		| Alloc & proceed
>
> This does not look right.
>
> The point of the prepare phase is to allow all the different layers
> involved to allocate whatever they need and to validate they can do
> the requested action. Any layer can say, No, stop, i cannot do
> this. The commit phase will then not happen. But that also means the
> prepare phase should not do any state changes. So you should not be
> proceeding here, just allocating.

Are you commenting based on the code, or just based on the comment?
If just based on the comment, then yeah, sorry. I was limited to 80
characters, and I couldn't specify "proceed to what". It's just "proceed
to call the prepare phase of the driver". Which is... normal and
expected, and does not contradict what you said above.

> And you need some way to cleanup the allocated memory when the commit
> never happens because some other layer has said No!

So this would be a fatal problem with the switchdev transactional model
if I am not misunderstanding it. On one hand there's this nice, bubbly
idea that you should preallocate memory in the prepare phase, so that
there's one reason less to fail at commit time. But on the other hand,
if "the commit phase might never happen" is even a remove possibility,
all of that goes to trash - how are you even supposed to free the
preallocated memory.

Sorry, I don't think that there's any possibility for the commit phase
to not happen as long as everybody is in agreement that the preparation
phase went ok. If you look at the code, I even allocated the memory at
preparation time _before_ calling into the driver, to ensure that we're
not giving the driver the false impression that it gave switchdev the
green light but the commit never came. If our allocation in the DSA core
fails during the prepare phase, the prepare phase of the driver is not
even called.

That being said, please let me know if you spot bugs in the actual code.
I tested it and it appeared to work ok (I also put debugging prints to
make sure that the refcounting works ok and the entries are removed
after all of them expire).
Andrew Lunn Dec. 13, 2020, 12:08 a.m. UTC | #3
On Sat, Dec 12, 2020 at 10:18:59PM +0000, Vladimir Oltean wrote:
> On Sat, Dec 12, 2020 at 11:06:41PM +0100, Andrew Lunn wrote:
> > > +	/* Complication created by the fact that addition has two phases, but
> > > +	 * deletion only has one phase, and we need reference counting.
> > > +	 * The strategy is to do the memory allocation in the prepare phase,
> > > +	 * but initialize the refcount in the commit phase.
> > > +	 *
> > > +	 * Have mdb	| mdb has refcount > 0	| Commit phase	| Resolution
> > > +	 * -------------+-----------------------+---------------+---------------
> > > +	 * no		| -			| no		| Alloc & proceed
> >
> > This does not look right.
> >
> > The point of the prepare phase is to allow all the different layers
> > involved to allocate whatever they need and to validate they can do
> > the requested action. Any layer can say, No, stop, i cannot do
> > this. The commit phase will then not happen. But that also means the
> > prepare phase should not do any state changes. So you should not be
> > proceeding here, just allocating.
> 
> Are you commenting based on the code, or just based on the comment?
> If just based on the comment, then yeah, sorry. I was limited to 80
> characters, and I couldn't specify "proceed to what". It's just "proceed
> to call the prepare phase of the driver"

Ah, O.K.

> > And you need some way to cleanup the allocated memory when the commit
> > never happens because some other layer has said No!
> 
> So this would be a fatal problem with the switchdev transactional model
> if I am not misunderstanding it. On one hand there's this nice, bubbly
> idea that you should preallocate memory in the prepare phase, so that
> there's one reason less to fail at commit time. But on the other hand,
> if "the commit phase might never happen" is even a remove possibility,
> all of that goes to trash - how are you even supposed to free the
> preallocated memory.

It can definitely happen, that commit is never called:

static int switchdev_port_obj_add_now(struct net_device *dev,
                                      const struct switchdev_obj *obj,
                                      struct netlink_ext_ack *extack)
{

       /* Phase I: prepare for obj add. Driver/device should fail
         * here if there are going to be issues in the commit phase,
         * such as lack of resources or support.  The driver/device
         * should reserve resources needed for the commit phase here,
         * but should not commit the obj.
         */

        trans.ph_prepare = true;
        err = switchdev_port_obj_notify(SWITCHDEV_PORT_OBJ_ADD,
                                        dev, obj, &trans, extack);
        if (err)
                return err;

        /* Phase II: commit obj add.  This cannot fail as a fault
         * of driver/device.  If it does, it's a bug in the driver/device
         * because the driver said everythings was OK in phase I.
         */

        trans.ph_prepare = false;
        err = switchdev_port_obj_notify(SWITCHDEV_PORT_OBJ_ADD,
                                        dev, obj, &trans, extack);
        WARN(err, "%s: Commit of object (id=%d) failed.\n", dev->name, obj->id);

        return err;

So if any notifier returns an error during prepare, the commit is
never called.

So the memory you allocated and added to the list may never get
used. Its refcount stays zero.  Which is why i suggested making the
MDB remove call do a general garbage collect. It is not perfect, the
cleanup could be deferred a long time, but is should get removed
eventually.

	Andrew
Vladimir Oltean Dec. 13, 2020, 12:14 a.m. UTC | #4
On Sun, Dec 13, 2020 at 01:08:55AM +0100, Andrew Lunn wrote:
> > > And you need some way to cleanup the allocated memory when the commit
> > > never happens because some other layer has said No!
> >
> > So this would be a fatal problem with the switchdev transactional model
> > if I am not misunderstanding it. On one hand there's this nice, bubbly
> > idea that you should preallocate memory in the prepare phase, so that
> > there's one reason less to fail at commit time. But on the other hand,
> > if "the commit phase might never happen" is even a remove possibility,
> > all of that goes to trash - how are you even supposed to free the
> > preallocated memory.
>
> It can definitely happen, that commit is never called:
>
> static int switchdev_port_obj_add_now(struct net_device *dev,
>                                       const struct switchdev_obj *obj,
>                                       struct netlink_ext_ack *extack)
> {
>
>        /* Phase I: prepare for obj add. Driver/device should fail
>          * here if there are going to be issues in the commit phase,
>          * such as lack of resources or support.  The driver/device
>          * should reserve resources needed for the commit phase here,
>          * but should not commit the obj.
>          */
>
>         trans.ph_prepare = true;
>         err = switchdev_port_obj_notify(SWITCHDEV_PORT_OBJ_ADD,
>                                         dev, obj, &trans, extack);
>         if (err)
>                 return err;
>
>         /* Phase II: commit obj add.  This cannot fail as a fault
>          * of driver/device.  If it does, it's a bug in the driver/device
>          * because the driver said everythings was OK in phase I.
>          */
>
>         trans.ph_prepare = false;
>         err = switchdev_port_obj_notify(SWITCHDEV_PORT_OBJ_ADD,
>                                         dev, obj, &trans, extack);
>         WARN(err, "%s: Commit of object (id=%d) failed.\n", dev->name, obj->id);
>
>         return err;
>
> So if any notifier returns an error during prepare, the commit is
> never called.
>
> So the memory you allocated and added to the list may never get
> used. Its refcount stays zero.  Which is why i suggested making the
> MDB remove call do a general garbage collect. It is not perfect, the
> cleanup could be deferred a long time, but is should get removed
> eventually.

What would the garbage collection look like?
Andrew Lunn Dec. 13, 2020, 12:34 a.m. UTC | #5
On Sun, Dec 13, 2020 at 12:14:19AM +0000, Vladimir Oltean wrote:
> On Sun, Dec 13, 2020 at 01:08:55AM +0100, Andrew Lunn wrote:
> > > > And you need some way to cleanup the allocated memory when the commit
> > > > never happens because some other layer has said No!
> > >
> > > So this would be a fatal problem with the switchdev transactional model
> > > if I am not misunderstanding it. On one hand there's this nice, bubbly
> > > idea that you should preallocate memory in the prepare phase, so that
> > > there's one reason less to fail at commit time. But on the other hand,
> > > if "the commit phase might never happen" is even a remove possibility,
> > > all of that goes to trash - how are you even supposed to free the
> > > preallocated memory.
> >
> > It can definitely happen, that commit is never called:
> >
> > static int switchdev_port_obj_add_now(struct net_device *dev,
> >                                       const struct switchdev_obj *obj,
> >                                       struct netlink_ext_ack *extack)
> > {
> >
> >        /* Phase I: prepare for obj add. Driver/device should fail
> >          * here if there are going to be issues in the commit phase,
> >          * such as lack of resources or support.  The driver/device
> >          * should reserve resources needed for the commit phase here,
> >          * but should not commit the obj.
> >          */
> >
> >         trans.ph_prepare = true;
> >         err = switchdev_port_obj_notify(SWITCHDEV_PORT_OBJ_ADD,
> >                                         dev, obj, &trans, extack);
> >         if (err)
> >                 return err;
> >
> >         /* Phase II: commit obj add.  This cannot fail as a fault
> >          * of driver/device.  If it does, it's a bug in the driver/device
> >          * because the driver said everythings was OK in phase I.
> >          */
> >
> >         trans.ph_prepare = false;
> >         err = switchdev_port_obj_notify(SWITCHDEV_PORT_OBJ_ADD,
> >                                         dev, obj, &trans, extack);
> >         WARN(err, "%s: Commit of object (id=%d) failed.\n", dev->name, obj->id);
> >
> >         return err;
> >
> > So if any notifier returns an error during prepare, the commit is
> > never called.
> >
> > So the memory you allocated and added to the list may never get
> > used. Its refcount stays zero.  Which is why i suggested making the
> > MDB remove call do a general garbage collect. It is not perfect, the
> > cleanup could be deferred a long time, but is should get removed
> > eventually.
> 
> What would the garbage collection look like?

        struct dsa_host_addr *a;

        list_for_each_entry_safe(a, addr_list, list)
		if (refcount_read(&a->refcount) == 0) {
			list_del(&a->list);
			free(a);
		}
	}

	Andrew
Vladimir Oltean Dec. 13, 2020, 12:49 a.m. UTC | #6
On Sun, Dec 13, 2020 at 01:34:10AM +0100, Andrew Lunn wrote:
> On Sun, Dec 13, 2020 at 12:14:19AM +0000, Vladimir Oltean wrote:
> > On Sun, Dec 13, 2020 at 01:08:55AM +0100, Andrew Lunn wrote:
> > > > > And you need some way to cleanup the allocated memory when the commit
> > > > > never happens because some other layer has said No!
> > > >
> > > > So this would be a fatal problem with the switchdev transactional model
> > > > if I am not misunderstanding it. On one hand there's this nice, bubbly
> > > > idea that you should preallocate memory in the prepare phase, so that
> > > > there's one reason less to fail at commit time. But on the other hand,
> > > > if "the commit phase might never happen" is even a remove possibility,
> > > > all of that goes to trash - how are you even supposed to free the
> > > > preallocated memory.
> > >
> > > It can definitely happen, that commit is never called:
> > >
> > > static int switchdev_port_obj_add_now(struct net_device *dev,
> > >                                       const struct switchdev_obj *obj,
> > >                                       struct netlink_ext_ack *extack)
> > > {
> > >
> > >        /* Phase I: prepare for obj add. Driver/device should fail
> > >          * here if there are going to be issues in the commit phase,
> > >          * such as lack of resources or support.  The driver/device
> > >          * should reserve resources needed for the commit phase here,
> > >          * but should not commit the obj.
> > >          */
> > >
> > >         trans.ph_prepare = true;
> > >         err = switchdev_port_obj_notify(SWITCHDEV_PORT_OBJ_ADD,
> > >                                         dev, obj, &trans, extack);
> > >         if (err)
> > >                 return err;
> > >
> > >         /* Phase II: commit obj add.  This cannot fail as a fault
> > >          * of driver/device.  If it does, it's a bug in the driver/device
> > >          * because the driver said everythings was OK in phase I.
> > >          */
> > >
> > >         trans.ph_prepare = false;
> > >         err = switchdev_port_obj_notify(SWITCHDEV_PORT_OBJ_ADD,
> > >                                         dev, obj, &trans, extack);
> > >         WARN(err, "%s: Commit of object (id=%d) failed.\n", dev->name, obj->id);
> > >
> > >         return err;
> > >
> > > So if any notifier returns an error during prepare, the commit is
> > > never called.
> > >
> > > So the memory you allocated and added to the list may never get
> > > used. Its refcount stays zero.  Which is why i suggested making the
> > > MDB remove call do a general garbage collect. It is not perfect, the
> > > cleanup could be deferred a long time, but is should get removed
> > > eventually.
> > 
> > What would the garbage collection look like?
> 
>         struct dsa_host_addr *a;
> 
>         list_for_each_entry_safe(a, addr_list, list)
> 		if (refcount_read(&a->refcount) == 0) {
> 			list_del(&a->list);
> 			free(a);
> 		}
> 	}

Sorry, this seems a bit absurd. The code is already absurdly complicated
as is. I don't want to go against the current and add more unjustified
nonsense instead of taking a step back, which I should have done earlier.
I thought this transactional API was supposed to help. Though I scanned
the kernel tree a bit and I fail to understand whom it helps exactly.
What I see is that the whole 'transaction' spans only the length of the
switchdev_port_attr_set_now function.

Am I right to say that there is no in-kernel code that depends upon the
switchdev transaction model right now, as it's completely hidden away
from callers? As in, we could just squash the two switchdev_port_attr_notify
calls into one and nothing would functionally change for the callers of
switchdev_port_attr_set?
Why don't we do just that? I might be completely blind, but I am getting
the feeling that this idea has not aged very well.

Florian, has anything happened in the meantime since this commit of yours?

commit 91cf8eceffc131d41f098351e1b290bdaab45ea7
Author: Florian Fainelli <f.fainelli@gmail.com>
Date:   Wed Feb 27 16:29:16 2019 -0800

    switchdev: Remove unused transaction item queue

    There are no more in tree users of the
    switchdev_trans_item_{dequeue,enqueue} or switchdev_trans_item structure
    in the kernel since commit 00fc0c51e35b ("rocker: Change world_ops API
    and implementation to be switchdev independant").

    Remove this unused code and update the documentation accordingly since.

    Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
    Acked-by: Jiri Pirko <jiri@mellanox.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

There isn't an API to hold this stuff any longer. So let's go back to
the implementation from v2, with memory allocation in the commit phase.
The way forward anyway is probably not to add a garbage collector in
DSA, but to fold the prepare and commit phases into one.
Florian Fainelli Dec. 13, 2020, 1:42 a.m. UTC | #7
On 12/12/2020 4:49 PM, Vladimir Oltean wrote:
> On Sun, Dec 13, 2020 at 01:34:10AM +0100, Andrew Lunn wrote:
>> On Sun, Dec 13, 2020 at 12:14:19AM +0000, Vladimir Oltean wrote:
>>> On Sun, Dec 13, 2020 at 01:08:55AM +0100, Andrew Lunn wrote:
>>>>>> And you need some way to cleanup the allocated memory when the commit
>>>>>> never happens because some other layer has said No!
>>>>>
>>>>> So this would be a fatal problem with the switchdev transactional model
>>>>> if I am not misunderstanding it. On one hand there's this nice, bubbly
>>>>> idea that you should preallocate memory in the prepare phase, so that
>>>>> there's one reason less to fail at commit time. But on the other hand,
>>>>> if "the commit phase might never happen" is even a remove possibility,
>>>>> all of that goes to trash - how are you even supposed to free the
>>>>> preallocated memory.
>>>>
>>>> It can definitely happen, that commit is never called:
>>>>
>>>> static int switchdev_port_obj_add_now(struct net_device *dev,
>>>>                                       const struct switchdev_obj *obj,
>>>>                                       struct netlink_ext_ack *extack)
>>>> {
>>>>
>>>>        /* Phase I: prepare for obj add. Driver/device should fail
>>>>          * here if there are going to be issues in the commit phase,
>>>>          * such as lack of resources or support.  The driver/device
>>>>          * should reserve resources needed for the commit phase here,
>>>>          * but should not commit the obj.
>>>>          */
>>>>
>>>>         trans.ph_prepare = true;
>>>>         err = switchdev_port_obj_notify(SWITCHDEV_PORT_OBJ_ADD,
>>>>                                         dev, obj, &trans, extack);
>>>>         if (err)
>>>>                 return err;
>>>>
>>>>         /* Phase II: commit obj add.  This cannot fail as a fault
>>>>          * of driver/device.  If it does, it's a bug in the driver/device
>>>>          * because the driver said everythings was OK in phase I.
>>>>          */
>>>>
>>>>         trans.ph_prepare = false;
>>>>         err = switchdev_port_obj_notify(SWITCHDEV_PORT_OBJ_ADD,
>>>>                                         dev, obj, &trans, extack);
>>>>         WARN(err, "%s: Commit of object (id=%d) failed.\n", dev->name, obj->id);
>>>>
>>>>         return err;
>>>>
>>>> So if any notifier returns an error during prepare, the commit is
>>>> never called.
>>>>
>>>> So the memory you allocated and added to the list may never get
>>>> used. Its refcount stays zero.  Which is why i suggested making the
>>>> MDB remove call do a general garbage collect. It is not perfect, the
>>>> cleanup could be deferred a long time, but is should get removed
>>>> eventually.
>>>
>>> What would the garbage collection look like?
>>
>>         struct dsa_host_addr *a;
>>
>>         list_for_each_entry_safe(a, addr_list, list)
>> 		if (refcount_read(&a->refcount) == 0) {
>> 			list_del(&a->list);
>> 			free(a);
>> 		}
>> 	}
> 
> Sorry, this seems a bit absurd. The code is already absurdly complicated
> as is. I don't want to go against the current and add more unjustified
> nonsense instead of taking a step back, which I should have done earlier.
> I thought this transactional API was supposed to help. Though I scanned
> the kernel tree a bit and I fail to understand whom it helps exactly.
> What I see is that the whole 'transaction' spans only the length of the
> switchdev_port_attr_set_now function.
> 
> Am I right to say that there is no in-kernel code that depends upon the
> switchdev transaction model right now, as it's completely hidden away
> from callers? As in, we could just squash the two switchdev_port_attr_notify
> calls into one and nothing would functionally change for the callers of
> switchdev_port_attr_set?
> Why don't we do just that? I might be completely blind, but I am getting
> the feeling that this idea has not aged very well.

IIRC that was the conclusion that Ido and I had reached as well way back
when doing the commit you cited below.

> 
> Florian, has anything happened in the meantime since this commit of yours?

This is where I stopped, mainly because the series that had motivated
this work was the one bringing management mode to bcm_sf2 and CPU RX
filtering had me wire up yet another switched attribute that drivers
like b53 wanted to veto (namely the disabling of IGMP snooping). We did
not agree on the approach to use switchdev for notifying drivers about
UC, MC lists down to drivers and so the series stalled.

IIRC Jiri and Ido were also keen on merging the switchdev with the
bridge code but I did not do that part, nor did I completely remove the
transaction model, but those were the next steps had I not been side
tracked with work on other topics.

> 
> commit 91cf8eceffc131d41f098351e1b290bdaab45ea7
> Author: Florian Fainelli <f.fainelli@gmail.com>
> Date:   Wed Feb 27 16:29:16 2019 -0800
> 
>     switchdev: Remove unused transaction item queue
> 
>     There are no more in tree users of the
>     switchdev_trans_item_{dequeue,enqueue} or switchdev_trans_item structure
>     in the kernel since commit 00fc0c51e35b ("rocker: Change world_ops API
>     and implementation to be switchdev independant").
> 
>     Remove this unused code and update the documentation accordingly since.
> 
>     Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
>     Acked-by: Jiri Pirko <jiri@mellanox.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> There isn't an API to hold this stuff any longer. So let's go back to
> the implementation from v2, with memory allocation in the commit phase.
> The way forward anyway is probably not to add a garbage collector in
> DSA, but to fold the prepare and commit phases into one.

Agreed.
diff mbox series

Patch

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 4e60d2610f20..e639db28e238 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -253,6 +253,13 @@  struct dsa_link {
 	struct list_head list;
 };
 
+struct dsa_host_addr {
+	unsigned char addr[ETH_ALEN];
+	u16 vid;
+	refcount_t refcount;
+	struct list_head list;
+};
+
 struct dsa_switch {
 	bool setup;
 
@@ -335,6 +342,8 @@  struct dsa_switch {
 	 */
 	bool			mtu_enforcement_ingress;
 
+	struct list_head	host_mdb;
+
 	size_t num_ports;
 };
 
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 183003e45762..52b3ef34a2cb 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -413,6 +413,8 @@  static int dsa_switch_setup(struct dsa_switch *ds)
 	if (ds->setup)
 		return 0;
 
+	INIT_LIST_HEAD(&ds->host_mdb);
+
 	/* Initialize ds->phys_mii_mask before registering the slave MDIO bus
 	 * driver and before ops->setup() has run, since the switch drivers and
 	 * the slave MDIO bus driver rely on these values for probing PHY
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 4a0498bf6c65..e0667be7d5ed 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -376,6 +376,123 @@  static int dsa_slave_vlan_add(struct net_device *dev,
 	return 0;
 }
 
+static struct dsa_host_addr *
+dsa_host_addr_find(struct list_head *addr_list,
+		   const struct switchdev_obj_port_mdb *mdb)
+{
+	struct dsa_host_addr *a;
+
+	list_for_each_entry(a, addr_list, list)
+		if (ether_addr_equal(a->addr, mdb->addr) && a->vid == mdb->vid)
+			return a;
+
+	return NULL;
+}
+
+/* DSA can directly translate this to a normal MDB add, but on the CPU port.
+ * But because multiple user ports can join the same multicast group and the
+ * bridge will emit a notification for each port, we need to add/delete the
+ * entry towards the host only once, so we reference count it.
+ */
+static int dsa_host_mdb_add(struct dsa_port *dp,
+			    const struct switchdev_obj_port_mdb *mdb,
+			    struct switchdev_trans *trans)
+{
+	struct dsa_port *cpu_dp = dp->cpu_dp;
+	struct dsa_switch *ds = dp->ds;
+	struct dsa_host_addr *a;
+	int err;
+
+	/* Complication created by the fact that addition has two phases, but
+	 * deletion only has one phase, and we need reference counting.
+	 * The strategy is to do the memory allocation in the prepare phase,
+	 * but initialize the refcount in the commit phase.
+	 *
+	 * Have mdb	| mdb has refcount > 0	| Commit phase	| Resolution
+	 * -------------+-----------------------+---------------+---------------
+	 * no		| -			| no		| Alloc & proceed
+	 * no		| -			| yes		| Error
+	 * yes		| no			| no		| Error
+	 * yes		| no			| yes		| Proceed
+	 * yes		| yes			| no		| Ignore
+	 * yes		| yes			| yes		| Add refcount
+	 */
+	a = dsa_host_addr_find(&ds->host_mdb, mdb);
+	if (!a) {
+		if (WARN_ON(switchdev_trans_ph_commit(trans)))
+			return -EINVAL;
+
+		a = kzalloc(sizeof(*a), GFP_KERNEL);
+		if (!a)
+			return -ENOMEM;
+
+		err = dsa_port_mdb_add(cpu_dp, mdb, trans);
+		if (err) {
+			kfree(a);
+			return err;
+		}
+
+		ether_addr_copy(a->addr, mdb->addr);
+		a->vid = mdb->vid;
+		refcount_set(&a->refcount, 0);
+		list_add_tail(&a->list, &ds->host_mdb);
+
+		return 0;
+	}
+
+	/* If we are in the prepare phase, and a host mdb exists,
+	 * then ignore it. The refcount will be incremented during
+	 * commit, if propagation to hardware went well.
+	 */
+	if (switchdev_trans_ph_prepare(trans))
+		return 0;
+
+	/* In the commit phase, a host mdb might exist either left as
+	 * unfinished work by the previous prepare phase (this will have
+	 * refcount 0), or as a complete entry installed for another port.
+	 */
+	if (refcount_read(&a->refcount) > 0) {
+		refcount_inc(&a->refcount);
+		return 0;
+	}
+
+	err = dsa_port_mdb_add(cpu_dp, mdb, trans);
+	if (err) {
+		list_del(&a->list);
+		kfree(a);
+		return err;
+	}
+
+	refcount_set(&a->refcount, 1);
+
+	return 0;
+}
+
+static int dsa_host_mdb_del(struct dsa_port *dp,
+			    const struct switchdev_obj_port_mdb *mdb)
+{
+	struct dsa_port *cpu_dp = dp->cpu_dp;
+	struct dsa_switch *ds = dp->ds;
+	struct dsa_host_addr *a;
+	int err;
+
+	a = dsa_host_addr_find(&ds->host_mdb, mdb);
+	if (!a)
+		return -ENOENT;
+
+	if (!refcount_dec_and_test(&a->refcount))
+		return 0;
+
+	err = dsa_port_mdb_del(cpu_dp, mdb);
+	if (err)
+		return err;
+
+	list_del(&a->list);
+	kfree(a);
+
+	return 0;
+}
+
 static int dsa_slave_port_obj_add(struct net_device *dev,
 				  const struct switchdev_obj *obj,
 				  struct switchdev_trans *trans,
@@ -396,11 +513,7 @@  static int dsa_slave_port_obj_add(struct net_device *dev,
 		err = dsa_port_mdb_add(dp, SWITCHDEV_OBJ_PORT_MDB(obj), trans);
 		break;
 	case SWITCHDEV_OBJ_ID_HOST_MDB:
-		/* DSA can directly translate this to a normal MDB add,
-		 * but on the CPU port.
-		 */
-		err = dsa_port_mdb_add(dp->cpu_dp, SWITCHDEV_OBJ_PORT_MDB(obj),
-				       trans);
+		err = dsa_host_mdb_add(dp, SWITCHDEV_OBJ_PORT_MDB(obj), trans);
 		break;
 	case SWITCHDEV_OBJ_ID_PORT_VLAN:
 		err = dsa_slave_vlan_add(dev, obj, trans);
@@ -455,10 +568,7 @@  static int dsa_slave_port_obj_del(struct net_device *dev,
 		err = dsa_port_mdb_del(dp, SWITCHDEV_OBJ_PORT_MDB(obj));
 		break;
 	case SWITCHDEV_OBJ_ID_HOST_MDB:
-		/* DSA can directly translate this to a normal MDB add,
-		 * but on the CPU port.
-		 */
-		err = dsa_port_mdb_del(dp->cpu_dp, SWITCHDEV_OBJ_PORT_MDB(obj));
+		err = dsa_host_mdb_del(dp, SWITCHDEV_OBJ_PORT_MDB(obj));
 		break;
 	case SWITCHDEV_OBJ_ID_PORT_VLAN:
 		err = dsa_slave_vlan_del(dev, obj);