diff mbox series

[net] team: Fix ABBA deadlock caused by race in team_del_slave

Message ID 20240703145159.80128-1-aha310510@gmail.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net] team: Fix ABBA deadlock caused by race in team_del_slave | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/apply fail Patch does not apply to net-0

Commit Message

Jeongjun Park July 3, 2024, 2:51 p.m. UTC
CPU0                    CPU1
       ----                    ----
  lock(&rdev->wiphy.mtx);
                               lock(team->team_lock_key#4);
                               lock(&rdev->wiphy.mtx);
  lock(team->team_lock_key#4);

Deadlock occurs due to the above scenario. Therefore,
modify the code as shown in the patch below to prevent deadlock.

Regards,
Jeongjun Park.

Reported-and-tested-by: syzbot+705c61d60b091ef42c04@syzkaller.appspotmail.com
Fixes: 61dc3461b954 ("team: convert overall spinlock to mutex")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
---
 drivers/net/team/team_core.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

--

Comments

Michal Kubiak July 3, 2024, 3:18 p.m. UTC | #1
On Wed, Jul 03, 2024 at 11:51:59PM +0900, Jeongjun Park wrote:
>        CPU0                    CPU1
>        ----                    ----
>   lock(&rdev->wiphy.mtx);
>                                lock(team->team_lock_key#4);
>                                lock(&rdev->wiphy.mtx);
>   lock(team->team_lock_key#4);
> 
> Deadlock occurs due to the above scenario. Therefore,
> modify the code as shown in the patch below to prevent deadlock.
> 
> Regards,
> Jeongjun Park.

The commit message should contain the patch description only (without
salutations, etc.).

> 
> Reported-and-tested-by: syzbot+705c61d60b091ef42c04@syzkaller.appspotmail.com
> Fixes: 61dc3461b954 ("team: convert overall spinlock to mutex")
> Signed-off-by: Jeongjun Park <aha310510@gmail.com>
> ---
>  drivers/net/team/team_core.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/team/team_core.c b/drivers/net/team/team_core.c
> index ab1935a4aa2c..3ac82df876b0 100644
> --- a/drivers/net/team/team_core.c
> +++ b/drivers/net/team/team_core.c
> @@ -1970,11 +1970,12 @@ static int team_add_slave(struct net_device *dev, struct net_device *port_dev,
>                           struct netlink_ext_ack *extack)
>  {
>         struct team *team = netdev_priv(dev);
> -       int err;
> +       int err, locked;
>  
> -       mutex_lock(&team->lock);
> +       locked = mutex_trylock(&team->lock);
>         err = team_port_add(team, port_dev, extack);
> -       mutex_unlock(&team->lock);
> +       if (locked)
> +               mutex_unlock(&team->lock);

This is not correct usage of 'mutex_trylock()' API. In such a case you
could as well remove the lock completely from that part of code.
If "mutex_trylock()" returns false it means the mutex cannot be taken
(because it was already taken by other thread), so you should not modify
the resources that were expected to be protected by the mutex.
In other words, there is a risk of modifying resources using
"team_port_add()" by several threads at a time.

>  
>         if (!err)
>                 netdev_change_features(dev);
> @@ -1985,11 +1986,12 @@ static int team_add_slave(struct net_device *dev, struct net_device *port_dev,
>  static int team_del_slave(struct net_device *dev, struct net_device *port_dev)
>  {
>         struct team *team = netdev_priv(dev);
> -       int err;
> +       int err, locked;
>  
> -       mutex_lock(&team->lock);
> +       locked = mutex_trylock(&team->lock);
>         err = team_port_del(team, port_dev);
> -       mutex_unlock(&team->lock);
> +       if (locked)
> +               mutex_unlock(&team->lock);

The same story as in case of "team_add_slave()".

>  
>         if (err)
>                 return err;
> --
> 

The patch does not seem to be a correct solution to remove a deadlock.
Most probably a synchronization design needs an inspection.
If you really want to use "mutex_trylock()" API, please consider several
attempts of taking the mutex, but never modify the protected resources when
the mutex is not taken successfully.

Thanks,
Michal
Jeongjun Park July 3, 2024, 4:02 p.m. UTC | #2
>
> On Wed, Jul 03, 2024 at 11:51:59PM +0900, Jeongjun Park wrote:
> >        CPU0                    CPU1
> >        ----                    ----
> >   lock(&rdev->wiphy.mtx);
> >                                lock(team->team_lock_key#4);
> >                                lock(&rdev->wiphy.mtx);
> >   lock(team->team_lock_key#4);
> >
> > Deadlock occurs due to the above scenario. Therefore,
> > modify the code as shown in the patch below to prevent deadlock.
> >
> > Regards,
> > Jeongjun Park.
>
> The commit message should contain the patch description only (without
> salutations, etc.).
>
> >
> > Reported-and-tested-by: syzbot+705c61d60b091ef42c04@syzkaller.appspotmail.com
> > Fixes: 61dc3461b954 ("team: convert overall spinlock to mutex")
> > Signed-off-by: Jeongjun Park <aha310510@gmail.com>
> > ---
> >  drivers/net/team/team_core.c | 14 ++++++++------
> >  1 file changed, 8 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/net/team/team_core.c b/drivers/net/team/team_core.c
> > index ab1935a4aa2c..3ac82df876b0 100644
> > --- a/drivers/net/team/team_core.c
> > +++ b/drivers/net/team/team_core.c
> > @@ -1970,11 +1970,12 @@ static int team_add_slave(struct net_device *dev, struct net_device *port_dev,
> >                           struct netlink_ext_ack *extack)
> >  {
> >         struct team *team = netdev_priv(dev);
> > -       int err;
> > +       int err, locked;
> > 
> > -       mutex_lock(&team->lock);
> > +       locked = mutex_trylock(&team->lock);
> >         err = team_port_add(team, port_dev, extack);
> > -       mutex_unlock(&team->lock);
> > +       if (locked)
> > +               mutex_unlock(&team->lock);
>
> This is not correct usage of 'mutex_trylock()' API. In such a case you
> could as well remove the lock completely from that part of code.
> If "mutex_trylock()" returns false it means the mutex cannot be taken
> (because it was already taken by other thread), so you should not modify
> the resources that were expected to be protected by the mutex.
> In other words, there is a risk of modifying resources using
> "team_port_add()" by several threads at a time.
>
> > 
> >         if (!err)
> >                 netdev_change_features(dev);
> > @@ -1985,11 +1986,12 @@ static int team_add_slave(struct net_device *dev, struct net_device *port_dev,
> >  static int team_del_slave(struct net_device *dev, struct net_device *port_dev)
> >  {
> >         struct team *team = netdev_priv(dev);
> > -       int err;
> > +       int err, locked;
> > 
> > -       mutex_lock(&team->lock);
> > +       locked = mutex_trylock(&team->lock);
> >         err = team_port_del(team, port_dev);
> > -       mutex_unlock(&team->lock);
> > +       if (locked)
> > +               mutex_unlock(&team->lock);
>
> The same story as in case of "team_add_slave()".
>
> > 
> >         if (err)
> >                 return err;
> > --
> >
>
> The patch does not seem to be a correct solution to remove a deadlock.
> Most probably a synchronization design needs an inspection.
> If you really want to use "mutex_trylock()" API, please consider several
> attempts of taking the mutex, but never modify the protected resources when
> the mutex is not taken successfully.
>

Thanks for your comment. I rewrote the patch based on those comments. 
This time, we modified it to return an error so that resources are not 
modified when a race situation occurs. We would appreciate your 
feedback on what this patch would be like.

> Thanks,
> Michal
>
>

Regards,
Jeongjun Park

---
 drivers/net/team/team_core.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/team/team_core.c b/drivers/net/team/team_core.c
index ab1935a4aa2c..43d7c73b25aa 100644
--- a/drivers/net/team/team_core.c
+++ b/drivers/net/team/team_core.c
@@ -1972,7 +1972,8 @@ static int team_add_slave(struct net_device *dev, struct net_device *port_dev,
        struct team *team = netdev_priv(dev);
        int err;
 
-       mutex_lock(&team->lock);
+       if (!mutex_trylock(&team->lock))
+               return -EBUSY;
        err = team_port_add(team, port_dev, extack);
        mutex_unlock(&team->lock);
 
@@ -1987,7 +1988,8 @@ static int team_del_slave(struct net_device *dev, struct net_device *port_dev)
        struct team *team = netdev_priv(dev);
        int err;
 
-       mutex_lock(&team->lock);
+       if (!mutex_trylock(&team->lock))
+               return -EBUSY;
        err = team_port_del(team, port_dev);
        mutex_unlock(&team->lock);
 
--
Eric Dumazet July 3, 2024, 4:30 p.m. UTC | #3
On Wed, Jul 3, 2024 at 6:02 PM Jeongjun Park <aha310510@gmail.com> wrote:
>
> >
> > On Wed, Jul 03, 2024 at 11:51:59PM +0900, Jeongjun Park wrote:
> > >        CPU0                    CPU1
> > >        ----                    ----
> > >   lock(&rdev->wiphy.mtx);
> > >                                lock(team->team_lock_key#4);
> > >                                lock(&rdev->wiphy.mtx);
> > >   lock(team->team_lock_key#4);
> > >
> > > Deadlock occurs due to the above scenario. Therefore,
> > > modify the code as shown in the patch below to prevent deadlock.
> > >
> > > Regards,
> > > Jeongjun Park.
> >
> > The commit message should contain the patch description only (without
> > salutations, etc.).
> >
> > >
> > > Reported-and-tested-by: syzbot+705c61d60b091ef42c04@syzkaller.appspotmail.com
> > > Fixes: 61dc3461b954 ("team: convert overall spinlock to mutex")
> > > Signed-off-by: Jeongjun Park <aha310510@gmail.com>
> > > ---
> > >  drivers/net/team/team_core.c | 14 ++++++++------
> > >  1 file changed, 8 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/drivers/net/team/team_core.c b/drivers/net/team/team_core.c
> > > index ab1935a4aa2c..3ac82df876b0 100644
> > > --- a/drivers/net/team/team_core.c
> > > +++ b/drivers/net/team/team_core.c
> > > @@ -1970,11 +1970,12 @@ static int team_add_slave(struct net_device *dev, struct net_device *port_dev,
> > >                           struct netlink_ext_ack *extack)
> > >  {
> > >         struct team *team = netdev_priv(dev);
> > > -       int err;
> > > +       int err, locked;
> > >
> > > -       mutex_lock(&team->lock);
> > > +       locked = mutex_trylock(&team->lock);
> > >         err = team_port_add(team, port_dev, extack);
> > > -       mutex_unlock(&team->lock);
> > > +       if (locked)
> > > +               mutex_unlock(&team->lock);
> >
> > This is not correct usage of 'mutex_trylock()' API. In such a case you
> > could as well remove the lock completely from that part of code.
> > If "mutex_trylock()" returns false it means the mutex cannot be taken
> > (because it was already taken by other thread), so you should not modify
> > the resources that were expected to be protected by the mutex.
> > In other words, there is a risk of modifying resources using
> > "team_port_add()" by several threads at a time.
> >
> > >
> > >         if (!err)
> > >                 netdev_change_features(dev);
> > > @@ -1985,11 +1986,12 @@ static int team_add_slave(struct net_device *dev, struct net_device *port_dev,
> > >  static int team_del_slave(struct net_device *dev, struct net_device *port_dev)
> > >  {
> > >         struct team *team = netdev_priv(dev);
> > > -       int err;
> > > +       int err, locked;
> > >
> > > -       mutex_lock(&team->lock);
> > > +       locked = mutex_trylock(&team->lock);
> > >         err = team_port_del(team, port_dev);
> > > -       mutex_unlock(&team->lock);
> > > +       if (locked)
> > > +               mutex_unlock(&team->lock);
> >
> > The same story as in case of "team_add_slave()".
> >
> > >
> > >         if (err)
> > >                 return err;
> > > --
> > >
> >
> > The patch does not seem to be a correct solution to remove a deadlock.
> > Most probably a synchronization design needs an inspection.
> > If you really want to use "mutex_trylock()" API, please consider several
> > attempts of taking the mutex, but never modify the protected resources when
> > the mutex is not taken successfully.
> >
>
> Thanks for your comment. I rewrote the patch based on those comments.
> This time, we modified it to return an error so that resources are not
> modified when a race situation occurs. We would appreciate your
> feedback on what this patch would be like.
>
> > Thanks,
> > Michal
> >
> >
>
> Regards,
> Jeongjun Park
>
> ---
>  drivers/net/team/team_core.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/team/team_core.c b/drivers/net/team/team_core.c
> index ab1935a4aa2c..43d7c73b25aa 100644
> --- a/drivers/net/team/team_core.c
> +++ b/drivers/net/team/team_core.c
> @@ -1972,7 +1972,8 @@ static int team_add_slave(struct net_device *dev, struct net_device *port_dev,
>         struct team *team = netdev_priv(dev);
>         int err;
>
> -       mutex_lock(&team->lock);
> +       if (!mutex_trylock(&team->lock))
> +               return -EBUSY;
>         err = team_port_add(team, port_dev, extack);
>         mutex_unlock(&team->lock);
>
> @@ -1987,7 +1988,8 @@ static int team_del_slave(struct net_device *dev, struct net_device *port_dev)
>         struct team *team = netdev_priv(dev);
>         int err;
>
> -       mutex_lock(&team->lock);
> +       if (!mutex_trylock(&team->lock))
> +               return -EBUSY;
>         err = team_port_del(team, port_dev);
>         mutex_unlock(&team->lock);
>
> --

Failing team_del_slave() is not an option. It will add various issues.
Jeongjun Park July 5, 2024, 3:17 p.m. UTC | #4
> >
> > Thanks for your comment. I rewrote the patch based on those comments.
> > This time, we modified it to return an error so that resources are not
> > modified when a race situation occurs. We would appreciate your
> > feedback on what this patch would be like.
> >
> > > Thanks,
> > > Michal
> > >
> > >
> >
> > Regards,
> > Jeongjun Park
> >
> > ---
> >  drivers/net/team/team_core.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/team/team_core.c b/drivers/net/team/team_core.c
> > index ab1935a4aa2c..43d7c73b25aa 100644
> > --- a/drivers/net/team/team_core.c
> > +++ b/drivers/net/team/team_core.c
> > @@ -1972,7 +1972,8 @@ static int team_add_slave(struct net_device *dev, struct net_device *port_dev,
> >         struct team *team = netdev_priv(dev);
> >         int err;
> >
> > -       mutex_lock(&team->lock);
> > +       if (!mutex_trylock(&team->lock))
> > +               return -EBUSY;
> >         err = team_port_add(team, port_dev, extack);
> >         mutex_unlock(&team->lock);
> >
> > @@ -1987,7 +1988,8 @@ static int team_del_slave(struct net_device *dev, struct net_device *port_dev)
> >         struct team *team = netdev_priv(dev);
> >         int err;
> >
> > -       mutex_lock(&team->lock);
> > +       if (!mutex_trylock(&team->lock))
> > +               return -EBUSY;
> >         err = team_port_del(team, port_dev);
> >         mutex_unlock(&team->lock);
> >
> > --
>
> Failing team_del_slave() is not an option. It will add various issues.

Thank you for comment. 

So, how about briefly releasing the lock before calling dev_open()
in team_port_add() and then locking it again? dev_open() does not use
&team, so disabling it briefly will not cause any major problems.

Regards,
Jeongjun Park

---
 drivers/net/team/team_core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/team/team_core.c b/drivers/net/team/team_core.c
index ab1935a4aa2c..245566a1875d 100644
--- a/drivers/net/team/team_core.c
+++ b/drivers/net/team/team_core.c
@@ -1213,7 +1213,9 @@ static int team_port_add(struct team *team, struct net_device *port_dev,
 		goto err_port_enter;
 	}
 
+	mutex_unlock(&team->lock);
 	err = dev_open(port_dev, extack);
+	mutex_lock(&team->lock);
 	if (err) {
 		netdev_dbg(dev, "Device %s opening failed\n",
 			   portname);
--
Jeongjun Park July 5, 2024, 3:19 p.m. UTC | #5
> >
> > Thanks for your comment. I rewrote the patch based on those comments.
> > This time, we modified it to return an error so that resources are not
> > modified when a race situation occurs. We would appreciate your
> > feedback on what this patch would be like.
> >
> > > Thanks,
> > > Michal
> > >
> > >
> >
> > Regards,
> > Jeongjun Park
> >
> > ---
> >  drivers/net/team/team_core.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/team/team_core.c b/drivers/net/team/team_core.c
> > index ab1935a4aa2c..43d7c73b25aa 100644
> > --- a/drivers/net/team/team_core.c
> > +++ b/drivers/net/team/team_core.c
> > @@ -1972,7 +1972,8 @@ static int team_add_slave(struct net_device *dev, struct net_device *port_dev,
> >         struct team *team = netdev_priv(dev);
> >         int err;
> >
> > -       mutex_lock(&team->lock);
> > +       if (!mutex_trylock(&team->lock))
> > +               return -EBUSY;
> >         err = team_port_add(team, port_dev, extack);
> >         mutex_unlock(&team->lock);
> >
> > @@ -1987,7 +1988,8 @@ static int team_del_slave(struct net_device *dev, struct net_device *port_dev)
> >         struct team *team = netdev_priv(dev);
> >         int err;
> >
> > -       mutex_lock(&team->lock);
> > +       if (!mutex_trylock(&team->lock))
> > +               return -EBUSY;
> >         err = team_port_del(team, port_dev);
> >         mutex_unlock(&team->lock);
> >
> > --
>
> Failing team_del_slave() is not an option. It will add various issues.

Thank you for comment. 

So, how about briefly releasing the lock before calling dev_open()
in team_port_add() and then locking it again? dev_open() does not use
&team, so disabling it briefly will not cause any major problems.

Regards,
Jeongjun Park

---
 drivers/net/team/team_core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/team/team_core.c b/drivers/net/team/team_core.c
index ab1935a4aa2c..245566a1875d 100644
--- a/drivers/net/team/team_core.c
+++ b/drivers/net/team/team_core.c
@@ -1213,7 +1213,9 @@ static int team_port_add(struct team *team, struct net_device *port_dev,
 		goto err_port_enter;
 	}
 
+	mutex_unlock(&team->lock);
 	err = dev_open(port_dev, extack);
+	mutex_lock(&team->lock);
 	if (err) {
 		netdev_dbg(dev, "Device %s opening failed\n",
 			   portname);
--
diff mbox series

Patch

diff --git a/drivers/net/team/team_core.c b/drivers/net/team/team_core.c
index ab1935a4aa2c..3ac82df876b0 100644
--- a/drivers/net/team/team_core.c
+++ b/drivers/net/team/team_core.c
@@ -1970,11 +1970,12 @@  static int team_add_slave(struct net_device *dev, struct net_device *port_dev,
                          struct netlink_ext_ack *extack)
 {
        struct team *team = netdev_priv(dev);
-       int err;
+       int err, locked;
 
-       mutex_lock(&team->lock);
+       locked = mutex_trylock(&team->lock);
        err = team_port_add(team, port_dev, extack);
-       mutex_unlock(&team->lock);
+       if (locked)
+               mutex_unlock(&team->lock);
 
        if (!err)
                netdev_change_features(dev);
@@ -1985,11 +1986,12 @@  static int team_add_slave(struct net_device *dev, struct net_device *port_dev,
 static int team_del_slave(struct net_device *dev, struct net_device *port_dev)
 {
        struct team *team = netdev_priv(dev);
-       int err;
+       int err, locked;
 
-       mutex_lock(&team->lock);
+       locked = mutex_trylock(&team->lock);
        err = team_port_del(team, port_dev);
-       mutex_unlock(&team->lock);
+       if (locked)
+               mutex_unlock(&team->lock);
 
        if (err)
                return err;