[4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race

Message ID	e1dbcfc25d8930b281aad12699ebf8fa82485b0e.1407885724.git.dledford@redhat.com (mailing list archive)
State	Superseded, archived
Headers	show Return-Path: <linux-rdma-owner@kernel.org> From: Doug Ledford <dledford@redhat.com> To: linux-rdma@vger.kernel.org, roland@kernel.org Cc: Doug Ledford <dledford@redhat.com> Subject: [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race Date: Tue, 12 Aug 2014 19:38:11 -0400 Message-Id: <e1dbcfc25d8930b281aad12699ebf8fa82485b0e.1407885724.git.dledford@redhat.com> In-Reply-To: <cover.1407885724.git.dledford@redhat.com> References: <cover.1407885724.git.dledford@redhat.com> In-Reply-To: <cover.1407885724.git.dledford@redhat.com> References: <cover.1407885724.git.dledford@redhat.com> Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk

Doug Ledford Aug. 12, 2014, 11:38 p.m. UTC

Our mcast_dev_flush routine and our mcast_restart_task can race against
each other.  In particular, they both hold the priv->lock while
manipulating the rbtree and while removing mcast entries from the
multicast_list and while adding entries to the remove_list, but they
also both drop their locks prior to doing the actual removes.  The
mcast_dev_flush routine is run entirely under the rtnl lock and so has
at least some locking.  The actual race condition is like this:

Thread 1                                Thread 2
ifconfig ib0 up
  start multicast join for broadcast
  multicast join completes for broadcast
  start to add more multicast joins
    call mcast_restart_task to add new entries
                                        ifconfig ib0 down
					  mcast_dev_flush
					    mcast_leave(mcast A)
    mcast_leave(mcast A)

As mcast_leave calls ib_sa_multicast_leave, and as member in
core/multicast.c is ref counted, we run into an unbalanced refcount
issue.  To avoid stomping on each others removes, take the rtnl lock
specifically when we are deleting the entries from the remove list.

Signed-off-by: Doug Ledford <dledford@redhat.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 37 ++++++++++++++++++++++----
 1 file changed, 32 insertions(+), 5 deletions(-)

Wendy Cheng Aug. 29, 2014, 9:53 p.m. UTC | #1

On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford@redhat.com> wrote:
> Our mcast_dev_flush routine and our mcast_restart_task can race against
> each other.  In particular, they both hold the priv->lock while
> manipulating the rbtree and while removing mcast entries from the
> multicast_list and while adding entries to the remove_list, but they
> also both drop their locks prior to doing the actual removes.  The
> mcast_dev_flush routine is run entirely under the rtnl lock and so has
> at least some locking.  The actual race condition is like this:
>
> Thread 1                                Thread 2
> ifconfig ib0 up
>   start multicast join for broadcast
>   multicast join completes for broadcast
>   start to add more multicast joins
>     call mcast_restart_task to add new entries
>                                         ifconfig ib0 down
>                                           mcast_dev_flush
>                                             mcast_leave(mcast A)
>     mcast_leave(mcast A)
>
> As mcast_leave calls ib_sa_multicast_leave, and as member in
> core/multicast.c is ref counted, we run into an unbalanced refcount
> issue.  To avoid stomping on each others removes, take the rtnl lock
> specifically when we are deleting the entries from the remove list.

Isn't "test_and_clear_bit()" atomic so it is unlikely that
ib_sa_free_multicast() can run multiple times  ?

    638 static int ipoib_mcast_leave(struct net_device *dev, struct
ipoib_mcast         *mcast)
    639 {
    640         struct ipoib_dev_priv *priv = netdev_priv(dev);
    641         int ret = 0;
    642
    643         if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
    644                 ib_sa_free_multicast(mcast->mc);
    645
    646         if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED,
&mcast->flags)        ) {


-- Wendy

>
> Signed-off-by: Doug Ledford <dledford@redhat.com>
> ---
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 37 ++++++++++++++++++++++----
>  1 file changed, 32 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index f5e8da530d9..19e3fe75ebf 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -810,7 +810,10 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
>
>         spin_unlock_irqrestore(&priv->lock, flags);
>
> -       /* seperate between the wait to the leave*/
> +       /*
> +        * make sure the in-flight joins have finished before we attempt
> +        * to leave
> +        */
>         list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
>                 if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
>                         wait_for_completion(&mcast->done);
> @@ -931,14 +934,38 @@ void ipoib_mcast_restart_task(struct work_struct *work)
>         netif_addr_unlock(dev);
>         local_irq_restore(flags);
>
> -       /* We have to cancel outside of the spinlock */
> +       /*
> +        * make sure the in-flight joins have finished before we attempt
> +        * to leave
> +        */
> +       list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
> +               if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
> +                       wait_for_completion(&mcast->done);
> +
> +       /*
> +        * We have to cancel outside of the spinlock, but we have to
> +        * take the rtnl lock or else we race with the removal of
> +        * entries from the remove list in mcast_dev_flush as part
> +        * of ipoib_stop() which will call mcast_stop_thread with
> +        * flush == 1 while holding the rtnl lock, and the
> +        * flush_workqueue won't complete until this restart_mcast_task
> +        * completes.  So do like the carrier on task and attempt to
> +        * take the rtnl lock, but if we can't before the ADMIN_UP flag
> +        * goes away, then just return and know that the remove list will
> +        * get flushed later by mcast_dev_flush.
> +        */
> +       while (!rtnl_trylock()) {
> +               if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> +                       return;
> +               else
> +                       msleep(20);
> +       }
>         list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
>                 ipoib_mcast_leave(mcast->dev, mcast);
>                 ipoib_mcast_free(mcast);
>         }
> -
> -       if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> -               ipoib_mcast_start_thread(dev);
> +       ipoib_mcast_start_thread(dev);
> +       rtnl_unlock();
>  }
>
>  #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
> --
> 1.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Wendy Cheng Aug. 30, 2014, 3:39 p.m. UTC | #2

On Fri, Aug 29, 2014 at 2:53 PM, Wendy Cheng <s.wendy.cheng@gmail.com> wrote:
> On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford@redhat.com> wrote:
>> Our mcast_dev_flush routine and our mcast_restart_task can race against
>> each other.  In particular, they both hold the priv->lock while
>> manipulating the rbtree and while removing mcast entries from the
>> multicast_list and while adding entries to the remove_list, but they
>> also both drop their locks prior to doing the actual removes.  The
>> mcast_dev_flush routine is run entirely under the rtnl lock and so has
>> at least some locking.  The actual race condition is like this:
>>
>> Thread 1                                Thread 2
>> ifconfig ib0 up
>>   start multicast join for broadcast
>>   multicast join completes for broadcast
>>   start to add more multicast joins
>>     call mcast_restart_task to add new entries
>>                                         ifconfig ib0 down
>>                                           mcast_dev_flush
>>                                             mcast_leave(mcast A)
>>     mcast_leave(mcast A)
>>
>> As mcast_leave calls ib_sa_multicast_leave, and as member in
>> core/multicast.c is ref counted, we run into an unbalanced refcount
>> issue.  To avoid stomping on each others removes, take the rtnl lock
>> specifically when we are deleting the entries from the remove list.
>
> Isn't "test_and_clear_bit()" atomic so it is unlikely that
> ib_sa_free_multicast() can run multiple times  ?

Oops .. how about if the structure itself gets freed ? My bad !

However, isn't that the remove_list a local list on the caller's stack
? .. and  the original list entry moving (to remove_list) is protected
by the spin lock (priv->lock), it is unlikely that the
ib_sa_free_multicast() can operate on the same entry ?

The patch itself is harmless though .. but adding the rntl_lock is
really not ideal.

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Doug Ledford Sept. 3, 2014, 5:49 p.m. UTC | #3

On Fri, 2014-08-29 at 14:53 -0700, Wendy Cheng wrote:
> On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford@redhat.com> wrote:
> > Our mcast_dev_flush routine and our mcast_restart_task can race against
> > each other.  In particular, they both hold the priv->lock while
> > manipulating the rbtree and while removing mcast entries from the
> > multicast_list and while adding entries to the remove_list, but they
> > also both drop their locks prior to doing the actual removes.  The
> > mcast_dev_flush routine is run entirely under the rtnl lock and so has
> > at least some locking.  The actual race condition is like this:
> >
> > Thread 1                                Thread 2
> > ifconfig ib0 up
> >   start multicast join for broadcast
> >   multicast join completes for broadcast
> >   start to add more multicast joins
> >     call mcast_restart_task to add new entries
> >                                         ifconfig ib0 down
> >                                           mcast_dev_flush
> >                                             mcast_leave(mcast A)
> >     mcast_leave(mcast A)
> >
> > As mcast_leave calls ib_sa_multicast_leave, and as member in
> > core/multicast.c is ref counted, we run into an unbalanced refcount
> > issue.  To avoid stomping on each others removes, take the rtnl lock
> > specifically when we are deleting the entries from the remove list.
> 
> Isn't "test_and_clear_bit()" atomic so it is unlikely that
> ib_sa_free_multicast() can run multiple times  ?
> 
>     638 static int ipoib_mcast_leave(struct net_device *dev, struct
> ipoib_mcast         *mcast)
>     639 {
>     640         struct ipoib_dev_priv *priv = netdev_priv(dev);
>     641         int ret = 0;
>     642
>     643         if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
>     644                 ib_sa_free_multicast(mcast->mc);
>     645
>     646         if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED,
> &mcast->flags)        ) {

This is the original code, and you would be right, but it had other
problems.  Even though test_and_clear_bit is atomic, the original usage
of MCAST_FLAG_BUSY was really that the mcast join had started, there was
no guarantee it had completed (we would set the flag, call
ib_sa_multicast_join, and we could then race with our leave where it was
possible to leave prior to the join completing).

My patches change this around now, and the code only checks for
MCAST_FLAG_BUSY as a double check that we aren't leaving an in-process
join.

> 
> -- Wendy
> 
> >
> > Signed-off-by: Doug Ledford <dledford@redhat.com>
> > ---
> >  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 37 ++++++++++++++++++++++----
> >  1 file changed, 32 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > index f5e8da530d9..19e3fe75ebf 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > @@ -810,7 +810,10 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
> >
> >         spin_unlock_irqrestore(&priv->lock, flags);
> >
> > -       /* seperate between the wait to the leave*/
> > +       /*
> > +        * make sure the in-flight joins have finished before we attempt
> > +        * to leave
> > +        */
> >         list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
> >                 if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
> >                         wait_for_completion(&mcast->done);
> > @@ -931,14 +934,38 @@ void ipoib_mcast_restart_task(struct work_struct *work)
> >         netif_addr_unlock(dev);
> >         local_irq_restore(flags);
> >
> > -       /* We have to cancel outside of the spinlock */
> > +       /*
> > +        * make sure the in-flight joins have finished before we attempt
> > +        * to leave
> > +        */
> > +       list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
> > +               if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
> > +                       wait_for_completion(&mcast->done);
> > +
> > +       /*
> > +        * We have to cancel outside of the spinlock, but we have to
> > +        * take the rtnl lock or else we race with the removal of
> > +        * entries from the remove list in mcast_dev_flush as part
> > +        * of ipoib_stop() which will call mcast_stop_thread with
> > +        * flush == 1 while holding the rtnl lock, and the
> > +        * flush_workqueue won't complete until this restart_mcast_task
> > +        * completes.  So do like the carrier on task and attempt to
> > +        * take the rtnl lock, but if we can't before the ADMIN_UP flag
> > +        * goes away, then just return and know that the remove list will
> > +        * get flushed later by mcast_dev_flush.
> > +        */
> > +       while (!rtnl_trylock()) {
> > +               if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> > +                       return;
> > +               else
> > +                       msleep(20);
> > +       }
> >         list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
> >                 ipoib_mcast_leave(mcast->dev, mcast);
> >                 ipoib_mcast_free(mcast);
> >         }
> > -
> > -       if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> > -               ipoib_mcast_start_thread(dev);
> > +       ipoib_mcast_start_thread(dev);
> > +       rtnl_unlock();
> >  }
> >
> >  #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
> > --
> > 1.9.3
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

Doug Ledford Sept. 3, 2014, 6:06 p.m. UTC | #4

On Sat, 2014-08-30 at 08:39 -0700, Wendy Cheng wrote:
> On Fri, Aug 29, 2014 at 2:53 PM, Wendy Cheng <s.wendy.cheng@gmail.com> wrote:
> > On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford@redhat.com> wrote:
> >> Our mcast_dev_flush routine and our mcast_restart_task can race against
> >> each other.  In particular, they both hold the priv->lock while
> >> manipulating the rbtree and while removing mcast entries from the
> >> multicast_list and while adding entries to the remove_list, but they
> >> also both drop their locks prior to doing the actual removes.  The
> >> mcast_dev_flush routine is run entirely under the rtnl lock and so has
> >> at least some locking.  The actual race condition is like this:
> >>
> >> Thread 1                                Thread 2
> >> ifconfig ib0 up
> >>   start multicast join for broadcast
> >>   multicast join completes for broadcast
> >>   start to add more multicast joins
> >>     call mcast_restart_task to add new entries
> >>                                         ifconfig ib0 down
> >>                                           mcast_dev_flush
> >>                                             mcast_leave(mcast A)
> >>     mcast_leave(mcast A)
> >>
> >> As mcast_leave calls ib_sa_multicast_leave, and as member in
> >> core/multicast.c is ref counted, we run into an unbalanced refcount
> >> issue.  To avoid stomping on each others removes, take the rtnl lock
> >> specifically when we are deleting the entries from the remove list.
> >
> > Isn't "test_and_clear_bit()" atomic so it is unlikely that
> > ib_sa_free_multicast() can run multiple times  ?
> 
> Oops .. how about if the structure itself gets freed ? My bad !

Well, just like the last email, the code you are referring to is in the
original code, and had other issues.  After my patches it does not look
like that.

> However, isn't that the remove_list a local list on the caller's stack
> ? .. and  the original list entry moving (to remove_list) is protected
> by the spin lock (priv->lock), it is unlikely that the
> ib_sa_free_multicast() can operate on the same entry ?

Yes, you're right.  I had it in my mind that the remove_list was part of
the ipoib private dev, not local on the stack.  So you are right that we
could probably get away with removing the rtnl lock there (although it
would need to be in a later patch than the one you are reviewing
here...here there would still be a race between the restart task and the
downing of the interface because they all still share the same work
queue, but once we switch to the per device work queues in patch #6,
this can happen in parallel safely with the flush task I think).

> The patch itself is harmless though .. but adding the rntl_lock is
> really not ideal.
> 
> -- Wendy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Wendy Cheng Sept. 3, 2014, 7:45 p.m. UTC | #5

On Wed, Sep 3, 2014 at 11:06 AM, Doug Ledford <dledford@redhat.com> wrote:
> On Sat, 2014-08-30 at 08:39 -0700, Wendy Cheng wrote:
>> On Fri, Aug 29, 2014 at 2:53 PM, Wendy Cheng <s.wendy.cheng@gmail.com> wrote:
>> > On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford@redhat.com> wrote:
>> >> Our mcast_dev_flush routine and our mcast_restart_task can race against
>> >> each other.  In particular, they both hold the priv->lock while
>> >> manipulating the rbtree and while removing mcast entries from the
>> >> multicast_list and while adding entries to the remove_list, but they
>> >> also both drop their locks prior to doing the actual removes.  The
>> >> mcast_dev_flush routine is run entirely under the rtnl lock and so has
>> >> at least some locking.  The actual race condition is like this:
>> >>
>> >> Thread 1                                Thread 2
>> >> ifconfig ib0 up
>> >>   start multicast join for broadcast
>> >>   multicast join completes for broadcast
>> >>   start to add more multicast joins
>> >>     call mcast_restart_task to add new entries
>> >>                                         ifconfig ib0 down
>> >>                                           mcast_dev_flush
>> >>                                             mcast_leave(mcast A)
>> >>     mcast_leave(mcast A)
>> >>
>> >> As mcast_leave calls ib_sa_multicast_leave, and as member in
>> >> core/multicast.c is ref counted, we run into an unbalanced refcount
>> >> issue.  To avoid stomping on each others removes, take the rtnl lock
>> >> specifically when we are deleting the entries from the remove list.
>> >

[snip] ..

>> However, isn't that the remove_list a local list on the caller's stack
>> ? .. and  the original list entry moving (to remove_list) is protected
>> by the spin lock (priv->lock), it is unlikely that the
>> ib_sa_free_multicast() can operate on the same entry ?
>
> Yes, you're right.  I had it in my mind that the remove_list was part of
> the ipoib private dev, not local on the stack.  So you are right that we
> could probably get away with removing the rtnl lock there (although it
> would need to be in a later patch than the one you are reviewing
> here...here there would still be a race between the restart task and the
> downing of the interface because they all still share the same work
> queue, but once we switch to the per device work queues in patch #6,
> this can happen in parallel safely with the flush task I think).
>

Yep, that local "remove_list" is easy to miss.

BTW, you might want to modify the text description of this patch to
make your point clear.

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race

Commit Message

Comments

Patch