diff mbox

[rdma-cm] IB/core: Fix use after free of ifa

Message ID 1444910463-5688-2-git-send-email-matanb@mellanox.com (mailing list archive)
State Accepted
Headers show

Commit Message

Matan Barak Oct. 15, 2015, 12:01 p.m. UTC
When using ifup/ifdown while executing enum_netdev_ipv4_ips,
ifa could become invalid and cause use after free error.
Fixing it by protecting with RCU lock.

Fixes: 03db3a2d81e6 ('IB/core: Add RoCE GID table management')
Signed-off-by: Matan Barak <matanb@mellanox.com>
---

Hi Doug,

This patch fixes a bug in RoCE GID table implementation. Under stress conditions
where ifup/ifdown are used, the ifa pointer could become invalid. Using a
RCU lock in order to avoid freeing the ifa node (as done in other inet functions
(for example, inet_addr_onlink).

Our QA team verified that this patch fixes this issue.

Thanks,
Matan

 drivers/infiniband/core/roce_gid_mgmt.c | 35 +++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

Comments

Doug Ledford Oct. 15, 2015, 5:37 p.m. UTC | #1
On 10/15/2015 08:01 AM, Matan Barak wrote:
> When using ifup/ifdown while executing enum_netdev_ipv4_ips,
> ifa could become invalid and cause use after free error.
> Fixing it by protecting with RCU lock.
> 
> Fixes: 03db3a2d81e6 ('IB/core: Add RoCE GID table management')
> Signed-off-by: Matan Barak <matanb@mellanox.com>
> ---
> 
> Hi Doug,
> 
> This patch fixes a bug in RoCE GID table implementation. Under stress conditions
> where ifup/ifdown are used, the ifa pointer could become invalid. Using a
> RCU lock in order to avoid freeing the ifa node (as done in other inet functions
> (for example, inet_addr_onlink).
> 
> Our QA team verified that this patch fixes this issue.

This doesn't look like a good fix to me.  In particular, I think you
merely shifted the bug around, you didn't actually resolve it.

In the original code, you called update_gid_ip while holding a reference
to in_dev.  The reference to in_dev was not enough to keep the ifa list
from changing while you were doing your work.  It's not surprising that
you hit a race with the ifa list because update_gid_ip being called
synchronously can both A) sleep because of the mutexes it takes and B)
be slow because of how many locks it takes (and it can really take a lot
due to find_gid) and C) be slow again because of updating the gid table
calling into the low level driver and actually writing a mailbox command
to the card.  So, for all those reasons, not only did you hit this race,
but you were *likely* to hit this race.

Now, you've put an rcu_read_lock on ndev instead.  And you're no longer
seeing the race.  However, does taking the rcu_read_lock on ndev
actually protect the ifa list on ndev, or is the real fix the fact that
you moved update_gid_ip out of the main loop?  Before, you blocked while
processing the ifa list, making hitting your race likely.  Now you
process the ifa list very fast and build your own sin_list that is no
longer impacted by changes to the ifa list, but I don't know that the
rcu_read_lock you have taken actually makes you for sure safe here
versus the possibility that you have just made the race much harder to
hit and hidden it.

And even if the rcu_read_lock is for sure safe in terms of accessing the
ifa list, these changes may have just introduced a totally new bug that
your QE tests haven't exposed but might exist none the less.  In
particular, we have now queued up adding a bunch of gids to the ndev.
But we drop our reference to the rcu lock, then we invoke a (possibly
large number) of sleeping iterations.  What's to prevent a situation
where we get enum_netdev_ipv4_ips() called on say a vlan child interface
of a primary RoCE netdev, create our address list, release our lock,
then the user destroys our vlan device, and we race with del_netdev_ips
on the vlan device, such that del_netdev_ips completes and removes all
the gids for that netdev, but we still have backlogged gid add events in
enum_netdev_ipv4_ips and so we add back in what will become permanently
stale gids?  I don't think we hold rtnl_lock while running in
enum_netdev_ipv4_ips and that's probably the only lock that would
exclude the user from deleting the vlan device, so as far as I can tell
we can easily call del_netdev_ips while the tail end of
enum_netdev_ipv4_ips is sleeping.  Am I wrong here?  A test would be to
take whatever QE test you have that hit this bug in the first place, and
on a different terminal add a while loop of adding/removing the same
vlan interface that you are updating gids on and see if the gid table
starts filling up with stale, unremovable entries.

> Thanks,
> Matan
> 
>  drivers/infiniband/core/roce_gid_mgmt.c | 35 +++++++++++++++++++++++++--------
>  1 file changed, 27 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c
> index 6b24cba..178f984 100644
> --- a/drivers/infiniband/core/roce_gid_mgmt.c
> +++ b/drivers/infiniband/core/roce_gid_mgmt.c
> @@ -250,25 +250,44 @@ static void enum_netdev_ipv4_ips(struct ib_device *ib_dev,
>  				 u8 port, struct net_device *ndev)
>  {
>  	struct in_device *in_dev;
> +	struct sin_list {
> +		struct list_head	list;
> +		struct sockaddr_in	ip;
> +	};
> +	struct sin_list *sin_iter;
> +	struct sin_list *sin_temp;
>  
> +	LIST_HEAD(sin_list);
>  	if (ndev->reg_state >= NETREG_UNREGISTERING)
>  		return;
>  
> -	in_dev = in_dev_get(ndev);
> -	if (!in_dev)
> +	rcu_read_lock();
> +	in_dev = __in_dev_get_rcu(ndev);
> +	if (!in_dev) {
> +		rcu_read_unlock();
>  		return;
> +	}
>  
>  	for_ifa(in_dev) {
> -		struct sockaddr_in ip;
> +		struct sin_list *entry = kzalloc(sizeof(*entry), GFP_ATOMIC);
>  
> -		ip.sin_family = AF_INET;
> -		ip.sin_addr.s_addr = ifa->ifa_address;
> -		update_gid_ip(GID_ADD, ib_dev, port, ndev,
> -			      (struct sockaddr *)&ip);
> +		if (!entry) {
> +			pr_warn("roce_gid_mgmt: couldn't allocate entry for IPv4 update\n");
> +			continue;
> +		}
> +		entry->ip.sin_family = AF_INET;
> +		entry->ip.sin_addr.s_addr = ifa->ifa_address;
> +		list_add_tail(&entry->list, &sin_list);
>  	}
>  	endfor_ifa(in_dev);
> +	rcu_read_unlock();
>  
> -	in_dev_put(in_dev);
> +	list_for_each_entry_safe(sin_iter, sin_temp, &sin_list, list) {
> +		update_gid_ip(GID_ADD, ib_dev, port, ndev,
> +			      (struct sockaddr *)&sin_iter->ip);
> +		list_del(&sin_iter->list);
> +		kfree(sin_iter);
> +	}
>  }
>  
>  static void enum_netdev_ipv6_ips(struct ib_device *ib_dev,
>
Jason Gunthorpe Oct. 15, 2015, 5:53 p.m. UTC | #2
On Thu, Oct 15, 2015 at 01:37:22PM -0400, Doug Ledford wrote:

> Now, you've put an rcu_read_lock on ndev instead.  And you're no longer
> seeing the race.  However, does taking the rcu_read_lock on ndev
> actually protect the ifa list on ndev

It looks OK, the free side for using call_rcu - though I'm not quite
sure why for_ifa isn't using rcu_dereference...

> And even if the rcu_read_lock is for sure safe in terms of accessing the
> ifa list, these changes may have just introduced a totally new bug that
> your QE tests haven't exposed but might exist none the less.

This event driven design is super complex/frail, if it actually
doesn't work, it should probably be changed to a level based system.

- Record incoming events indicating a change in ip list has happened
  for an ib device, Schedule a task to refresh the list
- Refresh task holds rtnl locks and builds a list of IP addresses associated
  with the ib device. This is atomic with user changes to the IP list.
- Drops locks and then directly synchs the discovered list above with
  the device's record of associated IPs. (ie run set intersection)

It is much less subtle than hoping all the async event driven stuff
in here works out, probably a lot less code too.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak Oct. 18, 2015, 7:49 a.m. UTC | #3
On Thu, Oct 15, 2015 at 8:37 PM, Doug Ledford <dledford@redhat.com> wrote:
> On 10/15/2015 08:01 AM, Matan Barak wrote:
>> When using ifup/ifdown while executing enum_netdev_ipv4_ips,
>> ifa could become invalid and cause use after free error.
>> Fixing it by protecting with RCU lock.
>>
>> Fixes: 03db3a2d81e6 ('IB/core: Add RoCE GID table management')
>> Signed-off-by: Matan Barak <matanb@mellanox.com>
>> ---
>>
>> Hi Doug,
>>
>> This patch fixes a bug in RoCE GID table implementation. Under stress conditions
>> where ifup/ifdown are used, the ifa pointer could become invalid. Using a
>> RCU lock in order to avoid freeing the ifa node (as done in other inet functions
>> (for example, inet_addr_onlink).
>>
>> Our QA team verified that this patch fixes this issue.
>
> This doesn't look like a good fix to me.  In particular, I think you
> merely shifted the bug around, you didn't actually resolve it.
>
> In the original code, you called update_gid_ip while holding a reference
> to in_dev.  The reference to in_dev was not enough to keep the ifa list
> from changing while you were doing your work.  It's not surprising that
> you hit a race with the ifa list because update_gid_ip being called
> synchronously can both A) sleep because of the mutexes it takes and B)
> be slow because of how many locks it takes (and it can really take a lot
> due to find_gid) and C) be slow again because of updating the gid table
> calling into the low level driver and actually writing a mailbox command
> to the card.  So, for all those reasons, not only did you hit this race,
> but you were *likely* to hit this race.
>

I don't mind that the list could be changing between the inet event
and the work handler.
I do mind the ifa is released while working on it. I think the major
reason for possible slowness is the vendor call. Most locks are
per-entry and are read-write locks.

> Now, you've put an rcu_read_lock on ndev instead.  And you're no longer
> seeing the race.  However, does taking the rcu_read_lock on ndev
> actually protect the ifa list on ndev, or is the real fix the fact that
> you moved update_gid_ip out of the main loop?  Before, you blocked while
> processing the ifa list, making hitting your race likely.  Now you
> process the ifa list very fast and build your own sin_list that is no
> longer impacted by changes to the ifa list, but I don't know that the
> rcu_read_lock you have taken actually makes you for sure safe here
> versus the possibility that you have just made the race much harder to
> hit and hidden it.
>

As Jason wrote, the release of the ifa is protected by call_rcu. So
protecting the usage of ifa with RCU should be enough to eliminate
this bug.


> And even if the rcu_read_lock is for sure safe in terms of accessing the
> ifa list, these changes may have just introduced a totally new bug that
> your QE tests haven't exposed but might exist none the less.  In
> particular, we have now queued up adding a bunch of gids to the ndev.
> But we drop our reference to the rcu lock, then we invoke a (possibly
> large number) of sleeping iterations.  What's to prevent a situation
> where we get enum_netdev_ipv4_ips() called on say a vlan child interface
> of a primary RoCE netdev, create our address list, release our lock,
> then the user destroys our vlan device, and we race with del_netdev_ips
> on the vlan device, such that del_netdev_ips completes and removes all
> the gids for that netdev, but we still have backlogged gid add events in
> enum_netdev_ipv4_ips and so we add back in what will become permanently
> stale gids?  I don't think we hold rtnl_lock while running in
> enum_netdev_ipv4_ips and that's probably the only lock that would
> exclude the user from deleting the vlan device, so as far as I can tell
> we can easily call del_netdev_ips while the tail end of
> enum_netdev_ipv4_ips is sleeping.  Am I wrong here?  A test would be to
> take whatever QE test you have that hit this bug in the first place, and
> on a different terminal add a while loop of adding/removing the same
> vlan interface that you are updating gids on and see if the gid table
> starts filling up with stale, unremovable entries.
>

The RoCE GID management design uses events handlers and one workqueue.
When an event (inet/net) is handled, we hold the net device and
execute a work in the workqueue.
The works are executed in a queue - thus first-come first-served.
That's why if you add/del a vlan (or its IP)
we do dev_hold in the event itself. Since the ndev is available in the
event and is held when executing the event - it can't be deleted until
we handle this
event in the workqueue. If the user tries to delete the vlan before
our add (inet/ndev) work was completed, we'll get an UNREGISTER event,
but since the dev is held, the stack will have to wait until we free
all our ref counts to this device. Using a queue guarantees us the
order - we'll first complete adding the vlan and then delete it. Only
after all reference counts are dropped, the net device could be
deleted.
Anyway, I'll ask the QA team here to add this test.

Thanks for taking a look on this patch.

>> Thanks,
>> Matan
>>
>>  drivers/infiniband/core/roce_gid_mgmt.c | 35 +++++++++++++++++++++++++--------
>>  1 file changed, 27 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c
>> index 6b24cba..178f984 100644
>> --- a/drivers/infiniband/core/roce_gid_mgmt.c
>> +++ b/drivers/infiniband/core/roce_gid_mgmt.c
>> @@ -250,25 +250,44 @@ static void enum_netdev_ipv4_ips(struct ib_device *ib_dev,
>>                                u8 port, struct net_device *ndev)
>>  {
>>       struct in_device *in_dev;
>> +     struct sin_list {
>> +             struct list_head        list;
>> +             struct sockaddr_in      ip;
>> +     };
>> +     struct sin_list *sin_iter;
>> +     struct sin_list *sin_temp;
>>
>> +     LIST_HEAD(sin_list);
>>       if (ndev->reg_state >= NETREG_UNREGISTERING)
>>               return;
>>
>> -     in_dev = in_dev_get(ndev);
>> -     if (!in_dev)
>> +     rcu_read_lock();
>> +     in_dev = __in_dev_get_rcu(ndev);
>> +     if (!in_dev) {
>> +             rcu_read_unlock();
>>               return;
>> +     }
>>
>>       for_ifa(in_dev) {
>> -             struct sockaddr_in ip;
>> +             struct sin_list *entry = kzalloc(sizeof(*entry), GFP_ATOMIC);
>>
>> -             ip.sin_family = AF_INET;
>> -             ip.sin_addr.s_addr = ifa->ifa_address;
>> -             update_gid_ip(GID_ADD, ib_dev, port, ndev,
>> -                           (struct sockaddr *)&ip);
>> +             if (!entry) {
>> +                     pr_warn("roce_gid_mgmt: couldn't allocate entry for IPv4 update\n");
>> +                     continue;
>> +             }
>> +             entry->ip.sin_family = AF_INET;
>> +             entry->ip.sin_addr.s_addr = ifa->ifa_address;
>> +             list_add_tail(&entry->list, &sin_list);
>>       }
>>       endfor_ifa(in_dev);
>> +     rcu_read_unlock();
>>
>> -     in_dev_put(in_dev);
>> +     list_for_each_entry_safe(sin_iter, sin_temp, &sin_list, list) {
>> +             update_gid_ip(GID_ADD, ib_dev, port, ndev,
>> +                           (struct sockaddr *)&sin_iter->ip);
>> +             list_del(&sin_iter->list);
>> +             kfree(sin_iter);
>> +     }
>>  }
>>
>>  static void enum_netdev_ipv6_ips(struct ib_device *ib_dev,
>>
>
>
> --
> Doug Ledford <dledford@redhat.com>
>               GPG KeyID: 0E572FDD
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak Oct. 18, 2015, 7:51 a.m. UTC | #4
On Thu, Oct 15, 2015 at 8:53 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Thu, Oct 15, 2015 at 01:37:22PM -0400, Doug Ledford wrote:
>
>> Now, you've put an rcu_read_lock on ndev instead.  And you're no longer
>> seeing the race.  However, does taking the rcu_read_lock on ndev
>> actually protect the ifa list on ndev
>
> It looks OK, the free side for using call_rcu - though I'm not quite
> sure why for_ifa isn't using rcu_dereference...
>

I was wondering the same regarding the rcu_dereference.

>> And even if the rcu_read_lock is for sure safe in terms of accessing the
>> ifa list, these changes may have just introduced a totally new bug that
>> your QE tests haven't exposed but might exist none the less.
>
> This event driven design is super complex/frail, if it actually
> doesn't work, it should probably be changed to a level based system.
>
> - Record incoming events indicating a change in ip list has happened
>   for an ib device, Schedule a task to refresh the list
> - Refresh task holds rtnl locks and builds a list of IP addresses associated
>   with the ib device. This is atomic with user changes to the IP list.
> - Drops locks and then directly synchs the discovered list above with
>   the device's record of associated IPs. (ie run set intersection)
>
> It is much less subtle than hoping all the async event driven stuff
> in here works out, probably a lot less code too.
>

It's simpler, but every IP change could make us refresh the whole list
and possibly the GID indices as well.

> Jason
> --

Matan

> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford Oct. 19, 2015, 12:23 p.m. UTC | #5
On 10/18/2015 03:49 AM, Matan Barak wrote:
> On Thu, Oct 15, 2015 at 8:37 PM, Doug Ledford <dledford@redhat.com> wrote:
>> On 10/15/2015 08:01 AM, Matan Barak wrote:
>>> When using ifup/ifdown while executing enum_netdev_ipv4_ips,
>>> ifa could become invalid and cause use after free error.
>>> Fixing it by protecting with RCU lock.
>>>
>>> Fixes: 03db3a2d81e6 ('IB/core: Add RoCE GID table management')
>>> Signed-off-by: Matan Barak <matanb@mellanox.com>
>>> ---
>>>
>>> Hi Doug,
>>>
>>> This patch fixes a bug in RoCE GID table implementation. Under stress conditions
>>> where ifup/ifdown are used, the ifa pointer could become invalid. Using a
>>> RCU lock in order to avoid freeing the ifa node (as done in other inet functions
>>> (for example, inet_addr_onlink).
>>>
>>> Our QA team verified that this patch fixes this issue.
>>
>> This doesn't look like a good fix to me.  In particular, I think you
>> merely shifted the bug around, you didn't actually resolve it.
>>
>> In the original code, you called update_gid_ip while holding a reference
>> to in_dev.  The reference to in_dev was not enough to keep the ifa list
>> from changing while you were doing your work.  It's not surprising that
>> you hit a race with the ifa list because update_gid_ip being called
>> synchronously can both A) sleep because of the mutexes it takes and B)
>> be slow because of how many locks it takes (and it can really take a lot
>> due to find_gid) and C) be slow again because of updating the gid table
>> calling into the low level driver and actually writing a mailbox command
>> to the card.  So, for all those reasons, not only did you hit this race,
>> but you were *likely* to hit this race.
>>
> 
> I don't mind that the list could be changing between the inet event
> and the work handler.
> I do mind the ifa is released while working on it. I think the major
> reason for possible slowness is the vendor call.

No, it's not.

> Most locks are
> per-entry and are read-write locks.

This is a major cause of the slowness.  Unless you have a specific need
of them, per-entry rwlocks are *NOT* a good idea.  I was going to bring
this up separately, so I'll just mention it here.  Per entry locks help
reduce contention when you have lots of multiple accessors.  Using
rwlocks help reduce contention when you have a read-mostly entry that is
only occasionally changed.  But every lock and every unlock (just like
every atomic access) still requires a locked memory cycle.  That means
every lock acquire and every lock release requires a synchronization
event between all CPUs.  Using per-entry locks means that every entry
triggers two of those synchronization events.  On modern Intel CPUs,
they cost about 32 cycles per event.  If you are going to do something,
and it can't be done without a lock, then grab a single lock and do it
as fast as you can.  Only in rare cases would you want per-entry locks.

>> Now, you've put an rcu_read_lock on ndev instead.  And you're no longer
>> seeing the race.  However, does taking the rcu_read_lock on ndev
>> actually protect the ifa list on ndev, or is the real fix the fact that
>> you moved update_gid_ip out of the main loop?  Before, you blocked while
>> processing the ifa list, making hitting your race likely.  Now you
>> process the ifa list very fast and build your own sin_list that is no
>> longer impacted by changes to the ifa list, but I don't know that the
>> rcu_read_lock you have taken actually makes you for sure safe here
>> versus the possibility that you have just made the race much harder to
>> hit and hidden it.
>>
> 
> As Jason wrote, the release of the ifa is protected by call_rcu. So
> protecting the usage of ifa with RCU should be enough to eliminate
> this bug.

OK, I'm happy enough with the explanation to take this patch.  But
please think about the per-entry locks you mention above.  Those need to
go if possible.  I'm relatively certain that you will be able to
demonstrate a *drastic* speed up in this code with them removed (try
running the cmtime application from librdmacm-utils and see what the
results are with per-entry locks and a per table lock instead).

> 
>> And even if the rcu_read_lock is for sure safe in terms of accessing the
>> ifa list, these changes may have just introduced a totally new bug that
>> your QE tests haven't exposed but might exist none the less.  In
>> particular, we have now queued up adding a bunch of gids to the ndev.
>> But we drop our reference to the rcu lock, then we invoke a (possibly
>> large number) of sleeping iterations.  What's to prevent a situation
>> where we get enum_netdev_ipv4_ips() called on say a vlan child interface
>> of a primary RoCE netdev, create our address list, release our lock,
>> then the user destroys our vlan device, and we race with del_netdev_ips
>> on the vlan device, such that del_netdev_ips completes and removes all
>> the gids for that netdev, but we still have backlogged gid add events in
>> enum_netdev_ipv4_ips and so we add back in what will become permanently
>> stale gids?  I don't think we hold rtnl_lock while running in
>> enum_netdev_ipv4_ips and that's probably the only lock that would
>> exclude the user from deleting the vlan device, so as far as I can tell
>> we can easily call del_netdev_ips while the tail end of
>> enum_netdev_ipv4_ips is sleeping.  Am I wrong here?  A test would be to
>> take whatever QE test you have that hit this bug in the first place, and
>> on a different terminal add a while loop of adding/removing the same
>> vlan interface that you are updating gids on and see if the gid table
>> starts filling up with stale, unremovable entries.
>>
> 
> The RoCE GID management design uses events handlers and one workqueue.
> When an event (inet/net) is handled, we hold the net device and
> execute a work in the workqueue.
> The works are executed in a queue - thus first-come first-served.
> That's why if you add/del a vlan (or its IP)
> we do dev_hold in the event itself. Since the ndev is available in the
> event and is held when executing the event - it can't be deleted until
> we handle this
> event in the workqueue. If the user tries to delete the vlan before
> our add (inet/ndev) work was completed, we'll get an UNREGISTER event,
> but since the dev is held, the stack will have to wait until we free
> all our ref counts to this device. Using a queue guarantees us the
> order - we'll first complete adding the vlan and then delete it. Only
> after all reference counts are dropped, the net device could be
> deleted.
> Anyway, I'll ask the QA team here to add this test.

OK.  And it works for now, but if there is ever added an additional
means of running one of these functions that isn't on the work queue,
then it can break subtly.

> Thanks for taking a look on this patch.
> 
>>> Thanks,
>>> Matan
>>>
>>>  drivers/infiniband/core/roce_gid_mgmt.c | 35 +++++++++++++++++++++++++--------
>>>  1 file changed, 27 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c
>>> index 6b24cba..178f984 100644
>>> --- a/drivers/infiniband/core/roce_gid_mgmt.c
>>> +++ b/drivers/infiniband/core/roce_gid_mgmt.c
>>> @@ -250,25 +250,44 @@ static void enum_netdev_ipv4_ips(struct ib_device *ib_dev,
>>>                                u8 port, struct net_device *ndev)
>>>  {
>>>       struct in_device *in_dev;
>>> +     struct sin_list {
>>> +             struct list_head        list;
>>> +             struct sockaddr_in      ip;
>>> +     };
>>> +     struct sin_list *sin_iter;
>>> +     struct sin_list *sin_temp;
>>>
>>> +     LIST_HEAD(sin_list);
>>>       if (ndev->reg_state >= NETREG_UNREGISTERING)
>>>               return;
>>>
>>> -     in_dev = in_dev_get(ndev);
>>> -     if (!in_dev)
>>> +     rcu_read_lock();
>>> +     in_dev = __in_dev_get_rcu(ndev);
>>> +     if (!in_dev) {
>>> +             rcu_read_unlock();
>>>               return;
>>> +     }
>>>
>>>       for_ifa(in_dev) {
>>> -             struct sockaddr_in ip;
>>> +             struct sin_list *entry = kzalloc(sizeof(*entry), GFP_ATOMIC);
>>>
>>> -             ip.sin_family = AF_INET;
>>> -             ip.sin_addr.s_addr = ifa->ifa_address;
>>> -             update_gid_ip(GID_ADD, ib_dev, port, ndev,
>>> -                           (struct sockaddr *)&ip);
>>> +             if (!entry) {
>>> +                     pr_warn("roce_gid_mgmt: couldn't allocate entry for IPv4 update\n");
>>> +                     continue;
>>> +             }
>>> +             entry->ip.sin_family = AF_INET;
>>> +             entry->ip.sin_addr.s_addr = ifa->ifa_address;
>>> +             list_add_tail(&entry->list, &sin_list);
>>>       }
>>>       endfor_ifa(in_dev);
>>> +     rcu_read_unlock();
>>>
>>> -     in_dev_put(in_dev);
>>> +     list_for_each_entry_safe(sin_iter, sin_temp, &sin_list, list) {
>>> +             update_gid_ip(GID_ADD, ib_dev, port, ndev,
>>> +                           (struct sockaddr *)&sin_iter->ip);
>>> +             list_del(&sin_iter->list);
>>> +             kfree(sin_iter);
>>> +     }
>>>  }
>>>
>>>  static void enum_netdev_ipv6_ips(struct ib_device *ib_dev,
>>>
>>
>>
>> --
>> Doug Ledford <dledford@redhat.com>
>>               GPG KeyID: 0E572FDD
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Matan Barak Oct. 19, 2015, 2:20 p.m. UTC | #6
On 10/19/2015 3:23 PM, Doug Ledford wrote:
> On 10/18/2015 03:49 AM, Matan Barak wrote:
>> On Thu, Oct 15, 2015 at 8:37 PM, Doug Ledford <dledford@redhat.com> wrote:
>>> On 10/15/2015 08:01 AM, Matan Barak wrote:
>>>> When using ifup/ifdown while executing enum_netdev_ipv4_ips,
>>>> ifa could become invalid and cause use after free error.
>>>> Fixing it by protecting with RCU lock.
>>>>
>>>> Fixes: 03db3a2d81e6 ('IB/core: Add RoCE GID table management')
>>>> Signed-off-by: Matan Barak <matanb@mellanox.com>
>>>> ---
>>>>
>>>> Hi Doug,
>>>>
>>>> This patch fixes a bug in RoCE GID table implementation. Under stress conditions
>>>> where ifup/ifdown are used, the ifa pointer could become invalid. Using a
>>>> RCU lock in order to avoid freeing the ifa node (as done in other inet functions
>>>> (for example, inet_addr_onlink).
>>>>
>>>> Our QA team verified that this patch fixes this issue.
>>>
>>> This doesn't look like a good fix to me.  In particular, I think you
>>> merely shifted the bug around, you didn't actually resolve it.
>>>
>>> In the original code, you called update_gid_ip while holding a reference
>>> to in_dev.  The reference to in_dev was not enough to keep the ifa list
>>> from changing while you were doing your work.  It's not surprising that
>>> you hit a race with the ifa list because update_gid_ip being called
>>> synchronously can both A) sleep because of the mutexes it takes and B)
>>> be slow because of how many locks it takes (and it can really take a lot
>>> due to find_gid) and C) be slow again because of updating the gid table
>>> calling into the low level driver and actually writing a mailbox command
>>> to the card.  So, for all those reasons, not only did you hit this race,
>>> but you were *likely* to hit this race.
>>>
>>
>> I don't mind that the list could be changing between the inet event
>> and the work handler.
>> I do mind the ifa is released while working on it. I think the major
>> reason for possible slowness is the vendor call.
>
> No, it's not.
>
>> Most locks are
>> per-entry and are read-write locks.
>
> This is a major cause of the slowness.  Unless you have a specific need
> of them, per-entry rwlocks are *NOT* a good idea.  I was going to bring
> this up separately, so I'll just mention it here.  Per entry locks help
> reduce contention when you have lots of multiple accessors.  Using
> rwlocks help reduce contention when you have a read-mostly entry that is
> only occasionally changed.  But every lock and every unlock (just like
> every atomic access) still requires a locked memory cycle.  That means
> every lock acquire and every lock release requires a synchronization
> event between all CPUs.  Using per-entry locks means that every entry
> triggers two of those synchronization events.  On modern Intel CPUs,
> they cost about 32 cycles per event.  If you are going to do something,
> and it can't be done without a lock, then grab a single lock and do it
> as fast as you can.  Only in rare cases would you want per-entry locks.
>

I agree that every rwlock costs us locked access. However, lets look at 
the common scenario here. I think that in production stable systems, the 
IPs our rdma-cm stack uses (and thus GIDs) should be pretty stable. 
Thus, most accesses should be read calls. That's why IMHO read-write 
access makes sense here.

Regarding single-lock vs per entry lock, it really depends on common an 
entry could be updated while another entry is being used by an 
application. In a (future) dynamic system, you might want to create 
containers dynamically, which will add a net device and change the 
hardware GID table while other application (maybe in another container) 
uses other GIDs, but this might be rare scenario.

>>> Now, you've put an rcu_read_lock on ndev instead.  And you're no longer
>>> seeing the race.  However, does taking the rcu_read_lock on ndev
>>> actually protect the ifa list on ndev, or is the real fix the fact that
>>> you moved update_gid_ip out of the main loop?  Before, you blocked while
>>> processing the ifa list, making hitting your race likely.  Now you
>>> process the ifa list very fast and build your own sin_list that is no
>>> longer impacted by changes to the ifa list, but I don't know that the
>>> rcu_read_lock you have taken actually makes you for sure safe here
>>> versus the possibility that you have just made the race much harder to
>>> hit and hidden it.
>>>
>>
>> As Jason wrote, the release of the ifa is protected by call_rcu. So
>> protecting the usage of ifa with RCU should be enough to eliminate
>> this bug.
>
> OK, I'm happy enough with the explanation to take this patch.  But
> please think about the per-entry locks you mention above.  Those need to
> go if possible.  I'm relatively certain that you will be able to
> demonstrate a *drastic* speed up in this code with them removed (try
> running the cmtime application from librdmacm-utils and see what the
> results are with per-entry locks and a per table lock instead).
>

Ok, refactoring this code for a single lock shouldn't be problematic. 
Regarding performance, I think the results here are vastly impacted by 
the rate we'll add/remove IPs or upper net-devices in the background.

>>
>>> And even if the rcu_read_lock is for sure safe in terms of accessing the
>>> ifa list, these changes may have just introduced a totally new bug that
>>> your QE tests haven't exposed but might exist none the less.  In
>>> particular, we have now queued up adding a bunch of gids to the ndev.
>>> But we drop our reference to the rcu lock, then we invoke a (possibly
>>> large number) of sleeping iterations.  What's to prevent a situation
>>> where we get enum_netdev_ipv4_ips() called on say a vlan child interface
>>> of a primary RoCE netdev, create our address list, release our lock,
>>> then the user destroys our vlan device, and we race with del_netdev_ips
>>> on the vlan device, such that del_netdev_ips completes and removes all
>>> the gids for that netdev, but we still have backlogged gid add events in
>>> enum_netdev_ipv4_ips and so we add back in what will become permanently
>>> stale gids?  I don't think we hold rtnl_lock while running in
>>> enum_netdev_ipv4_ips and that's probably the only lock that would
>>> exclude the user from deleting the vlan device, so as far as I can tell
>>> we can easily call del_netdev_ips while the tail end of
>>> enum_netdev_ipv4_ips is sleeping.  Am I wrong here?  A test would be to
>>> take whatever QE test you have that hit this bug in the first place, and
>>> on a different terminal add a while loop of adding/removing the same
>>> vlan interface that you are updating gids on and see if the gid table
>>> starts filling up with stale, unremovable entries.
>>>
>>
>> The RoCE GID management design uses events handlers and one workqueue.
>> When an event (inet/net) is handled, we hold the net device and
>> execute a work in the workqueue.
>> The works are executed in a queue - thus first-come first-served.
>> That's why if you add/del a vlan (or its IP)
>> we do dev_hold in the event itself. Since the ndev is available in the
>> event and is held when executing the event - it can't be deleted until
>> we handle this
>> event in the workqueue. If the user tries to delete the vlan before
>> our add (inet/ndev) work was completed, we'll get an UNREGISTER event,
>> but since the dev is held, the stack will have to wait until we free
>> all our ref counts to this device. Using a queue guarantees us the
>> order - we'll first complete adding the vlan and then delete it. Only
>> after all reference counts are dropped, the net device could be
>> deleted.
>> Anyway, I'll ask the QA team here to add this test.
>
> OK.  And it works for now, but if there is ever added an additional
> means of running one of these functions that isn't on the work queue,
> then it can break subtly.
>

This workqueue is an important part of this design. We need to impose an 
order, but handle events in a different context (as some IPv6 events 
come from an atomic context).

>> Thanks for taking a look on this patch.
>>
>>>> Thanks,
>>>> Matan
>>>>
>>>>   drivers/infiniband/core/roce_gid_mgmt.c | 35 +++++++++++++++++++++++++--------
>>>>   1 file changed, 27 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c
>>>> index 6b24cba..178f984 100644
>>>> --- a/drivers/infiniband/core/roce_gid_mgmt.c
>>>> +++ b/drivers/infiniband/core/roce_gid_mgmt.c
>>>> @@ -250,25 +250,44 @@ static void enum_netdev_ipv4_ips(struct ib_device *ib_dev,
>>>>                                 u8 port, struct net_device *ndev)
>>>>   {
>>>>        struct in_device *in_dev;
>>>> +     struct sin_list {
>>>> +             struct list_head        list;
>>>> +             struct sockaddr_in      ip;
>>>> +     };
>>>> +     struct sin_list *sin_iter;
>>>> +     struct sin_list *sin_temp;
>>>>
>>>> +     LIST_HEAD(sin_list);
>>>>        if (ndev->reg_state >= NETREG_UNREGISTERING)
>>>>                return;
>>>>
>>>> -     in_dev = in_dev_get(ndev);
>>>> -     if (!in_dev)
>>>> +     rcu_read_lock();
>>>> +     in_dev = __in_dev_get_rcu(ndev);
>>>> +     if (!in_dev) {
>>>> +             rcu_read_unlock();
>>>>                return;
>>>> +     }
>>>>
>>>>        for_ifa(in_dev) {
>>>> -             struct sockaddr_in ip;
>>>> +             struct sin_list *entry = kzalloc(sizeof(*entry), GFP_ATOMIC);
>>>>
>>>> -             ip.sin_family = AF_INET;
>>>> -             ip.sin_addr.s_addr = ifa->ifa_address;
>>>> -             update_gid_ip(GID_ADD, ib_dev, port, ndev,
>>>> -                           (struct sockaddr *)&ip);
>>>> +             if (!entry) {
>>>> +                     pr_warn("roce_gid_mgmt: couldn't allocate entry for IPv4 update\n");
>>>> +                     continue;
>>>> +             }
>>>> +             entry->ip.sin_family = AF_INET;
>>>> +             entry->ip.sin_addr.s_addr = ifa->ifa_address;
>>>> +             list_add_tail(&entry->list, &sin_list);
>>>>        }
>>>>        endfor_ifa(in_dev);
>>>> +     rcu_read_unlock();
>>>>
>>>> -     in_dev_put(in_dev);
>>>> +     list_for_each_entry_safe(sin_iter, sin_temp, &sin_list, list) {
>>>> +             update_gid_ip(GID_ADD, ib_dev, port, ndev,
>>>> +                           (struct sockaddr *)&sin_iter->ip);
>>>> +             list_del(&sin_iter->list);
>>>> +             kfree(sin_iter);
>>>> +     }
>>>>   }
>>>>
>>>>   static void enum_netdev_ipv6_ips(struct ib_device *ib_dev,
>>>>
>>>
>>>
>>> --
>>> Doug Ledford <dledford@redhat.com>
>>>                GPG KeyID: 0E572FDD
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford Oct. 19, 2015, 3:27 p.m. UTC | #7
On 10/19/2015 10:20 AM, Matan Barak wrote:
> On 10/19/2015 3:23 PM, Doug Ledford wrote:

>> This is a major cause of the slowness.  Unless you have a specific need
>> of them, per-entry rwlocks are *NOT* a good idea.  I was going to bring
>> this up separately, so I'll just mention it here.  Per entry locks help
>> reduce contention when you have lots of multiple accessors.  Using
>> rwlocks help reduce contention when you have a read-mostly entry that is
>> only occasionally changed.  But every lock and every unlock (just like
>> every atomic access) still requires a locked memory cycle.  That means
>> every lock acquire and every lock release requires a synchronization
>> event between all CPUs.  Using per-entry locks means that every entry
>> triggers two of those synchronization events.  On modern Intel CPUs,
>> they cost about 32 cycles per event.  If you are going to do something,
>> and it can't be done without a lock, then grab a single lock and do it
>> as fast as you can.  Only in rare cases would you want per-entry locks.
>>
> 
> I agree that every rwlock costs us locked access. However, lets look at
> the common scenario here.

No, let's not.  Especially if your "common scenario" causes you to
ignore pathological bad behavior.  Optimizing for the common case is not
an excuse to enable pathological behavior.

> I think that in production stable systems, the
> IPs our rdma-cm stack uses (and thus GIDs) should be pretty stable.

Probably true.

> Thus, most accesses should be read calls.

Probably true.

> That's why IMHO read-write
> access makes sense here.

Probably true.

> Regarding single-lock vs per entry lock, it really depends on common an
> entry could be updated while another entry is being used by an
> application.

No, it depends on more than that.  In addition to the frequency of
updates versus lookups, there is also the issue of lookup cost and
update cost.  Using per entry locks makes both the lookup cost and the
update cost of anything other than the first gid in the table grow
exponentially.  A good way to demonstrate this would be to create 100
vlans on a RoCE device, then run the cmtime test between two hosts using
the 100th vlan IP at each end.  Using per-entry locks, the performance
of this test relative to using the first IP on the device will be
pathologically bad.

> In a (future) dynamic system, you might want to create
> containers dynamically, which will add a net device and change the
> hardware GID table while other application (maybe in another container)
> uses other GIDs, but this might be rare scenario.

This is a non-issue.  Creating a container and an application inside
that container to use an RDMA interface is already a heavy weight
operation (minimum fork/exec/program startup).  Table lookups can be
needed thousands of times per second under certain conditions and must
be fast.  Per-entry locks are only fast for the first item in the list.
 After that, they progressively get slower and slower than usage with a
single lock.

> Ok, refactoring this code for a single lock shouldn't be problematic.
> Regarding performance, I think the results here are vastly impacted by
> the rate we'll add/remove IPs or upper net-devices in the background.

Not really.  You will never add/remove IPs fast enough to justify
per-entry locks, especially when you consider that ib_cache_gid_add
calls find_gid twice, once going over the entire table if the gid isn't
found, before ever calling add_gid.  The ocrdma table isn't bad because
it's only 16 entries.  But the mlx4 table is going to be 128 I suspect,
so you can do the test I mentioned above and use 100 vlans.  In that
scenario, the 101st gid will require that you take and release 128 locks
to scan for the entry in the list, it will come back empty, then you
will take and release 101 locks until you find an empty entry, and then
you will take a write lock as you write the gid.  That's 230 total
locks, for 460 total locked memory operations at 32 cycles each, so a
total of 14,720 cycles spent doing nothing but locking the memory bus.
And each of those operations slows down all of the CPUs in the system.

Where I've seen this sort of behavior totally wreck a production system
is when the GID you need to look up is on port 2.  Even if it is the
firstmost GID on port 2, if you have to search all of port 1's table
first, then by the time you get to port 2's table, you've already lost.

So, here's some suggestions:

1)  Use a single lock on the table per port.
2)  Change find_gid so that it will take an additional element, int
*empty_idx, and as it is scanning the gid table, when it runs across an
empty entry, if empty_idx != NULL, *empty_idx = idx;.  This prevents
needing to scan the array twice.
3)  Consider optimizing the code by requiring that any time a GID is
added, it must take the first empty spot, and any time one is deleted,
any valid entries after it must be moved up to fill the empty spot, then
optimize find_gid to quit once it finds an empty slot because it knows
the rest of the table is empty.  Given that mlx4 keeps a huge 128 table
array, this can be very helpful.
4)  Does the table port struct have to use a mutex?  Could it be changed
to a rwlock instead?  If so, you could changing write_gid so that it
assumes it is called with the read lock on the port already held and it
merely updates itself to a write lock when it is ready to write the
entry to the array and update the card.  It would then downgrade back to
a read lock on return and the calling function would hold the single
read lock from the time it is ready to call find_gid until write_gid has
returned.
5)  I noticed that ib_cache_gid_find_by_port calls find_gid without even
taking the read lock, that needs fixed.
Jason Gunthorpe Oct. 19, 2015, 6:26 p.m. UTC | #8
On Sun, Oct 18, 2015 at 10:51:04AM +0300, Matan Barak wrote:

> It's simpler, but every IP change could make us refresh the whole list
> and possibly the GID indices as well.

No, it is trivial to use a set intersection algorithm to convert two
full current state lists into a sequence of add/remove's for the
driver and preserve the indices.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak Oct. 20, 2015, 2:50 p.m. UTC | #9
On 10/19/2015 6:27 PM, Doug Ledford wrote:
> On 10/19/2015 10:20 AM, Matan Barak wrote:
>> On 10/19/2015 3:23 PM, Doug Ledford wrote:
>
>>> This is a major cause of the slowness.  Unless you have a specific need
>>> of them, per-entry rwlocks are *NOT* a good idea.  I was going to bring
>>> this up separately, so I'll just mention it here.  Per entry locks help
>>> reduce contention when you have lots of multiple accessors.  Using
>>> rwlocks help reduce contention when you have a read-mostly entry that is
>>> only occasionally changed.  But every lock and every unlock (just like
>>> every atomic access) still requires a locked memory cycle.  That means
>>> every lock acquire and every lock release requires a synchronization
>>> event between all CPUs.  Using per-entry locks means that every entry
>>> triggers two of those synchronization events.  On modern Intel CPUs,
>>> they cost about 32 cycles per event.  If you are going to do something,
>>> and it can't be done without a lock, then grab a single lock and do it
>>> as fast as you can.  Only in rare cases would you want per-entry locks.
>>>
>>
>> I agree that every rwlock costs us locked access. However, lets look at
>> the common scenario here.
>
> No, let's not.  Especially if your "common scenario" causes you to
> ignore pathological bad behavior.  Optimizing for the common case is not
> an excuse to enable pathological behavior.
>
>> I think that in production stable systems, the
>> IPs our rdma-cm stack uses (and thus GIDs) should be pretty stable.
>
> Probably true.
>
>> Thus, most accesses should be read calls.
>
> Probably true.
>
>> That's why IMHO read-write
>> access makes sense here.
>
> Probably true.
>
>> Regarding single-lock vs per entry lock, it really depends on common an
>> entry could be updated while another entry is being used by an
>> application.
>
> No, it depends on more than that.  In addition to the frequency of
> updates versus lookups, there is also the issue of lookup cost and
> update cost.  Using per entry locks makes both the lookup cost and the
> update cost of anything other than the first gid in the table grow
> exponentially.  A good way to demonstrate this would be to create 100
> vlans on a RoCE device, then run the cmtime test between two hosts using
> the 100th vlan IP at each end.  Using per-entry locks, the performance
> of this test relative to using the first IP on the device will be
> pathologically bad.
>

That's correct, we"waste" #entries * <time of lock and unlock>. But we 
have to note this is mainly during the connecting part.

>> In a (future) dynamic system, you might want to create
>> containers dynamically, which will add a net device and change the
>> hardware GID table while other application (maybe in another container)
>> uses other GIDs, but this might be rare scenario.
>
> This is a non-issue.  Creating a container and an application inside
> that container to use an RDMA interface is already a heavy weight
> operation (minimum fork/exec/program startup).  Table lookups can be
> needed thousands of times per second under certain conditions and must
> be fast.  Per-entry locks are only fast for the first item in the list.
>   After that, they progressively get slower and slower than usage with a
> single lock.
>

We first used RCU and seqcount in order to protect each entry. When we 
changed that to rwlock, we added some per-entry work.

>> Ok, refactoring this code for a single lock shouldn't be problematic.
>> Regarding performance, I think the results here are vastly impacted by
>> the rate we'll add/remove IPs or upper net-devices in the background.
>
> Not really.  You will never add/remove IPs fast enough to justify
> per-entry locks, especially when you consider that ib_cache_gid_add
> calls find_gid twice, once going over the entire table if the gid isn't
> found, before ever calling add_gid.  The ocrdma table isn't bad because
> it's only 16 entries.  But the mlx4 table is going to be 128 I suspect,
> so you can do the test I mentioned above and use 100 vlans.  In that
> scenario, the 101st gid will require that you take and release 128 locks
> to scan for the entry in the list, it will come back empty, then you
> will take and release 101 locks until you find an empty entry, and then
> you will take a write lock as you write the gid.  That's 230 total
> locks, for 460 total locked memory operations at 32 cycles each, so a
> total of 14,720 cycles spent doing nothing but locking the memory bus.
> And each of those operations slows down all of the CPUs in the system.
>
> Where I've seen this sort of behavior totally wreck a production system
> is when the GID you need to look up is on port 2.  Even if it is the
> firstmost GID on port 2, if you have to search all of port 1's table
> first, then by the time you get to port 2's table, you've already lost.
>
> So, here's some suggestions:
>
> 1)  Use a single lock on the table per port.

Make sense as an enhancement.

> 2)  Change find_gid so that it will take an additional element, int
> *empty_idx, and as it is scanning the gid table, when it runs across an
> empty entry, if empty_idx != NULL, *empty_idx = idx;.  This prevents
> needing to scan the array twice.

Make sense as an enhancement.

> 3)  Consider optimizing the code by requiring that any time a GID is
> added, it must take the first empty spot, and any time one is deleted,
> any valid entries after it must be moved up to fill the empty spot, then
> optimize find_gid to quit once it finds an empty slot because it knows
> the rest of the table is empty.  Given that mlx4 keeps a huge 128 table
> array, this can be very helpful.

I don't think we want to change existing GID entries. Entities sometimes 
want to refer to GIDs via GID indices. Changing the GIDs order under 
their feet could be problematic.

> 4)  Does the table port struct have to use a mutex?  Could it be changed
> to a rwlock instead?  If so, you could changing write_gid so that it
> assumes it is called with the read lock on the port already held and it
> merely updates itself to a write lock when it is ready to write the
> entry to the array and update the card.  It would then downgrade back to
> a read lock on return and the calling function would hold the single
> read lock from the time it is ready to call find_gid until write_gid has
> returned.

AFAIK, read_locks can't be "upgraded" to write_locks. In addition, the 
cost of the locks here is negligible (assuming it's a per-table rwlock 
and not per entry lock). In all code paths that uses the table mutex, we 
call the vendor's command to write the GID table. Updating the GID table 
usually cost a lot more than the mutex locks.

> 5)  I noticed that ib_cache_gid_find_by_port calls find_gid without even
> taking the read lock, that needs fixed.
>
>

find_gid takes a per-entry read_lock. Which lock do you think is missing 
here?

Although it's mostly used in the connection part, I think the locking 
refactoring here is a good performance enhancement.
Because it is an enhancement, do you agree we could work here by gradual 
changes and change the locking schema later?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford Oct. 20, 2015, 4:52 p.m. UTC | #10
On 10/20/2015 10:50 AM, Matan Barak wrote:
> 
> 
> On 10/19/2015 6:27 PM, Doug Ledford wrote:
>> On 10/19/2015 10:20 AM, Matan Barak wrote:
>>> On 10/19/2015 3:23 PM, Doug Ledford wrote:
>>
>>>> This is a major cause of the slowness.  Unless you have a specific need
>>>> of them, per-entry rwlocks are *NOT* a good idea.  I was going to bring
>>>> this up separately, so I'll just mention it here.  Per entry locks help
>>>> reduce contention when you have lots of multiple accessors.  Using
>>>> rwlocks help reduce contention when you have a read-mostly entry
>>>> that is
>>>> only occasionally changed.  But every lock and every unlock (just like
>>>> every atomic access) still requires a locked memory cycle.  That means
>>>> every lock acquire and every lock release requires a synchronization
>>>> event between all CPUs.  Using per-entry locks means that every entry
>>>> triggers two of those synchronization events.  On modern Intel CPUs,
>>>> they cost about 32 cycles per event.  If you are going to do something,
>>>> and it can't be done without a lock, then grab a single lock and do it
>>>> as fast as you can.  Only in rare cases would you want per-entry locks.
>>>>
>>>
>>> I agree that every rwlock costs us locked access. However, lets look at
>>> the common scenario here.
>>
>> No, let's not.  Especially if your "common scenario" causes you to
>> ignore pathological bad behavior.  Optimizing for the common case is not
>> an excuse to enable pathological behavior.
>>
>>> I think that in production stable systems, the
>>> IPs our rdma-cm stack uses (and thus GIDs) should be pretty stable.
>>
>> Probably true.
>>
>>> Thus, most accesses should be read calls.
>>
>> Probably true.
>>
>>> That's why IMHO read-write
>>> access makes sense here.
>>
>> Probably true.
>>
>>> Regarding single-lock vs per entry lock, it really depends on common an
>>> entry could be updated while another entry is being used by an
>>> application.
>>
>> No, it depends on more than that.  In addition to the frequency of
>> updates versus lookups, there is also the issue of lookup cost and
>> update cost.  Using per entry locks makes both the lookup cost and the
>> update cost of anything other than the first gid in the table grow
>> exponentially.  A good way to demonstrate this would be to create 100
>> vlans on a RoCE device, then run the cmtime test between two hosts using
>> the 100th vlan IP at each end.  Using per-entry locks, the performance
>> of this test relative to using the first IP on the device will be
>> pathologically bad.
>>
> 
> That's correct, we"waste" #entries * <time of lock and unlock>. But we
> have to note this is mainly during the connecting part.

Yes, and I spent a *month* of my life tracking down why a customer's
real world application was failing in production (but not in their
staging environment) and it all boiled down to the fact that when their
app had to start up more than about 700-900 connections at app startup
and their setup included dual port controllers with both ports in use,
and some of those connections had to look up GIDs on the second port,
the abysmal per-gid lookup rate meant that the kernel had a hard ceiling
of how many GIDs it could look up per second.  Once that was hit, the
app went into a permanent cycle of dropping failed connections and
re-opening new ones and it would simply never catch up.

So it concerns me greatly to hear you say "But we have to note this is
mainly during the connecting part", implying that some wasted cycles on
connection startup are OK.  They aren't.  Not ever.  Production systems
fall over dead precisely because of things like this.

>> 3)  Consider optimizing the code by requiring that any time a GID is
>> added, it must take the first empty spot, and any time one is deleted,
>> any valid entries after it must be moved up to fill the empty spot, then
>> optimize find_gid to quit once it finds an empty slot because it knows
>> the rest of the table is empty.  Given that mlx4 keeps a huge 128 table
>> array, this can be very helpful.
> 
> I don't think we want to change existing GID entries. Entities sometimes
> want to refer to GIDs via GID indices. Changing the GIDs order under
> their feet could be problematic.

I had a feeling this one might be an issue.  There is, however, a
solution that would resolve the problem.  The issue here is that you
have an array of GIDs that you want to be able to both scan efficiently
and access directly.  The array is great for direct access, but the scan
efficiently part is a total looser because you scan so many empty
entries most of the time.  You could change it so that a gid port table
is actually a combination of 1) an array of gid_port_table_element
pointers and 2) a linked list of gid_port_table_element structs.  Then
you only allocate a struct when needed, link it into the list, assign it
an index, set the pointer in the array at point index to point to your
struct, and that way you can still direct access the elements, but can
also scan efficiently.  It's more complex, but would speed up the
scanning considerably.

>> 4)  Does the table port struct have to use a mutex?  Could it be changed
>> to a rwlock instead?  If so, you could changing write_gid so that it
>> assumes it is called with the read lock on the port already held and it
>> merely updates itself to a write lock when it is ready to write the
>> entry to the array and update the card.  It would then downgrade back to
>> a read lock on return and the calling function would hold the single
>> read lock from the time it is ready to call find_gid until write_gid has
>> returned.
> 
> AFAIK, read_locks can't be "upgraded" to write_locks. In addition, the
> cost of the locks here is negligible (assuming it's a per-table rwlock
> and not per entry lock). In all code paths that uses the table mutex, we
> call the vendor's command to write the GID table. Updating the GID table
> usually cost a lot more than the mutex locks.

I thought the kernel allowed upgrades, but I admit I didn't check.
Anyway, it was merely to prevent the case where you drop the read lock
and try to take the write lock and other readers rush in.  You could
also take the write lock from the beginning, but that might hurt
concurrency.

>> 5)  I noticed that ib_cache_gid_find_by_port calls find_gid without even
>> taking the read lock, that needs fixed.
> 
> find_gid takes a per-entry read_lock. Which lock do you think is missing
> here?

The other places hold the table mutex.

> Although it's mostly used in the connection part,

Please don't say that as though it makes the inefficiencies in this code
OK.  It doesn't.  Being on the connection startup or teardown path is
not an excuse.

> I think the locking
> refactoring here is a good performance enhancement.
> Because it is an enhancement, do you agree we could work here by gradual
> changes and change the locking schema later?

I never intended it to be during the rc phase.  But it would be good to
tackle soon.
Doug Ledford Oct. 20, 2015, 8:17 p.m. UTC | #11
On 10/15/2015 08:01 AM, Matan Barak wrote:
> When using ifup/ifdown while executing enum_netdev_ipv4_ips,
> ifa could become invalid and cause use after free error.
> Fixing it by protecting with RCU lock.
> 
> Fixes: 03db3a2d81e6 ('IB/core: Add RoCE GID table management')
> Signed-off-by: Matan Barak <matanb@mellanox.com>

This is in my tree for -rc.  Thanks.

> ---
> 
> Hi Doug,
> 
> This patch fixes a bug in RoCE GID table implementation. Under stress conditions
> where ifup/ifdown are used, the ifa pointer could become invalid. Using a
> RCU lock in order to avoid freeing the ifa node (as done in other inet functions
> (for example, inet_addr_onlink).
> 
> Our QA team verified that this patch fixes this issue.
> 
> Thanks,
> Matan
> 
>  drivers/infiniband/core/roce_gid_mgmt.c | 35 +++++++++++++++++++++++++--------
>  1 file changed, 27 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c
> index 6b24cba..178f984 100644
> --- a/drivers/infiniband/core/roce_gid_mgmt.c
> +++ b/drivers/infiniband/core/roce_gid_mgmt.c
> @@ -250,25 +250,44 @@ static void enum_netdev_ipv4_ips(struct ib_device *ib_dev,
>  				 u8 port, struct net_device *ndev)
>  {
>  	struct in_device *in_dev;
> +	struct sin_list {
> +		struct list_head	list;
> +		struct sockaddr_in	ip;
> +	};
> +	struct sin_list *sin_iter;
> +	struct sin_list *sin_temp;
>  
> +	LIST_HEAD(sin_list);
>  	if (ndev->reg_state >= NETREG_UNREGISTERING)
>  		return;
>  
> -	in_dev = in_dev_get(ndev);
> -	if (!in_dev)
> +	rcu_read_lock();
> +	in_dev = __in_dev_get_rcu(ndev);
> +	if (!in_dev) {
> +		rcu_read_unlock();
>  		return;
> +	}
>  
>  	for_ifa(in_dev) {
> -		struct sockaddr_in ip;
> +		struct sin_list *entry = kzalloc(sizeof(*entry), GFP_ATOMIC);
>  
> -		ip.sin_family = AF_INET;
> -		ip.sin_addr.s_addr = ifa->ifa_address;
> -		update_gid_ip(GID_ADD, ib_dev, port, ndev,
> -			      (struct sockaddr *)&ip);
> +		if (!entry) {
> +			pr_warn("roce_gid_mgmt: couldn't allocate entry for IPv4 update\n");
> +			continue;
> +		}
> +		entry->ip.sin_family = AF_INET;
> +		entry->ip.sin_addr.s_addr = ifa->ifa_address;
> +		list_add_tail(&entry->list, &sin_list);
>  	}
>  	endfor_ifa(in_dev);
> +	rcu_read_unlock();
>  
> -	in_dev_put(in_dev);
> +	list_for_each_entry_safe(sin_iter, sin_temp, &sin_list, list) {
> +		update_gid_ip(GID_ADD, ib_dev, port, ndev,
> +			      (struct sockaddr *)&sin_iter->ip);
> +		list_del(&sin_iter->list);
> +		kfree(sin_iter);
> +	}
>  }
>  
>  static void enum_netdev_ipv6_ips(struct ib_device *ib_dev,
>
Matan Barak Nov. 16, 2015, 1:17 p.m. UTC | #12
On Tue, Oct 20, 2015 at 10:17 PM, Doug Ledford <dledford@redhat.com> wrote:
> On 10/15/2015 08:01 AM, Matan Barak wrote:
>> When using ifup/ifdown while executing enum_netdev_ipv4_ips,
>> ifa could become invalid and cause use after free error.
>> Fixing it by protecting with RCU lock.
>>
>> Fixes: 03db3a2d81e6 ('IB/core: Add RoCE GID table management')
>> Signed-off-by: Matan Barak <matanb@mellanox.com>
>
> This is in my tree for -rc.  Thanks.
>

The series "[PATCH for-next 0/3] Change per-entry locks in GID cache
to table lock" I've posted a while ago addresses the locking concerns.
Thanks.

>> ---
>>
>> Hi Doug,
>>
>> This patch fixes a bug in RoCE GID table implementation. Under stress conditions
>> where ifup/ifdown are used, the ifa pointer could become invalid. Using a
>> RCU lock in order to avoid freeing the ifa node (as done in other inet functions
>> (for example, inet_addr_onlink).
>>
>> Our QA team verified that this patch fixes this issue.
>>
>> Thanks,
>> Matan
>>
>>  drivers/infiniband/core/roce_gid_mgmt.c | 35 +++++++++++++++++++++++++--------
>>  1 file changed, 27 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c
>> index 6b24cba..178f984 100644
>> --- a/drivers/infiniband/core/roce_gid_mgmt.c
>> +++ b/drivers/infiniband/core/roce_gid_mgmt.c
>> @@ -250,25 +250,44 @@ static void enum_netdev_ipv4_ips(struct ib_device *ib_dev,
>>                                u8 port, struct net_device *ndev)
>>  {
>>       struct in_device *in_dev;
>> +     struct sin_list {
>> +             struct list_head        list;
>> +             struct sockaddr_in      ip;
>> +     };
>> +     struct sin_list *sin_iter;
>> +     struct sin_list *sin_temp;
>>
>> +     LIST_HEAD(sin_list);
>>       if (ndev->reg_state >= NETREG_UNREGISTERING)
>>               return;
>>
>> -     in_dev = in_dev_get(ndev);
>> -     if (!in_dev)
>> +     rcu_read_lock();
>> +     in_dev = __in_dev_get_rcu(ndev);
>> +     if (!in_dev) {
>> +             rcu_read_unlock();
>>               return;
>> +     }
>>
>>       for_ifa(in_dev) {
>> -             struct sockaddr_in ip;
>> +             struct sin_list *entry = kzalloc(sizeof(*entry), GFP_ATOMIC);
>>
>> -             ip.sin_family = AF_INET;
>> -             ip.sin_addr.s_addr = ifa->ifa_address;
>> -             update_gid_ip(GID_ADD, ib_dev, port, ndev,
>> -                           (struct sockaddr *)&ip);
>> +             if (!entry) {
>> +                     pr_warn("roce_gid_mgmt: couldn't allocate entry for IPv4 update\n");
>> +                     continue;
>> +             }
>> +             entry->ip.sin_family = AF_INET;
>> +             entry->ip.sin_addr.s_addr = ifa->ifa_address;
>> +             list_add_tail(&entry->list, &sin_list);
>>       }
>>       endfor_ifa(in_dev);
>> +     rcu_read_unlock();
>>
>> -     in_dev_put(in_dev);
>> +     list_for_each_entry_safe(sin_iter, sin_temp, &sin_list, list) {
>> +             update_gid_ip(GID_ADD, ib_dev, port, ndev,
>> +                           (struct sockaddr *)&sin_iter->ip);
>> +             list_del(&sin_iter->list);
>> +             kfree(sin_iter);
>> +     }
>>  }
>>
>>  static void enum_netdev_ipv6_ips(struct ib_device *ib_dev,
>>
>
>
> --
> Doug Ledford <dledford@redhat.com>
>               GPG KeyID: 0E572FDD
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c
index 6b24cba..178f984 100644
--- a/drivers/infiniband/core/roce_gid_mgmt.c
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -250,25 +250,44 @@  static void enum_netdev_ipv4_ips(struct ib_device *ib_dev,
 				 u8 port, struct net_device *ndev)
 {
 	struct in_device *in_dev;
+	struct sin_list {
+		struct list_head	list;
+		struct sockaddr_in	ip;
+	};
+	struct sin_list *sin_iter;
+	struct sin_list *sin_temp;
 
+	LIST_HEAD(sin_list);
 	if (ndev->reg_state >= NETREG_UNREGISTERING)
 		return;
 
-	in_dev = in_dev_get(ndev);
-	if (!in_dev)
+	rcu_read_lock();
+	in_dev = __in_dev_get_rcu(ndev);
+	if (!in_dev) {
+		rcu_read_unlock();
 		return;
+	}
 
 	for_ifa(in_dev) {
-		struct sockaddr_in ip;
+		struct sin_list *entry = kzalloc(sizeof(*entry), GFP_ATOMIC);
 
-		ip.sin_family = AF_INET;
-		ip.sin_addr.s_addr = ifa->ifa_address;
-		update_gid_ip(GID_ADD, ib_dev, port, ndev,
-			      (struct sockaddr *)&ip);
+		if (!entry) {
+			pr_warn("roce_gid_mgmt: couldn't allocate entry for IPv4 update\n");
+			continue;
+		}
+		entry->ip.sin_family = AF_INET;
+		entry->ip.sin_addr.s_addr = ifa->ifa_address;
+		list_add_tail(&entry->list, &sin_list);
 	}
 	endfor_ifa(in_dev);
+	rcu_read_unlock();
 
-	in_dev_put(in_dev);
+	list_for_each_entry_safe(sin_iter, sin_temp, &sin_list, list) {
+		update_gid_ip(GID_ADD, ib_dev, port, ndev,
+			      (struct sockaddr *)&sin_iter->ip);
+		list_del(&sin_iter->list);
+		kfree(sin_iter);
+	}
 }
 
 static void enum_netdev_ipv6_ips(struct ib_device *ib_dev,