diff mbox

[v3,for-next,02/33] IB/core: Add kref to IB devices

Message ID 9f65de5e-ed5f-48d2-bff2-03ffbe4f4876@CMEXHTCAS2.ad.emulex.com (mailing list archive)
State Rejected
Headers show

Commit Message

Somnath Kotur March 25, 2015, 9:19 p.m. UTC
From: Matan Barak <matanb@mellanox.com>

Previously. we used device_mutex lock in order to protect
the device's list. That means that in order to guarantee a
device isn't freed while we use it, we had to lock all
devices.

Adding a kref per IB device. Before an IB device
is unregistered, we wait before its not held anymore.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
---
 drivers/infiniband/core/device.c | 41 ++++++++++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h          |  6 ++++++
 2 files changed, 47 insertions(+)

Comments

Bart Van Assche March 25, 2015, 11:46 p.m. UTC | #1
On 03/25/2015 02:19 PM, Somnath Kotur wrote:
> +static void ib_device_complete_cb(struct kref *kref)
> +{
> +	struct ib_device *device = container_of(kref, struct ib_device,
> +						refcount);
> +
> +	if (device->reg_state >= IB_DEV_UNREGISTERING)
> +		complete(&device->free);
> +}

> @@ -355,6 +393,9 @@ void ib_unregister_device(struct ib_device *device)
>
>   	ib_device_unregister_sysfs(device);
>
> +	ib_device_put(device);
> +	wait_for_completion(&device->free);

Why is it necessary here to wait until the last reference is gone ? Why 
doesn't ib_device_complete_cb() free any memory ?

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak April 14, 2015, 1:27 p.m. UTC | #2
On 3/26/2015 1:46 AM, Bart Van Assche wrote:
> On 03/25/2015 02:19 PM, Somnath Kotur wrote:
>> +static void ib_device_complete_cb(struct kref *kref)
>> +{
>> +    struct ib_device *device = container_of(kref, struct ib_device,
>> +                        refcount);
>> +
>> +    if (device->reg_state >= IB_DEV_UNREGISTERING)
>> +        complete(&device->free);
>> +}
>
>> @@ -355,6 +393,9 @@ void ib_unregister_device(struct ib_device *device)
>>
>>       ib_device_unregister_sysfs(device);
>>
>> +    ib_device_put(device);
>> +    wait_for_completion(&device->free);
>
> Why is it necessary here to wait until the last reference is gone ? Why
> doesn't ib_device_complete_cb() free any memory ?
>

IMHO, ib_unregister_device should be a blocking call. The caller would 
like to be certain that any usage of the IB device is completed before 
it frees other resources/memory it allocated.

> Bart.
>

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz April 26, 2015, 8:10 p.m. UTC | #3
On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur
<somnath.kotur@emulex.com> wrote:
> From: Matan Barak <matanb@mellanox.com>
>
> Previously. we used device_mutex lock in order to protect
> the device's list. That means that in order to guarantee a
> device isn't freed while we use it, we had to lock all
> devices.

Matan, looking on the cover letter, it says: "[...] Patch 0002 adds a
reference count mechanism to IB devices. This mechanism is similar to
dev_hold and dev_put available for net devices. This is mandatory for
later patches [...]"

So in that respect, saying here "Previously. we used device_mutex
lock" is a bit cryptic, @ least one typo must exist in this sentence,
right? did you want to say "Currently we use device_mutex lock for XXX
[...] and this should be replaced as of a YYY change which is to be
introduced [...]" please clarify

> Adding a kref per IB device. Before an IB device
> is unregistered, we wait before its not held anymore.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak April 27, 2015, 8:25 a.m. UTC | #4
On 4/26/2015 11:10 PM, Or Gerlitz wrote:
> On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur
> <somnath.kotur@emulex.com> wrote:
>> From: Matan Barak <matanb@mellanox.com>
>>
>> Previously. we used device_mutex lock in order to protect
>> the device's list. That means that in order to guarantee a
>> device isn't freed while we use it, we had to lock all
>> devices.
>
> Matan, looking on the cover letter, it says: "[...] Patch 0002 adds a
> reference count mechanism to IB devices. This mechanism is similar to
> dev_hold and dev_put available for net devices. This is mandatory for
> later patches [...]"
>
> So in that respect, saying here "Previously. we used device_mutex
> lock" is a bit cryptic, @ least one typo must exist in this sentence,
> right? did you want to say "Currently we use device_mutex lock for XXX
> [...] and this should be replaced as of a YYY change which is to be
> introduced [...]" please clarify

Correct, I'll change that into:

Currently we use device_mutex lock for protecting the device's list. In 
the current approach, in order to guarantee a device isn't freed we have 
to lock all devices.

Adding a kref per IB device. Before an IB device is unregistered, we 
wait until it's not held anymore.

>
>> Adding a kref per IB device. Before an IB device
>> is unregistered, we wait before its not held anymore.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe April 27, 2015, 4:22 p.m. UTC | #5
On Mon, Apr 27, 2015 at 11:25:56AM +0300, Matan Barak wrote:
> 
> 
> On 4/26/2015 11:10 PM, Or Gerlitz wrote:
> >On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur
> ><somnath.kotur@emulex.com> wrote:
> >>From: Matan Barak <matanb@mellanox.com>
> >>
> >>Previously. we used device_mutex lock in order to protect
> >>the device's list. That means that in order to guarantee a
> >>device isn't freed while we use it, we had to lock all
> >>devices.
> >
> >Matan, looking on the cover letter, it says: "[...] Patch 0002 adds a
> >reference count mechanism to IB devices. This mechanism is similar to
> >dev_hold and dev_put available for net devices. This is mandatory for
> >later patches [...]"
> >
> >So in that respect, saying here "Previously. we used device_mutex
> >lock" is a bit cryptic, @ least one typo must exist in this sentence,
> >right? did you want to say "Currently we use device_mutex lock for XXX
> >[...] and this should be replaced as of a YYY change which is to be
> >introduced [...]" please clarify
> 
> Correct, I'll change that into:
> 
> Currently we use device_mutex lock for protecting the device's list.
> In the current approach, in order to guarantee a device isn't freed
> we have to lock all devices.
> 
> Adding a kref per IB device. Before an IB device is unregistered, we
> wait until it's not held anymore.

Why do we need two krefs for this structure? There is already a kref
inside the embedded 'struct device dev'

Sounds wrong to me without a lot more explanation.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak April 28, 2015, 8:32 a.m. UTC | #6
On 4/27/2015 7:22 PM, Jason Gunthorpe wrote:
> On Mon, Apr 27, 2015 at 11:25:56AM +0300, Matan Barak wrote:
>>
>>
>> On 4/26/2015 11:10 PM, Or Gerlitz wrote:
>>> On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur
>>> <somnath.kotur@emulex.com> wrote:
>>>> From: Matan Barak <matanb@mellanox.com>
>>>>
>>>> Previously. we used device_mutex lock in order to protect
>>>> the device's list. That means that in order to guarantee a
>>>> device isn't freed while we use it, we had to lock all
>>>> devices.
>>>
>>> Matan, looking on the cover letter, it says: "[...] Patch 0002 adds a
>>> reference count mechanism to IB devices. This mechanism is similar to
>>> dev_hold and dev_put available for net devices. This is mandatory for
>>> later patches [...]"
>>>
>>> So in that respect, saying here "Previously. we used device_mutex
>>> lock" is a bit cryptic, @ least one typo must exist in this sentence,
>>> right? did you want to say "Currently we use device_mutex lock for XXX
>>> [...] and this should be replaced as of a YYY change which is to be
>>> introduced [...]" please clarify
>>
>> Correct, I'll change that into:
>>
>> Currently we use device_mutex lock for protecting the device's list.
>> In the current approach, in order to guarantee a device isn't freed
>> we have to lock all devices.
>>
>> Adding a kref per IB device. Before an IB device is unregistered, we
>> wait until it's not held anymore.
>
> Why do we need two krefs for this structure? There is already a kref
> inside the embedded 'struct device dev'
>
> Sounds wrong to me without a lot more explanation.
>

This was already asked by Haggai Eran awhile ago and was answered. 
Anyway, in ib_unregister_device we delete all client's related data.
We would like to ensure that all references were released before this
data is being deleted. Meaning, we would like to ensure the device is 
still functioning but isn't referenced rather than just to avoid freeing 
the IB device's memory.
ib_device_get and ib_device_hold are APIs for the clients, similar to
dev_hold and dev_put.

> Jason
>

Regards,
Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz April 28, 2015, 11:51 a.m. UTC | #7
On Mon, Apr 27, 2015 at 11:25 AM, Matan Barak <matanb@mellanox.com> wrote:
> On 4/26/2015 11:10 PM, Or Gerlitz wrote:
>> On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur
>> <somnath.kotur@emulex.com> wrote:
>>>
>>> From: Matan Barak <matanb@mellanox.com>
>>>
>>> Previously. we used device_mutex lock in order to protect
>>> the device's list. That means that in order to guarantee a
>>> device isn't freed while we use it, we had to lock all
>>> devices.
>>
>>
>> Matan, looking on the cover letter, it says: "[...] Patch 0002 adds a
>> reference count mechanism to IB devices. This mechanism is similar to
>> dev_hold and dev_put available for net devices. This is mandatory for
>> later patches [...]"

> Correct, I'll change that into:
> Currently we use device_mutex lock for protecting the device's list. In the
> current approach, in order to guarantee a device isn't freed we have to lock
> all devices.
> Adding a kref per IB device. Before an IB device is unregistered, we wait
> until it's not held anymore.

Why is this change mandatory for the proposed design?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak April 28, 2015, 2:03 p.m. UTC | #8
On 4/28/2015 2:51 PM, Or Gerlitz wrote:
> On Mon, Apr 27, 2015 at 11:25 AM, Matan Barak <matanb@mellanox.com> wrote:
>> On 4/26/2015 11:10 PM, Or Gerlitz wrote:
>>> On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur
>>> <somnath.kotur@emulex.com> wrote:
>>>>
>>>> From: Matan Barak <matanb@mellanox.com>
>>>>
>>>> Previously. we used device_mutex lock in order to protect
>>>> the device's list. That means that in order to guarantee a
>>>> device isn't freed while we use it, we had to lock all
>>>> devices.
>>>
>>>
>>> Matan, looking on the cover letter, it says: "[...] Patch 0002 adds a
>>> reference count mechanism to IB devices. This mechanism is similar to
>>> dev_hold and dev_put available for net devices. This is mandatory for
>>> later patches [...]"
>
>> Correct, I'll change that into:
>> Currently we use device_mutex lock for protecting the device's list. In the
>> current approach, in order to guarantee a device isn't freed we have to lock
>> all devices.
>> Adding a kref per IB device. Before an IB device is unregistered, we wait
>> until it's not held anymore.
>
> Why is this change mandatory for the proposed design?
>

The cleanup of roce_gid_cache is done in a different context, so we need 
to make sure the device is still alive while doing so. In addition, we 
don't want that the unregistration process of ib_core will free our 
context data.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe April 28, 2015, 4:03 p.m. UTC | #9
On Tue, Apr 28, 2015 at 11:32:08AM +0300, Matan Barak wrote:

> This was already asked by Haggai Eran awhile ago and was answered.
> Anyway, in ib_unregister_device we delete all client's related data.
> We would like to ensure that all references were released before this
> data is being deleted. Meaning, we would like to ensure the device
> is still functioning but isn't referenced rather than just to avoid
> freeing the IB device's memory.

A kref is the wrong datastructure for that purpose.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak April 28, 2015, 4:17 p.m. UTC | #10
On 4/28/2015 7:03 PM, Jason Gunthorpe wrote:
> On Tue, Apr 28, 2015 at 11:32:08AM +0300, Matan Barak wrote:
>
>> This was already asked by Haggai Eran awhile ago and was answered.
>> Anyway, in ib_unregister_device we delete all client's related data.
>> We would like to ensure that all references were released before this
>> data is being deleted. Meaning, we would like to ensure the device
>> is still functioning but isn't referenced rather than just to avoid
>> freeing the IB device's memory.
>
> A kref is the wrong datastructure for that purpose.

What is the right data structure in your opinion?

>
> Jason
>

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe April 28, 2015, 5:43 p.m. UTC | #11
On Tue, Apr 28, 2015 at 05:03:11PM +0300, Matan Barak wrote:
 
> The cleanup of roce_gid_cache is done in a different context, so we
> need to make sure the device is still alive while doing so. 

This explanation doesn't look right to me. I don't see anything like
that under roce_gid_cache_cleanup_one ?

Although, there must be a call to the driver's modify_gid to free
context before freeing, and I don't see that obviously happening..

However, all the other async work launched doesn't look safe at all.

So, did you mean that the device must still be alive while all the
other work is running? And the point of this scheme is to guarentee
all the work is flushed? (at least I hope it is, otherwise there are
bigger problems here)

It is just fundamentally wrong to return from ib_client.remove while
async work is still outstanding, the client is expected to deal with
this internally and synchronously.

You don't need IB core help to do this.

Or: This should have been fixed after Haggai brought it up...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz April 28, 2015, 7:04 p.m. UTC | #12
On Tue, Apr 28, 2015 at 8:43 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Apr 28, 2015 at 05:03:11PM +0300, Matan Barak wrote:

[...]
> Or: This should have been fixed after Haggai brought it up...

Jason, looking again on the correspondence between Matan and Haggai, I
think this one was sort of left in the air (or actually fell on the
floor),  happens, and indeed we should strive to do better and avoid
that, thanks for the 2nd eye.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak April 29, 2015, 9:16 a.m. UTC | #13
On 4/28/2015 8:43 PM, Jason Gunthorpe wrote:
> On Tue, Apr 28, 2015 at 05:03:11PM +0300, Matan Barak wrote:
>
>> The cleanup of roce_gid_cache is done in a different context, so we
>> need to make sure the device is still alive while doing so.
>
> This explanation doesn't look right to me. I don't see anything like
> that under roce_gid_cache_cleanup_one ?
>
> Although, there must be a call to the driver's modify_gid to free
> context before freeing, and I don't see that obviously happening..
>

Of course there is:
roce_gid_cache_client_cleanup_one -> 
roce_gid_cache_client_cleanup_one_work -> *work* -> 
roce_gid_cache_client_cleanup_work_handler -> roce_gid_cache_cleanup_one 
-> free_roce_gid_cache -> write_gid -> modify_gid

> However, all the other async work launched doesn't look safe at all.
>

I think it's safe - roce_gid_cache_client_cleanup_one deactivates the 
cache (no new events are handled on the cache). Ongoing events are 
flushed in roce_gid_cache_client_cleanup_one_work 
(flush_workqueue(roce_gid_mgmt_wq)). When roce_gid_cache_cleanup_one is 
called - no work will be done or is currently in process on this cache.

> So, did you mean that the device must still be alive while all the
> other work is running? And the point of this scheme is to guarentee
> all the work is flushed? (at least I hope it is, otherwise there are
> bigger problems here)
>

It's not safe to free the client's context in ib_unregister_device while 
the client isn't done. The obvious solution is to wait in client->remove 
(like you suggested) until the client has finished cleaning up things. 
This doesn't fit our case - since client->remove is called under 
device_lock, but it's possible (for example) that 
roce_rescan_device_work_handler is currently running and waits to grab 
this exact mutex - DEAD LOCK.
Freeing thing asynchronously is a possible solution - we don't lock all 
IB devices but just our device. Moreover, it's also more future proof as 
other clients might want to run other things concurrently.

> It is just fundamentally wrong to return from ib_client.remove while
> async work is still outstanding, the client is expected to deal with
> this internally and synchronously.
>
> You don't need IB core help to do this.
>

netdev is taking a similar approach - please take a look at 
netdev_wait_allrefs

> Or: This should have been fixed after Haggai brought it up...
>
> Jason
>

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak April 29, 2015, 3:29 p.m. UTC | #14
On 4/28/2015 8:43 PM, Jason Gunthorpe wrote:
> On Tue, Apr 28, 2015 at 05:03:11PM +0300, Matan Barak wrote:
>
>> The cleanup of roce_gid_cache is done in a different context, so we
>> need to make sure the device is still alive while doing so.
>
> This explanation doesn't look right to me. I don't see anything like
> that under roce_gid_cache_cleanup_one ?
>
> Although, there must be a call to the driver's modify_gid to free
> context before freeing, and I don't see that obviously happening..
>

Of course there is:
roce_gid_cache_client_cleanup_one -> 
roce_gid_cache_client_cleanup_one_work -> *work* -> 
roce_gid_cache_client_cleanup_work_handler -> roce_gid_cache_cleanup_one 
-> free_roce_gid_cache -> write_gid -> modify_gid

> However, all the other async work launched doesn't look safe at all.
>

I think it's safe - roce_gid_cache_client_cleanup_one deactivates the 
cache (no new events are handled on the cache). Ongoing events are 
flushed in roce_gid_cache_client_cleanup_one_work 
(flush_workqueue(roce_gid_mgmt_wq)). When roce_gid_cache_cleanup_one is 
called - no work will be done or is currently in process on this cache.

> So, did you mean that the device must still be alive while all the
> other work is running? And the point of this scheme is to guarentee
> all the work is flushed? (at least I hope it is, otherwise there are
> bigger problems here)
>

It's not safe to free the client's context in ib_unregister_device while 
the client isn't done. The obvious solution is to wait in client->remove 
(like you suggested) until the client has finished cleaning up things. 
This doesn't fit our case - since client->remove is called under 
device_lock, but it's possible (for example) that 
roce_rescan_device_work_handler is currently running and waits to grab 
this exact mutex - DEAD LOCK.
Freeing thing asynchronously is a possible solution - we don't lock all 
IB devices but just our device. Moreover, it's also more future proof as 
other clients might want to run other things concurrently.

> It is just fundamentally wrong to return from ib_client.remove while
> async work is still outstanding, the client is expected to deal with
> this internally and synchronously.
>
> You don't need IB core help to do this.
>

netdev is taking a similar approach - please take a look at 
netdev_wait_allrefs

> Or: This should have been fixed after Haggai brought it up...
>
> Jason
>

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe April 29, 2015, 4:48 p.m. UTC | #15
> >Although, there must be a call to the driver's modify_gid to free
> >context before freeing, and I don't see that obviously happening..
 
> Of course there is:
> roce_gid_cache_client_cleanup_one ->
> roce_gid_cache_client_cleanup_one_work -> *work* ->
> roce_gid_cache_client_cleanup_work_handler ->
> roce_gid_cache_cleanup_one -> free_roce_gid_cache -> write_gid ->
> modify_gid

Ah, this wasn't there in the earlier versions. Good.
 
> >However, all the other async work launched doesn't look safe at all.

> I think it's safe - roce_gid_cache_client_cleanup_one deactivates
> the cache (no new events are handled on the cache). Ongoing events
> are flushed in roce_gid_cache_client_cleanup_one_work

Do you mean roce_gid_cache_client_cleanup_work_handler? 

> It's not safe to free the client's context in ib_unregister_device
> while the client isn't done. The obvious solution is to wait in
> client->remove (like you suggested) until the client has finished
> cleaning up things.

This isn't just the obvious solution, it is the *expected* solution.
In the kernel the add/remove style idiom always works like this.

> This doesn't fit our case - since client->remove is called under
> device_lock, but it's possible (for example) that
> roce_rescan_device_work_handler is currently running and waits to
> grab this exact mutex - DEAD LOCK.

Uh, client->remove is most obviously called with
drivers/infiniband/core/device.c:device_mutex held, is that what you
mean? But that can't be right because only four functions hold that
lock and none of them are obviously called from this patch?

If these patches have a locking problem then breaking the add/remove
idiom is not the way to solve it.

Look, four people have asked about this patch, and I still have yet to
see an accurate and convincing answer from you what is actually going
on here. Please actually spend some time to properly research and
describe why the remove callback can't be synchronous.

> >It is just fundamentally wrong to return from ib_client.remove while
> >async work is still outstanding, the client is expected to deal with
> >this internally and synchronously.
> >
> >You don't need IB core help to do this.
> 
> netdev is taking a similar approach - please take a look at
> netdev_wait_allrefs

No, it really isn't, from the attached drivers perspective after
unregister_netdevice returns the driver is not allowed to operate the
net device anymore. netdev_wait_allrefs is part of making
unregister_netdevice synchronous.

The same is true of our IB client attaches, after a client returns
from remove it is not allowed to operate the IB device anymore.

That is a standard idiom, and we'd need a huge compelling reason to go
away from that.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak April 30, 2015, 8:21 a.m. UTC | #16
On 4/29/2015 7:48 PM, Jason Gunthorpe wrote:
>
>>> Although, there must be a call to the driver's modify_gid to free
>>> context before freeing, and I don't see that obviously happening..
>
>> Of course there is:
>> roce_gid_cache_client_cleanup_one ->
>> roce_gid_cache_client_cleanup_one_work -> *work* ->
>> roce_gid_cache_client_cleanup_work_handler ->
>> roce_gid_cache_cleanup_one -> free_roce_gid_cache -> write_gid ->
>> modify_gid
>
> Ah, this wasn't there in the earlier versions. Good.
>
>>> However, all the other async work launched doesn't look safe at all.
>
>> I think it's safe - roce_gid_cache_client_cleanup_one deactivates
>> the cache (no new events are handled on the cache). Ongoing events
>> are flushed in roce_gid_cache_client_cleanup_one_work
>
> Do you mean roce_gid_cache_client_cleanup_work_handler?
>

Yeah, the flush_workqueue is done in 
roce_gid_cache_client_cleanup_work_handler

>> It's not safe to free the client's context in ib_unregister_device
>> while the client isn't done. The obvious solution is to wait in
>> client->remove (like you suggested) until the client has finished
>> cleaning up things.
>
> This isn't just the obvious solution, it is the *expected* solution.
> In the kernel the add/remove style idiom always works like this.
>
>> This doesn't fit our case - since client->remove is called under
>> device_lock, but it's possible (for example) that
>> roce_rescan_device_work_handler is currently running and waits to
>> grab this exact mutex - DEAD LOCK.
>
> Uh, client->remove is most obviously called with
> drivers/infiniband/core/device.c:device_mutex held, is that what you
> mean? But that can't be right because only four functions hold that
> lock and none of them are obviously called from this patch?
>
> If these patches have a locking problem then breaking the add/remove
> idiom is not the way to solve it.
>
> Look, four people have asked about this patch, and I still have yet to
> see an accurate and convincing answer from you what is actually going
> on here. Please actually spend some time to properly research and
> describe why the remove callback can't be synchronous.
>

ib_unregister_device calls the remove function with device_mutex help. 
In addition, ib_enum_roce_ports_of_netdev does the same. Every 
interesting netdev/inet/inet6 event that's handled in roce_gid_mgmt 
triggers ib_enum_roce_ports_of_netdev by using the workqueue (for 
example, 
netdevice_event->*work*->netdevice_event_work_handler->ib_enum_roce_ports_of_netdev).

When a device is being unregistered, the remove function is called under 
device_mutex:
ib_unregister_device->roce_gid_cache_client_cleanup_one->*work*->roce_gid_cache_client_cleanup_work_handler->flush_workqueue

This flush_workqueue could wait for ib_enum_roce_ports_of_netdev. If we 
would have called it straight from the remove function, we're waiting 
for a work which waits for a mutex that will be unlocked only after the 
iteration over all remove functions is completed -> *DEADLOCK*

You could argue that flush_workqueue isn't needed, but that let's look 
at the following flow:

roce_gid_cache_client_setup_one->roce_rescan_device->*work (with the 
exact ib_dev)*->....
We need to make sure the ib_dev isn't free'd before this work is done.

There might be some ways around it - for example, introduce another 
workqueue for roce_rescan_device and flush this workqueue only. Every 
way has its advantages and disadvantages. I think it's problematic that 
device_mutex can't be held in a work as *most* client works are 
synchronized when the a device is being unregistered. It could affect 
future clients as well.

I'll be happy to explain/fix any issue you have regarding this code, but 
of course it needs to be concrete.

>>> It is just fundamentally wrong to return from ib_client.remove while
>>> async work is still outstanding, the client is expected to deal with
>>> this internally and synchronously.
>>>
>>> You don't need IB core help to do this.
>>
>> netdev is taking a similar approach - please take a look at
>> netdev_wait_allrefs
>
> No, it really isn't, from the attached drivers perspective after
> unregister_netdevice returns the driver is not allowed to operate the
> net device anymore. netdev_wait_allrefs is part of making
> unregister_netdevice synchronous.
>
> The same is true of our IB client attaches, after a client returns
> from remove it is not allowed to operate the IB device anymore.
>

ib_unregister_device is synchronous in the exact same manner - after it 
returns, no client operate on the IB device. 
wait_for_completion(&device->free) was added for this exact reason.

> That is a standard idiom, and we'd need a huge compelling reason to go
> away from that.

A device can be remove if and only if it's reference count is zero. 
That's the only point where we guarantee nobody uses it anymore. That's 
a standard idiom as well.

>
> Jason
>

I really appreciate the review and thanks for looking at this patch,
Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean April 30, 2015, 4:56 p.m. UTC | #17
> >> roce_gid_cache_client_cleanup_one ->
> >> roce_gid_cache_client_cleanup_one_work -> *work* ->
> >> roce_gid_cache_client_cleanup_work_handler ->
> >> roce_gid_cache_cleanup_one -> free_roce_gid_cache -> write_gid ->
> >> modify_gid
> >
> > Ah, this wasn't there in the earlier versions. Good.
> >
> >>> However, all the other async work launched doesn't look safe at all.
> >
> >> I think it's safe - roce_gid_cache_client_cleanup_one deactivates
> >> the cache (no new events are handled on the cache). Ongoing events
> >> are flushed in roce_gid_cache_client_cleanup_one_work
> >
> > Do you mean roce_gid_cache_client_cleanup_work_handler?
> >
> 
> Yeah, the flush_workqueue is done in
> roce_gid_cache_client_cleanup_work_handler

This entire cache seems completely over-architected.  Why is this complexity needed?  How frequently do we really expect these "GIDs" to change?

I still don't even buy that a cache is needed at all.  The RDMA CM can accept as input UDP/IP addresses.  RoCEv2 puts UDP/IP addresses on the wire.  Why should there be a conversion from an IP address to a GID to get back to the same IP address?  Why aren't IP addresses passed directly to the drivers?  Just change how the RDMA CM associates an IP address with the RDMA device.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe April 30, 2015, 5:26 p.m. UTC | #18
On Thu, Apr 30, 2015 at 11:21:01AM +0300, Matan Barak wrote:

> ib_unregister_device calls the remove function with device_mutex
> help. In addition, ib_enum_roce_ports_of_netdev does the same. Every
> interesting netdev/inet/inet6 event that's handled in roce_gid_mgmt
> triggers ib_enum_roce_ports_of_netdev by using the workqueue (for
> example, netdevice_event->*work*->netdevice_event_work_handler->ib_enum_roce_ports_of_netdev).

So, okay, now it is very clear. This should have been described
explicitly in the kref commit message, for instance:

 ... 

 Later commits in this series are going to extend the use of
 device_mutex via ib_enum_roce_ports_of_netdev resulting in ...
 .. To solve this deadlock introduce ....

Part of the job of the patch author is to make review work better by
highlighting the most troublesome areas, remember we have more patch
authors than reviewers so work must be pushed onto the author side.

Lets look at the original commit message provided:

 Previously. we used device_mutex lock in order to protect
 the device's list. 

  ** 'previously' is wrong, this patch does nothing to change what
      device_mutex covers, it still protects the device_list and still
      protects the client_list

 That means that in order to guarantee a
 device isn't freed while we use it, we had to lock all
 devices.

  ** No, locking the device_mutex does nothing to protect a device
     from being freed. The existing kref does that. The device_mutex
     protects the device_list from mutation, and
     ib_enum_roce_ports_of_netdev must hold it when it iterates over
     that list.

     It prevents a device from being unregistered.
     Accurate specificity is important in these commit messages.
     Otherwise nobody understands what is being described.

 Adding a kref per IB device. Before an IB device
 is unregistered, we wait before its not held anymore.

  ** Well, that is what the patch did, but the commit message is
     supposed to explain *why* too

Do you understand why this is so confusing?

> You could argue that flush_workqueue isn't needed, but that let's
> look at the following flow:

No, I wouldn't argue that, all the async work obviously needs to
cancel or complete during remove(), that's what I've been saying.

> There might be some ways around it - for example, introduce another
> workqueue for roce_rescan_device and flush this workqueue only.
> Every way has its advantages and disadvantages.

I don't see that either, like I keep saying, all work must complete,
more work queues don't really seem to help that.

> I think it's problematic that device_mutex can't be held in a work
> as *most* client works are synchronized when the a device is being
> unregistered. It could affect future clients as well.

But until this patch set added ib_enum_roce_ports_of_netdev the
device_mutex would never conflict with anything a client could do.

So, ultimtely, this is really a self created problem, and infact, the
problem lies with the introduction of ib_enum_roce_ports_of_netdev -
adding a new use of the device_mutex that is not register/unregister
related exposes the design limitations of that mutex *that Roland
clearly explained in a giant comment!*

So we need to fix that problem, not add a new mechanism. Off the top
of my head:
 - Split device_mutex into client_mutex and device_mutex,
   hold only the client_mutex when working with the client_list.
 - Convert device_mutex into a r/w sem
 - Use a different scheme to match netdevs for
   ib_enum_roce_ports_of_netdev, that doesn't rely on holding
   device_mutex during query

The first two seem really simple to me. I'd do that.
 
> >No, it really isn't, from the attached drivers perspective after
> >unregister_netdevice returns the driver is not allowed to operate the
> >net device anymore. netdev_wait_allrefs is part of making
> >unregister_netdevice synchronous.
> >
> >The same is true of our IB client attaches, after a client returns
> >from remove it is not allowed to operate the IB device anymore.
> >
> 
> ib_unregister_device is synchronous in the exact same manner - after
> it returns, no client operate on the IB device.
> wait_for_completion(&device->free) was added for this exact reason.

No, you are missing the layering here.

In this context ib_client is the *driver*, and like all drivers once
it's remove method returns the driver can no longer expect the
attached device is operable, and must guarentee the driver's .text is
freeable.

> >That is a standard idiom, and we'd need a huge compelling reason to go
> >away from that.
> 
> A device can be remove if and only if it's reference count is zero.
> That's the only point where we guarantee nobody uses it anymore.
> That's a standard idiom as well.

No actually. The kref is tied to the memory lifetime, and is pretty
much universal that the memory lifetime and device operable lifetime
(ie still registered) are very different things.

This is why using a kref to describe anything other than memory
lifetime is not correct, and two krefs in a structure is obviously
nonsense.

Some places use an 'active count + completion', like netdev, kernfs,
etc to track active users and use that to block an unregister path,
but that isn't a kref.

Okay, lets break down and understand why this is an important standard
guarentee, for our specific case. Lets generalize this scenario a
bit. roce_gid_cache isn't modular, but lets assume it is (after all
the intent of this patch is to weaken the remove() invarient for
everyone.)

On module remove it will call ib_unregister_client, and when
ib_unregister_client returns the module will be become unloaded.

The invariant for module unload requires no module code to be
running and no module code to be runnable in future -- ie all work
must be complete or canceled.

module unload always calls the driver's remove function, that is a
standard idiom.

So, the first thing to notice, the kref patch didn't actually change
ib_unregister_client - it doesn't
'wait_for_completion(&device->free)'.  Immediately we have a
architectural bug, modules can exit while they have outstanding code
runnable in their .text. Oops.

Next, we realize we cannot fix this by waiting on device->free, it is
ostensibly a general lock that all clients use on the ib_device, we
need something isolated to this client.

Finally, we realize if we can isolate something to one client, then
the client can handle it during it's own remove() function, we don't
need core support.

Thus, the way out is to make ib_client.remove() completely synchronous
and *guarentee* that the module's .text will never be used again after
remove() returns. Obviously this means all work is complete or
canceled.

*THIS* is why driver remove is idiomatically always synchronous in the
kernel.

Please appreciate how much time it takes to explain all of this :(
Just fix it already :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe April 30, 2015, 5:52 p.m. UTC | #19
On Thu, Apr 30, 2015 at 04:56:13PM +0000, Hefty, Sean wrote:

> I still don't even buy that a cache is needed at all.  The RDMA CM
> can accept as input UDP/IP addresses.  RoCEv2 puts UDP/IP addresses
> on the wire.  Why should there be a conversion from an IP address to
> a GID to get back to the same IP address?  Why aren't IP addresses
> passed directly to the drivers?  Just change how the RDMA CM
> associates an IP address with the RDMA device.

I feel it looks overdesigned too..

But, how could RDMA CM work? Inbound CM messages will be filtered if
the IP is not in the HW GID table? How could UD work?

This current scheme is just so ugly, there are so many wonky
possibilities. What happens if I remove an IP and then add a new one?
The GID index will eventually be re-used, and QPs bound to that gid
index will silently change source IPs. Horrible.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean April 30, 2015, 7:21 p.m. UTC | #20
> But, how could RDMA CM work? Inbound CM messages will be filtered if
> the IP is not in the HW GID table?

I'm not understanding the issue.

If a device has some requirement to program its assigned IP addresses into some HW table, I don't see why upper layers should be made to care.  The IB CM is essentially handling 2 different type of CM messages -- one for IB and one for RoCE.  This just adds a third type.  All three types are similar, with some fields ignored and others formatted using different data.  The CM interface can be updated to better reflect reality, rather than pretending that RoCE has path records or anything other than IB-classic has LIDs.  The CM message definitions themselves could also be updated to indicate which fields matter.

> How could UD work?

I haven't given much thought to UD, but since AV creation goes directly to the driver, I still don't see why GIDs need to be used.  The driver can encode whatever it needs to (e.g. GID/IP index) into the AV.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe April 30, 2015, 9:28 p.m. UTC | #21
On Thu, Apr 30, 2015 at 07:21:08PM +0000, Hefty, Sean wrote:
> > But, how could RDMA CM work? Inbound CM messages will be filtered if
> > the IP is not in the HW GID table?
> 
> I'm not understanding the issue.
> 
> If a device has some requirement to program its assigned IP
> addresses into some HW table, I don't see why upper layers should be
> made to care.

Okay, I was only thinking about the first couple patches in this
series.  Three drivers will have this HW gid table, so having a driver
helper library in the common code makes sense to me.

But then the series it just seems to go crazy - what is with all these
net dev patches? Now we are doing something with bonding? And a lot of
screwy looking rocev2 gid type stuff? And driver updates? And 33 patches worth?

You are totally right, this GID index and GID type stuff is getting
*everywhere*, and it is hard to follow the *why* of all the changes
are really needed.

Matan, if you want to progress this, then split it up.  Make a patch
to library the existing roce GID table driver code and update the
drivers. Just that bit alone has already spawned a huge discussion.

Work the other ancillary unrelated patches in small chuncks through
their trees. 

Then come back to rocev2 with an actual core API proposal patch set
that can be discussed.

Honestly, I'm not willing to even look at a patch set this big and
rambly.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak May 1, 2015, 6:28 a.m. UTC | #22
On Thu, Apr 30, 2015 at 8:26 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Thu, Apr 30, 2015 at 11:21:01AM +0300, Matan Barak wrote:
>
>> ib_unregister_device calls the remove function with device_mutex
>> help. In addition, ib_enum_roce_ports_of_netdev does the same. Every
>> interesting netdev/inet/inet6 event that's handled in roce_gid_mgmt
>> triggers ib_enum_roce_ports_of_netdev by using the workqueue (for
>> example, netdevice_event->*work*->netdevice_event_work_handler->ib_enum_roce_ports_of_netdev).
>
> So, okay, now it is very clear. This should have been described
> explicitly in the kref commit message, for instance:
>
>  ...
>
>  Later commits in this series are going to extend the use of
>  device_mutex via ib_enum_roce_ports_of_netdev resulting in ...
>  .. To solve this deadlock introduce ....
>
> Part of the job of the patch author is to make review work better by
> highlighting the most troublesome areas, remember we have more patch
> authors than reviewers so work must be pushed onto the author side.
>
> Lets look at the original commit message provided:
>
>  Previously. we used device_mutex lock in order to protect
>  the device's list.
>
>   ** 'previously' is wrong, this patch does nothing to change what
>       device_mutex covers, it still protects the device_list and still
>       protects the client_list
>
>  That means that in order to guarantee a
>  device isn't freed while we use it, we had to lock all
>  devices.
>
>   ** No, locking the device_mutex does nothing to protect a device
>      from being freed. The existing kref does that. The device_mutex
>      protects the device_list from mutation, and
>      ib_enum_roce_ports_of_netdev must hold it when it iterates over
>      that list.
>
>      It prevents a device from being unregistered.
>      Accurate specificity is important in these commit messages.
>      Otherwise nobody understands what is being described.
>
>  Adding a kref per IB device. Before an IB device
>  is unregistered, we wait before its not held anymore.
>
>   ** Well, that is what the patch did, but the commit message is
>      supposed to explain *why* too
>
> Do you understand why this is so confusing?
>

I agree, it should have been clarified better. The "why" is as important as the
"what" and an accurate reason could have make review's life easier.
I'll fix that. Thanks.

>> You could argue that flush_workqueue isn't needed, but that let's
>> look at the following flow:
>
> No, I wouldn't argue that, all the async work obviously needs to
> cancel or complete during remove(), that's what I've been saying.
>
>> There might be some ways around it - for example, introduce another
>> workqueue for roce_rescan_device and flush this workqueue only.
>> Every way has its advantages and disadvantages.
>
> I don't see that either, like I keep saying, all work must complete,
> more work queues don't really seem to help that.
>

I think this is a possible solution, as all works but the "rescan" work
aren't device specific. That means that after we move the rescan work
to another workqueue and synchronize it in the client->remove function,
we could be sure the device won't be used by this client anymore.

The rest of the works only iterate on the *existing* device list. Since the
ib_unregister_device
(a) lock (b) iterate over client->remove (c) remove from list (d) free
(e) unlock
all roce_gid_cache's works won't find it in the list and we'll be safe.

Regarding ib_unregister_client - if we guarantee that after each
client->remove(dev)
dev isn't used in this client, we actually guarantee that after removing all IB
devices no IB device will be used (induction).

What do you think?

>> I think it's problematic that device_mutex can't be held in a work
>> as *most* client works are synchronized when the a device is being
>> unregistered. It could affect future clients as well.
>
> But until this patch set added ib_enum_roce_ports_of_netdev the
> device_mutex would never conflict with anything a client could do.
>
> So, ultimtely, this is really a self created problem, and infact, the
> problem lies with the introduction of ib_enum_roce_ports_of_netdev -
> adding a new use of the device_mutex that is not register/unregister
> related exposes the design limitations of that mutex *that Roland
> clearly explained in a giant comment!*
>

I agree, while it's a general problem - it was first introduced by using
device_mutex in an asynchronous context (that should be flushed in
the remove function).

> So we need to fix that problem, not add a new mechanism. Off the top
> of my head:
>  - Split device_mutex into client_mutex and device_mutex,
>    hold only the client_mutex when working with the client_list.

Seems like a possible nice solution. I'll look into that.

>  - Convert device_mutex into a r/w sem

I'm not sure this will solve the problem, as besides the new enumerate
devices function, all existing functions update the list - so we'll have
read-while-write scenario and we'll be in the exact same condition.

>  - Use a different scheme to match netdevs for
>    ib_enum_roce_ports_of_netdev, that doesn't rely on holding
>    device_mutex during query
>

We could switch to RCU protected list or something similar,
but I honestly don't think it worth the complexity.

> The first two seem really simple to me. I'd do that.
>

Agree, but please consider also the addition of another
workqueue as a possible solution. It *does* (seem) to answer
all of your concerns and could be safer and cause less code changes.

>> >No, it really isn't, from the attached drivers perspective after
>> >unregister_netdevice returns the driver is not allowed to operate the
>> >net device anymore. netdev_wait_allrefs is part of making
>> >unregister_netdevice synchronous.
>> >
>> >The same is true of our IB client attaches, after a client returns
>> >from remove it is not allowed to operate the IB device anymore.
>> >
>>
>> ib_unregister_device is synchronous in the exact same manner - after
>> it returns, no client operate on the IB device.
>> wait_for_completion(&device->free) was added for this exact reason.
>
> No, you are missing the layering here.
>
> In this context ib_client is the *driver*, and like all drivers once
> it's remove method returns the driver can no longer expect the
> attached device is operable, and must guarentee the driver's .text is
> freeable.
>
>> >That is a standard idiom, and we'd need a huge compelling reason to go
>> >away from that.
>>
>> A device can be remove if and only if it's reference count is zero.
>> That's the only point where we guarantee nobody uses it anymore.
>> That's a standard idiom as well.
>
> No actually. The kref is tied to the memory lifetime, and is pretty
> much universal that the memory lifetime and device operable lifetime
> (ie still registered) are very different things.
>
> This is why using a kref to describe anything other than memory
> lifetime is not correct, and two krefs in a structure is obviously
> nonsense.
>
> Some places use an 'active count + completion', like netdev, kernfs,
> etc to track active users and use that to block an unregister path,
> but that isn't a kref.
>

kref just hides an atomic refcount - so you're actually saying using
the abstraction contradicts the kernel object's lifetime strategy?
I get that, I could have used an atomic refcount - but I agree that it's
worth exploring other solutions the will preserve the invariant of
client->remove being synchronous.

> Okay, lets break down and understand why this is an important standard
> guarentee, for our specific case. Lets generalize this scenario a
> bit. roce_gid_cache isn't modular, but lets assume it is (after all
> the intent of this patch is to weaken the remove() invarient for
> everyone.)
>
> On module remove it will call ib_unregister_client, and when
> ib_unregister_client returns the module will be become unloaded.
>
> The invariant for module unload requires no module code to be
> running and no module code to be runnable in future -- ie all work
> must be complete or canceled.
>
> module unload always calls the driver's remove function, that is a
> standard idiom.
>
> So, the first thing to notice, the kref patch didn't actually change
> ib_unregister_client - it doesn't
> 'wait_for_completion(&device->free)'.  Immediately we have a
> architectural bug, modules can exit while they have outstanding code
> runnable in their .text. Oops.
>

I tend to agree. The __exit function flushes the workqueue (so eventually
we guarentee that no code will execute before the module is unloaded).
IMHO, after ib_unregister_client returns, the module could still run code
(this is why the __exit handler exists), but it's not allowed to use any
data of the module which unregistered it, meaning - we can't allow it to
use any IB device.

> Next, we realize we cannot fix this by waiting on device->free, it is
> ostensibly a general lock that all clients use on the ib_device, we
> need something isolated to this client.
>
> Finally, we realize if we can isolate something to one client, then
> the client can handle it during it's own remove() function, we don't
> need core support.
>
> Thus, the way out is to make ib_client.remove() completely synchronous
> and *guarentee* that the module's .text will never be used again after
> remove() returns. Obviously this means all work is complete or
> canceled.
>
> *THIS* is why driver remove is idiomatically always synchronous in the
> kernel.
>

Ok, I get your point, thanks. I'll keep that call synchronous. The two
options which are on the table right now are:
(a) split the device_mutex to device_mutex and client_mutex.
      ib_unregister_device first grabs only client_mutex and after it called
      client->remove it grabs device->mutex to the rest of the work.
(b) Introduce another workqueue for all device-specific events
(actually, only rescan).
      This allows us to flush only this workqueue without waiting for any
      work which needs device_mutex.

> Please appreciate how much time it takes to explain all of this :(
> Just fix it already :(
>

I really appreciate it, again - thanks for looking at this and writing
this *long*
explanation.

> Jason
> --

Thanks a lot,
Matan

> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak May 1, 2015, 6:34 a.m. UTC | #23
On Thu, Apr 30, 2015 at 8:52 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Thu, Apr 30, 2015 at 04:56:13PM +0000, Hefty, Sean wrote:
>
>> I still don't even buy that a cache is needed at all.  The RDMA CM
>> can accept as input UDP/IP addresses.  RoCEv2 puts UDP/IP addresses
>> on the wire.  Why should there be a conversion from an IP address to
>> a GID to get back to the same IP address?  Why aren't IP addresses
>> passed directly to the drivers?  Just change how the RDMA CM
>> associates an IP address with the RDMA device.
>
> I feel it looks overdesigned too..
>
> But, how could RDMA CM work? Inbound CM messages will be filtered if
> the IP is not in the HW GID table? How could UD work?
>
> This current scheme is just so ugly, there are so many wonky
> possibilities. What happens if I remove an IP and then add a new one?
> The GID index will eventually be re-used, and QPs bound to that gid
> index will silently change source IPs. Horrible.

This should be handled by the vendor's driver/other future ib_core part.
This patchset introduces roce_gid_cache that manages the GID table and
notify vendors about GID changes.

The vendor needs to:
(a) Move all QPs that use GID x to error state when GID x is deleted from
      the table.
(b) Change all QPs that use GID x to use a special invalid GID entry.
(c) Don't delete GIDs that are being used by a QP.

>
> Jason

Matan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak May 1, 2015, 6:41 a.m. UTC | #24
On Fri, May 1, 2015 at 12:28 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Thu, Apr 30, 2015 at 07:21:08PM +0000, Hefty, Sean wrote:
>> > But, how could RDMA CM work? Inbound CM messages will be filtered if
>> > the IP is not in the HW GID table?
>>
>> I'm not understanding the issue.
>>
>> If a device has some requirement to program its assigned IP
>> addresses into some HW table, I don't see why upper layers should be
>> made to care.
>
> Okay, I was only thinking about the first couple patches in this
> series.  Three drivers will have this HW gid table, so having a driver
> helper library in the common code makes sense to me.
>
> But then the series it just seems to go crazy - what is with all these
> net dev patches? Now we are doing something with bonding? And a lot of
> screwy looking rocev2 gid type stuff? And driver updates? And 33 patches worth?
>

Actually, we're in the middle of splitting this series into two series
- the first introduces
the roce_gid_cache management and the second will introduce RoCE V2.
The bonding changes are crucial - the mlx4 driver currently support bonding
and changing to a new mechanism without supporting bonding will cause
a regression.

> You are totally right, this GID index and GID type stuff is getting
> *everywhere*, and it is hard to follow the *why* of all the changes
> are really needed.
>
> Matan, if you want to progress this, then split it up.  Make a patch
> to library the existing roce GID table driver code and update the
> drivers. Just that bit alone has already spawned a huge discussion.
>

It's already in progress. The new series is expected to be a lot smaller.

> Work the other ancillary unrelated patches in small chuncks through
> their trees.
>

The DRV patch was already sent to net. Regarding the bonding and ipv6
patch - we can't change them without a good reason which relies in this
patchset.

> Then come back to rocev2 with an actual core API proposal patch set
> that can be discussed.
>
> Honestly, I'm not willing to even look at a patch set this big and
> rambly.

Ok, hopefully we'll have something soon enough.

>
> Jason

Thanks a lot for the review and your comments.

Matan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 1, 2015, 5:31 p.m. UTC | #25
On Fri, May 01, 2015 at 09:28:40AM +0300, Matan Barak wrote:

> >> There might be some ways around it - for example, introduce another
> >> workqueue for roce_rescan_device and flush this workqueue only.
> >> Every way has its advantages and disadvantages.
> >
> > I don't see that either, like I keep saying, all work must complete,
> > more work queues don't really seem to help that.
> >
> 
> I think this is a possible solution, as all works but the "rescan" work
> aren't device specific. That means that after we move the rescan work
> to another workqueue and synchronize it in the client->remove function,
> we could be sure the device won't be used by this client anymore.
> 
> The rest of the works only iterate on the *existing* device list. Since the
> ib_unregister_device
> (a) lock (b) iterate over client->remove (c) remove from list (d) free
> (e) unlock
> all roce_gid_cache's works won't find it in the list and we'll be safe.

Well.. be carefull with the locking here, if device_mutex no longer
covers remove() then remove() and ib_enum_roce_ports_of_netdev() can
run concurrently and they will need locking against
free_roce_gid_cache(), which appears to implicitly rely on the
device_mutex (too subtle and fragile)

If there is a 'global' part and a 'device' specific part, it would
good to make this more obvious and clear: group the functions together,
annotate the difference with naming, a comment, or something. There
are *alot* of little functions in this module, it is hard to
understand the call graph without careful study.

> Regarding ib_unregister_client - if we guarantee that after each
> client->remove(dev)
> dev isn't used in this client, we actually guarantee that after removing all IB
> devices no IB device will be used (induction).

Yes, that is the idea..

> > So, ultimtely, this is really a self created problem, and infact, the
> > problem lies with the introduction of ib_enum_roce_ports_of_netdev -
> > adding a new use of the device_mutex that is not register/unregister
> > related exposes the design limitations of that mutex *that Roland
> > clearly explained in a giant comment!*

> I agree, while it's a general problem - it was first introduced by using
> device_mutex in an asynchronous context (that should be flushed in
> the remove function).

Well, no, generally speaking you can't use device_mutex in any client
facing API like ib_enum_roce_ports_of_netdev - it isn't just async
contexts, but it means the add() and remove() functions instantly
deadlock if they call a client facing API - that is not sane.

> >  - Convert device_mutex into a r/w sem
> 
> I'm not sure this will solve the problem, as besides the new enumerate
> devices function, all existing functions update the list - so we'll have
> read-while-write scenario and we'll be in the exact same condition.

You'd do something like hold the write sem and mutate the device list
to unlink the target device, then switch to a read sem to traverse the
client list. Two locks are clearer if contention is not an issue.

> Agree, but please consider also the addition of another
> workqueue as a possible solution. It *does* (seem) to answer
> all of your concerns and could be safer and cause less code changes.

As above, fundamentally, introducing a client API that holds
device_mutex is just wrong, so we need to address this.

I would also recommend splitting your per-device/global stuff anyhow,
global stuff needs a clear unwind during remove, and so on. It isn't
really optional to have this clarity.

> kref just hides an atomic refcount - so you're actually saying using
> the abstraction contradicts the kernel object's lifetime strategy?

kref has specific semantic meaning, it isn't a general use data
structure.

> I tend to agree. The __exit function flushes the workqueue (so eventually
> we guarentee that no code will execute before the module is unloaded).
> IMHO, after ib_unregister_client returns, the module could still run code
> (this is why the __exit handler exists), but it's not allowed to use any
> data of the module which unregistered it, meaning - we can't allow it to
> use any IB device.

Yes, I was being terse, of course the .text is still going to be used,
but the general idea of lifetime holds: the .text cannot remain in
service of the ib_device because the __exit handler cannot clean that
up.

> Ok, I get your point, thanks. I'll keep that call synchronous. The two
> options which are on the table right now are:
> (a) split the device_mutex to device_mutex and client_mutex.
>       ib_unregister_device first grabs only client_mutex and after it called
>       client->remove it grabs device->mutex to the rest of the work.
> (b) Introduce another workqueue for all device-specific events
> (actually, only rescan).
>       This allows us to flush only this workqueue without waiting for any
>       work which needs device_mutex.

The other option is to make rescan sense that the ib_device has been
freed and ignore it, and then flush the work queue that runs that on
module __exit. This might be needed anyhow due to the free race
discussed above. Be sure to add the necessary kref get/puts to hold
the ib device though.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 1, 2015, 5:36 p.m. UTC | #26
On Fri, May 01, 2015 at 09:34:24AM +0300, Matan Barak wrote:

> > This current scheme is just so ugly, there are so many wonky
> > possibilities. What happens if I remove an IP and then add a new one?
> > The GID index will eventually be re-used, and QPs bound to that gid
> > index will silently change source IPs. Horrible.
> 
> This should be handled by the vendor's driver/other future ib_core part.
> This patchset introduces roce_gid_cache that manages the GID table and
> notify vendors about GID changes.
> 
> The vendor needs to:
> (a) Move all QPs that use GID x to error state when GID x is deleted from
>       the table.
> (b) Change all QPs that use GID x to use a special invalid GID entry.
> (c) Don't delete GIDs that are being used by a QP.

What about AH's for UD?

What about clients that discover and then hold the GID index
internally?

What about the impossible to fix race of returing the GID index in the
work completion and translating that back to an IP?

It is a terrible scheme, Sean is right, the clients should work with
the actual sock addr, somehow, at least kernel side. Converting from a
sockaddr to a gid index cannot really be done without some kind of
lock and ref count scheme.

At the very least, that should be the starting point, if we can't get
there then patch on a case by case basis why.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak May 3, 2015, 9:05 a.m. UTC | #27
On 5/1/2015 8:36 PM, Jason Gunthorpe wrote:
> On Fri, May 01, 2015 at 09:34:24AM +0300, Matan Barak wrote:
>
>>> This current scheme is just so ugly, there are so many wonky
>>> possibilities. What happens if I remove an IP and then add a new one?
>>> The GID index will eventually be re-used, and QPs bound to that gid
>>> index will silently change source IPs. Horrible.
>>
>> This should be handled by the vendor's driver/other future ib_core part.
>> This patchset introduces roce_gid_cache that manages the GID table and
>> notify vendors about GID changes.
>>
>> The vendor needs to:
>> (a) Move all QPs that use GID x to error state when GID x is deleted from
>>        the table.
>> (b) Change all QPs that use GID x to use a special invalid GID entry.
>> (c) Don't delete GIDs that are being used by a QP.
>
> What about AH's for UD?
>

The plan is to have read-only memory-mapped AHs for UD. The kernel will
create AH with a sequence counter. This AH will be mapped as read-only
memory to the user-space. When sending, the user-space will atomically 
use this AH. If a GID is changed, the kernel will update this GID index 
internally. That's a long-term goal.

> What about clients that discover and then hold the GID index
> internally?
>
> What about the impossible to fix race of returing the GID index in the
> work completion and translating that back to an IP?
>
> It is a terrible scheme, Sean is right, the clients should work with
> the actual sock addr, somehow, at least kernel side. Converting from a
> sockaddr to a gid index cannot really be done without some kind of
> lock and ref count scheme.

This is the current behavior as well. The current patch-set doesn't make
it any worse or better. We don't expect to fix all world's problems.
We could add reference-count in a later patchset. Working with sockaddr
has its own (similar) problems - if the net-device's IP is changed -
using a sockaddr will just use an old incorrect IP address (which by
now could be prohibited by the administrator).

>
> At the very least, that should be the starting point, if we can't get
> there then patch on a case by case basis why.
>

I agree - we don't want to regress, but we only add the roce_gid_cache 
in this patchset (next version will postpone adding RoCE V2 to later 
patchset).

> Jason
>

Matan

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak May 5, 2015, 10:58 a.m. UTC | #28
On Fri, May 1, 2015 at 8:31 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Fri, May 01, 2015 at 09:28:40AM +0300, Matan Barak wrote:
>
>> >> There might be some ways around it - for example, introduce another
>> >> workqueue for roce_rescan_device and flush this workqueue only.
>> >> Every way has its advantages and disadvantages.
>> >
>> > I don't see that either, like I keep saying, all work must complete,
>> > more work queues don't really seem to help that.
>> >
>>
>> I think this is a possible solution, as all works but the "rescan" work
>> aren't device specific. That means that after we move the rescan work
>> to another workqueue and synchronize it in the client->remove function,
>> we could be sure the device won't be used by this client anymore.
>>
>> The rest of the works only iterate on the *existing* device list. Since the
>> ib_unregister_device
>> (a) lock (b) iterate over client->remove (c) remove from list (d) free
>> (e) unlock
>> all roce_gid_cache's works won't find it in the list and we'll be safe.
>
> Well.. be carefull with the locking here, if device_mutex no longer
> covers remove() then remove() and ib_enum_roce_ports_of_netdev() can
> run concurrently and they will need locking against
> free_roce_gid_cache(), which appears to implicitly rely on the
> device_mutex (too subtle and fragile)
>
> If there is a 'global' part and a 'device' specific part, it would
> good to make this more obvious and clear: group the functions together,
> annotate the difference with naming, a comment, or something. There
> are *alot* of little functions in this module, it is hard to
> understand the call graph without careful study.
>

I'll try to rearrange the code a bit better.

>> Regarding ib_unregister_client - if we guarantee that after each
>> client->remove(dev)
>> dev isn't used in this client, we actually guarantee that after removing all IB
>> devices no IB device will be used (induction).
>
> Yes, that is the idea..
>
>> > So, ultimtely, this is really a self created problem, and infact, the
>> > problem lies with the introduction of ib_enum_roce_ports_of_netdev -
>> > adding a new use of the device_mutex that is not register/unregister
>> > related exposes the design limitations of that mutex *that Roland
>> > clearly explained in a giant comment!*
>
>> I agree, while it's a general problem - it was first introduced by using
>> device_mutex in an asynchronous context (that should be flushed in
>> the remove function).
>
> Well, no, generally speaking you can't use device_mutex in any client
> facing API like ib_enum_roce_ports_of_netdev - it isn't just async
> contexts, but it means the add() and remove() functions instantly
> deadlock if they call a client facing API - that is not sane.
>

I still haven't looked at this thoroughly, but we might be able to use rwsem
(as you suggested) or a RCU protected device list. I think that changing to
one of these will make ib_enum_roce_ports_of_netdev safe in async context.

>> >  - Convert device_mutex into a r/w sem
>>
>> I'm not sure this will solve the problem, as besides the new enumerate
>> devices function, all existing functions update the list - so we'll have
>> read-while-write scenario and we'll be in the exact same condition.
>
> You'd do something like hold the write sem and mutate the device list
> to unlink the target device, then switch to a read sem to traverse the
> client list. Two locks are clearer if contention is not an issue.
>
>> Agree, but please consider also the addition of another
>> workqueue as a possible solution. It *does* (seem) to answer
>> all of your concerns and could be safer and cause less code changes.
>
> As above, fundamentally, introducing a client API that holds
> device_mutex is just wrong, so we need to address this.
>
> I would also recommend splitting your per-device/global stuff anyhow,
> global stuff needs a clear unwind during remove, and so on. It isn't
> really optional to have this clarity.
>

Ok, I'll look into that for the next version of this patchset -
probably splitting to
rwsem/RCU with clear separation of per-device and global functions.

>> kref just hides an atomic refcount - so you're actually saying using
>> the abstraction contradicts the kernel object's lifetime strategy?
>
> kref has specific semantic meaning, it isn't a general use data
> structure.
>
>> I tend to agree. The __exit function flushes the workqueue (so eventually
>> we guarentee that no code will execute before the module is unloaded).
>> IMHO, after ib_unregister_client returns, the module could still run code
>> (this is why the __exit handler exists), but it's not allowed to use any
>> data of the module which unregistered it, meaning - we can't allow it to
>> use any IB device.
>
> Yes, I was being terse, of course the .text is still going to be used,
> but the general idea of lifetime holds: the .text cannot remain in
> service of the ib_device because the __exit handler cannot clean that
> up.
>
>> Ok, I get your point, thanks. I'll keep that call synchronous. The two
>> options which are on the table right now are:
>> (a) split the device_mutex to device_mutex and client_mutex.
>>       ib_unregister_device first grabs only client_mutex and after it called
>>       client->remove it grabs device->mutex to the rest of the work.
>> (b) Introduce another workqueue for all device-specific events
>> (actually, only rescan).
>>       This allows us to flush only this workqueue without waiting for any
>>       work which needs device_mutex.
>
> The other option is to make rescan sense that the ib_device has been
> freed and ignore it, and then flush the work queue that runs that on
> module __exit. This might be needed anyhow due to the free race
> discussed above. Be sure to add the necessary kref get/puts to hold
> the ib device though.
>
> Jason

Thanks for the review. I'll look thoroughly on all the options you mentioned.

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 18c1ece..8616a95 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -261,6 +261,39 @@  out:
 	return ret;
 }
 
+static void ib_device_complete_cb(struct kref *kref)
+{
+	struct ib_device *device = container_of(kref, struct ib_device,
+						refcount);
+
+	if (device->reg_state >= IB_DEV_UNREGISTERING)
+		complete(&device->free);
+}
+
+/**
+ * ib_device_hold - increase the reference count of device
+ * @device: ib device to prevent from being free'd
+ *
+ * Prevent the device from being free'd.
+ */
+void ib_device_hold(struct ib_device *device)
+{
+	kref_get(&device->refcount);
+}
+EXPORT_SYMBOL(ib_device_hold);
+
+/**
+ * ib_device_put - decrease the reference count of device
+ * @device: allows this device to be free'd
+ *
+ * Puts the ib_device and allows it to be free'd.
+ */
+int ib_device_put(struct ib_device *device)
+{
+	return kref_put(&device->refcount, ib_device_complete_cb);
+}
+EXPORT_SYMBOL(ib_device_put);
+
 /**
  * ib_register_device - Register an IB device with IB core
  * @device:Device to register
@@ -312,6 +345,9 @@  int ib_register_device(struct ib_device *device,
 
 	list_add_tail(&device->core_list, &device_list);
 
+	kref_init(&device->refcount);
+	init_completion(&device->free);
+
 	device->reg_state = IB_DEV_REGISTERED;
 
 	{
@@ -342,6 +378,8 @@  void ib_unregister_device(struct ib_device *device)
 
 	mutex_lock(&device_mutex);
 
+	device->reg_state = IB_DEV_UNREGISTERING;
+
 	list_for_each_entry_reverse(client, &client_list, list)
 		if (client->remove)
 			client->remove(device);
@@ -355,6 +393,9 @@  void ib_unregister_device(struct ib_device *device)
 
 	ib_device_unregister_sysfs(device);
 
+	ib_device_put(device);
+	wait_for_completion(&device->free);
+
 	spin_lock_irqsave(&device->client_data_lock, flags);
 	list_for_each_entry_safe(context, tmp, &device->client_data_list, list)
 		kfree(context);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 1866595..a7593b0 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1716,6 +1716,7 @@  struct ib_device {
 	enum {
 		IB_DEV_UNINITIALIZED,
 		IB_DEV_REGISTERED,
+		IB_DEV_UNREGISTERING,
 		IB_DEV_UNREGISTERED
 	}                            reg_state;
 
@@ -1728,6 +1729,8 @@  struct ib_device {
 	u32			     local_dma_lkey;
 	u8                           node_type;
 	u8                           phys_port_cnt;
+	struct kref		     refcount;
+	struct completion	     free;
 };
 
 struct ib_client {
@@ -1741,6 +1744,9 @@  struct ib_client {
 struct ib_device *ib_alloc_device(size_t size);
 void ib_dealloc_device(struct ib_device *device);
 
+void ib_device_hold(struct ib_device *device);
+int ib_device_put(struct ib_device *device);
+
 int ib_register_device(struct ib_device *device,
 		       int (*port_callback)(struct ib_device *,
 					    u8, struct kobject *));