Message ID | 9f65de5e-ed5f-48d2-bff2-03ffbe4f4876@CMEXHTCAS2.ad.emulex.com (mailing list archive) |
---|---|
State | Rejected |
Headers | show |
On 03/25/2015 02:19 PM, Somnath Kotur wrote: > +static void ib_device_complete_cb(struct kref *kref) > +{ > + struct ib_device *device = container_of(kref, struct ib_device, > + refcount); > + > + if (device->reg_state >= IB_DEV_UNREGISTERING) > + complete(&device->free); > +} > @@ -355,6 +393,9 @@ void ib_unregister_device(struct ib_device *device) > > ib_device_unregister_sysfs(device); > > + ib_device_put(device); > + wait_for_completion(&device->free); Why is it necessary here to wait until the last reference is gone ? Why doesn't ib_device_complete_cb() free any memory ? Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 3/26/2015 1:46 AM, Bart Van Assche wrote: > On 03/25/2015 02:19 PM, Somnath Kotur wrote: >> +static void ib_device_complete_cb(struct kref *kref) >> +{ >> + struct ib_device *device = container_of(kref, struct ib_device, >> + refcount); >> + >> + if (device->reg_state >= IB_DEV_UNREGISTERING) >> + complete(&device->free); >> +} > >> @@ -355,6 +393,9 @@ void ib_unregister_device(struct ib_device *device) >> >> ib_device_unregister_sysfs(device); >> >> + ib_device_put(device); >> + wait_for_completion(&device->free); > > Why is it necessary here to wait until the last reference is gone ? Why > doesn't ib_device_complete_cb() free any memory ? > IMHO, ib_unregister_device should be a blocking call. The caller would like to be certain that any usage of the IB device is completed before it frees other resources/memory it allocated. > Bart. > Matan -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur <somnath.kotur@emulex.com> wrote: > From: Matan Barak <matanb@mellanox.com> > > Previously. we used device_mutex lock in order to protect > the device's list. That means that in order to guarantee a > device isn't freed while we use it, we had to lock all > devices. Matan, looking on the cover letter, it says: "[...] Patch 0002 adds a reference count mechanism to IB devices. This mechanism is similar to dev_hold and dev_put available for net devices. This is mandatory for later patches [...]" So in that respect, saying here "Previously. we used device_mutex lock" is a bit cryptic, @ least one typo must exist in this sentence, right? did you want to say "Currently we use device_mutex lock for XXX [...] and this should be replaced as of a YYY change which is to be introduced [...]" please clarify > Adding a kref per IB device. Before an IB device > is unregistered, we wait before its not held anymore. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/26/2015 11:10 PM, Or Gerlitz wrote: > On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur > <somnath.kotur@emulex.com> wrote: >> From: Matan Barak <matanb@mellanox.com> >> >> Previously. we used device_mutex lock in order to protect >> the device's list. That means that in order to guarantee a >> device isn't freed while we use it, we had to lock all >> devices. > > Matan, looking on the cover letter, it says: "[...] Patch 0002 adds a > reference count mechanism to IB devices. This mechanism is similar to > dev_hold and dev_put available for net devices. This is mandatory for > later patches [...]" > > So in that respect, saying here "Previously. we used device_mutex > lock" is a bit cryptic, @ least one typo must exist in this sentence, > right? did you want to say "Currently we use device_mutex lock for XXX > [...] and this should be replaced as of a YYY change which is to be > introduced [...]" please clarify Correct, I'll change that into: Currently we use device_mutex lock for protecting the device's list. In the current approach, in order to guarantee a device isn't freed we have to lock all devices. Adding a kref per IB device. Before an IB device is unregistered, we wait until it's not held anymore. > >> Adding a kref per IB device. Before an IB device >> is unregistered, we wait before its not held anymore. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 27, 2015 at 11:25:56AM +0300, Matan Barak wrote: > > > On 4/26/2015 11:10 PM, Or Gerlitz wrote: > >On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur > ><somnath.kotur@emulex.com> wrote: > >>From: Matan Barak <matanb@mellanox.com> > >> > >>Previously. we used device_mutex lock in order to protect > >>the device's list. That means that in order to guarantee a > >>device isn't freed while we use it, we had to lock all > >>devices. > > > >Matan, looking on the cover letter, it says: "[...] Patch 0002 adds a > >reference count mechanism to IB devices. This mechanism is similar to > >dev_hold and dev_put available for net devices. This is mandatory for > >later patches [...]" > > > >So in that respect, saying here "Previously. we used device_mutex > >lock" is a bit cryptic, @ least one typo must exist in this sentence, > >right? did you want to say "Currently we use device_mutex lock for XXX > >[...] and this should be replaced as of a YYY change which is to be > >introduced [...]" please clarify > > Correct, I'll change that into: > > Currently we use device_mutex lock for protecting the device's list. > In the current approach, in order to guarantee a device isn't freed > we have to lock all devices. > > Adding a kref per IB device. Before an IB device is unregistered, we > wait until it's not held anymore. Why do we need two krefs for this structure? There is already a kref inside the embedded 'struct device dev' Sounds wrong to me without a lot more explanation. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/27/2015 7:22 PM, Jason Gunthorpe wrote: > On Mon, Apr 27, 2015 at 11:25:56AM +0300, Matan Barak wrote: >> >> >> On 4/26/2015 11:10 PM, Or Gerlitz wrote: >>> On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur >>> <somnath.kotur@emulex.com> wrote: >>>> From: Matan Barak <matanb@mellanox.com> >>>> >>>> Previously. we used device_mutex lock in order to protect >>>> the device's list. That means that in order to guarantee a >>>> device isn't freed while we use it, we had to lock all >>>> devices. >>> >>> Matan, looking on the cover letter, it says: "[...] Patch 0002 adds a >>> reference count mechanism to IB devices. This mechanism is similar to >>> dev_hold and dev_put available for net devices. This is mandatory for >>> later patches [...]" >>> >>> So in that respect, saying here "Previously. we used device_mutex >>> lock" is a bit cryptic, @ least one typo must exist in this sentence, >>> right? did you want to say "Currently we use device_mutex lock for XXX >>> [...] and this should be replaced as of a YYY change which is to be >>> introduced [...]" please clarify >> >> Correct, I'll change that into: >> >> Currently we use device_mutex lock for protecting the device's list. >> In the current approach, in order to guarantee a device isn't freed >> we have to lock all devices. >> >> Adding a kref per IB device. Before an IB device is unregistered, we >> wait until it's not held anymore. > > Why do we need two krefs for this structure? There is already a kref > inside the embedded 'struct device dev' > > Sounds wrong to me without a lot more explanation. > This was already asked by Haggai Eran awhile ago and was answered. Anyway, in ib_unregister_device we delete all client's related data. We would like to ensure that all references were released before this data is being deleted. Meaning, we would like to ensure the device is still functioning but isn't referenced rather than just to avoid freeing the IB device's memory. ib_device_get and ib_device_hold are APIs for the clients, similar to dev_hold and dev_put. > Jason > Regards, Matan -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 27, 2015 at 11:25 AM, Matan Barak <matanb@mellanox.com> wrote: > On 4/26/2015 11:10 PM, Or Gerlitz wrote: >> On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur >> <somnath.kotur@emulex.com> wrote: >>> >>> From: Matan Barak <matanb@mellanox.com> >>> >>> Previously. we used device_mutex lock in order to protect >>> the device's list. That means that in order to guarantee a >>> device isn't freed while we use it, we had to lock all >>> devices. >> >> >> Matan, looking on the cover letter, it says: "[...] Patch 0002 adds a >> reference count mechanism to IB devices. This mechanism is similar to >> dev_hold and dev_put available for net devices. This is mandatory for >> later patches [...]" > Correct, I'll change that into: > Currently we use device_mutex lock for protecting the device's list. In the > current approach, in order to guarantee a device isn't freed we have to lock > all devices. > Adding a kref per IB device. Before an IB device is unregistered, we wait > until it's not held anymore. Why is this change mandatory for the proposed design? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/28/2015 2:51 PM, Or Gerlitz wrote: > On Mon, Apr 27, 2015 at 11:25 AM, Matan Barak <matanb@mellanox.com> wrote: >> On 4/26/2015 11:10 PM, Or Gerlitz wrote: >>> On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur >>> <somnath.kotur@emulex.com> wrote: >>>> >>>> From: Matan Barak <matanb@mellanox.com> >>>> >>>> Previously. we used device_mutex lock in order to protect >>>> the device's list. That means that in order to guarantee a >>>> device isn't freed while we use it, we had to lock all >>>> devices. >>> >>> >>> Matan, looking on the cover letter, it says: "[...] Patch 0002 adds a >>> reference count mechanism to IB devices. This mechanism is similar to >>> dev_hold and dev_put available for net devices. This is mandatory for >>> later patches [...]" > >> Correct, I'll change that into: >> Currently we use device_mutex lock for protecting the device's list. In the >> current approach, in order to guarantee a device isn't freed we have to lock >> all devices. >> Adding a kref per IB device. Before an IB device is unregistered, we wait >> until it's not held anymore. > > Why is this change mandatory for the proposed design? > The cleanup of roce_gid_cache is done in a different context, so we need to make sure the device is still alive while doing so. In addition, we don't want that the unregistration process of ib_core will free our context data. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 28, 2015 at 11:32:08AM +0300, Matan Barak wrote: > This was already asked by Haggai Eran awhile ago and was answered. > Anyway, in ib_unregister_device we delete all client's related data. > We would like to ensure that all references were released before this > data is being deleted. Meaning, we would like to ensure the device > is still functioning but isn't referenced rather than just to avoid > freeing the IB device's memory. A kref is the wrong datastructure for that purpose. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/28/2015 7:03 PM, Jason Gunthorpe wrote: > On Tue, Apr 28, 2015 at 11:32:08AM +0300, Matan Barak wrote: > >> This was already asked by Haggai Eran awhile ago and was answered. >> Anyway, in ib_unregister_device we delete all client's related data. >> We would like to ensure that all references were released before this >> data is being deleted. Meaning, we would like to ensure the device >> is still functioning but isn't referenced rather than just to avoid >> freeing the IB device's memory. > > A kref is the wrong datastructure for that purpose. What is the right data structure in your opinion? > > Jason > Matan -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 28, 2015 at 05:03:11PM +0300, Matan Barak wrote: > The cleanup of roce_gid_cache is done in a different context, so we > need to make sure the device is still alive while doing so. This explanation doesn't look right to me. I don't see anything like that under roce_gid_cache_cleanup_one ? Although, there must be a call to the driver's modify_gid to free context before freeing, and I don't see that obviously happening.. However, all the other async work launched doesn't look safe at all. So, did you mean that the device must still be alive while all the other work is running? And the point of this scheme is to guarentee all the work is flushed? (at least I hope it is, otherwise there are bigger problems here) It is just fundamentally wrong to return from ib_client.remove while async work is still outstanding, the client is expected to deal with this internally and synchronously. You don't need IB core help to do this. Or: This should have been fixed after Haggai brought it up... Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 28, 2015 at 8:43 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Tue, Apr 28, 2015 at 05:03:11PM +0300, Matan Barak wrote: [...] > Or: This should have been fixed after Haggai brought it up... Jason, looking again on the correspondence between Matan and Haggai, I think this one was sort of left in the air (or actually fell on the floor), happens, and indeed we should strive to do better and avoid that, thanks for the 2nd eye. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/28/2015 8:43 PM, Jason Gunthorpe wrote: > On Tue, Apr 28, 2015 at 05:03:11PM +0300, Matan Barak wrote: > >> The cleanup of roce_gid_cache is done in a different context, so we >> need to make sure the device is still alive while doing so. > > This explanation doesn't look right to me. I don't see anything like > that under roce_gid_cache_cleanup_one ? > > Although, there must be a call to the driver's modify_gid to free > context before freeing, and I don't see that obviously happening.. > Of course there is: roce_gid_cache_client_cleanup_one -> roce_gid_cache_client_cleanup_one_work -> *work* -> roce_gid_cache_client_cleanup_work_handler -> roce_gid_cache_cleanup_one -> free_roce_gid_cache -> write_gid -> modify_gid > However, all the other async work launched doesn't look safe at all. > I think it's safe - roce_gid_cache_client_cleanup_one deactivates the cache (no new events are handled on the cache). Ongoing events are flushed in roce_gid_cache_client_cleanup_one_work (flush_workqueue(roce_gid_mgmt_wq)). When roce_gid_cache_cleanup_one is called - no work will be done or is currently in process on this cache. > So, did you mean that the device must still be alive while all the > other work is running? And the point of this scheme is to guarentee > all the work is flushed? (at least I hope it is, otherwise there are > bigger problems here) > It's not safe to free the client's context in ib_unregister_device while the client isn't done. The obvious solution is to wait in client->remove (like you suggested) until the client has finished cleaning up things. This doesn't fit our case - since client->remove is called under device_lock, but it's possible (for example) that roce_rescan_device_work_handler is currently running and waits to grab this exact mutex - DEAD LOCK. Freeing thing asynchronously is a possible solution - we don't lock all IB devices but just our device. Moreover, it's also more future proof as other clients might want to run other things concurrently. > It is just fundamentally wrong to return from ib_client.remove while > async work is still outstanding, the client is expected to deal with > this internally and synchronously. > > You don't need IB core help to do this. > netdev is taking a similar approach - please take a look at netdev_wait_allrefs > Or: This should have been fixed after Haggai brought it up... > > Jason > Matan -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/28/2015 8:43 PM, Jason Gunthorpe wrote: > On Tue, Apr 28, 2015 at 05:03:11PM +0300, Matan Barak wrote: > >> The cleanup of roce_gid_cache is done in a different context, so we >> need to make sure the device is still alive while doing so. > > This explanation doesn't look right to me. I don't see anything like > that under roce_gid_cache_cleanup_one ? > > Although, there must be a call to the driver's modify_gid to free > context before freeing, and I don't see that obviously happening.. > Of course there is: roce_gid_cache_client_cleanup_one -> roce_gid_cache_client_cleanup_one_work -> *work* -> roce_gid_cache_client_cleanup_work_handler -> roce_gid_cache_cleanup_one -> free_roce_gid_cache -> write_gid -> modify_gid > However, all the other async work launched doesn't look safe at all. > I think it's safe - roce_gid_cache_client_cleanup_one deactivates the cache (no new events are handled on the cache). Ongoing events are flushed in roce_gid_cache_client_cleanup_one_work (flush_workqueue(roce_gid_mgmt_wq)). When roce_gid_cache_cleanup_one is called - no work will be done or is currently in process on this cache. > So, did you mean that the device must still be alive while all the > other work is running? And the point of this scheme is to guarentee > all the work is flushed? (at least I hope it is, otherwise there are > bigger problems here) > It's not safe to free the client's context in ib_unregister_device while the client isn't done. The obvious solution is to wait in client->remove (like you suggested) until the client has finished cleaning up things. This doesn't fit our case - since client->remove is called under device_lock, but it's possible (for example) that roce_rescan_device_work_handler is currently running and waits to grab this exact mutex - DEAD LOCK. Freeing thing asynchronously is a possible solution - we don't lock all IB devices but just our device. Moreover, it's also more future proof as other clients might want to run other things concurrently. > It is just fundamentally wrong to return from ib_client.remove while > async work is still outstanding, the client is expected to deal with > this internally and synchronously. > > You don't need IB core help to do this. > netdev is taking a similar approach - please take a look at netdev_wait_allrefs > Or: This should have been fixed after Haggai brought it up... > > Jason > Matan -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> >Although, there must be a call to the driver's modify_gid to free > >context before freeing, and I don't see that obviously happening.. > Of course there is: > roce_gid_cache_client_cleanup_one -> > roce_gid_cache_client_cleanup_one_work -> *work* -> > roce_gid_cache_client_cleanup_work_handler -> > roce_gid_cache_cleanup_one -> free_roce_gid_cache -> write_gid -> > modify_gid Ah, this wasn't there in the earlier versions. Good. > >However, all the other async work launched doesn't look safe at all. > I think it's safe - roce_gid_cache_client_cleanup_one deactivates > the cache (no new events are handled on the cache). Ongoing events > are flushed in roce_gid_cache_client_cleanup_one_work Do you mean roce_gid_cache_client_cleanup_work_handler? > It's not safe to free the client's context in ib_unregister_device > while the client isn't done. The obvious solution is to wait in > client->remove (like you suggested) until the client has finished > cleaning up things. This isn't just the obvious solution, it is the *expected* solution. In the kernel the add/remove style idiom always works like this. > This doesn't fit our case - since client->remove is called under > device_lock, but it's possible (for example) that > roce_rescan_device_work_handler is currently running and waits to > grab this exact mutex - DEAD LOCK. Uh, client->remove is most obviously called with drivers/infiniband/core/device.c:device_mutex held, is that what you mean? But that can't be right because only four functions hold that lock and none of them are obviously called from this patch? If these patches have a locking problem then breaking the add/remove idiom is not the way to solve it. Look, four people have asked about this patch, and I still have yet to see an accurate and convincing answer from you what is actually going on here. Please actually spend some time to properly research and describe why the remove callback can't be synchronous. > >It is just fundamentally wrong to return from ib_client.remove while > >async work is still outstanding, the client is expected to deal with > >this internally and synchronously. > > > >You don't need IB core help to do this. > > netdev is taking a similar approach - please take a look at > netdev_wait_allrefs No, it really isn't, from the attached drivers perspective after unregister_netdevice returns the driver is not allowed to operate the net device anymore. netdev_wait_allrefs is part of making unregister_netdevice synchronous. The same is true of our IB client attaches, after a client returns from remove it is not allowed to operate the IB device anymore. That is a standard idiom, and we'd need a huge compelling reason to go away from that. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/29/2015 7:48 PM, Jason Gunthorpe wrote: > >>> Although, there must be a call to the driver's modify_gid to free >>> context before freeing, and I don't see that obviously happening.. > >> Of course there is: >> roce_gid_cache_client_cleanup_one -> >> roce_gid_cache_client_cleanup_one_work -> *work* -> >> roce_gid_cache_client_cleanup_work_handler -> >> roce_gid_cache_cleanup_one -> free_roce_gid_cache -> write_gid -> >> modify_gid > > Ah, this wasn't there in the earlier versions. Good. > >>> However, all the other async work launched doesn't look safe at all. > >> I think it's safe - roce_gid_cache_client_cleanup_one deactivates >> the cache (no new events are handled on the cache). Ongoing events >> are flushed in roce_gid_cache_client_cleanup_one_work > > Do you mean roce_gid_cache_client_cleanup_work_handler? > Yeah, the flush_workqueue is done in roce_gid_cache_client_cleanup_work_handler >> It's not safe to free the client's context in ib_unregister_device >> while the client isn't done. The obvious solution is to wait in >> client->remove (like you suggested) until the client has finished >> cleaning up things. > > This isn't just the obvious solution, it is the *expected* solution. > In the kernel the add/remove style idiom always works like this. > >> This doesn't fit our case - since client->remove is called under >> device_lock, but it's possible (for example) that >> roce_rescan_device_work_handler is currently running and waits to >> grab this exact mutex - DEAD LOCK. > > Uh, client->remove is most obviously called with > drivers/infiniband/core/device.c:device_mutex held, is that what you > mean? But that can't be right because only four functions hold that > lock and none of them are obviously called from this patch? > > If these patches have a locking problem then breaking the add/remove > idiom is not the way to solve it. > > Look, four people have asked about this patch, and I still have yet to > see an accurate and convincing answer from you what is actually going > on here. Please actually spend some time to properly research and > describe why the remove callback can't be synchronous. > ib_unregister_device calls the remove function with device_mutex help. In addition, ib_enum_roce_ports_of_netdev does the same. Every interesting netdev/inet/inet6 event that's handled in roce_gid_mgmt triggers ib_enum_roce_ports_of_netdev by using the workqueue (for example, netdevice_event->*work*->netdevice_event_work_handler->ib_enum_roce_ports_of_netdev). When a device is being unregistered, the remove function is called under device_mutex: ib_unregister_device->roce_gid_cache_client_cleanup_one->*work*->roce_gid_cache_client_cleanup_work_handler->flush_workqueue This flush_workqueue could wait for ib_enum_roce_ports_of_netdev. If we would have called it straight from the remove function, we're waiting for a work which waits for a mutex that will be unlocked only after the iteration over all remove functions is completed -> *DEADLOCK* You could argue that flush_workqueue isn't needed, but that let's look at the following flow: roce_gid_cache_client_setup_one->roce_rescan_device->*work (with the exact ib_dev)*->.... We need to make sure the ib_dev isn't free'd before this work is done. There might be some ways around it - for example, introduce another workqueue for roce_rescan_device and flush this workqueue only. Every way has its advantages and disadvantages. I think it's problematic that device_mutex can't be held in a work as *most* client works are synchronized when the a device is being unregistered. It could affect future clients as well. I'll be happy to explain/fix any issue you have regarding this code, but of course it needs to be concrete. >>> It is just fundamentally wrong to return from ib_client.remove while >>> async work is still outstanding, the client is expected to deal with >>> this internally and synchronously. >>> >>> You don't need IB core help to do this. >> >> netdev is taking a similar approach - please take a look at >> netdev_wait_allrefs > > No, it really isn't, from the attached drivers perspective after > unregister_netdevice returns the driver is not allowed to operate the > net device anymore. netdev_wait_allrefs is part of making > unregister_netdevice synchronous. > > The same is true of our IB client attaches, after a client returns > from remove it is not allowed to operate the IB device anymore. > ib_unregister_device is synchronous in the exact same manner - after it returns, no client operate on the IB device. wait_for_completion(&device->free) was added for this exact reason. > That is a standard idiom, and we'd need a huge compelling reason to go > away from that. A device can be remove if and only if it's reference count is zero. That's the only point where we guarantee nobody uses it anymore. That's a standard idiom as well. > > Jason > I really appreciate the review and thanks for looking at this patch, Matan -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> >> roce_gid_cache_client_cleanup_one -> > >> roce_gid_cache_client_cleanup_one_work -> *work* -> > >> roce_gid_cache_client_cleanup_work_handler -> > >> roce_gid_cache_cleanup_one -> free_roce_gid_cache -> write_gid -> > >> modify_gid > > > > Ah, this wasn't there in the earlier versions. Good. > > > >>> However, all the other async work launched doesn't look safe at all. > > > >> I think it's safe - roce_gid_cache_client_cleanup_one deactivates > >> the cache (no new events are handled on the cache). Ongoing events > >> are flushed in roce_gid_cache_client_cleanup_one_work > > > > Do you mean roce_gid_cache_client_cleanup_work_handler? > > > > Yeah, the flush_workqueue is done in > roce_gid_cache_client_cleanup_work_handler This entire cache seems completely over-architected. Why is this complexity needed? How frequently do we really expect these "GIDs" to change? I still don't even buy that a cache is needed at all. The RDMA CM can accept as input UDP/IP addresses. RoCEv2 puts UDP/IP addresses on the wire. Why should there be a conversion from an IP address to a GID to get back to the same IP address? Why aren't IP addresses passed directly to the drivers? Just change how the RDMA CM associates an IP address with the RDMA device. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 30, 2015 at 11:21:01AM +0300, Matan Barak wrote: > ib_unregister_device calls the remove function with device_mutex > help. In addition, ib_enum_roce_ports_of_netdev does the same. Every > interesting netdev/inet/inet6 event that's handled in roce_gid_mgmt > triggers ib_enum_roce_ports_of_netdev by using the workqueue (for > example, netdevice_event->*work*->netdevice_event_work_handler->ib_enum_roce_ports_of_netdev). So, okay, now it is very clear. This should have been described explicitly in the kref commit message, for instance: ... Later commits in this series are going to extend the use of device_mutex via ib_enum_roce_ports_of_netdev resulting in ... .. To solve this deadlock introduce .... Part of the job of the patch author is to make review work better by highlighting the most troublesome areas, remember we have more patch authors than reviewers so work must be pushed onto the author side. Lets look at the original commit message provided: Previously. we used device_mutex lock in order to protect the device's list. ** 'previously' is wrong, this patch does nothing to change what device_mutex covers, it still protects the device_list and still protects the client_list That means that in order to guarantee a device isn't freed while we use it, we had to lock all devices. ** No, locking the device_mutex does nothing to protect a device from being freed. The existing kref does that. The device_mutex protects the device_list from mutation, and ib_enum_roce_ports_of_netdev must hold it when it iterates over that list. It prevents a device from being unregistered. Accurate specificity is important in these commit messages. Otherwise nobody understands what is being described. Adding a kref per IB device. Before an IB device is unregistered, we wait before its not held anymore. ** Well, that is what the patch did, but the commit message is supposed to explain *why* too Do you understand why this is so confusing? > You could argue that flush_workqueue isn't needed, but that let's > look at the following flow: No, I wouldn't argue that, all the async work obviously needs to cancel or complete during remove(), that's what I've been saying. > There might be some ways around it - for example, introduce another > workqueue for roce_rescan_device and flush this workqueue only. > Every way has its advantages and disadvantages. I don't see that either, like I keep saying, all work must complete, more work queues don't really seem to help that. > I think it's problematic that device_mutex can't be held in a work > as *most* client works are synchronized when the a device is being > unregistered. It could affect future clients as well. But until this patch set added ib_enum_roce_ports_of_netdev the device_mutex would never conflict with anything a client could do. So, ultimtely, this is really a self created problem, and infact, the problem lies with the introduction of ib_enum_roce_ports_of_netdev - adding a new use of the device_mutex that is not register/unregister related exposes the design limitations of that mutex *that Roland clearly explained in a giant comment!* So we need to fix that problem, not add a new mechanism. Off the top of my head: - Split device_mutex into client_mutex and device_mutex, hold only the client_mutex when working with the client_list. - Convert device_mutex into a r/w sem - Use a different scheme to match netdevs for ib_enum_roce_ports_of_netdev, that doesn't rely on holding device_mutex during query The first two seem really simple to me. I'd do that. > >No, it really isn't, from the attached drivers perspective after > >unregister_netdevice returns the driver is not allowed to operate the > >net device anymore. netdev_wait_allrefs is part of making > >unregister_netdevice synchronous. > > > >The same is true of our IB client attaches, after a client returns > >from remove it is not allowed to operate the IB device anymore. > > > > ib_unregister_device is synchronous in the exact same manner - after > it returns, no client operate on the IB device. > wait_for_completion(&device->free) was added for this exact reason. No, you are missing the layering here. In this context ib_client is the *driver*, and like all drivers once it's remove method returns the driver can no longer expect the attached device is operable, and must guarentee the driver's .text is freeable. > >That is a standard idiom, and we'd need a huge compelling reason to go > >away from that. > > A device can be remove if and only if it's reference count is zero. > That's the only point where we guarantee nobody uses it anymore. > That's a standard idiom as well. No actually. The kref is tied to the memory lifetime, and is pretty much universal that the memory lifetime and device operable lifetime (ie still registered) are very different things. This is why using a kref to describe anything other than memory lifetime is not correct, and two krefs in a structure is obviously nonsense. Some places use an 'active count + completion', like netdev, kernfs, etc to track active users and use that to block an unregister path, but that isn't a kref. Okay, lets break down and understand why this is an important standard guarentee, for our specific case. Lets generalize this scenario a bit. roce_gid_cache isn't modular, but lets assume it is (after all the intent of this patch is to weaken the remove() invarient for everyone.) On module remove it will call ib_unregister_client, and when ib_unregister_client returns the module will be become unloaded. The invariant for module unload requires no module code to be running and no module code to be runnable in future -- ie all work must be complete or canceled. module unload always calls the driver's remove function, that is a standard idiom. So, the first thing to notice, the kref patch didn't actually change ib_unregister_client - it doesn't 'wait_for_completion(&device->free)'. Immediately we have a architectural bug, modules can exit while they have outstanding code runnable in their .text. Oops. Next, we realize we cannot fix this by waiting on device->free, it is ostensibly a general lock that all clients use on the ib_device, we need something isolated to this client. Finally, we realize if we can isolate something to one client, then the client can handle it during it's own remove() function, we don't need core support. Thus, the way out is to make ib_client.remove() completely synchronous and *guarentee* that the module's .text will never be used again after remove() returns. Obviously this means all work is complete or canceled. *THIS* is why driver remove is idiomatically always synchronous in the kernel. Please appreciate how much time it takes to explain all of this :( Just fix it already :( Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 30, 2015 at 04:56:13PM +0000, Hefty, Sean wrote: > I still don't even buy that a cache is needed at all. The RDMA CM > can accept as input UDP/IP addresses. RoCEv2 puts UDP/IP addresses > on the wire. Why should there be a conversion from an IP address to > a GID to get back to the same IP address? Why aren't IP addresses > passed directly to the drivers? Just change how the RDMA CM > associates an IP address with the RDMA device. I feel it looks overdesigned too.. But, how could RDMA CM work? Inbound CM messages will be filtered if the IP is not in the HW GID table? How could UD work? This current scheme is just so ugly, there are so many wonky possibilities. What happens if I remove an IP and then add a new one? The GID index will eventually be re-used, and QPs bound to that gid index will silently change source IPs. Horrible. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> But, how could RDMA CM work? Inbound CM messages will be filtered if > the IP is not in the HW GID table? I'm not understanding the issue. If a device has some requirement to program its assigned IP addresses into some HW table, I don't see why upper layers should be made to care. The IB CM is essentially handling 2 different type of CM messages -- one for IB and one for RoCE. This just adds a third type. All three types are similar, with some fields ignored and others formatted using different data. The CM interface can be updated to better reflect reality, rather than pretending that RoCE has path records or anything other than IB-classic has LIDs. The CM message definitions themselves could also be updated to indicate which fields matter. > How could UD work? I haven't given much thought to UD, but since AV creation goes directly to the driver, I still don't see why GIDs need to be used. The driver can encode whatever it needs to (e.g. GID/IP index) into the AV. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 30, 2015 at 07:21:08PM +0000, Hefty, Sean wrote: > > But, how could RDMA CM work? Inbound CM messages will be filtered if > > the IP is not in the HW GID table? > > I'm not understanding the issue. > > If a device has some requirement to program its assigned IP > addresses into some HW table, I don't see why upper layers should be > made to care. Okay, I was only thinking about the first couple patches in this series. Three drivers will have this HW gid table, so having a driver helper library in the common code makes sense to me. But then the series it just seems to go crazy - what is with all these net dev patches? Now we are doing something with bonding? And a lot of screwy looking rocev2 gid type stuff? And driver updates? And 33 patches worth? You are totally right, this GID index and GID type stuff is getting *everywhere*, and it is hard to follow the *why* of all the changes are really needed. Matan, if you want to progress this, then split it up. Make a patch to library the existing roce GID table driver code and update the drivers. Just that bit alone has already spawned a huge discussion. Work the other ancillary unrelated patches in small chuncks through their trees. Then come back to rocev2 with an actual core API proposal patch set that can be discussed. Honestly, I'm not willing to even look at a patch set this big and rambly. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 30, 2015 at 8:26 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Thu, Apr 30, 2015 at 11:21:01AM +0300, Matan Barak wrote: > >> ib_unregister_device calls the remove function with device_mutex >> help. In addition, ib_enum_roce_ports_of_netdev does the same. Every >> interesting netdev/inet/inet6 event that's handled in roce_gid_mgmt >> triggers ib_enum_roce_ports_of_netdev by using the workqueue (for >> example, netdevice_event->*work*->netdevice_event_work_handler->ib_enum_roce_ports_of_netdev). > > So, okay, now it is very clear. This should have been described > explicitly in the kref commit message, for instance: > > ... > > Later commits in this series are going to extend the use of > device_mutex via ib_enum_roce_ports_of_netdev resulting in ... > .. To solve this deadlock introduce .... > > Part of the job of the patch author is to make review work better by > highlighting the most troublesome areas, remember we have more patch > authors than reviewers so work must be pushed onto the author side. > > Lets look at the original commit message provided: > > Previously. we used device_mutex lock in order to protect > the device's list. > > ** 'previously' is wrong, this patch does nothing to change what > device_mutex covers, it still protects the device_list and still > protects the client_list > > That means that in order to guarantee a > device isn't freed while we use it, we had to lock all > devices. > > ** No, locking the device_mutex does nothing to protect a device > from being freed. The existing kref does that. The device_mutex > protects the device_list from mutation, and > ib_enum_roce_ports_of_netdev must hold it when it iterates over > that list. > > It prevents a device from being unregistered. > Accurate specificity is important in these commit messages. > Otherwise nobody understands what is being described. > > Adding a kref per IB device. Before an IB device > is unregistered, we wait before its not held anymore. > > ** Well, that is what the patch did, but the commit message is > supposed to explain *why* too > > Do you understand why this is so confusing? > I agree, it should have been clarified better. The "why" is as important as the "what" and an accurate reason could have make review's life easier. I'll fix that. Thanks. >> You could argue that flush_workqueue isn't needed, but that let's >> look at the following flow: > > No, I wouldn't argue that, all the async work obviously needs to > cancel or complete during remove(), that's what I've been saying. > >> There might be some ways around it - for example, introduce another >> workqueue for roce_rescan_device and flush this workqueue only. >> Every way has its advantages and disadvantages. > > I don't see that either, like I keep saying, all work must complete, > more work queues don't really seem to help that. > I think this is a possible solution, as all works but the "rescan" work aren't device specific. That means that after we move the rescan work to another workqueue and synchronize it in the client->remove function, we could be sure the device won't be used by this client anymore. The rest of the works only iterate on the *existing* device list. Since the ib_unregister_device (a) lock (b) iterate over client->remove (c) remove from list (d) free (e) unlock all roce_gid_cache's works won't find it in the list and we'll be safe. Regarding ib_unregister_client - if we guarantee that after each client->remove(dev) dev isn't used in this client, we actually guarantee that after removing all IB devices no IB device will be used (induction). What do you think? >> I think it's problematic that device_mutex can't be held in a work >> as *most* client works are synchronized when the a device is being >> unregistered. It could affect future clients as well. > > But until this patch set added ib_enum_roce_ports_of_netdev the > device_mutex would never conflict with anything a client could do. > > So, ultimtely, this is really a self created problem, and infact, the > problem lies with the introduction of ib_enum_roce_ports_of_netdev - > adding a new use of the device_mutex that is not register/unregister > related exposes the design limitations of that mutex *that Roland > clearly explained in a giant comment!* > I agree, while it's a general problem - it was first introduced by using device_mutex in an asynchronous context (that should be flushed in the remove function). > So we need to fix that problem, not add a new mechanism. Off the top > of my head: > - Split device_mutex into client_mutex and device_mutex, > hold only the client_mutex when working with the client_list. Seems like a possible nice solution. I'll look into that. > - Convert device_mutex into a r/w sem I'm not sure this will solve the problem, as besides the new enumerate devices function, all existing functions update the list - so we'll have read-while-write scenario and we'll be in the exact same condition. > - Use a different scheme to match netdevs for > ib_enum_roce_ports_of_netdev, that doesn't rely on holding > device_mutex during query > We could switch to RCU protected list or something similar, but I honestly don't think it worth the complexity. > The first two seem really simple to me. I'd do that. > Agree, but please consider also the addition of another workqueue as a possible solution. It *does* (seem) to answer all of your concerns and could be safer and cause less code changes. >> >No, it really isn't, from the attached drivers perspective after >> >unregister_netdevice returns the driver is not allowed to operate the >> >net device anymore. netdev_wait_allrefs is part of making >> >unregister_netdevice synchronous. >> > >> >The same is true of our IB client attaches, after a client returns >> >from remove it is not allowed to operate the IB device anymore. >> > >> >> ib_unregister_device is synchronous in the exact same manner - after >> it returns, no client operate on the IB device. >> wait_for_completion(&device->free) was added for this exact reason. > > No, you are missing the layering here. > > In this context ib_client is the *driver*, and like all drivers once > it's remove method returns the driver can no longer expect the > attached device is operable, and must guarentee the driver's .text is > freeable. > >> >That is a standard idiom, and we'd need a huge compelling reason to go >> >away from that. >> >> A device can be remove if and only if it's reference count is zero. >> That's the only point where we guarantee nobody uses it anymore. >> That's a standard idiom as well. > > No actually. The kref is tied to the memory lifetime, and is pretty > much universal that the memory lifetime and device operable lifetime > (ie still registered) are very different things. > > This is why using a kref to describe anything other than memory > lifetime is not correct, and two krefs in a structure is obviously > nonsense. > > Some places use an 'active count + completion', like netdev, kernfs, > etc to track active users and use that to block an unregister path, > but that isn't a kref. > kref just hides an atomic refcount - so you're actually saying using the abstraction contradicts the kernel object's lifetime strategy? I get that, I could have used an atomic refcount - but I agree that it's worth exploring other solutions the will preserve the invariant of client->remove being synchronous. > Okay, lets break down and understand why this is an important standard > guarentee, for our specific case. Lets generalize this scenario a > bit. roce_gid_cache isn't modular, but lets assume it is (after all > the intent of this patch is to weaken the remove() invarient for > everyone.) > > On module remove it will call ib_unregister_client, and when > ib_unregister_client returns the module will be become unloaded. > > The invariant for module unload requires no module code to be > running and no module code to be runnable in future -- ie all work > must be complete or canceled. > > module unload always calls the driver's remove function, that is a > standard idiom. > > So, the first thing to notice, the kref patch didn't actually change > ib_unregister_client - it doesn't > 'wait_for_completion(&device->free)'. Immediately we have a > architectural bug, modules can exit while they have outstanding code > runnable in their .text. Oops. > I tend to agree. The __exit function flushes the workqueue (so eventually we guarentee that no code will execute before the module is unloaded). IMHO, after ib_unregister_client returns, the module could still run code (this is why the __exit handler exists), but it's not allowed to use any data of the module which unregistered it, meaning - we can't allow it to use any IB device. > Next, we realize we cannot fix this by waiting on device->free, it is > ostensibly a general lock that all clients use on the ib_device, we > need something isolated to this client. > > Finally, we realize if we can isolate something to one client, then > the client can handle it during it's own remove() function, we don't > need core support. > > Thus, the way out is to make ib_client.remove() completely synchronous > and *guarentee* that the module's .text will never be used again after > remove() returns. Obviously this means all work is complete or > canceled. > > *THIS* is why driver remove is idiomatically always synchronous in the > kernel. > Ok, I get your point, thanks. I'll keep that call synchronous. The two options which are on the table right now are: (a) split the device_mutex to device_mutex and client_mutex. ib_unregister_device first grabs only client_mutex and after it called client->remove it grabs device->mutex to the rest of the work. (b) Introduce another workqueue for all device-specific events (actually, only rescan). This allows us to flush only this workqueue without waiting for any work which needs device_mutex. > Please appreciate how much time it takes to explain all of this :( > Just fix it already :( > I really appreciate it, again - thanks for looking at this and writing this *long* explanation. > Jason > -- Thanks a lot, Matan > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 30, 2015 at 8:52 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Thu, Apr 30, 2015 at 04:56:13PM +0000, Hefty, Sean wrote: > >> I still don't even buy that a cache is needed at all. The RDMA CM >> can accept as input UDP/IP addresses. RoCEv2 puts UDP/IP addresses >> on the wire. Why should there be a conversion from an IP address to >> a GID to get back to the same IP address? Why aren't IP addresses >> passed directly to the drivers? Just change how the RDMA CM >> associates an IP address with the RDMA device. > > I feel it looks overdesigned too.. > > But, how could RDMA CM work? Inbound CM messages will be filtered if > the IP is not in the HW GID table? How could UD work? > > This current scheme is just so ugly, there are so many wonky > possibilities. What happens if I remove an IP and then add a new one? > The GID index will eventually be re-used, and QPs bound to that gid > index will silently change source IPs. Horrible. This should be handled by the vendor's driver/other future ib_core part. This patchset introduces roce_gid_cache that manages the GID table and notify vendors about GID changes. The vendor needs to: (a) Move all QPs that use GID x to error state when GID x is deleted from the table. (b) Change all QPs that use GID x to use a special invalid GID entry. (c) Don't delete GIDs that are being used by a QP. > > Jason Matan > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 1, 2015 at 12:28 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Thu, Apr 30, 2015 at 07:21:08PM +0000, Hefty, Sean wrote: >> > But, how could RDMA CM work? Inbound CM messages will be filtered if >> > the IP is not in the HW GID table? >> >> I'm not understanding the issue. >> >> If a device has some requirement to program its assigned IP >> addresses into some HW table, I don't see why upper layers should be >> made to care. > > Okay, I was only thinking about the first couple patches in this > series. Three drivers will have this HW gid table, so having a driver > helper library in the common code makes sense to me. > > But then the series it just seems to go crazy - what is with all these > net dev patches? Now we are doing something with bonding? And a lot of > screwy looking rocev2 gid type stuff? And driver updates? And 33 patches worth? > Actually, we're in the middle of splitting this series into two series - the first introduces the roce_gid_cache management and the second will introduce RoCE V2. The bonding changes are crucial - the mlx4 driver currently support bonding and changing to a new mechanism without supporting bonding will cause a regression. > You are totally right, this GID index and GID type stuff is getting > *everywhere*, and it is hard to follow the *why* of all the changes > are really needed. > > Matan, if you want to progress this, then split it up. Make a patch > to library the existing roce GID table driver code and update the > drivers. Just that bit alone has already spawned a huge discussion. > It's already in progress. The new series is expected to be a lot smaller. > Work the other ancillary unrelated patches in small chuncks through > their trees. > The DRV patch was already sent to net. Regarding the bonding and ipv6 patch - we can't change them without a good reason which relies in this patchset. > Then come back to rocev2 with an actual core API proposal patch set > that can be discussed. > > Honestly, I'm not willing to even look at a patch set this big and > rambly. Ok, hopefully we'll have something soon enough. > > Jason Thanks a lot for the review and your comments. Matan > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 01, 2015 at 09:28:40AM +0300, Matan Barak wrote: > >> There might be some ways around it - for example, introduce another > >> workqueue for roce_rescan_device and flush this workqueue only. > >> Every way has its advantages and disadvantages. > > > > I don't see that either, like I keep saying, all work must complete, > > more work queues don't really seem to help that. > > > > I think this is a possible solution, as all works but the "rescan" work > aren't device specific. That means that after we move the rescan work > to another workqueue and synchronize it in the client->remove function, > we could be sure the device won't be used by this client anymore. > > The rest of the works only iterate on the *existing* device list. Since the > ib_unregister_device > (a) lock (b) iterate over client->remove (c) remove from list (d) free > (e) unlock > all roce_gid_cache's works won't find it in the list and we'll be safe. Well.. be carefull with the locking here, if device_mutex no longer covers remove() then remove() and ib_enum_roce_ports_of_netdev() can run concurrently and they will need locking against free_roce_gid_cache(), which appears to implicitly rely on the device_mutex (too subtle and fragile) If there is a 'global' part and a 'device' specific part, it would good to make this more obvious and clear: group the functions together, annotate the difference with naming, a comment, or something. There are *alot* of little functions in this module, it is hard to understand the call graph without careful study. > Regarding ib_unregister_client - if we guarantee that after each > client->remove(dev) > dev isn't used in this client, we actually guarantee that after removing all IB > devices no IB device will be used (induction). Yes, that is the idea.. > > So, ultimtely, this is really a self created problem, and infact, the > > problem lies with the introduction of ib_enum_roce_ports_of_netdev - > > adding a new use of the device_mutex that is not register/unregister > > related exposes the design limitations of that mutex *that Roland > > clearly explained in a giant comment!* > I agree, while it's a general problem - it was first introduced by using > device_mutex in an asynchronous context (that should be flushed in > the remove function). Well, no, generally speaking you can't use device_mutex in any client facing API like ib_enum_roce_ports_of_netdev - it isn't just async contexts, but it means the add() and remove() functions instantly deadlock if they call a client facing API - that is not sane. > > - Convert device_mutex into a r/w sem > > I'm not sure this will solve the problem, as besides the new enumerate > devices function, all existing functions update the list - so we'll have > read-while-write scenario and we'll be in the exact same condition. You'd do something like hold the write sem and mutate the device list to unlink the target device, then switch to a read sem to traverse the client list. Two locks are clearer if contention is not an issue. > Agree, but please consider also the addition of another > workqueue as a possible solution. It *does* (seem) to answer > all of your concerns and could be safer and cause less code changes. As above, fundamentally, introducing a client API that holds device_mutex is just wrong, so we need to address this. I would also recommend splitting your per-device/global stuff anyhow, global stuff needs a clear unwind during remove, and so on. It isn't really optional to have this clarity. > kref just hides an atomic refcount - so you're actually saying using > the abstraction contradicts the kernel object's lifetime strategy? kref has specific semantic meaning, it isn't a general use data structure. > I tend to agree. The __exit function flushes the workqueue (so eventually > we guarentee that no code will execute before the module is unloaded). > IMHO, after ib_unregister_client returns, the module could still run code > (this is why the __exit handler exists), but it's not allowed to use any > data of the module which unregistered it, meaning - we can't allow it to > use any IB device. Yes, I was being terse, of course the .text is still going to be used, but the general idea of lifetime holds: the .text cannot remain in service of the ib_device because the __exit handler cannot clean that up. > Ok, I get your point, thanks. I'll keep that call synchronous. The two > options which are on the table right now are: > (a) split the device_mutex to device_mutex and client_mutex. > ib_unregister_device first grabs only client_mutex and after it called > client->remove it grabs device->mutex to the rest of the work. > (b) Introduce another workqueue for all device-specific events > (actually, only rescan). > This allows us to flush only this workqueue without waiting for any > work which needs device_mutex. The other option is to make rescan sense that the ib_device has been freed and ignore it, and then flush the work queue that runs that on module __exit. This might be needed anyhow due to the free race discussed above. Be sure to add the necessary kref get/puts to hold the ib device though. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 01, 2015 at 09:34:24AM +0300, Matan Barak wrote: > > This current scheme is just so ugly, there are so many wonky > > possibilities. What happens if I remove an IP and then add a new one? > > The GID index will eventually be re-used, and QPs bound to that gid > > index will silently change source IPs. Horrible. > > This should be handled by the vendor's driver/other future ib_core part. > This patchset introduces roce_gid_cache that manages the GID table and > notify vendors about GID changes. > > The vendor needs to: > (a) Move all QPs that use GID x to error state when GID x is deleted from > the table. > (b) Change all QPs that use GID x to use a special invalid GID entry. > (c) Don't delete GIDs that are being used by a QP. What about AH's for UD? What about clients that discover and then hold the GID index internally? What about the impossible to fix race of returing the GID index in the work completion and translating that back to an IP? It is a terrible scheme, Sean is right, the clients should work with the actual sock addr, somehow, at least kernel side. Converting from a sockaddr to a gid index cannot really be done without some kind of lock and ref count scheme. At the very least, that should be the starting point, if we can't get there then patch on a case by case basis why. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 5/1/2015 8:36 PM, Jason Gunthorpe wrote: > On Fri, May 01, 2015 at 09:34:24AM +0300, Matan Barak wrote: > >>> This current scheme is just so ugly, there are so many wonky >>> possibilities. What happens if I remove an IP and then add a new one? >>> The GID index will eventually be re-used, and QPs bound to that gid >>> index will silently change source IPs. Horrible. >> >> This should be handled by the vendor's driver/other future ib_core part. >> This patchset introduces roce_gid_cache that manages the GID table and >> notify vendors about GID changes. >> >> The vendor needs to: >> (a) Move all QPs that use GID x to error state when GID x is deleted from >> the table. >> (b) Change all QPs that use GID x to use a special invalid GID entry. >> (c) Don't delete GIDs that are being used by a QP. > > What about AH's for UD? > The plan is to have read-only memory-mapped AHs for UD. The kernel will create AH with a sequence counter. This AH will be mapped as read-only memory to the user-space. When sending, the user-space will atomically use this AH. If a GID is changed, the kernel will update this GID index internally. That's a long-term goal. > What about clients that discover and then hold the GID index > internally? > > What about the impossible to fix race of returing the GID index in the > work completion and translating that back to an IP? > > It is a terrible scheme, Sean is right, the clients should work with > the actual sock addr, somehow, at least kernel side. Converting from a > sockaddr to a gid index cannot really be done without some kind of > lock and ref count scheme. This is the current behavior as well. The current patch-set doesn't make it any worse or better. We don't expect to fix all world's problems. We could add reference-count in a later patchset. Working with sockaddr has its own (similar) problems - if the net-device's IP is changed - using a sockaddr will just use an old incorrect IP address (which by now could be prohibited by the administrator). > > At the very least, that should be the starting point, if we can't get > there then patch on a case by case basis why. > I agree - we don't want to regress, but we only add the roce_gid_cache in this patchset (next version will postpone adding RoCE V2 to later patchset). > Jason > Matan -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 1, 2015 at 8:31 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Fri, May 01, 2015 at 09:28:40AM +0300, Matan Barak wrote: > >> >> There might be some ways around it - for example, introduce another >> >> workqueue for roce_rescan_device and flush this workqueue only. >> >> Every way has its advantages and disadvantages. >> > >> > I don't see that either, like I keep saying, all work must complete, >> > more work queues don't really seem to help that. >> > >> >> I think this is a possible solution, as all works but the "rescan" work >> aren't device specific. That means that after we move the rescan work >> to another workqueue and synchronize it in the client->remove function, >> we could be sure the device won't be used by this client anymore. >> >> The rest of the works only iterate on the *existing* device list. Since the >> ib_unregister_device >> (a) lock (b) iterate over client->remove (c) remove from list (d) free >> (e) unlock >> all roce_gid_cache's works won't find it in the list and we'll be safe. > > Well.. be carefull with the locking here, if device_mutex no longer > covers remove() then remove() and ib_enum_roce_ports_of_netdev() can > run concurrently and they will need locking against > free_roce_gid_cache(), which appears to implicitly rely on the > device_mutex (too subtle and fragile) > > If there is a 'global' part and a 'device' specific part, it would > good to make this more obvious and clear: group the functions together, > annotate the difference with naming, a comment, or something. There > are *alot* of little functions in this module, it is hard to > understand the call graph without careful study. > I'll try to rearrange the code a bit better. >> Regarding ib_unregister_client - if we guarantee that after each >> client->remove(dev) >> dev isn't used in this client, we actually guarantee that after removing all IB >> devices no IB device will be used (induction). > > Yes, that is the idea.. > >> > So, ultimtely, this is really a self created problem, and infact, the >> > problem lies with the introduction of ib_enum_roce_ports_of_netdev - >> > adding a new use of the device_mutex that is not register/unregister >> > related exposes the design limitations of that mutex *that Roland >> > clearly explained in a giant comment!* > >> I agree, while it's a general problem - it was first introduced by using >> device_mutex in an asynchronous context (that should be flushed in >> the remove function). > > Well, no, generally speaking you can't use device_mutex in any client > facing API like ib_enum_roce_ports_of_netdev - it isn't just async > contexts, but it means the add() and remove() functions instantly > deadlock if they call a client facing API - that is not sane. > I still haven't looked at this thoroughly, but we might be able to use rwsem (as you suggested) or a RCU protected device list. I think that changing to one of these will make ib_enum_roce_ports_of_netdev safe in async context. >> > - Convert device_mutex into a r/w sem >> >> I'm not sure this will solve the problem, as besides the new enumerate >> devices function, all existing functions update the list - so we'll have >> read-while-write scenario and we'll be in the exact same condition. > > You'd do something like hold the write sem and mutate the device list > to unlink the target device, then switch to a read sem to traverse the > client list. Two locks are clearer if contention is not an issue. > >> Agree, but please consider also the addition of another >> workqueue as a possible solution. It *does* (seem) to answer >> all of your concerns and could be safer and cause less code changes. > > As above, fundamentally, introducing a client API that holds > device_mutex is just wrong, so we need to address this. > > I would also recommend splitting your per-device/global stuff anyhow, > global stuff needs a clear unwind during remove, and so on. It isn't > really optional to have this clarity. > Ok, I'll look into that for the next version of this patchset - probably splitting to rwsem/RCU with clear separation of per-device and global functions. >> kref just hides an atomic refcount - so you're actually saying using >> the abstraction contradicts the kernel object's lifetime strategy? > > kref has specific semantic meaning, it isn't a general use data > structure. > >> I tend to agree. The __exit function flushes the workqueue (so eventually >> we guarentee that no code will execute before the module is unloaded). >> IMHO, after ib_unregister_client returns, the module could still run code >> (this is why the __exit handler exists), but it's not allowed to use any >> data of the module which unregistered it, meaning - we can't allow it to >> use any IB device. > > Yes, I was being terse, of course the .text is still going to be used, > but the general idea of lifetime holds: the .text cannot remain in > service of the ib_device because the __exit handler cannot clean that > up. > >> Ok, I get your point, thanks. I'll keep that call synchronous. The two >> options which are on the table right now are: >> (a) split the device_mutex to device_mutex and client_mutex. >> ib_unregister_device first grabs only client_mutex and after it called >> client->remove it grabs device->mutex to the rest of the work. >> (b) Introduce another workqueue for all device-specific events >> (actually, only rescan). >> This allows us to flush only this workqueue without waiting for any >> work which needs device_mutex. > > The other option is to make rescan sense that the ib_device has been > freed and ignore it, and then flush the work queue that runs that on > module __exit. This might be needed anyhow due to the free race > discussed above. Be sure to add the necessary kref get/puts to hold > the ib device though. > > Jason Thanks for the review. I'll look thoroughly on all the options you mentioned. Matan -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 18c1ece..8616a95 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -261,6 +261,39 @@ out: return ret; } +static void ib_device_complete_cb(struct kref *kref) +{ + struct ib_device *device = container_of(kref, struct ib_device, + refcount); + + if (device->reg_state >= IB_DEV_UNREGISTERING) + complete(&device->free); +} + +/** + * ib_device_hold - increase the reference count of device + * @device: ib device to prevent from being free'd + * + * Prevent the device from being free'd. + */ +void ib_device_hold(struct ib_device *device) +{ + kref_get(&device->refcount); +} +EXPORT_SYMBOL(ib_device_hold); + +/** + * ib_device_put - decrease the reference count of device + * @device: allows this device to be free'd + * + * Puts the ib_device and allows it to be free'd. + */ +int ib_device_put(struct ib_device *device) +{ + return kref_put(&device->refcount, ib_device_complete_cb); +} +EXPORT_SYMBOL(ib_device_put); + /** * ib_register_device - Register an IB device with IB core * @device:Device to register @@ -312,6 +345,9 @@ int ib_register_device(struct ib_device *device, list_add_tail(&device->core_list, &device_list); + kref_init(&device->refcount); + init_completion(&device->free); + device->reg_state = IB_DEV_REGISTERED; { @@ -342,6 +378,8 @@ void ib_unregister_device(struct ib_device *device) mutex_lock(&device_mutex); + device->reg_state = IB_DEV_UNREGISTERING; + list_for_each_entry_reverse(client, &client_list, list) if (client->remove) client->remove(device); @@ -355,6 +393,9 @@ void ib_unregister_device(struct ib_device *device) ib_device_unregister_sysfs(device); + ib_device_put(device); + wait_for_completion(&device->free); + spin_lock_irqsave(&device->client_data_lock, flags); list_for_each_entry_safe(context, tmp, &device->client_data_list, list) kfree(context); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 1866595..a7593b0 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1716,6 +1716,7 @@ struct ib_device { enum { IB_DEV_UNINITIALIZED, IB_DEV_REGISTERED, + IB_DEV_UNREGISTERING, IB_DEV_UNREGISTERED } reg_state; @@ -1728,6 +1729,8 @@ struct ib_device { u32 local_dma_lkey; u8 node_type; u8 phys_port_cnt; + struct kref refcount; + struct completion free; }; struct ib_client { @@ -1741,6 +1744,9 @@ struct ib_client { struct ib_device *ib_alloc_device(size_t size); void ib_dealloc_device(struct ib_device *device); +void ib_device_hold(struct ib_device *device); +int ib_device_put(struct ib_device *device); + int ib_register_device(struct ib_device *device, int (*port_callback)(struct ib_device *, u8, struct kobject *));