diff mbox series

Kernel panic when unbinding UVC gadget function

Message ID 4d7aa3f4-22d9-9f5a-3d70-1bd7148ff4ba@google.com (mailing list archive)
State New, archived
Headers show
Series Kernel panic when unbinding UVC gadget function | expand

Commit Message

Avichal Rakesh July 20, 2023, 1:28 a.m. UTC
Hey all,

I recently ran into a kernel panic when testing the UVC Gadget Driver.
The device ramdumps with the following stack when removing the UVC config from
configfs:

KP: Oops - BUG: Fatal exception: comm:Thread-6 PC:__list_del_entry_valid+0xb0/0xc4 LR:__list_del_entry_valid+0xb0/0xc4
PC: __list_del_entry_valid+0xb0 <FFFFFFE685330294>
LR: __list_del_entry_valid+0xb0 <FFFFFFE685330294>

[<FFFFFFE685330294>] __list_del_entry_valid+0xb0
[<FFFFFFE6857E50AC>] v4l2_fh_del+0x78
[<FFFFFFE685769774>] uvc_v4l2_release+0xd0
[<FFFFFFE6857D9B10>] v4l2_release+0xcc
[<FFFFFFE684EE192C>] __fput+0xf8
[<FFFFFFE684EE17CC>] ____fput+0xc
[<FFFFFFE684B5C9E0>] task_work_run+0x138

This looks like a side effect of
https://lore.kernel.org/lkml/20230608204517.105396-1-badhri@google.com/.
Effectively, UVC function tried to disconnect the gadget before 
cleaning up resources. However, usb_gadget_unregister_driver which is 
removing the function prevents the gadget from disconnecting until the 
function is unbound.

As of the patch mentioned above, gadget_unbind_driver holds
udc->connect_lock and calls both usb_gadget_disconnect_locked and
udc->driver->unbind one after the other.

usb_gadget_disconnect_locked calls into UVC Gadget driver as follows:

1. usb_gadget_disconnect_locked
2. configfs_composite_disconnect
3. __composite_disconnect
4. uvc_function_disable

udc->driver->unbind calls into UVC Gadget driver as follows:

1. udc->driver->unbind
2. configfs_composite_unbind
3. purge_configs_funcs
4. uvc_function_unbind

uvc_function_disable notifies the userspace application with
UVC_EVENT_DISCONNECTED which causes the V4L2 node to be released
(or unsubscribed to). Either on unsubscribe or on release, the UVC Gadget
Driver calls uvc_function_disconnect before cleaning up resources. Following
is the problematic stack from uvc_v4l2_disable.

1. uvc_v4l2_disable
2. uvc_function_disconnect
3. usb_function_deactivate
4. usb_gadget_deactivate

usb_gadget_deactivate attempts to lock udc->connect_lock as of the patch
mentioned above.

This means that attempting to unregister the UVC Gadget Driver results in the
V4L2 resource cleanup being stuck behind udc->connect_lock, which will only
be released after uvc_function_unbind finishes. This results in either the
gadget deactivating after the unbind process has finished, or in a Kernel Panic
as it tries to cleanup a V4L2 node that has been purged.

This leaves us with two options:
1. We "fix" the locking in core.c to restore old behavior, and let the
   usb_gadget_deactivate call go through without locking. However,
   I am not sure about the specifics of the patch were and what exact issue it
   was trying to fix. 

   Badhri, would you know if relaxing the constraints on 
   usb_gadget_deactivate is feasible? It is possible that other gadget drivers
   run into similar issues as UVC driver.

3. UVC Gadget Driver calls usb_function_deactivate to take down the gadget if
   the userspace application stops listening to the V4L2 node. However, AFAICT
   disable is called as a part of the gadget resetting. So, if the V4L2 node 
   is released because of UVC_EVENT_DISCONNECT, we can skip calling 
   usb_function_deactivate as the gadget will be reset anyway.

   usb_function documentation seems to agree that if 'disable' is called,
   the gadget will be reset/reconfigured shortly:

     @disable: (REQUIRED) Indicates the function should be disabled.  Reasons
      *	  include host resetting or reconfiguring the gadget, and disconnection.

A dirty Patch for option 2 is attached below which skips calling 
usb_function_deactivate if uvc_function_disable was called before. It seems 
to work okay in testing. Let me know if the analysis and solutions seems okay
and I can upload a formal patch.

Thank you!

---
 drivers/usb/gadget/function/f_uvc.c | 12 ++++++++++--
 drivers/usb/gadget/function/uvc.h   |  1 +
 2 files changed, 11 insertions(+), 2 deletions(-)

--

Comments

Dan Scally July 21, 2023, 3:57 p.m. UTC | #1
Hi Avichal - thanks for all the detail

On 20/07/2023 02:28, Avichal Rakesh wrote:
> Hey all,
>
> I recently ran into a kernel panic when testing the UVC Gadget Driver.
> The device ramdumps with the following stack when removing the UVC config from
> configfs:
>
> KP: Oops - BUG: Fatal exception: comm:Thread-6 PC:__list_del_entry_valid+0xb0/0xc4 LR:__list_del_entry_valid+0xb0/0xc4
> PC: __list_del_entry_valid+0xb0 <FFFFFFE685330294>
> LR: __list_del_entry_valid+0xb0 <FFFFFFE685330294>
>
> [<FFFFFFE685330294>] __list_del_entry_valid+0xb0
> [<FFFFFFE6857E50AC>] v4l2_fh_del+0x78
> [<FFFFFFE685769774>] uvc_v4l2_release+0xd0
> [<FFFFFFE6857D9B10>] v4l2_release+0xcc
> [<FFFFFFE684EE192C>] __fput+0xf8
> [<FFFFFFE684EE17CC>] ____fput+0xc
> [<FFFFFFE684B5C9E0>] task_work_run+0x138
>
> This looks like a side effect of
> https://lore.kernel.org/lkml/20230608204517.105396-1-badhri@google.com/.
> Effectively, UVC function tried to disconnect the gadget before
> cleaning up resources. However, usb_gadget_unregister_driver which is
> removing the function prevents the gadget from disconnecting until the
> function is unbound.
>
> As of the patch mentioned above, gadget_unbind_driver holds
> udc->connect_lock and calls both usb_gadget_disconnect_locked and
> udc->driver->unbind one after the other.
>
> usb_gadget_disconnect_locked calls into UVC Gadget driver as follows:
>
> 1. usb_gadget_disconnect_locked
> 2. configfs_composite_disconnect
> 3. __composite_disconnect
> 4. uvc_function_disable
>
> udc->driver->unbind calls into UVC Gadget driver as follows:
>
> 1. udc->driver->unbind
> 2. configfs_composite_unbind
> 3. purge_configs_funcs
> 4. uvc_function_unbind
>
> uvc_function_disable notifies the userspace application with
> UVC_EVENT_DISCONNECTED which causes the V4L2 node to be released
> (or unsubscribed to). Either on unsubscribe or on release, the UVC Gadget
> Driver calls uvc_function_disconnect before cleaning up resources. Following
> is the problematic stack from uvc_v4l2_disable.
>
> 1. uvc_v4l2_disable
> 2. uvc_function_disconnect
> 3. usb_function_deactivate
> 4. usb_gadget_deactivate
>
> usb_gadget_deactivate attempts to lock udc->connect_lock as of the patch
> mentioned above.
>
> This means that attempting to unregister the UVC Gadget Driver results in the
> V4L2 resource cleanup being stuck behind udc->connect_lock, which will only
> be released after uvc_function_unbind finishes. This results in either the
> gadget deactivating after the unbind process has finished, or in a Kernel Panic
> as it tries to cleanup a V4L2 node that has been purged.
>
> This leaves us with two options:
> 1. We "fix" the locking in core.c to restore old behavior, and let the
>     usb_gadget_deactivate call go through without locking. However,
>     I am not sure about the specifics of the patch were and what exact issue it
>     was trying to fix.
>
>     Badhri, would you know if relaxing the constraints on
>     usb_gadget_deactivate is feasible? It is possible that other gadget drivers
>     run into similar issues as UVC driver.
>
> 3. UVC Gadget Driver calls usb_function_deactivate to take down the gadget if
>     the userspace application stops listening to the V4L2 node. However, AFAICT
>     disable is called as a part of the gadget resetting. So, if the V4L2 node
>     is released because of UVC_EVENT_DISCONNECT, we can skip calling
>     usb_function_deactivate as the gadget will be reset anyway.
>
>     usb_function documentation seems to agree that if 'disable' is called,
>     the gadget will be reset/reconfigured shortly:
>
>       @disable: (REQUIRED) Indicates the function should be disabled.  Reasons
>        *	  include host resetting or reconfiguring the gadget, and disconnection.
>
> A dirty Patch for option 2 is attached below which skips calling
> usb_function_deactivate if uvc_function_disable was called before. It seems
> to work okay in testing. Let me know if the analysis and solutions seems okay
> and I can upload a formal patch.


For what it's worth the analysis makes sense; the patch looks ok to me so if the conclusion is to 
fix the problem that way I think it's fine, but I'm more inclined to consider this a locking problem 
in core - it'd be better to fix it there I think.

> Thank you!
>
> ---
>   drivers/usb/gadget/function/f_uvc.c | 12 ++++++++++--
>   drivers/usb/gadget/function/uvc.h   |  1 +
>   2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/usb/gadget/function/f_uvc.c b/drivers/usb/gadget/function/f_uvc.c
> index 5e919fb65833..cef92243f1f7 100644
> --- a/drivers/usb/gadget/function/f_uvc.c
> +++ b/drivers/usb/gadget/function/f_uvc.c
> @@ -385,7 +385,7 @@ uvc_function_disable(struct usb_function *f)
>   	v4l2_event.type = UVC_EVENT_DISCONNECT;
>   	v4l2_event_queue(&uvc->vdev, &v4l2_event);
>   
> -	uvc->state = UVC_STATE_DISCONNECTED;
> +	uvc->state = UVC_STATE_HOST_DISCONNECTED;
>   
>   	usb_ep_disable(uvc->video.ep);
>   	if (uvc->enable_interrupt_ep)
> @@ -410,8 +410,16 @@ uvc_function_disconnect(struct uvc_device *uvc)
>   {
>   	int ret;
>   
> -	if ((ret = usb_function_deactivate(&uvc->func)) < 0)
> +	if (uvc->state == UVC_STATE_HOST_DISCONNECTED) {
> +		/*
> +		 * Don't deactivate gadget as this is being called in
> +		 * response to the host resetting. Gadget will be deactivated
> +		 * anyway. Just update to state as acknowledgement
> +		 */
> +		uvc->state = UVC_STATE_DISCONNECTED;
> +	} else if ((ret = usb_function_deactivate(&uvc->func)) < 0) {
>   		uvcg_info(&uvc->func, "UVC disconnect failed with %d\n", ret);
> +	}
>   }
>   
>   /* --------------------------------------------------------------------------
> diff --git a/drivers/usb/gadget/function/uvc.h b/drivers/usb/gadget/function/uvc.h
> index 100475b1363e..f1e2bc98dc61 100644
> --- a/drivers/usb/gadget/function/uvc.h
> +++ b/drivers/usb/gadget/function/uvc.h
> @@ -120,6 +120,7 @@ struct uvc_video {
>   };
>   
>   enum uvc_state {
> +	UVC_STATE_HOST_DISCONNECTED,
>   	UVC_STATE_DISCONNECTED,
>   	UVC_STATE_CONNECTED,
>   	UVC_STATE_STREAMING,
Alan Stern July 21, 2023, 7:32 p.m. UTC | #2
On Fri, Jul 21, 2023 at 04:57:51PM +0100, Dan Scally wrote:
> Hi Avichal - thanks for all the detail
> 
> On 20/07/2023 02:28, Avichal Rakesh wrote:

> > This looks like a side effect of
> > https://lore.kernel.org/lkml/20230608204517.105396-1-badhri@google.com/.
> > Effectively, UVC function tried to disconnect the gadget before
> > cleaning up resources. However, usb_gadget_unregister_driver which is
> > removing the function prevents the gadget from disconnecting until the
> > function is unbound.

> > A dirty Patch for option 2 is attached below which skips calling
> > usb_function_deactivate if uvc_function_disable was called before. It seems
> > to work okay in testing. Let me know if the analysis and solutions seems okay
> > and I can upload a formal patch.
> 
> 
> For what it's worth the analysis makes sense; the patch looks ok to me so if
> the conclusion is to fix the problem that way I think it's fine, but I'm
> more inclined to consider this a locking problem in core - it'd be better to
> fix it there I think.

I'm not so sure that handling this in the core is feasible.  Removing 
the driver obviously needs to be synchronized with deactivation, since 
the two actions affect the same parts of the state (i.e., the pull-ups 
and the "connected" flag).

Consequently I don't see how to avoid a deadlock if the driver's unbind 
callback does a deactivate.  Besides, as the patch mentions, doing so is 
never necessary.

However, even with that call removed we could still have a problem.  I 
don't know much about how the UVC function driver works, but it would be 
reasonable for the driver to have a private mutex that gets held both 
during unbind and when the user application closes the V4L2 node.  Then 
there's an ABBA locking issue:

	Unbind: The UDC core holds connect_lock while calling the UVC
	unbind handler, which needs to acquire the private mutex.

	Close node: The UVC driver holds the private mutex while doing
	a deactivate, which needs to acquire connect_lock.

Any ideas on how to clear this up?

Alan Stern
Avichal Rakesh July 21, 2023, 10:44 p.m. UTC | #3
Thank you both, Dan and Alan, for your comments!

On Fri, Jul 21, 2023 at 12:32 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Fri, Jul 21, 2023 at 04:57:51PM +0100, Dan Scally wrote:
> > Hi Avichal - thanks for all the detail
> >
> > > A dirty Patch for option 2 is attached below which skips calling
> > > usb_function_deactivate if uvc_function_disable was called before. It seems
> > > to work okay in testing. Let me know if the analysis and solutions seems okay
> > > and I can upload a formal patch.
> >
> >
> > For what it's worth the analysis makes sense; the patch looks ok to me so if
> > the conclusion is to fix the problem that way I think it's fine, but I'm
> > more inclined to consider this a locking problem in core - it'd be better to
> > fix it there I think.
>
> I'm not so sure that handling this in the core is feasible.  Removing
> the driver obviously needs to be synchronized with deactivation, since
> the two actions affect the same parts of the state (i.e., the pull-ups
> and the "connected" flag).

I don't have the full context on what caused the locking to be added,
but now that it
in place, it seems like there needs to be a clarification of
expectation between core
and the gadget drivers. Is it valid for the gadget drivers to call
usb_gadget_deactivate (and similar functions) as a part of disable/unbind
(in terms of API/expectations)?

1. If yes, maybe core can track when it is in the middle of resetting and
drop calls to usb_gadget_deactivate if called in the middle of the
disconnect--->unbind sequence. This is effectively what the patch above
does in UVC driver, but core might (with some extra state) have stronger
guarantees of when a call is redundant and can be safely dropped.

2. If no, then it becomes the gadget's responsibility to ensure that it doesn't
call any of the usb_gadget_* functions when disabling/unbinding. However, it
does require core to provide some concrete rules around when things are safe
to call, and when they aren't.

>
> Consequently I don't see how to avoid a deadlock if the driver's unbind
> callback does a deactivate.  Besides, as the patch mentions, doing so is
> never necessary.
>
> However, even with that call removed we could still have a problem.  I
> don't know much about how the UVC function driver works, but it would be
> reasonable for the driver to have a private mutex that gets held both
> during unbind and when the user application closes the V4L2 node.  Then
> there's an ABBA locking issue:
>
>         Unbind: The UDC core holds connect_lock while calling the UVC
>         unbind handler, which needs to acquire the private mutex.
>
>         Close node: The UVC driver holds the private mutex while doing
>         a deactivate, which needs to acquire connect_lock.
>
> Any ideas on how to clear this up?
>

I think my question above gives us two options out based on the answer:

1. Core handling redundant calls is the more bullet-proof solution IMO. It
means that the gadget driver never holds connect_lock when it shouldn't.
No more ABBA!

One potential implementation is to track when core is resetting in a protected
flag. All functions related to resetting/disconnecting would check the
flag before
locking connect_lock and would become no-ops if gadget is in the middle of
resetting.

2. Some stronger guarantees will let the gadget driver's state machine decide
if it can call usb_gadget_* functions. For example, if we can say for sure that
disable call will always be followed by the unbind call, and that usb_gadget_*
functions are disallowed between the two, then UVC driver can handle ABBA
situation by tracking when it is between a disable and unbind call and skip
calling usb_gadget_* function until unbind finishes.

The downside of (2), is that a poorly written (or malicious) gadget driver can
grind the gadget to a halt with a somewhat simple deadlock.

Unfortunately, I am travelling over the next week, but I'll try to
create and attach
a dirty patch for core to handle redundant calls to usb_gadget_* over the next
week.

I am fairly new and don't know the full semantics around core, so if I
am missing
something simple here, please do let me know!

Regards,
Avi
Alan Stern July 22, 2023, 3:27 p.m. UTC | #4
On Fri, Jul 21, 2023 at 03:44:52PM -0700, Avichal Rakesh wrote:
> Thank you both, Dan and Alan, for your comments!
> 
> On Fri, Jul 21, 2023 at 12:32 PM Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> > On Fri, Jul 21, 2023 at 04:57:51PM +0100, Dan Scally wrote:
> > > Hi Avichal - thanks for all the detail
> > >
> > > > A dirty Patch for option 2 is attached below which skips calling
> > > > usb_function_deactivate if uvc_function_disable was called before. It seems
> > > > to work okay in testing. Let me know if the analysis and solutions seems okay
> > > > and I can upload a formal patch.
> > >
> > >
> > > For what it's worth the analysis makes sense; the patch looks ok to me so if
> > > the conclusion is to fix the problem that way I think it's fine, but I'm
> > > more inclined to consider this a locking problem in core - it'd be better to
> > > fix it there I think.
> >
> > I'm not so sure that handling this in the core is feasible.  Removing
> > the driver obviously needs to be synchronized with deactivation, since
> > the two actions affect the same parts of the state (i.e., the pull-ups
> > and the "connected" flag).
> 
> I don't have the full context on what caused the locking to be added,
> but now that it
> in place, it seems like there needs to be a clarification of
> expectation between core
> and the gadget drivers. Is it valid for the gadget drivers to call
> usb_gadget_deactivate (and similar functions) as a part of disable/unbind
> (in terms of API/expectations)?
> 
> 1. If yes, maybe core can track when it is in the middle of resetting and
> drop calls to usb_gadget_deactivate if called in the middle of the
> disconnect--->unbind sequence. This is effectively what the patch above
> does in UVC driver, but core might (with some extra state) have stronger
> guarantees of when a call is redundant and can be safely dropped.
> 
> 2. If no, then it becomes the gadget's responsibility to ensure that it doesn't
> call any of the usb_gadget_* functions when disabling/unbinding. However, it
> does require core to provide some concrete rules around when things are safe
> to call, and when they aren't.
> 
> >
> > Consequently I don't see how to avoid a deadlock if the driver's unbind
> > callback does a deactivate.  Besides, as the patch mentions, doing so is
> > never necessary.
> >
> > However, even with that call removed we could still have a problem.  I
> > don't know much about how the UVC function driver works, but it would be
> > reasonable for the driver to have a private mutex that gets held both
> > during unbind and when the user application closes the V4L2 node.  Then
> > there's an ABBA locking issue:
> >
> >         Unbind: The UDC core holds connect_lock while calling the UVC
> >         unbind handler, which needs to acquire the private mutex.
> >
> >         Close node: The UVC driver holds the private mutex while doing
> >         a deactivate, which needs to acquire connect_lock.
> >
> > Any ideas on how to clear this up?
> >
> 
> I think my question above gives us two options out based on the answer:
> 
> 1. Core handling redundant calls is the more bullet-proof solution IMO. It
> means that the gadget driver never holds connect_lock when it shouldn't.
> No more ABBA!
> 
> One potential implementation is to track when core is resetting in a protected
> flag. All functions related to resetting/disconnecting would check the
> flag before
> locking connect_lock and would become no-ops if gadget is in the middle of
> resetting.
> 
> 2. Some stronger guarantees will let the gadget driver's state machine decide
> if it can call usb_gadget_* functions. For example, if we can say for sure that
> disable call will always be followed by the unbind call, and that usb_gadget_*
> functions are disallowed between the two, then UVC driver can handle ABBA
> situation by tracking when it is between a disable and unbind call and skip
> calling usb_gadget_* function until unbind finishes.
> 
> The downside of (2), is that a poorly written (or malicious) gadget driver can
> grind the gadget to a halt with a somewhat simple deadlock.
> 
> Unfortunately, I am travelling over the next week, but I'll try to
> create and attach
> a dirty patch for core to handle redundant calls to usb_gadget_* over the next
> week.
> 
> I am fairly new and don't know the full semantics around core, so if I
> am missing
> something simple here, please do let me know!

Here's a proposal, along the lines of your first suggestion above.  The 
idea is to avoid holding the connect_lock mutex while invoking a gadget 
driver's callbacks.

Unfortunately, this is unavoidable in the case of the ->disconnect() 
callback, but that's okay because the kerneldoc already says that 
->disconnect() may be called in_interrupt, so it's not allowed to call 
any core routines that may sleep.  Just to make this perfectly clear, 
the patch adds a couple of comments along these lines.

As far as I can tell, there is no real reason to hold connect_lock 
during the ->unbind() callback.  Not doing so should fix the problem in 
the UVC function driver.

Let me know if this works.

Alan Stern



 drivers/usb/gadget/udc/core.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: usb-devel/drivers/usb/gadget/udc/core.c
===================================================================
--- usb-devel.orig/drivers/usb/gadget/udc/core.c
+++ usb-devel/drivers/usb/gadget/udc/core.c
@@ -822,6 +822,9 @@ EXPORT_SYMBOL_GPL(usb_gadget_disconnect)
  * usb_gadget_activate() is called.  For example, user mode components may
  * need to be activated before the system can talk to hosts.
  *
+ * This routine may sleep; it must not be called in interrupt context
+ * (such as from within a gadget driver's disconnect() callback).
+ *
  * Returns zero on success, else negative errno.
  */
 int usb_gadget_deactivate(struct usb_gadget *gadget)
@@ -860,6 +863,8 @@ EXPORT_SYMBOL_GPL(usb_gadget_deactivate)
  * This routine activates gadget which was previously deactivated with
  * usb_gadget_deactivate() call. It calls usb_gadget_connect() if needed.
  *
+ * This routine may sleep; it must not be called in interrupt context.
+ *
  * Returns zero on success, else negative errno.
  */
 int usb_gadget_activate(struct usb_gadget *gadget)
@@ -1639,7 +1644,11 @@ static void gadget_unbind_driver(struct
 	usb_gadget_disable_async_callbacks(udc);
 	if (gadget->irq)
 		synchronize_irq(gadget->irq);
+	mutex_unlock(&udc->connect_lock);
+
 	udc->driver->unbind(gadget);
+
+	mutex_lock(&udc->connect_lock);
 	usb_gadget_udc_stop_locked(udc);
 	mutex_unlock(&udc->connect_lock);
Avichal Rakesh July 28, 2023, 8:27 a.m. UTC | #5
On Sat, Jul 22, 2023 at 8:57 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Fri, Jul 21, 2023 at 03:44:52PM -0700, Avichal Rakesh wrote:
> > Thank you both, Dan and Alan, for your comments!
> >
> > On Fri, Jul 21, 2023 at 12:32 PM Alan Stern <stern@rowland.harvard.edu> wrote:
> > >
> > > On Fri, Jul 21, 2023 at 04:57:51PM +0100, Dan Scally wrote:
> > > > Hi Avichal - thanks for all the detail
> > > >
> > > > > A dirty Patch for option 2 is attached below which skips calling
> > > > > usb_function_deactivate if uvc_function_disable was called before. It seems
> > > > > to work okay in testing. Let me know if the analysis and solutions seems okay
> > > > > and I can upload a formal patch.
> > > >
> > > >
> > > > For what it's worth the analysis makes sense; the patch looks ok to me so if
> > > > the conclusion is to fix the problem that way I think it's fine, but I'm
> > > > more inclined to consider this a locking problem in core - it'd be better to
> > > > fix it there I think.
> > >
> > > I'm not so sure that handling this in the core is feasible.  Removing
> > > the driver obviously needs to be synchronized with deactivation, since
> > > the two actions affect the same parts of the state (i.e., the pull-ups
> > > and the "connected" flag).
> >
> > I don't have the full context on what caused the locking to be added,
> > but now that it
> > in place, it seems like there needs to be a clarification of
> > expectation between core
> > and the gadget drivers. Is it valid for the gadget drivers to call
> > usb_gadget_deactivate (and similar functions) as a part of disable/unbind
> > (in terms of API/expectations)?
> >
> > 1. If yes, maybe core can track when it is in the middle of resetting and
> > drop calls to usb_gadget_deactivate if called in the middle of the
> > disconnect--->unbind sequence. This is effectively what the patch above
> > does in UVC driver, but core might (with some extra state) have stronger
> > guarantees of when a call is redundant and can be safely dropped.
> >
> > 2. If no, then it becomes the gadget's responsibility to ensure that it doesn't
> > call any of the usb_gadget_* functions when disabling/unbinding. However, it
> > does require core to provide some concrete rules around when things are safe
> > to call, and when they aren't.
> >
> > >
> > > Consequently I don't see how to avoid a deadlock if the driver's unbind
> > > callback does a deactivate.  Besides, as the patch mentions, doing so is
> > > never necessary.
> > >
> > > However, even with that call removed we could still have a problem.  I
> > > don't know much about how the UVC function driver works, but it would be
> > > reasonable for the driver to have a private mutex that gets held both
> > > during unbind and when the user application closes the V4L2 node.  Then
> > > there's an ABBA locking issue:
> > >
> > >         Unbind: The UDC core holds connect_lock while calling the UVC
> > >         unbind handler, which needs to acquire the private mutex.
> > >
> > >         Close node: The UVC driver holds the private mutex while doing
> > >         a deactivate, which needs to acquire connect_lock.
> > >
> > > Any ideas on how to clear this up?
> > >
> >
> > I think my question above gives us two options out based on the answer:
> >
> > 1. Core handling redundant calls is the more bullet-proof solution IMO. It
> > means that the gadget driver never holds connect_lock when it shouldn't.
> > No more ABBA!
> >
> > One potential implementation is to track when core is resetting in a protected
> > flag. All functions related to resetting/disconnecting would check the
> > flag before
> > locking connect_lock and would become no-ops if gadget is in the middle of
> > resetting.
> >
> > 2. Some stronger guarantees will let the gadget driver's state machine decide
> > if it can call usb_gadget_* functions. For example, if we can say for sure that
> > disable call will always be followed by the unbind call, and that usb_gadget_*
> > functions are disallowed between the two, then UVC driver can handle ABBA
> > situation by tracking when it is between a disable and unbind call and skip
> > calling usb_gadget_* function until unbind finishes.
> >
> > The downside of (2), is that a poorly written (or malicious) gadget driver can
> > grind the gadget to a halt with a somewhat simple deadlock.
> >
> > Unfortunately, I am travelling over the next week, but I'll try to
> > create and attach
> > a dirty patch for core to handle redundant calls to usb_gadget_* over the next
> > week.
> >
> > I am fairly new and don't know the full semantics around core, so if I
> > am missing
> > something simple here, please do let me know!
>
> Here's a proposal, along the lines of your first suggestion above.  The
> idea is to avoid holding the connect_lock mutex while invoking a gadget
> driver's callbacks.
>
> Unfortunately, this is unavoidable in the case of the ->disconnect()
> callback, but that's okay because the kerneldoc already says that
> ->disconnect() may be called in_interrupt, so it's not allowed to call
> any core routines that may sleep.  Just to make this perfectly clear,
> the patch adds a couple of comments along these lines.
>
> As far as I can tell, there is no real reason to hold connect_lock
> during the ->unbind() callback.  Not doing so should fix the problem in
> the UVC function driver.

Thank you for the patch (and apologies for the delay)! This is a neat
fix I completely glossed over. Looked around at other places where
unbind is called, and it looks like the lock isn't held anywhere else
either, so dropping the lock before calling unbind shouldn't break any
existing assumptions around the callback.

>
> Let me know if this works.
>
>
>  drivers/usb/gadget/udc/core.c |    9 +++++++++
>  1 file changed, 9 insertions(+)
>
> Index: usb-devel/drivers/usb/gadget/udc/core.c
> ===================================================================
> --- usb-devel.orig/drivers/usb/gadget/udc/core.c
> +++ usb-devel/drivers/usb/gadget/udc/core.c
> @@ -822,6 +822,9 @@ EXPORT_SYMBOL_GPL(usb_gadget_disconnect)
>   * usb_gadget_activate() is called.  For example, user mode components may
>   * need to be activated before the system can talk to hosts.
>   *
> + * This routine may sleep; it must not be called in interrupt context
> + * (such as from within a gadget driver's disconnect() callback).
> + *
>   * Returns zero on success, else negative errno.
>   */
>  int usb_gadget_deactivate(struct usb_gadget *gadget)
> @@ -860,6 +863,8 @@ EXPORT_SYMBOL_GPL(usb_gadget_deactivate)
>   * This routine activates gadget which was previously deactivated with
>   * usb_gadget_deactivate() call. It calls usb_gadget_connect() if needed.
>   *
> + * This routine may sleep; it must not be called in interrupt context.
> + *
>   * Returns zero on success, else negative errno.
>   */
>  int usb_gadget_activate(struct usb_gadget *gadget)
> @@ -1639,7 +1644,11 @@ static void gadget_unbind_driver(struct
>         usb_gadget_disable_async_callbacks(udc);
>         if (gadget->irq)
>                 synchronize_irq(gadget->irq);
> +       mutex_unlock(&udc->connect_lock);
> +
>         udc->driver->unbind(gadget);
> +
> +       mutex_lock(&udc->connect_lock);
>         usb_gadget_udc_stop_locked(udc);
>         mutex_unlock(&udc->connect_lock);
>
>

Tried the patch, and it fixes the issue for UVC Gadget Driver! UVC
driver can now be unbound without locking up with the V4L2 release
callback. Logs confirm that the calls are still being interleaved, but
don't result in a deadlock now.

Regards,
Avi.
Alan Stern July 28, 2023, 2:05 p.m. UTC | #6
On Fri, Jul 28, 2023 at 01:57:15PM +0530, Avichal Rakesh wrote:
> On Sat, Jul 22, 2023 at 8:57 PM Alan Stern <stern@rowland.harvard.edu> wrote:

> > Here's a proposal, along the lines of your first suggestion above.  The
> > idea is to avoid holding the connect_lock mutex while invoking a gadget
> > driver's callbacks.
> >
> > Unfortunately, this is unavoidable in the case of the ->disconnect()
> > callback, but that's okay because the kerneldoc already says that
> > ->disconnect() may be called in_interrupt, so it's not allowed to call
> > any core routines that may sleep.  Just to make this perfectly clear,
> > the patch adds a couple of comments along these lines.
> >
> > As far as I can tell, there is no real reason to hold connect_lock
> > during the ->unbind() callback.  Not doing so should fix the problem in
> > the UVC function driver.
> 
> Thank you for the patch (and apologies for the delay)! This is a neat
> fix I completely glossed over. Looked around at other places where
> unbind is called, and it looks like the lock isn't held anywhere else
> either, so dropping the lock before calling unbind shouldn't break any
> existing assumptions around the callback.

> Tried the patch, and it fixes the issue for UVC Gadget Driver! UVC
> driver can now be unbound without locking up with the V4L2 release
> callback. Logs confirm that the calls are still being interleaved, but
> don't result in a deadlock now.

Thanks for trying it out.  Is it okay for me to add your "Tested-by:"  
tag when the patch is submitted?

Alan Stern
Alan Stern July 28, 2023, 3:37 p.m. UTC | #7
On Fri, Jul 28, 2023 at 10:05:39AM -0400, Alan Stern wrote:
> On Fri, Jul 28, 2023 at 01:57:15PM +0530, Avichal Rakesh wrote:
> > On Sat, Jul 22, 2023 at 8:57 PM Alan Stern <stern@rowland.harvard.edu> wrote:
> 
> > > Here's a proposal, along the lines of your first suggestion above.  The
> > > idea is to avoid holding the connect_lock mutex while invoking a gadget
> > > driver's callbacks.
> > >
> > > Unfortunately, this is unavoidable in the case of the ->disconnect()
> > > callback, but that's okay because the kerneldoc already says that
> > > ->disconnect() may be called in_interrupt, so it's not allowed to call
> > > any core routines that may sleep.  Just to make this perfectly clear,
> > > the patch adds a couple of comments along these lines.
> > >
> > > As far as I can tell, there is no real reason to hold connect_lock
> > > during the ->unbind() callback.  Not doing so should fix the problem in
> > > the UVC function driver.
> > 
> > Thank you for the patch (and apologies for the delay)! This is a neat
> > fix I completely glossed over. Looked around at other places where
> > unbind is called, and it looks like the lock isn't held anywhere else
> > either, so dropping the lock before calling unbind shouldn't break any
> > existing assumptions around the callback.
> 
> > Tried the patch, and it fixes the issue for UVC Gadget Driver! UVC
> > driver can now be unbound without locking up with the V4L2 release
> > callback. Logs confirm that the calls are still being interleaved, but
> > don't result in a deadlock now.
> 
> Thanks for trying it out.  Is it okay for me to add your "Tested-by:"  
> tag when the patch is submitted?

Another thing...

It's not clear that the patch will fix the problem entirely.  Your 
original analysis of the bug stated:

> This means that attempting to unregister the UVC Gadget Driver results in the
> V4L2 resource cleanup being stuck behind udc->connect_lock, which will only
> be released after uvc_function_unbind finishes. This results in either the
> gadget deactivating after the unbind process has finished, or in a Kernel Panic
> as it tries to cleanup a V4L2 node that has been purged.

My patch removes the locking issue.  But if an execution path can 
occur with a lock present, it can also occur when the lock has been 
removed.  That means it may still be possible for the UVC gadget driver 
to try deactivating the UDC after the unbind process has finished or for 
it to try cleaning up a V4L2 node that has been purged.

If either of those really could have happened (as opposed to just 
getting stuck in a deadlock, waiting for a mutex that would never be 
released), then it can still happen with the patch.  Fixing them would 
require changes to the UVC gadget driver.  So the problem may not be 
gone entirely.

Alan Stern
Avichal Rakesh July 28, 2023, 6:11 p.m. UTC | #8
On Fri, Jul 28, 2023 at 9:07 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Fri, Jul 28, 2023 at 10:05:39AM -0400, Alan Stern wrote:
> > On Fri, Jul 28, 2023 at 01:57:15PM +0530, Avichal Rakesh wrote:
> > > On Sat, Jul 22, 2023 at 8:57 PM Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> > > > Here's a proposal, along the lines of your first suggestion above.  The
> > > > idea is to avoid holding the connect_lock mutex while invoking a gadget
> > > > driver's callbacks.
> > > >
> > > > Unfortunately, this is unavoidable in the case of the ->disconnect()
> > > > callback, but that's okay because the kerneldoc already says that
> > > > ->disconnect() may be called in_interrupt, so it's not allowed to call
> > > > any core routines that may sleep.  Just to make this perfectly clear,
> > > > the patch adds a couple of comments along these lines.
> > > >
> > > > As far as I can tell, there is no real reason to hold connect_lock
> > > > during the ->unbind() callback.  Not doing so should fix the problem in
> > > > the UVC function driver.
> > >
> > > Thank you for the patch (and apologies for the delay)! This is a neat
> > > fix I completely glossed over. Looked around at other places where
> > > unbind is called, and it looks like the lock isn't held anywhere else
> > > either, so dropping the lock before calling unbind shouldn't break any
> > > existing assumptions around the callback.
> >
> > > Tried the patch, and it fixes the issue for UVC Gadget Driver! UVC
> > > driver can now be unbound without locking up with the V4L2 release
> > > callback. Logs confirm that the calls are still being interleaved, but
> > > don't result in a deadlock now.
> >
> > Thanks for trying it out.  Is it okay for me to add your "Tested-by:"
> > tag when the patch is submitted?

Oh, yes of course!

>
> Another thing...
>
> It's not clear that the patch will fix the problem entirely.  Your
> original analysis of the bug stated:
>
> > This means that attempting to unregister the UVC Gadget Driver results in the
> > V4L2 resource cleanup being stuck behind udc->connect_lock, which will only
> > be released after uvc_function_unbind finishes. This results in either the
> > gadget deactivating after the unbind process has finished, or in a Kernel Panic
> > as it tries to cleanup a V4L2 node that has been purged.
>
> My patch removes the locking issue.  But if an execution path can
> occur with a lock present, it can also occur when the lock has been
> removed.  That means it may still be possible for the UVC gadget driver
> to try deactivating the UDC after the unbind process has finished or for
> it to try cleaning up a V4L2 node that has been purged.
>
> If either of those really could have happened (as opposed to just
> getting stuck in a deadlock, waiting for a mutex that would never be
> released), then it can still happen with the patch.  Fixing them would
> require changes to the UVC gadget driver.  So the problem may not be
> gone entirely.
>
The current situation can theoretically happen without the deadlock,
yes, but shouldn't happen in practice. UVC's disable/unbind code flow
looks as follows:

1. When disable callback is called, the gadget driver signals the
userspace application to close the V4L2 node.
2. Closing the V4L2 node calls the release callback to clean up
resources. It is this callback that calls into gadget_deactivate and
gets blocked currently (without your patch).
3. Separately, the unbind callback waits on release callback to finish
for 500ms, assuming the userspace application to behave well and close
the node in a reasonable amount of time.
4. If the release callback still hasn't been called, the V4L2 node is
forcefully removed and UVC driver waits for another 1000ms for the
release callback to clean up any pending resources.
5. The unbind process continues regardless of the status of release
callback after waiting at most 1.5s for release.

So the only way to run into the current issue is if the release
callback fails to finish in both step 3 and step 4 (for example, due
to a deadlock) in the span of 1.5s. It is possible, but fairly
unlikely (at least in my limited understanding) for the release
callback to be delayed for quite that long.

Hope that makes some sense!

- Avi.
Alan Stern July 28, 2023, 9:03 p.m. UTC | #9
On Fri, Jul 28, 2023 at 11:41:19PM +0530, Avichal Rakesh wrote:
> On Fri, Jul 28, 2023 at 9:07 PM Alan Stern <stern@rowland.harvard.edu> wrote:
> > It's not clear that the patch will fix the problem entirely.  Your
> > original analysis of the bug stated:
> >
> > > This means that attempting to unregister the UVC Gadget Driver results in the
> > > V4L2 resource cleanup being stuck behind udc->connect_lock, which will only
> > > be released after uvc_function_unbind finishes. This results in either the
> > > gadget deactivating after the unbind process has finished, or in a Kernel Panic
> > > as it tries to cleanup a V4L2 node that has been purged.
> >
> > My patch removes the locking issue.  But if an execution path can
> > occur with a lock present, it can also occur when the lock has been
> > removed.  That means it may still be possible for the UVC gadget driver
> > to try deactivating the UDC after the unbind process has finished or for
> > it to try cleaning up a V4L2 node that has been purged.
> >
> > If either of those really could have happened (as opposed to just
> > getting stuck in a deadlock, waiting for a mutex that would never be
> > released), then it can still happen with the patch.  Fixing them would
> > require changes to the UVC gadget driver.  So the problem may not be
> > gone entirely.
> >
> The current situation can theoretically happen without the deadlock,
> yes, but shouldn't happen in practice. UVC's disable/unbind code flow
> looks as follows:
> 
> 1. When disable callback is called, the gadget driver signals the
> userspace application to close the V4L2 node.
> 2. Closing the V4L2 node calls the release callback to clean up
> resources. It is this callback that calls into gadget_deactivate and
> gets blocked currently (without your patch).
> 3. Separately, the unbind callback waits on release callback to finish
> for 500ms, assuming the userspace application to behave well and close
> the node in a reasonable amount of time.
> 4. If the release callback still hasn't been called, the V4L2 node is
> forcefully removed and UVC driver waits for another 1000ms for the
> release callback to clean up any pending resources.
> 5. The unbind process continues regardless of the status of release
> callback after waiting at most 1.5s for release.
> 
> So the only way to run into the current issue is if the release
> callback fails to finish in both step 3 and step 4 (for example, due
> to a deadlock) in the span of 1.5s. It is possible, but fairly
> unlikely (at least in my limited understanding) for the release
> callback to be delayed for quite that long.
> 
> Hope that makes some sense!

Yes, and it shows that there really is a bug in the UVC driver.  In 
kernel programming, fairly unlikely == not impossible == bound to happen 
eventually!

I don't know enough about the driver to make any detailed 
recommendations.  But you might consider, for example, that if the 
unbind routine can get along with forcibly removing the V4L2 node and 
not waiting for the release callback to finish, then why not have it 
behave that way all the time?  In other words, shorten the timeouts from 
500 ms and 1000 ms to 0 ms.

Whether you do that or not, someone definitely should fix up the release 
routine so that it won't get into trouble if it is called after (or 
concurrently with) all of the cleanup operations -- which is quite 
likely to happen if those timeouts are eliminated!  In particular, it 
shouldn't call gadget_deactivate unless it knows that an unbind hasn't 
happened yet.  And if that is the case, it should block the unbind 
routine until gadget_deactivate returns.  Basically, it's a bug for a 
function driver to call any gadget core routine after its unbind 
callback has returned.

Alan Stern
Avichal Rakesh July 31, 2023, 6:58 p.m. UTC | #10
On Sat, Jul 29, 2023 at 2:33 AM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Fri, Jul 28, 2023 at 11:41:19PM +0530, Avichal Rakesh wrote:
> > On Fri, Jul 28, 2023 at 9:07 PM Alan Stern <stern@rowland.harvard.edu> wrote:
> > > It's not clear that the patch will fix the problem entirely.  Your
> > > original analysis of the bug stated:
> > >
> > > > This means that attempting to unregister the UVC Gadget Driver results in the
> > > > V4L2 resource cleanup being stuck behind udc->connect_lock, which will only
> > > > be released after uvc_function_unbind finishes. This results in either the
> > > > gadget deactivating after the unbind process has finished, or in a Kernel Panic
> > > > as it tries to cleanup a V4L2 node that has been purged.
> > >
> > > My patch removes the locking issue.  But if an execution path can
> > > occur with a lock present, it can also occur when the lock has been
> > > removed.  That means it may still be possible for the UVC gadget driver
> > > to try deactivating the UDC after the unbind process has finished or for
> > > it to try cleaning up a V4L2 node that has been purged.
> > >
> > > If either of those really could have happened (as opposed to just
> > > getting stuck in a deadlock, waiting for a mutex that would never be
> > > released), then it can still happen with the patch.  Fixing them would
> > > require changes to the UVC gadget driver.  So the problem may not be
> > > gone entirely.
> > >
> > The current situation can theoretically happen without the deadlock,
> > yes, but shouldn't happen in practice. UVC's disable/unbind code flow
> > looks as follows:
> >
> > 1. When disable callback is called, the gadget driver signals the
> > userspace application to close the V4L2 node.
> > 2. Closing the V4L2 node calls the release callback to clean up
> > resources. It is this callback that calls into gadget_deactivate and
> > gets blocked currently (without your patch).
> > 3. Separately, the unbind callback waits on release callback to finish
> > for 500ms, assuming the userspace application to behave well and close
> > the node in a reasonable amount of time.
> > 4. If the release callback still hasn't been called, the V4L2 node is
> > forcefully removed and UVC driver waits for another 1000ms for the
> > release callback to clean up any pending resources.
> > 5. The unbind process continues regardless of the status of release
> > callback after waiting at most 1.5s for release.
> >
> > So the only way to run into the current issue is if the release
> > callback fails to finish in both step 3 and step 4 (for example, due
> > to a deadlock) in the span of 1.5s. It is possible, but fairly
> > unlikely (at least in my limited understanding) for the release
> > callback to be delayed for quite that long.
> >
> > Hope that makes some sense!
>
> Yes, and it shows that there really is a bug in the UVC driver.  In
> kernel programming, fairly unlikely == not impossible == bound to happen
> eventually!
>
> I don't know enough about the driver to make any detailed
> recommendations.  But you might consider, for example, that if the
> unbind routine can get along with forcibly removing the V4L2 node and
> not waiting for the release callback to finish, then why not have it
> behave that way all the time?  In other words, shorten the timeouts from
> 500 ms and 1000 ms to 0 ms.

Forcibly removing the V4L2 node doesn't clean up the associated
resources (for example, the fd held by the userspace application),
so we risk running into kernel panics if the V4L2 node is forcibly
removed without a clean release from the userspace application.

I don't see an easy way to reduce or remove the timeouts entirely,
but I might be missing something simple again. Dan and Laurent,
if you have ideas around this, I am happy to test stuff out!

>
> Whether you do that or not, someone definitely should fix up the release
> routine so that it won't get into trouble if it is called after (or
> concurrently with) all of the cleanup operations -- which is quite
> likely to happen if those timeouts are eliminated!  In particular, it
> shouldn't call gadget_deactivate unless it knows that an unbind hasn't
> happened yet.  And if that is the case, it should block the unbind
> routine until gadget_deactivate returns.  Basically, it's a bug for a
> function driver to call any gadget core routine after its unbind
> callback has returned.
>

This seems like a reasonable check to have. I am out until next week,
but I can test and put up a patch once I get back!

Thank you!

- Avi.
diff mbox series

Patch

diff --git a/drivers/usb/gadget/function/f_uvc.c b/drivers/usb/gadget/function/f_uvc.c
index 5e919fb65833..cef92243f1f7 100644
--- a/drivers/usb/gadget/function/f_uvc.c
+++ b/drivers/usb/gadget/function/f_uvc.c
@@ -385,7 +385,7 @@  uvc_function_disable(struct usb_function *f)
 	v4l2_event.type = UVC_EVENT_DISCONNECT;
 	v4l2_event_queue(&uvc->vdev, &v4l2_event);
 
-	uvc->state = UVC_STATE_DISCONNECTED;
+	uvc->state = UVC_STATE_HOST_DISCONNECTED;
 
 	usb_ep_disable(uvc->video.ep);
 	if (uvc->enable_interrupt_ep)
@@ -410,8 +410,16 @@  uvc_function_disconnect(struct uvc_device *uvc)
 {
 	int ret;
 
-	if ((ret = usb_function_deactivate(&uvc->func)) < 0)
+	if (uvc->state == UVC_STATE_HOST_DISCONNECTED) {
+		/*
+		 * Don't deactivate gadget as this is being called in
+		 * response to the host resetting. Gadget will be deactivated
+		 * anyway. Just update to state as acknowledgement
+		 */
+		uvc->state = UVC_STATE_DISCONNECTED;
+	} else if ((ret = usb_function_deactivate(&uvc->func)) < 0) {
 		uvcg_info(&uvc->func, "UVC disconnect failed with %d\n", ret);
+	}
 }
 
 /* --------------------------------------------------------------------------
diff --git a/drivers/usb/gadget/function/uvc.h b/drivers/usb/gadget/function/uvc.h
index 100475b1363e..f1e2bc98dc61 100644
--- a/drivers/usb/gadget/function/uvc.h
+++ b/drivers/usb/gadget/function/uvc.h
@@ -120,6 +120,7 @@  struct uvc_video {
 };
 
 enum uvc_state {
+	UVC_STATE_HOST_DISCONNECTED,
 	UVC_STATE_DISCONNECTED,
 	UVC_STATE_CONNECTED,
 	UVC_STATE_STREAMING,