diff mbox series

[1/2] usb: xhci: Fix the NEC stop bug workaround

Message ID 20241014211005.07562933@foxbook (mailing list archive)
State New
Headers show
Series Fix the NEC stop bug workaround | expand

Commit Message

Michał Pecio Oct. 14, 2024, 7:10 p.m. UTC
The NEC uPD720200 has a bug, which prevents reliably stopping
an endpoint shortly after it has been restarted. This usually
happens when a driver kills many URBs in quick succession and
it results in concurrent execution and cancellation of TDs.

This is handled by stopping the endpoint again if in doubt.

This "doubt" turns out to be a problem, because Stop Endpoint
may be queued when the EP is already Stopped (for Set TR Deq
execution, for example) or becomes Stopped concurrently (by
Reset Endpoint, for example). If the EP is truly Stopped, the
command fails and further retries just keep failing forever.

This is easily triggered by modifying uvcvideo to unlink its
isochronous URBs in 100us intervals instead of poisoning them.
Any driver that unlinks URBs asynchronously may trigger this,
and any URB unlink during ongoing halt recovery also can.

Fix the problem by tracking redundant Stop Endpoint commands
which are sure to fail, and by not retrying them. It's easy,
because xhci_urb_dequeue() is the only user ever queuing the
command with the default handler and without ensuring that
the endpoint is Running and will not Halt before it Stops.
For this case, we assume that an endpoint with pending URBs
is always Running, unless certain operations are pending on
it which indicate known exceptions.

Note that we need to catch those exceptions when they occur,
because their flags may be cleared before our handler runs.

It's possible that other HCs have similar bugs (see also the
related "Running" case below), but the workaround is limited
to NEC because no such chips are currently known and tested.

Fixes: fd9d55d190c0 ("xhci: retry Stop Endpoint on buggy NEC controllers")
Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
---
 drivers/usb/host/xhci-ring.c | 44 +++++++++++++++++++++++++++++++++---
 drivers/usb/host/xhci.h      |  2 ++
 2 files changed, 43 insertions(+), 3 deletions(-)

Comments

Greg Kroah-Hartman Oct. 15, 2024, 10:38 a.m. UTC | #1
On Mon, Oct 14, 2024 at 09:10:05PM +0200, Michal Pecio wrote:
> The NEC uPD720200 has a bug, which prevents reliably stopping
> an endpoint shortly after it has been restarted. This usually
> happens when a driver kills many URBs in quick succession and
> it results in concurrent execution and cancellation of TDs.
> 
> This is handled by stopping the endpoint again if in doubt.
> 
> This "doubt" turns out to be a problem, because Stop Endpoint
> may be queued when the EP is already Stopped (for Set TR Deq
> execution, for example) or becomes Stopped concurrently (by
> Reset Endpoint, for example). If the EP is truly Stopped, the
> command fails and further retries just keep failing forever.
> 
> This is easily triggered by modifying uvcvideo to unlink its
> isochronous URBs in 100us intervals instead of poisoning them.
> Any driver that unlinks URBs asynchronously may trigger this,
> and any URB unlink during ongoing halt recovery also can.
> 
> Fix the problem by tracking redundant Stop Endpoint commands
> which are sure to fail, and by not retrying them. It's easy,
> because xhci_urb_dequeue() is the only user ever queuing the
> command with the default handler and without ensuring that
> the endpoint is Running and will not Halt before it Stops.
> For this case, we assume that an endpoint with pending URBs
> is always Running, unless certain operations are pending on
> it which indicate known exceptions.
> 
> Note that we need to catch those exceptions when they occur,
> because their flags may be cleared before our handler runs.
> 
> It's possible that other HCs have similar bugs (see also the
> related "Running" case below), but the workaround is limited
> to NEC because no such chips are currently known and tested.
> 
> Fixes: fd9d55d190c0 ("xhci: retry Stop Endpoint on buggy NEC controllers")
> Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
> ---
>  drivers/usb/host/xhci-ring.c | 44 +++++++++++++++++++++++++++++++++---
>  drivers/usb/host/xhci.h      |  2 ++
>  2 files changed, 43 insertions(+), 3 deletions(-)
> 

Hi,

This is the friendly patch-bot of Greg Kroah-Hartman.  You have sent him
a patch that has triggered this response.  He used to manually respond
to these common problems, but in order to save his sanity (he kept
writing the same thing over and over, yet to different people), I was
created.  Hopefully you will not take offence and will fix the problem
in your patch and resubmit it so that it can be accepted into the Linux
kernel tree.

You are receiving this message because of the following common error(s)
as indicated below:

- You have marked a patch with a "Fixes:" tag for a commit that is in an
  older released kernel, yet you do not have a cc: stable line in the
  signed-off-by area at all, which means that the patch will not be
  applied to any older kernel releases.  To properly fix this, please
  follow the documented rules in the
  Documentation/process/stable-kernel-rules.rst file for how to resolve
  this.

If you wish to discuss this problem further, or you have questions about
how to resolve this issue, please feel free to respond to this email and
Greg will reply once he has dug out from the pending patches received
from other developers.

thanks,

greg k-h's patch email bot
Mathias Nyman Oct. 15, 2024, 11:05 a.m. UTC | #2
On 14.10.2024 22.10, Michal Pecio wrote:
> The NEC uPD720200 has a bug, which prevents reliably stopping
> an endpoint shortly after it has been restarted. This usually
> happens when a driver kills many URBs in quick succession and
> it results in concurrent execution and cancellation of TDs.
> 
> This is handled by stopping the endpoint again if in doubt.
> 
> This "doubt" turns out to be a problem, because Stop Endpoint
> may be queued when the EP is already Stopped (for Set TR Deq
> execution, for example) or becomes Stopped concurrently (by
> Reset Endpoint, for example). If the EP is truly Stopped, the
> command fails and further retries just keep failing forever.
> 
> This is easily triggered by modifying uvcvideo to unlink its
> isochronous URBs in 100us intervals instead of poisoning them.
> Any driver that unlinks URBs asynchronously may trigger this,
> and any URB unlink during ongoing halt recovery also can.
> 
> Fix the problem by tracking redundant Stop Endpoint commands
> which are sure to fail, and by not retrying them. It's easy,
> because xhci_urb_dequeue() is the only user ever queuing the
> command with the default handler and without ensuring that
> the endpoint is Running and will not Halt before it Stops.
> For this case, we assume that an endpoint with pending URBs
> is always Running, unless certain operations are pending on
> it which indicate known exceptions.
> 
> Note that we need to catch those exceptions when they occur,
> because their flags may be cleared before our handler runs.
> 
> It's possible that other HCs have similar bugs (see also the
> related "Running" case below), but the workaround is limited
> to NEC because no such chips are currently known and tested.
> 
> Fixes: fd9d55d190c0 ("xhci: retry Stop Endpoint on buggy NEC controllers")
> Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
> ---
>   drivers/usb/host/xhci-ring.c | 44 +++++++++++++++++++++++++++++++++---
>   drivers/usb/host/xhci.h      |  2 ++
>   2 files changed, 43 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> index 4d664ba53fe9..c0efb4d34ab9 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -911,6 +911,21 @@ static int xhci_reset_halted_ep(struct xhci_hcd *xhci, unsigned int slot_id,
>   	return ret;
>   }
>   
> +/*
> + * A Stop Endpoint command is redundant if the EP is not in the Running state.
> + * It will fail with Context State Error. We sometimes queue redundant Stop EP
> + * commands when the EP is held Stopped for Set TR Deq execution, or Halted.
> + * A pending Stop Endpoint command *becomes* redundant if the EP halts before
> + * its completion, and this flag needs to be updated in those cases too.
> + */
> +static void xhci_update_stop_cmd_redundant(struct xhci_virt_ep *ep)
> +{
> +	if (ep->ep_state & (SET_DEQ_PENDING | EP_HALTED | EP_CLEARING_TT))
> +		ep->ep_state |= EP_STOP_CMD_REDUNDANT;
> +	else
> +		ep->ep_state &= ~EP_STOP_CMD_REDUNDANT;
> +}
> +
>   static int xhci_handle_halted_endpoint(struct xhci_hcd *xhci,
>   				struct xhci_virt_ep *ep,
>   				struct xhci_td *td,
> @@ -946,6 +961,7 @@ static int xhci_handle_halted_endpoint(struct xhci_hcd *xhci,
>   		return err;
>   
>   	ep->ep_state |= EP_HALTED;
> +	xhci_update_stop_cmd_redundant(ep);
>   
>   	xhci_ring_cmd_db(xhci);
>   
> @@ -1149,15 +1165,31 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
>   				break;
>   			ep->ep_state &= ~EP_STOP_CMD_PENDING;
>   			return;
> +
>   		case EP_STATE_STOPPED:
>   			/*
> -			 * NEC uPD720200 sometimes sets this state and fails with
> -			 * Context Error while continuing to process TRBs.
> -			 * Be conservative and trust EP_CTX_STATE on other chips.
> +			 * Per xHCI 4.6.9, Stop Endpoint command on a Stopped
> +			 * EP is a Context State Error, and EP stays Stopped.
> +			 * The EP could be stopped by some concurrent job, so
> +			 * ignore this error when that's the case.
> +			 */
> +			if (ep->ep_state & EP_STOP_CMD_REDUNDANT)
> +				break;

Can we skip the new flag and just check for the correct flags here directly?

if (ep->ep_state & (SET_DEQ_PENDING | EP_HALTED | EP_CLEARING_TT)
	break;

Thanks
Mathias
Michał Pecio Oct. 15, 2024, 1:27 p.m. UTC | #3
> Can we skip the new flag and just check for the correct flags here
> directly?
> 
> if (ep->ep_state & (SET_DEQ_PENDING | EP_HALTED | EP_CLEARING_TT)
> 	break;

Unfortunately not, because those pending operations may (and usually
will) complete before our handler runs. They will not restart the EP
because we set EP_STOP_CMD_PENDING, but they will clear their flags.
So we know that Stop Endpoint is guaranteed to fail, but its handler
will not see those flags and will have no clue why it failed, hence
we store this one bit of knowledge specially for its use.

But you raise a valid point. If Stop EP fails on a Halted endpoint and
somebody else resets it before Stop EP handler runs, the handler will
see EP_HALTED, because Reset EP handler must run later if the commands
were queued and executed in this order.

So if Stop EP handler tests for EP_HALTED, nobody needs to worry about
updating EP_STOP_CMD_REDUNDANT for us. The helper function can go out,
the patch is shorter, and the solution more robust against any changes
to halt recovery code that anyone might do. All that "redundant" logic
becomes concentrated in queue/handle _stop_endpoint() functions.

I think I will do a v2.


By the way, is this list of conditions complete? There are other flags
like GETTING_STREAMS or CLEAR_TOGGLE, but I'm under impression that they
are valid only with no queued URBs, so nothing can be cancelled then.

Regards,
Michal
diff mbox series

Patch

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 4d664ba53fe9..c0efb4d34ab9 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -911,6 +911,21 @@  static int xhci_reset_halted_ep(struct xhci_hcd *xhci, unsigned int slot_id,
 	return ret;
 }
 
+/*
+ * A Stop Endpoint command is redundant if the EP is not in the Running state.
+ * It will fail with Context State Error. We sometimes queue redundant Stop EP
+ * commands when the EP is held Stopped for Set TR Deq execution, or Halted.
+ * A pending Stop Endpoint command *becomes* redundant if the EP halts before
+ * its completion, and this flag needs to be updated in those cases too.
+ */
+static void xhci_update_stop_cmd_redundant(struct xhci_virt_ep *ep)
+{
+	if (ep->ep_state & (SET_DEQ_PENDING | EP_HALTED | EP_CLEARING_TT))
+		ep->ep_state |= EP_STOP_CMD_REDUNDANT;
+	else
+		ep->ep_state &= ~EP_STOP_CMD_REDUNDANT;
+}
+
 static int xhci_handle_halted_endpoint(struct xhci_hcd *xhci,
 				struct xhci_virt_ep *ep,
 				struct xhci_td *td,
@@ -946,6 +961,7 @@  static int xhci_handle_halted_endpoint(struct xhci_hcd *xhci,
 		return err;
 
 	ep->ep_state |= EP_HALTED;
+	xhci_update_stop_cmd_redundant(ep);
 
 	xhci_ring_cmd_db(xhci);
 
@@ -1149,15 +1165,31 @@  static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 				break;
 			ep->ep_state &= ~EP_STOP_CMD_PENDING;
 			return;
+
 		case EP_STATE_STOPPED:
 			/*
-			 * NEC uPD720200 sometimes sets this state and fails with
-			 * Context Error while continuing to process TRBs.
-			 * Be conservative and trust EP_CTX_STATE on other chips.
+			 * Per xHCI 4.6.9, Stop Endpoint command on a Stopped
+			 * EP is a Context State Error, and EP stays Stopped.
+			 * The EP could be stopped by some concurrent job, so
+			 * ignore this error when that's the case.
+			 */
+			if (ep->ep_state & EP_STOP_CMD_REDUNDANT)
+				break;
+			/*
+			 * NEC uPD720200 controllers have this bug: it takes one
+			 * service interval to restart a stopped EP, and in this
+			 * time its state remains Stopped. Any new Stop Endpoint
+			 * command fails and this handler may run quickly enough
+			 * to still observe the Stopped state, but it's bound to
+			 * soon change to Running, and all hell breaks loose.
+			 *
+			 * So keep retrying until the command clearly succeeds.
+			 * Not clear what to do if other HCs have similar bugs.
 			 */
 			if (!(xhci->quirks & XHCI_NEC_HOST))
 				break;
 			fallthrough;
+
 		case EP_STATE_RUNNING:
 			/* Race, HW handled stop ep cmd before ep was running */
 			xhci_dbg(xhci, "Stop ep completion ctx error, ep is running\n");
@@ -4383,11 +4415,17 @@  int xhci_queue_evaluate_context(struct xhci_hcd *xhci, struct xhci_command *cmd,
 int xhci_queue_stop_endpoint(struct xhci_hcd *xhci, struct xhci_command *cmd,
 			     int slot_id, unsigned int ep_index, int suspend)
 {
+	struct xhci_virt_ep *ep;
 	u32 trb_slot_id = SLOT_ID_FOR_TRB(slot_id);
 	u32 trb_ep_index = EP_INDEX_FOR_TRB(ep_index);
 	u32 type = TRB_TYPE(TRB_STOP_RING);
 	u32 trb_suspend = SUSPEND_PORT_FOR_TRB(suspend);
 
+	ep = xhci_get_virt_ep(xhci, slot_id, ep_index);
+	if (ep)
+		xhci_update_stop_cmd_redundant(ep);
+	/* else: don't care, the handler will do nothing anyway */
+
 	return queue_command(xhci, cmd, 0, 0, 0,
 			trb_slot_id | trb_ep_index | type | trb_suspend, false);
 }
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 620502de971a..dd803b016c48 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -670,6 +670,8 @@  struct xhci_virt_ep {
 #define EP_SOFT_CLEAR_TOGGLE	(1 << 7)
 /* usb_hub_clear_tt_buffer is in progress */
 #define EP_CLEARING_TT		(1 << 8)
+/* the EP is kept Stopped or recovering from Halt/Error */
+#define EP_STOP_CMD_REDUNDANT	(1 << 9)
 	/* ----  Related to URB cancellation ---- */
 	struct list_head	cancelled_td_list;
 	struct xhci_hcd		*xhci;