[0/3] usb: gadget: uvc: allocate requests based on frame interval length and buffersize

Message ID	20240403-uvc_request_length_by_interval-v1-0-9436c4716233@pengutronix.de (mailing list archive)
Headers	show Received: from metis.whiteo.stw.pengutronix.de (metis.whiteo.stw.pengutronix.de [185.203.201.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C2D53158A02 for <linux-usb@vger.kernel.org>; Tue, 9 Apr 2024 21:25:02 +0000 (UTC) From: Michael Grzeschik <m.grzeschik@pengutronix.de> Subject: [PATCH 0/3] usb: gadget: uvc: allocate requests based on frame interval length and buffersize Date: Tue, 09 Apr 2024 23:24:56 +0200 Message-Id: <20240403-uvc_request_length_by_interval-v1-0-9436c4716233@pengutronix.de> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit To: Laurent Pinchart <laurent.pinchart@ideasonboard.com>, Daniel Scally <dan.scally@ideasonboard.com>, Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: linux-usb@vger.kernel.org, linux-kernel@vger.kernel.org, Michael Grzeschik <m.grzeschik@pengutronix.de>
Series	usb: gadget: uvc: allocate requests based on frame interval length and buffersize \| expand [0/3] usb: gadget: uvc: allocate requests based on frame interval length and buffersize [1/3] usb: gadget: function: uvc: set req_size once when the vb2 queue is calculated [2/3] usb: gadget: uvc: add g_parm and s_parm for frame interval [3/3] usb: gadget: uvc: set req_size and n_requests based on the frame interval

Michael Grzeschik April 9, 2024, 9:24 p.m. UTC

This patch series is improving the size calculation and allocation
of the uvc requests. Using the currenlty setup frame duration of the
stream it is possible to calculate the number of requests based on the
interval length.

Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
---
Michael Grzeschik (3):
      usb: gadget: function: uvc: set req_size once when the vb2 queue is calculated
      usb: gadget: uvc: add g_parm and s_parm for frame interval
      usb: gadget: uvc: set req_size and n_requests based on the frame interval

 drivers/usb/gadget/function/uvc.h       |  1 +
 drivers/usb/gadget/function/uvc_queue.c | 30 ++++++++++++++-----
 drivers/usb/gadget/function/uvc_v4l2.c  | 52 +++++++++++++++++++++++++++++++++
 drivers/usb/gadget/function/uvc_video.c | 17 ++---------
 4 files changed, 79 insertions(+), 21 deletions(-)
---
base-commit: 3295f1b866bfbcabd625511968e8a5c541f9ab32
change-id: 20240403-uvc_request_length_by_interval-a7efd587d963

Best regards,

Michael Grzeschik April 21, 2024, 11:25 p.m. UTC | #1

On Tue, Apr 09, 2024 at 11:24:56PM +0200, Michael Grzeschik wrote:
>This patch series is improving the size calculation and allocation
>of the uvc requests. Using the currenlty setup frame duration of the
>stream it is possible to calculate the number of requests based on the
>interval length.

The basic concept here is right. But unfortunatly we found out that
together with Patch [1] and the current zero length request pump
mechanism [2] and [3] this is not working as expected.

The conclusion that we can not queue more than one frame at once into
the hw led to [1]. The current implementation of zero length reqeusts
which will be queued while we are waiting for the frame to finish
transferring will enlarge the frame duration. Since every zero-length
request is still taking up at least one frame interval of 125 us.

This longer frameduration of each enqueued will therefor decrease the
absolut framerate.

Therefor to properly make those patches work, we will have to get rid of
the zero length pump mechanism again and make sure that the whole
business logic of what to be queued and when will only be done in the
pump worker. It is possible to let the dwc3 udc run dry, as we are
actively waiting for the frame to finish, the last request in the
prepared and started list will stop the current dwc3 stream and therfor
no underruns will occur with the next ep_queue.

Also keeping the prepared list and doing the preparation in any case
of the pump worker is still a good point we need to keep.

With all these pending patches the whole uvc saga of underruns and
flickering videostreams should come to an end™.

I already started with this but would be happy to see Avichal and others
to review the patches when they are ready in my eyes.

mgr

[1] https://lore.kernel.org/all/20240324-uvc-gadget-errorcheck-v1-1-5538c57bbeba@pengutronix.de/
[2] https://lore.kernel.org/all/99384044-0d14-4ebe-9109-8a5557e64449@google.com/
[3] https://lore.kernel.org/all/20230508231103.1621375-1-arakesh@google.com/

>Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
>---
>Michael Grzeschik (3):
>      usb: gadget: function: uvc: set req_size once when the vb2 queue is calculated
>      usb: gadget: uvc: add g_parm and s_parm for frame interval
>      usb: gadget: uvc: set req_size and n_requests based on the frame interval
>
> drivers/usb/gadget/function/uvc.h       |  1 +
> drivers/usb/gadget/function/uvc_queue.c | 30 ++++++++++++++-----
> drivers/usb/gadget/function/uvc_v4l2.c  | 52 +++++++++++++++++++++++++++++++++
> drivers/usb/gadget/function/uvc_video.c | 17 ++---------
> 4 files changed, 79 insertions(+), 21 deletions(-)
>---
>base-commit: 3295f1b866bfbcabd625511968e8a5c541f9ab32
>change-id: 20240403-uvc_request_length_by_interval-a7efd587d963
>
>Best regards,
>-- 
>Michael Grzeschik <m.grzeschik@pengutronix.de>
>
>

Avichal Rakesh April 23, 2024, 12:21 a.m. UTC | #2

On 4/21/24 16:25, Michael Grzeschik wrote:
> On Tue, Apr 09, 2024 at 11:24:56PM +0200, Michael Grzeschik wrote:
>> This patch series is improving the size calculation and allocation
>> of the uvc requests. Using the currenlty setup frame duration of the
>> stream it is possible to calculate the number of requests based on the
>> interval length.
> 
> The basic concept here is right. But unfortunatly we found out that
> together with Patch [1] and the current zero length request pump
> mechanism [2] and [3] this is not working as expected.
> 
> The conclusion that we can not queue more than one frame at once into
> the hw led to [1]. The current implementation of zero length reqeusts
> which will be queued while we are waiting for the frame to finish
> transferring will enlarge the frame duration. Since every zero-length
> request is still taking up at least one frame interval of 125 us.

I haven't taken a super close look at your patches, so please feel free
to correct me if I am misunderstanding something.

It looks like the goal of the patches is to determine a better number
and size of usb_requests from the given framerate such that we send exactly
nreqs requests per frame where nreqs is determined to be the exact number 
of requests that can be sent in one frame interval?

As the logic stands, we need some 0-length requests to be circulating to
ensure that we don't miss ISOC deadlines. The current logic unconditionally
sends half of all allocated requests to be circulated.

With those two things in mind, this means than video_pump can at encode
at most half a frame in one go, and then has to wait for complete 
callbacks to come in. In such cases, the theoretical worst case for 
encode time is  
125us * (number of requests needed per frame / 2) + scheduling delays
as after the first half of the frame has been encoded, the video_pump
thread will have to wait 125us for each of the zero length requests to
be returned.

The underlying assumption behind the "queue 0-length requests" approach
was that video_pump encodes the frames in as few requests as possible
and that there are spare requests to maintain a pressure on the 
ISOC queue without hindering the video_pump thread, and unfortunately
it seems like patch 3/3 is breaking both of them?

Assuming my understanding of your patches is correct, my question 
is: Why do we want to spread the frame uniformly over the requests
instead of encoding it in as few requests as possible. Spreading
the frame over more requests artificially increases the encode time
required by video_pump, and AFAICT there is no real benefit to it?

> Therefor to properly make those patches work, we will have to get rid of
> the zero length pump mechanism again and make sure that the whole
> business logic of what to be queued and when will only be done in the
> pump worker. It is possible to let the dwc3 udc run dry, as we are
> actively waiting for the frame to finish, the last request in the
> prepared and started list will stop the current dwc3 stream and therf> no underruns will occur with the next ep_queue.

One thing to note here: The reason we moved to queuing 0-length requests
from complete callback was because even with realtime priority, video_pump
thread doesn't always meet the ISOC queueing cadence. I think stopping and
starting the stream was briefly discussed in our initial discussion in 
https://lore.kernel.org/all/20230419001143.pdxflhzyecf4kvee@synopsys.com/
and Thinh mentioned that dwc3 controller does it if it detects an underrun,
but I am not sure if starting and stopping an ISOC stream is good practice.
Someone better versed in USB protocol can probably confirm, but it seems
somewhat hacky to stop the ISOC stream at the end of the frame and restart
with the next frame. 

> With all these pending patches the whole uvc saga of underruns and
> flickering videostreams should come to an end™.

This would indeed be nice!

> 
> I already started with this but would be happy to see Avichal and others
> to review the patches when they are ready in my eyes.

Of course!

- Avi.

Michael Grzeschik April 23, 2024, 2:25 p.m. UTC | #3

Ccing:

Michael Riesch <michael.riesch@wolfvision.net>
Thinh Nguyen <Thinh.Nguyen@synopsys.com>

On Mon, Apr 22, 2024 at 05:21:09PM -0700, Avichal Rakesh wrote:
>On 4/21/24 16:25, Michael Grzeschik wrote:
>> On Tue, Apr 09, 2024 at 11:24:56PM +0200, Michael Grzeschik wrote:
>>> This patch series is improving the size calculation and allocation
>>> of the uvc requests. Using the currenlty setup frame duration of the
>>> stream it is possible to calculate the number of requests based on the
>>> interval length.
>>
>> The basic concept here is right. But unfortunatly we found out that
>> together with Patch [1] and the current zero length request pump
>> mechanism [2] and [3] this is not working as expected.
>>
>> The conclusion that we can not queue more than one frame at once into
>> the hw led to [1]. The current implementation of zero length reqeusts
>> which will be queued while we are waiting for the frame to finish
>> transferring will enlarge the frame duration. Since every zero-length
>> request is still taking up at least one frame interval of 125 us.
>
>I haven't taken a super close look at your patches, so please feel free
>to correct me if I am misunderstanding something.
>
>It looks like the goal of the patches is to determine a better number
>and size of usb_requests from the given framerate such that we send exactly
>nreqs requests per frame where nreqs is determined to be the exact number
>of requests that can be sent in one frame interval?

It does not need to be the exact time, actually it may not be exact.
Scattering the data over all requests would not leave any headroom for
any latencies or overhead.

>As the logic stands, we need some 0-length requests to be circulating to
>ensure that we don't miss ISOC deadlines. The current logic unconditionally
>sends half of all allocated requests to be circulated.
>
>With those two things in mind, this means than video_pump can at encode
>at most half a frame in one go, and then has to wait for complete
>callbacks to come in. In such cases, the theoretical worst case for
>encode time is
>125us * (number of requests needed per frame / 2) + scheduling delays
>as after the first half of the frame has been encoded, the video_pump
>thread will have to wait 125us for each of the zero length requests to
>be returned.
>
>The underlying assumption behind the "queue 0-length requests" approach
>was that video_pump encodes the frames in as few requests as possible
>and that there are spare requests to maintain a pressure on the
>ISOC queue without hindering the video_pump thread, and unfortunately
>it seems like patch 3/3 is breaking both of them?

Right.

>Assuming my understanding of your patches is correct, my question
>is: Why do we want to spread the frame uniformly over the requests
>instead of encoding it in as few requests as possible. Spreading
>the frame over more requests artificially increases the encode time
>required by video_pump, and AFAICT there is no real benefit to it?

Thinh gave me the advise that it is better to use the isoc stream
constantly filled. Rather then streaming big amounts of data in the
beginning of an frameinterval and having then a lot of spare time
where the bandwidth is completely unsused.

In our reallife scenario streaming big requests had the impact, that
the dwc3 core could not keep up with reading the amount of data
from the memory bus, as the bus is already under heavy load. When the
HW was then not able to transfer the requested and actually available
amount of data in the interval, the hw did give us the usual missed
interrupt answer.

Using smaller requests solved the problem here, as it really was
unnecessary to stress the memory and usb bus in the beginning as
we had enough headroom in the temporal domain.

Which then led to the conclusion that the number of needed requests
per image frame interval is calculatable since we know the usb
interval length.

@Thinh: Correct me if I am saying something wrong here.

>> Therefor to properly make those patches work, we will have to get rid of
>> the zero length pump mechanism again and make sure that the whole
>> business logic of what to be queued and when will only be done in the
>> pump worker. It is possible to let the dwc3 udc run dry, as we are
>> actively waiting for the frame to finish, the last request in the
>> prepared and started list will stop the current dwc3 stream and  for
>> no underruns will occur with the next ep_queue.
>
>One thing to note here: The reason we moved to queuing 0-length requests
>from complete callback was because even with realtime priority, video_pump
>thread doesn't always meet the ISOC queueing cadence. I think stopping and
>starting the stream was briefly discussed in our initial discussion in
>https://lore.kernel.org/all/20230419001143.pdxflhzyecf4kvee@synopsys.com/
>and Thinh mentioned that dwc3 controller does it if it detects an underrun,
>but I am not sure if starting and stopping an ISOC stream is good practice.

The realtime latency aspect is not an issue anymore if we ensure that we
always keep only one frame in the hw ring buffer. When the pump worker
ensure that it will always run through one full frame the scheduler has
no chance to break our running dwc3 stream. Since the pump is running
under a while(1) this should be possible.

Also with the request amount precalculation we can always encode the
whole frame into all available requests and don't have to wait for
requests to be available again.

Together with the latest knowladge about the underlying hw we even need to only
keep one frame in the HW ring buffer. Since we have some interrupt latency,
keeping more frames in the ring buffer, would mean that we are not able to tag
the currently streamed frame properly as errornous if the dwc3 hw ring buffer
is already telling the host some data about the next frame. And as we already
need to wait for the end of the frame to finish, based on the assumption that
only one frame is enqueued in the ring buffer the hw will stop the stream and
the next requst will start a new stream. So there will no missed underruns be
happening.

So the main fact here is, that telling the host some status about a
frame in the past is impossible! Therefor the first request of the next
hw stream need to be the one that is telling the Host if the previous frame
is ment to be drawn or not.

>Someone better versed in USB protocol can probably confirm, but it seems
>somewhat hacky to stop the ISOC stream at the end of the frame and restart
>with the next frame.

All I know is that the HW mechanism that is reading from the trb ring buffer is
started or stopped I don't know if really the ISOC stream is stopped and
restarted here or what that means on the real wire. And if so, I am unsure if
that is really a problem or not. Thinh?

Regards,
Michael

Avichal Rakesh April 23, 2024, 11:23 p.m. UTC | #4

On 4/23/24 07:25, Michael Grzeschik wrote:
> Ccing:
> 
> Michael Riesch <michael.riesch@wolfvision.net>
> Thinh Nguyen <Thinh.Nguyen@synopsys.com>
> 
> On Mon, Apr 22, 2024 at 05:21:09PM -0700, Avichal Rakesh wrote:
>> On 4/21/24 16:25, Michael Grzeschik wrote:
>>> On Tue, Apr 09, 2024 at 11:24:56PM +0200, Michael Grzeschik wrote:
>>>> This patch series is improving the size calculation and allocation
>>>> of the uvc requests. Using the currenlty setup frame duration of the
>>>> stream it is possible to calculate the number of requests based on the
>>>> interval length.
>>>
>>> The basic concept here is right. But unfortunatly we found out that
>>> together with Patch [1] and the current zero length request pump
>>> mechanism [2] and [3] this is not working as expected.
>>>
>>> The conclusion that we can not queue more than one frame at once into
>>> the hw led to [1]. The current implementation of zero length reqeusts
>>> which will be queued while we are waiting for the frame to finish
>>> transferring will enlarge the frame duration. Since every zero-length
>>> request is still taking up at least one frame interval of 125 us.
>>
>> I haven't taken a super close look at your patches, so please feel free
>> to correct me if I am misunderstanding something.
>>
>> It looks like the goal of the patches is to determine a better number
>> and size of usb_requests from the given framerate such that we send exactly
>> nreqs requests per frame where nreqs is determined to be the exact number
>> of requests that can be sent in one frame interval?
> 
> It does not need to be the exact time, actually it may not be exact.
> Scattering the data over all requests would not leave any headroom for
> any latencies or overhead.

IIUC, patch 3/3 sets the number of requests to frameinterval / 125 us, 
which gives us the number of requests we can send in exactly one frame interval, 
and then sets the size of the request as max framesize / nreq, which means the 
frames will be evenly divided up into all available requests (with a little
fuzz factor here and there).

This effectively means that (assuming no other delays) one frame will take
~one frameinterval to be transmitted?

> 
>> As the logic stands, we need some 0-length requests to be circulating to
>> ensure that we don't miss ISOC deadlines. The current logic unconditionally
>> sends half of all allocated requests to be circulated.
>>
>> With those two things in mind, this means than video_pump can at encode
>> at most half a frame in one go, and then has to wait for complete
>> callbacks to come in. In such cases, the theoretical worst case for
>> encode time is
>> 125us * (number of requests needed per frame / 2) + scheduling delays
>> as after the first half of the frame has been encoded, the video_pump
>> thread will have to wait 125us for each of the zero length requests to
>> be returned.
>>
>> The underlying assumption behind the "queue 0-length requests" approach
>> was that video_pump encodes the frames in as few requests as possible
>> and that there are spare requests to maintain a pressure on the
>> ISOC queue without hindering the video_pump thread, and unfortunately
>> it seems like patch 3/3 is breaking both of them?
> 
> Right.
> 
>> Assuming my understanding of your patches is correct, my question
>> is: Why do we want to spread the frame uniformly over the requests
>> instead of encoding it in as few requests as possible. Spreading
>> the frame over more requests artificially increases the encode time
>> required by video_pump, and AFAICT there is no real benefit to it?
> 
> Thinh gave me the advise that it is better to use the isoc stream
> constantly filled. Rather then streaming big amounts of data in the
> beginning of an frameinterval and having then a lot of spare time
> where the bandwidth is completely unsused.
> 
> In our reallife scenario streaming big requests had the impact, that
> the dwc3 core could not keep up with reading the amount of data
> from the memory bus, as the bus is already under heavy load. When the
> HW was then not able to transfer the requested and actually available
> amount of data in the interval, the hw did give us the usual missed
> interrupt answer.
> 
> Using smaller requests solved the problem here, as it really was
> unnecessary to stress the memory and usb bus in the beginning as
> we had enough headroom in the temporal domain.

Ah, I see. This was not a consideration, and it makes sense if USB
bus is under contention from a few different streams. So the solution
seems to be to spread the frame of as many requests as we can transmit 
in one frameinterval?

As an experiment, while we wait for others to respond, could you try 
doubling (or 2.5x'ing to be extra safe) the number of requests allocated
by patch 3/3 without changing the request's buffer size? 

It won't help with the error reporting but should help with ensuring
that frames are sent out in one frameinterval with little to no
0-length requests between them.

The idea is that video_pump will have enough requests available to fully 
encode the frame in one burst, and another frame's worth of request will be 
re-added to req_free list for video_pump to fill up in the time that the next 
frame comes in.

> 
> Which then led to the conclusion that the number of needed requests
> per image frame interval is calculatable since we know the usb
> interval length.
> 
> @Thinh: Correct me if I am saying something wrong here.
> 
>>> Therefor to properly make those patches work, we will have to get rid of
>>> the zero length pump mechanism again and make sure that the whole
>>> business logic of what to be queued and when will only be done in the
>>> pump worker. It is possible to let the dwc3 udc run dry, as we are
>>> actively waiting for the frame to finish, the last request in the
>>> prepared and started list will stop the current dwc3 stream and  for
>>> no underruns will occur with the next ep_queue.
>>
>> One thing to note here: The reason we moved to queuing 0-length requests
>> from complete callback was because even with realtime priority, video_pump
>> thread doesn't always meet the ISOC queueing cadence. I think stopping and
>> starting the stream was briefly discussed in our initial discussion in
>> https://lore.kernel.org/all/20230419001143.pdxflhzyecf4kvee@synopsys.com/
>> and Thinh mentioned that dwc3 controller does it if it detects an underrun,
>> but I am not sure if starting and stopping an ISOC stream is good practice.
> 
> The realtime latency aspect is not an issue anymore if we ensure that we
> always keep only one frame in the hw ring buffer. When the pump worker
> ensure that it will always run through one full frame the scheduler has
> no chance to break our running dwc3 stream. Since the pump is running
> under a while(1) this should be possible.

I'll wait for your patch to see, but are you saying that we should have the 
pump worker busy spinning  when there are no frames available? Cameras cannot
produce frames fast enough for video_pump to be constantly encoding frames. 
IIRC, "encoding" a frame to usb_requests took less than a ms or two on my 
device, and frame interval is 33ms for a 30fps stream, so the CPU would be
busy spinning for ~30ms which is an unreasonable time for a CPU to be
idling.

> 
> Also with the request amount precalculation we can always encode the
> whole frame into all available requests and don't have to wait for
> requests to be available again.
> 
> Together with the latest knowladge about the underlying hw we even need to only
> keep one frame in the HW ring buffer. Since we have some interrupt latency,
> keeping more frames in the ring buffer, would mean that we are not able to tag
> the currently streamed frame properly as errornous if the dwc3 hw ring buffer
> is already telling the host some data about the next frame. And as we already
> need to wait for the end of the frame to finish, based on the assumption that
> only one frame is enqueued in the ring buffer the hw will stop the stream and
> the next requst will start a new stream. So there will no missed underruns be
> happening.
> 
> So the main fact here is, that telling the host some status about a
> frame in the past is impossible! Therefor the first request of the next
> hw stream need to be the one that is telling the Host if the previous frame
> is ment to be drawn or not.

This is a fair point, but the timing on this becomes a little difficult if 
the frame is sent over the entire frameinterval. If we wait for the entire 
frame to be transmitted, then we have 125us between the last request of a 
frame being transmitted and the first request of the next frame being 
queued. The userspace app producing the frames will have timing variations
larger than 125us, so we cannot rely on a frame being available exactly as
one frame is fully transmitted, or of us being notified of transmission
status by the time the next frame comes in.

> 
>> Someone better versed in USB protocol can probably confirm, but it seems
>> somewhat hacky to stop the ISOC stream at the end of the frame and restart
>> with the next frame.
> 
> All I know is that the HW mechanism that is reading from the trb ring buffer is
> started or stopped I don't know if really the ISOC stream is stopped and
> restarted here or what that means on the real wire. And if so, I am unsure if
> that is really a problem or not. Thinh?

Oh? That's great! If the controller can keep the ISOC stream from underruning
without the gadget feeding it 0-length requests, then we can simplify the 
gadget side implementation quite a bit!


Regards,
Avi.

Thinh Nguyen April 24, 2024, 2:28 a.m. UTC | #5

On Tue, Apr 23, 2024, Avichal Rakesh wrote:
> 
> 
> On 4/23/24 07:25, Michael Grzeschik wrote:
> > Ccing:
> > 
> > Michael Riesch <michael.riesch@wolfvision.net>
> > Thinh Nguyen <Thinh.Nguyen@synopsys.com>
> > 
> > On Mon, Apr 22, 2024 at 05:21:09PM -0700, Avichal Rakesh wrote:
> >> On 4/21/24 16:25, Michael Grzeschik wrote:
> >>> On Tue, Apr 09, 2024 at 11:24:56PM +0200, Michael Grzeschik wrote:
> >>>> This patch series is improving the size calculation and allocation
> >>>> of the uvc requests. Using the currenlty setup frame duration of the
> >>>> stream it is possible to calculate the number of requests based on the
> >>>> interval length.
> >>>
> >>> The basic concept here is right. But unfortunatly we found out that
> >>> together with Patch [1] and the current zero length request pump
> >>> mechanism [2] and [3] this is not working as expected.
> >>>
> >>> The conclusion that we can not queue more than one frame at once into
> >>> the hw led to [1]. The current implementation of zero length reqeusts
> >>> which will be queued while we are waiting for the frame to finish
> >>> transferring will enlarge the frame duration. Since every zero-length
> >>> request is still taking up at least one frame interval of 125 us.
> >>
> >> I haven't taken a super close look at your patches, so please feel free
> >> to correct me if I am misunderstanding something.
> >>
> >> It looks like the goal of the patches is to determine a better number
> >> and size of usb_requests from the given framerate such that we send exactly
> >> nreqs requests per frame where nreqs is determined to be the exact number
> >> of requests that can be sent in one frame interval?
> > 
> > It does not need to be the exact time, actually it may not be exact.
> > Scattering the data over all requests would not leave any headroom for
> > any latencies or overhead.
> 
> IIUC, patch 3/3 sets the number of requests to frameinterval / 125 us, 
> which gives us the number of requests we can send in exactly one frame interval, 
> and then sets the size of the request as max framesize / nreq, which means the 
> frames will be evenly divided up into all available requests (with a little
> fuzz factor here and there).
> 
> This effectively means that (assuming no other delays) one frame will take
> ~one frameinterval to be transmitted?
> 
> > 
> >> As the logic stands, we need some 0-length requests to be circulating to
> >> ensure that we don't miss ISOC deadlines. The current logic unconditionally
> >> sends half of all allocated requests to be circulated.
> >>
> >> With those two things in mind, this means than video_pump can at encode
> >> at most half a frame in one go, and then has to wait for complete
> >> callbacks to come in. In such cases, the theoretical worst case for
> >> encode time is
> >> 125us * (number of requests needed per frame / 2) + scheduling delays
> >> as after the first half of the frame has been encoded, the video_pump
> >> thread will have to wait 125us for each of the zero length requests to
> >> be returned.
> >>
> >> The underlying assumption behind the "queue 0-length requests" approach
> >> was that video_pump encodes the frames in as few requests as possible
> >> and that there are spare requests to maintain a pressure on the
> >> ISOC queue without hindering the video_pump thread, and unfortunately
> >> it seems like patch 3/3 is breaking both of them?
> > 
> > Right.
> > 
> >> Assuming my understanding of your patches is correct, my question
> >> is: Why do we want to spread the frame uniformly over the requests
> >> instead of encoding it in as few requests as possible. Spreading
> >> the frame over more requests artificially increases the encode time
> >> required by video_pump, and AFAICT there is no real benefit to it?
> > 
> > Thinh gave me the advise that it is better to use the isoc stream
> > constantly filled. Rather then streaming big amounts of data in the
> > beginning of an frameinterval and having then a lot of spare time
> > where the bandwidth is completely unsused.
> > 
> > In our reallife scenario streaming big requests had the impact, that
> > the dwc3 core could not keep up with reading the amount of data
> > from the memory bus, as the bus is already under heavy load. When the
> > HW was then not able to transfer the requested and actually available
> > amount of data in the interval, the hw did give us the usual missed
> > interrupt answer.
> > 
> > Using smaller requests solved the problem here, as it really was
> > unnecessary to stress the memory and usb bus in the beginning as
> > we had enough headroom in the temporal domain.
> 
> Ah, I see. This was not a consideration, and it makes sense if USB
> bus is under contention from a few different streams. So the solution
> seems to be to spread the frame of as many requests as we can transmit 
> in one frameinterval?
> 
> As an experiment, while we wait for others to respond, could you try 
> doubling (or 2.5x'ing to be extra safe) the number of requests allocated
> by patch 3/3 without changing the request's buffer size? 
> 
> It won't help with the error reporting but should help with ensuring
> that frames are sent out in one frameinterval with little to no
> 0-length requests between them.
> 
> The idea is that video_pump will have enough requests available to fully 
> encode the frame in one burst, and another frame's worth of request will be 
> re-added to req_free list for video_pump to fill up in the time that the next 
> frame comes in.
> 
> > 
> > Which then led to the conclusion that the number of needed requests
> > per image frame interval is calculatable since we know the usb
> > interval length.
> > 
> > @Thinh: Correct me if I am saying something wrong here.

Right, if you max out the data rate per uframe, there's less opportunity
for the host to schedule everything for that interval (e.g. affected
from other endpoint/device traffics, link commands etc). It also
increases the latency of DMA. In many cases, many other vendor hosts
can't handle 48KB/uframe for SuperSpeed and 96KB/uframe for SuperSpeed
Plus. So, you'd need to test your platform find the optimal request size
so it can work for most hosts.

> > 
> >>> Therefor to properly make those patches work, we will have to get rid of

Sorry if I may have missed the explaination, but why do we need to rid
of this?

> >>> the zero length pump mechanism again and make sure that the whole
> >>> business logic of what to be queued and when will only be done in the
> >>> pump worker. It is possible to let the dwc3 udc run dry, as we are
> >>> actively waiting for the frame to finish, the last request in the
> >>> prepared and started list will stop the current dwc3 stream and  for
> >>> no underruns will occur with the next ep_queue.
> >>
> >> One thing to note here: The reason we moved to queuing 0-length requests
> >> from complete callback was because even with realtime priority, video_pump
> >> thread doesn't always meet the ISOC queueing cadence. I think stopping and
> >> starting the stream was briefly discussed in our initial discussion in
> >> https://urldefense.com/v3/__https://lore.kernel.org/all/20230419001143.pdxflhzyecf4kvee@synopsys.com/__;!!A4F2R9G_pg!ZmfvrPq4rs7MIhxNrrEqmgGrlYTJ12WgdzaqQhfEehKfjKqxPr2bC1RzUqaa9tvdBtAvXdyK2GpxYzvslpV6$ 
> >> and Thinh mentioned that dwc3 controller does it if it detects an underrun,
> >> but I am not sure if starting and stopping an ISOC stream is good practice.

There's a workaround specific for UVC in dwc3 to "guess" when underrun
happen. It's not foolproof. dwc3 should not need to do that.

Isoc data is periodic and continuous. We should not expect this
unconventional re-synchronization.

> > 
> > The realtime latency aspect is not an issue anymore if we ensure that we
> > always keep only one frame in the hw ring buffer. When the pump worker
> > ensure that it will always run through one full frame the scheduler has
> > no chance to break our running dwc3 stream. Since the pump is running
> > under a while(1) this should be possible.
> 
> I'll wait for your patch to see, but are you saying that we should have the 
> pump worker busy spinning  when there are no frames available? Cameras cannot
> produce frames fast enough for video_pump to be constantly encoding frames. 
> IIRC, "encoding" a frame to usb_requests took less than a ms or two on my 
> device, and frame interval is 33ms for a 30fps stream, so the CPU would be
> busy spinning for ~30ms which is an unreasonable time for a CPU to be
> idling.
> 
> > 
> > Also with the request amount precalculation we can always encode the
> > whole frame into all available requests and don't have to wait for
> > requests to be available again.
> > 
> > Together with the latest knowladge about the underlying hw we even need to only
> > keep one frame in the HW ring buffer. Since we have some interrupt latency,
> > keeping more frames in the ring buffer, would mean that we are not able to tag
> > the currently streamed frame properly as errornous if the dwc3 hw ring buffer
> > is already telling the host some data about the next frame. And as we already
> > need to wait for the end of the frame to finish, based on the assumption that
> > only one frame is enqueued in the ring buffer the hw will stop the stream and
> > the next requst will start a new stream. So there will no missed underruns be
> > happening.
> > 
> > So the main fact here is, that telling the host some status about a
> > frame in the past is impossible! Therefor the first request of the next
> > hw stream need to be the one that is telling the Host if the previous frame
> > is ment to be drawn or not.
> 
> This is a fair point, but the timing on this becomes a little difficult if 
> the frame is sent over the entire frameinterval. If we wait for the entire 
> frame to be transmitted, then we have 125us between the last request of a 
> frame being transmitted and the first request of the next frame being 
> queued. The userspace app producing the frames will have timing variations
> larger than 125us, so we cannot rely on a frame being available exactly as
> one frame is fully transmitted, or of us being notified of transmission
> status by the time the next frame comes in.
> 
> > 
> >> Someone better versed in USB protocol can probably confirm, but it seems
> >> somewhat hacky to stop the ISOC stream at the end of the frame and restart
> >> with the next frame.
> > 
> > All I know is that the HW mechanism that is reading from the trb ring buffer is
> > started or stopped I don't know if really the ISOC stream is stopped and
> > restarted here or what that means on the real wire. And if so, I am unsure if
> > that is really a problem or not. Thinh?

For isoc IN endpoint, if the host requests for data while there's no TRB
prepared, the controller would respond with 0-length data. When we stop
and start again, we reschedule the prepared isoc data to go out on a new
interval.

> 
> Oh? That's great! If the controller can keep the ISOC stream from underruning
> without the gadget feeding it 0-length requests, then we can simplify the 
> gadget side implementation quite a bit!
> 

I'm not entirely clear on why we cannot use 0-length requests now. I
hope we can resolve this from UVC function driver as it seems to be a
proper place to handle this issue.

Thanks,
Thinh

Michael Grzeschik May 12, 2024, 10:10 p.m. UTC | #6

On Wed, Apr 24, 2024 at 02:28:10AM +0000, Thinh Nguyen wrote:
>On Tue, Apr 23, 2024, Avichal Rakesh wrote:
>>
>>
>> On 4/23/24 07:25, Michael Grzeschik wrote:
>> > Ccing:
>> >
>> > Michael Riesch <michael.riesch@wolfvision.net>
>> > Thinh Nguyen <Thinh.Nguyen@synopsys.com>
>> >
>> > On Mon, Apr 22, 2024 at 05:21:09PM -0700, Avichal Rakesh wrote:
>> >> On 4/21/24 16:25, Michael Grzeschik wrote:
>> >>> On Tue, Apr 09, 2024 at 11:24:56PM +0200, Michael Grzeschik wrote:
>> >>>> This patch series is improving the size calculation and allocation
>> >>>> of the uvc requests. Using the currenlty setup frame duration of the
>> >>>> stream it is possible to calculate the number of requests based on the
>> >>>> interval length.
>> >>>
>> >>> The basic concept here is right. But unfortunatly we found out that
>> >>> together with Patch [1] and the current zero length request pump
>> >>> mechanism [2] and [3] this is not working as expected.
>> >>>
>> >>> The conclusion that we can not queue more than one frame at once into
>> >>> the hw led to [1]. The current implementation of zero length reqeusts
>> >>> which will be queued while we are waiting for the frame to finish
>> >>> transferring will enlarge the frame duration. Since every zero-length
>> >>> request is still taking up at least one frame interval of 125 us.
>> >>
>> >> I haven't taken a super close look at your patches, so please feel free
>> >> to correct me if I am misunderstanding something.
>> >>
>> >> It looks like the goal of the patches is to determine a better number
>> >> and size of usb_requests from the given framerate such that we send exactly
>> >> nreqs requests per frame where nreqs is determined to be the exact number
>> >> of requests that can be sent in one frame interval?
>> >
>> > It does not need to be the exact time, actually it may not be exact.
>> > Scattering the data over all requests would not leave any headroom for
>> > any latencies or overhead.
>>
>> IIUC, patch 3/3 sets the number of requests to frameinterval / 125 us,
>> which gives us the number of requests we can send in exactly one frame interval,
>> and then sets the size of the request as max framesize / nreq, which means the
>> frames will be evenly divided up into all available requests (with a little
>> fuzz factor here and there).
>>
>> This effectively means that (assuming no other delays) one frame will take
>> ~one frameinterval to be transmitted?
>>
>> >
>> >> As the logic stands, we need some 0-length requests to be circulating to
>> >> ensure that we don't miss ISOC deadlines. The current logic unconditionally
>> >> sends half of all allocated requests to be circulated.
>> >>
>> >> With those two things in mind, this means than video_pump can at encode
>> >> at most half a frame in one go, and then has to wait for complete
>> >> callbacks to come in. In such cases, the theoretical worst case for
>> >> encode time is
>> >> 125us * (number of requests needed per frame / 2) + scheduling delays
>> >> as after the first half of the frame has been encoded, the video_pump
>> >> thread will have to wait 125us for each of the zero length requests to
>> >> be returned.
>> >>
>> >> The underlying assumption behind the "queue 0-length requests" approach
>> >> was that video_pump encodes the frames in as few requests as possible
>> >> and that there are spare requests to maintain a pressure on the
>> >> ISOC queue without hindering the video_pump thread, and unfortunately
>> >> it seems like patch 3/3 is breaking both of them?
>> >
>> > Right.
>> >
>> >> Assuming my understanding of your patches is correct, my question
>> >> is: Why do we want to spread the frame uniformly over the requests
>> >> instead of encoding it in as few requests as possible. Spreading
>> >> the frame over more requests artificially increases the encode time
>> >> required by video_pump, and AFAICT there is no real benefit to it?
>> >
>> > Thinh gave me the advise that it is better to use the isoc stream
>> > constantly filled. Rather then streaming big amounts of data in the
>> > beginning of an frameinterval and having then a lot of spare time
>> > where the bandwidth is completely unsused.
>> >
>> > In our reallife scenario streaming big requests had the impact, that
>> > the dwc3 core could not keep up with reading the amount of data
>> > from the memory bus, as the bus is already under heavy load. When the
>> > HW was then not able to transfer the requested and actually available
>> > amount of data in the interval, the hw did give us the usual missed
>> > interrupt answer.
>> >
>> > Using smaller requests solved the problem here, as it really was
>> > unnecessary to stress the memory and usb bus in the beginning as
>> > we had enough headroom in the temporal domain.
>>
>> Ah, I see. This was not a consideration, and it makes sense if USB
>> bus is under contention from a few different streams. So the solution
>> seems to be to spread the frame of as many requests as we can transmit
>> in one frameinterval?
>>
>> As an experiment, while we wait for others to respond, could you try
>> doubling (or 2.5x'ing to be extra safe) the number of requests allocated
>> by patch 3/3 without changing the request's buffer size?
>>
>> It won't help with the error reporting but should help with ensuring
>> that frames are sent out in one frameinterval with little to no
>> 0-length requests between them.
>>
>> The idea is that video_pump will have enough requests available to fully
>> encode the frame in one burst, and another frame's worth of request will be
>> re-added to req_free list for video_pump to fill up in the time that the next
>> frame comes in.
>>
>> >
>> > Which then led to the conclusion that the number of needed requests
>> > per image frame interval is calculatable since we know the usb
>> > interval length.
>> >
>> > @Thinh: Correct me if I am saying something wrong here.
>
>Right, if you max out the data rate per uframe, there's less opportunity
>for the host to schedule everything for that interval (e.g. affected
>from other endpoint/device traffics, link commands etc). It also
>increases the latency of DMA. In many cases, many other vendor hosts
>can't handle 48KB/uframe for SuperSpeed and 96KB/uframe for SuperSpeed
>Plus. So, you'd need to test your platform find the optimal request size
>so it can work for most hosts.
>
>> >
>> >>> Therefor to properly make those patches work, we will have to get rid of
>
>Sorry if I may have missed the explaination, but why do we need to rid
>of this?


The uvc_video gadget is queueing requests with ep_queue whenever they
are prepared. However for uvc we may not send EOF to the host until
we know that the frame was transmitted correct or wrong.

To ensure this the gadget is waiting for the last request to be
completed from dwc3. Until this request was not received, the current
workflow is to enqueue zero-length requests into the dwc3 hw. With that,
the final EOF request for the frame will be transmitted after the
zero-length requests have passed the hw. (They have no data, but they
still take one frameinterval durtion). This sparsed frame with
zero-requests inbetween will interfere with the precalculation for
request data we fill every request with based on the expected frame
duration.

I know this seems very interlocked. It is very complex indeed. Tell
me if you still have questions and I will come up with some more
details to the current uvc_video driver.

>> >>> the zero length pump mechanism again and make sure that the whole
>> >>> business logic of what to be queued and when will only be done in the
>> >>> pump worker. It is possible to let the dwc3 udc run dry, as we are
>> >>> actively waiting for the frame to finish, the last request in the
>> >>> prepared and started list will stop the current dwc3 stream and  for
>> >>> no underruns will occur with the next ep_queue.
>> >>
>> >> One thing to note here: The reason we moved to queuing 0-length requests
>> >> from complete callback was because even with realtime priority, video_pump
>> >> thread doesn't always meet the ISOC queueing cadence. I think stopping and
>> >> starting the stream was briefly discussed in our initial discussion in
>> >> https://urldefense.com/v3/__https://lore.kernel.org/all/20230419001143.pdxflhzyecf4kvee@synopsys.com/__;!!A4F2R9G_pg!ZmfvrPq4rs7MIhxNrrEqmgGrlYTJ12WgdzaqQhfEehKfjKqxPr2bC1RzUqaa9tvdBtAvXdyK2GpxYzvslpV6$
>> >> and Thinh mentioned that dwc3 controller does it if it detects an underrun,
>> >> but I am not sure if starting and stopping an ISOC stream is good practice.
>
>There's a workaround specific for UVC in dwc3 to "guess" when underrun
>happen. It's not foolproof. dwc3 should not need to do that.
>
>Isoc data is periodic and continuous. We should not expect this
>unconventional re-synchronization.

I think we have to discuss what is ment by resynchronization here. If
the trb ring buffer did run dry and the software is aware of this
(elemnt in the started and prepared list) then the interrupt handler
already is calling End Stream Command.

When the stream is stopped, what implications does this have on the bus?

When the Endpoint is enabled, will the hardware then send zero-length
requests on its own?

With the next ep_queue we start another stream and when we keep up with
this stream there is no underruns, right?

I picture this scenario in my mind:

thread 1: uvc->queue_buf is called:
   - we encode the frame buffer data into all available requests
     and put them into the per uvc_buffer perpared list
     (as we precalculated the amount of requests properly to the expected
      frame duration and buffer size there will be enough requests
      available)
   - wake up the pump thread

thread 2: pump_worker is triggered
   - take all requests from the prepared available buffer and enqueue them
     into the hardware
     (The pump worker is running with while(1) while it finds requests in
      the per buffer prepared list) and therefor will have a high chance
      to finish the pumping for one complete frame.
   - check for any errors reported from the complete handlers
     - on error
       - stop enqueing new requests from current frame
       - wait for the last request from errornous frame has returned
   - only start pumping new requests from the next buffer when the last
     request from the active frame has finished
   - In the beginning of the next frame send one extra request with
     EOF/ERR tag so the host knows that the last one was ok or not.

thread 3: complete handler (interrupt)
   - give back the requests into the empty_list
   - report EXDEV and errors
   - wake up the pump thread

With this method we will continously drain the hw trb stream of the dwc3
controller per frame and therefor will not shoot into one window where
the current stream could be missed. With the data spreading over the
many requests we also avoid the missed requests when the DMA was to
slow.

>> > The realtime latency aspect is not an issue anymore if we ensure that we
>> > always keep only one frame in the hw ring buffer. When the pump worker
>> > ensure that it will always run through one full frame the scheduler has
>> > no chance to break our running dwc3 stream. Since the pump is running
>> > under a while(1) this should be possible.
>>
>> I'll wait for your patch to see, but are you saying that we should have the
>> pump worker busy spinning  when there are no frames available? Cameras cannot
>> produce frames fast enough for video_pump to be constantly encoding frames.
>> IIRC, "encoding" a frame to usb_requests took less than a ms or two on my
>> device, and frame interval is 33ms for a 30fps stream, so the CPU would be
>> busy spinning for ~30ms which is an unreasonable time for a CPU to be
>> idling.
>>
>> >
>> > Also with the request amount precalculation we can always encode the
>> > whole frame into all available requests and don't have to wait for
>> > requests to be available again.
>> >
>> > Together with the latest knowladge about the underlying hw we even need to only
>> > keep one frame in the HW ring buffer. Since we have some interrupt latency,
>> > keeping more frames in the ring buffer, would mean that we are not able to tag
>> > the currently streamed frame properly as errornous if the dwc3 hw ring buffer
>> > is already telling the host some data about the next frame. And as we already
>> > need to wait for the end of the frame to finish, based on the assumption that
>> > only one frame is enqueued in the ring buffer the hw will stop the stream and
>> > the next requst will start a new stream. So there will no missed underruns be
>> > happening.
>> >
>> > So the main fact here is, that telling the host some status about a
>> > frame in the past is impossible! Therefor the first request of the next
>> > hw stream need to be the one that is telling the Host if the previous frame
>> > is ment to be drawn or not.
>>
>> This is a fair point, but the timing on this becomes a little difficult if
>> the frame is sent over the entire frameinterval. If we wait for the entire
>> frame to be transmitted, then we have 125us between the last request of a
>> frame being transmitted and the first request of the next frame being
>> queued. The userspace app producing the frames will have timing variations
>> larger than 125us, so we cannot rely on a frame being available exactly as
>> one frame is fully transmitted, or of us being notified of transmission
>> status by the time the next frame comes in.
>>
>> >
>> >> Someone better versed in USB protocol can probably confirm, but it seems
>> >> somewhat hacky to stop the ISOC stream at the end of the frame and restart
>> >> with the next frame.
>> >
>> > All I know is that the HW mechanism that is reading from the trb ring buffer is
>> > started or stopped I don't know if really the ISOC stream is stopped and
>> > restarted here or what that means on the real wire. And if so, I am unsure if
>> > that is really a problem or not. Thinh?
>
>For isoc IN endpoint, if the host requests for data while there's no TRB
>prepared, the controller would respond with 0-length data. When we stop
>and start again, we reschedule the prepared isoc data to go out on a new
>interval.
>
>>
>> Oh? That's great! If the controller can keep the ISOC stream from underruning
>> without the gadget feeding it 0-length requests, then we can simplify the
>> gadget side implementation quite a bit!

>I'm not entirely clear on why we cannot use 0-length requests now. I
>hope we can resolve this from UVC function driver as it seems to be a
>proper place to handle this issue.

See above.

Michael

Thinh Nguyen May 17, 2024, 1:44 a.m. UTC | #7

On Mon, May 13, 2024, Michael Grzeschik wrote:
> On Wed, Apr 24, 2024 at 02:28:10AM +0000, Thinh Nguyen wrote:
> > On Tue, Apr 23, 2024, Avichal Rakesh wrote:
> > > 
> > > 
> > > On 4/23/24 07:25, Michael Grzeschik wrote:
> > > > Ccing:
> > > >
> > > > Michael Riesch <michael.riesch@wolfvision.net>
> > > > Thinh Nguyen <Thinh.Nguyen@synopsys.com>
> > > >
> > > > On Mon, Apr 22, 2024 at 05:21:09PM -0700, Avichal Rakesh wrote:
> > > >> On 4/21/24 16:25, Michael Grzeschik wrote:
> > > >>> On Tue, Apr 09, 2024 at 11:24:56PM +0200, Michael Grzeschik wrote:
> > > >>>> This patch series is improving the size calculation and allocation
> > > >>>> of the uvc requests. Using the currenlty setup frame duration of the
> > > >>>> stream it is possible to calculate the number of requests based on the
> > > >>>> interval length.
> > > >>>
> > > >>> The basic concept here is right. But unfortunatly we found out that
> > > >>> together with Patch [1] and the current zero length request pump
> > > >>> mechanism [2] and [3] this is not working as expected.
> > > >>>
> > > >>> The conclusion that we can not queue more than one frame at once into
> > > >>> the hw led to [1]. The current implementation of zero length reqeusts
> > > >>> which will be queued while we are waiting for the frame to finish
> > > >>> transferring will enlarge the frame duration. Since every zero-length
> > > >>> request is still taking up at least one frame interval of 125 us.
> > > >>
> > > >> I haven't taken a super close look at your patches, so please feel free
> > > >> to correct me if I am misunderstanding something.
> > > >>
> > > >> It looks like the goal of the patches is to determine a better number
> > > >> and size of usb_requests from the given framerate such that we send exactly
> > > >> nreqs requests per frame where nreqs is determined to be the exact number
> > > >> of requests that can be sent in one frame interval?
> > > >
> > > > It does not need to be the exact time, actually it may not be exact.
> > > > Scattering the data over all requests would not leave any headroom for
> > > > any latencies or overhead.
> > > 
> > > IIUC, patch 3/3 sets the number of requests to frameinterval / 125 us,
> > > which gives us the number of requests we can send in exactly one frame interval,
> > > and then sets the size of the request as max framesize / nreq, which means the
> > > frames will be evenly divided up into all available requests (with a little
> > > fuzz factor here and there).
> > > 
> > > This effectively means that (assuming no other delays) one frame will take
> > > ~one frameinterval to be transmitted?
> > > 
> > > >
> > > >> As the logic stands, we need some 0-length requests to be circulating to
> > > >> ensure that we don't miss ISOC deadlines. The current logic unconditionally
> > > >> sends half of all allocated requests to be circulated.
> > > >>
> > > >> With those two things in mind, this means than video_pump can at encode
> > > >> at most half a frame in one go, and then has to wait for complete
> > > >> callbacks to come in. In such cases, the theoretical worst case for
> > > >> encode time is
> > > >> 125us * (number of requests needed per frame / 2) + scheduling delays
> > > >> as after the first half of the frame has been encoded, the video_pump
> > > >> thread will have to wait 125us for each of the zero length requests to
> > > >> be returned.
> > > >>
> > > >> The underlying assumption behind the "queue 0-length requests" approach
> > > >> was that video_pump encodes the frames in as few requests as possible
> > > >> and that there are spare requests to maintain a pressure on the
> > > >> ISOC queue without hindering the video_pump thread, and unfortunately
> > > >> it seems like patch 3/3 is breaking both of them?
> > > >
> > > > Right.
> > > >
> > > >> Assuming my understanding of your patches is correct, my question
> > > >> is: Why do we want to spread the frame uniformly over the requests
> > > >> instead of encoding it in as few requests as possible. Spreading
> > > >> the frame over more requests artificially increases the encode time
> > > >> required by video_pump, and AFAICT there is no real benefit to it?
> > > >
> > > > Thinh gave me the advise that it is better to use the isoc stream
> > > > constantly filled. Rather then streaming big amounts of data in the
> > > > beginning of an frameinterval and having then a lot of spare time
> > > > where the bandwidth is completely unsused.
> > > >
> > > > In our reallife scenario streaming big requests had the impact, that
> > > > the dwc3 core could not keep up with reading the amount of data
> > > > from the memory bus, as the bus is already under heavy load. When the
> > > > HW was then not able to transfer the requested and actually available
> > > > amount of data in the interval, the hw did give us the usual missed
> > > > interrupt answer.
> > > >
> > > > Using smaller requests solved the problem here, as it really was
> > > > unnecessary to stress the memory and usb bus in the beginning as
> > > > we had enough headroom in the temporal domain.
> > > 
> > > Ah, I see. This was not a consideration, and it makes sense if USB
> > > bus is under contention from a few different streams. So the solution
> > > seems to be to spread the frame of as many requests as we can transmit
> > > in one frameinterval?
> > > 
> > > As an experiment, while we wait for others to respond, could you try
> > > doubling (or 2.5x'ing to be extra safe) the number of requests allocated
> > > by patch 3/3 without changing the request's buffer size?
> > > 
> > > It won't help with the error reporting but should help with ensuring
> > > that frames are sent out in one frameinterval with little to no
> > > 0-length requests between them.
> > > 
> > > The idea is that video_pump will have enough requests available to fully
> > > encode the frame in one burst, and another frame's worth of request will be
> > > re-added to req_free list for video_pump to fill up in the time that the next
> > > frame comes in.
> > > 
> > > >
> > > > Which then led to the conclusion that the number of needed requests
> > > > per image frame interval is calculatable since we know the usb
> > > > interval length.
> > > >
> > > > @Thinh: Correct me if I am saying something wrong here.
> > 
> > Right, if you max out the data rate per uframe, there's less opportunity
> > for the host to schedule everything for that interval (e.g. affected
> > from other endpoint/device traffics, link commands etc). It also
> > increases the latency of DMA. In many cases, many other vendor hosts
> > can't handle 48KB/uframe for SuperSpeed and 96KB/uframe for SuperSpeed
> > Plus. So, you'd need to test your platform find the optimal request size
> > so it can work for most hosts.
> > 
> > > >
> > > >>> Therefor to properly make those patches work, we will have to get rid of
> > 
> > Sorry if I may have missed the explaination, but why do we need to rid
> > of this?
> 
> 
> The uvc_video gadget is queueing requests with ep_queue whenever they
> are prepared. However for uvc we may not send EOF to the host until
> we know that the frame was transmitted correct or wrong.
> 
> To ensure this the gadget is waiting for the last request to be
> completed from dwc3. Until this request was not received, the current
> workflow is to enqueue zero-length requests into the dwc3 hw. With that,
> the final EOF request for the frame will be transmitted after the
> zero-length requests have passed the hw. (They have no data, but they
> still take one frameinterval durtion). This sparsed frame with
> zero-requests inbetween will interfere with the precalculation for
> request data we fill every request with based on the expected frame
> duration.
> 
> I know this seems very interlocked. It is very complex indeed. Tell
> me if you still have questions and I will come up with some more
> details to the current uvc_video driver.
> 
> > > >>> the zero length pump mechanism again and make sure that the whole
> > > >>> business logic of what to be queued and when will only be done in the
> > > >>> pump worker. It is possible to let the dwc3 udc run dry, as we are
> > > >>> actively waiting for the frame to finish, the last request in the
> > > >>> prepared and started list will stop the current dwc3 stream and  for
> > > >>> no underruns will occur with the next ep_queue.
> > > >>
> > > >> One thing to note here: The reason we moved to queuing 0-length requests
> > > >> from complete callback was because even with realtime priority, video_pump
> > > >> thread doesn't always meet the ISOC queueing cadence. I think stopping and
> > > >> starting the stream was briefly discussed in our initial discussion in
> > > >> https://urldefense.com/v3/__https://lore.kernel.org/all/20230419001143.pdxflhzyecf4kvee@synopsys.com/__;!!A4F2R9G_pg!ZmfvrPq4rs7MIhxNrrEqmgGrlYTJ12WgdzaqQhfEehKfjKqxPr2bC1RzUqaa9tvdBtAvXdyK2GpxYzvslpV6$
> > > >> and Thinh mentioned that dwc3 controller does it if it detects an underrun,
> > > >> but I am not sure if starting and stopping an ISOC stream is good practice.
> > 
> > There's a workaround specific for UVC in dwc3 to "guess" when underrun
> > happen. It's not foolproof. dwc3 should not need to do that.
> > 
> > Isoc data is periodic and continuous. We should not expect this
> > unconventional re-synchronization.
> 
> I think we have to discuss what is ment by resynchronization here. If
> the trb ring buffer did run dry and the software is aware of this
> (elemnt in the started and prepared list) then the interrupt handler
> already is calling End Stream Command.

The driver only aware of this when the controller tells it, which may be
already too late.

> 
> When the stream is stopped, what implications does this have on the bus?
> 
> When the Endpoint is enabled, will the hardware then send zero-length
> requests on its own?

For isoc endpoint IN, yes. If the host requests for isoc data IN while
no TRB is prepared, then the controller will automatically send 0-length
packet respond.

> 
> With the next ep_queue we start another stream and when we keep up with
> this stream there is no underruns, right?
> 
> I picture this scenario in my mind:
> 
> thread 1: uvc->queue_buf is called:
>   - we encode the frame buffer data into all available requests
>     and put them into the per uvc_buffer perpared list
>     (as we precalculated the amount of requests properly to the expected
>      frame duration and buffer size there will be enough requests
>      available)
>   - wake up the pump thread
> 
> thread 2: pump_worker is triggered
>   - take all requests from the prepared available buffer and enqueue them
>     into the hardware
>     (The pump worker is running with while(1) while it finds requests in
>      the per buffer prepared list) and therefor will have a high chance
>      to finish the pumping for one complete frame.
>   - check for any errors reported from the complete handlers
>     - on error
>       - stop enqueing new requests from current frame
>       - wait for the last request from errornous frame has returned
>   - only start pumping new requests from the next buffer when the last
>     request from the active frame has finished
>   - In the beginning of the next frame send one extra request with
>     EOF/ERR tag so the host knows that the last one was ok or not.
> 
> thread 3: complete handler (interrupt)
>   - give back the requests into the empty_list
>   - report EXDEV and errors
>   - wake up the pump thread
> 
> With this method we will continously drain the hw trb stream of the dwc3
> controller per frame and therefor will not shoot into one window where
> the current stream could be missed. With the data spreading over the
> many requests we also avoid the missed requests when the DMA was to
> slow.
> 

This sounds good.

As long as we can maintain more than X number of requests enqueued to
accomodate for the worst latency, then we can avoid underrun. The driver
should monitor how many requests are enqueued and hopefully can keep up
the queue with zero-length requests.

Thanks,
Thinh

Michael Grzeschik May 22, 2024, 12:08 a.m. UTC | #8

On Fri, May 17, 2024 at 01:44:05AM +0000, Thinh Nguyen wrote:
>On Mon, May 13, 2024, Michael Grzeschik wrote:
>> On Wed, Apr 24, 2024 at 02:28:10AM +0000, Thinh Nguyen wrote:
>> > On Tue, Apr 23, 2024, Avichal Rakesh wrote:
>> > >
>> > >
>> > > On 4/23/24 07:25, Michael Grzeschik wrote:
>> > > > Ccing:
>> > > >
>> > > > Michael Riesch <michael.riesch@wolfvision.net>
>> > > > Thinh Nguyen <Thinh.Nguyen@synopsys.com>
>> > > >
>> > > > On Mon, Apr 22, 2024 at 05:21:09PM -0700, Avichal Rakesh wrote:
>> > > >> On 4/21/24 16:25, Michael Grzeschik wrote:
>> > > >>> On Tue, Apr 09, 2024 at 11:24:56PM +0200, Michael Grzeschik wrote:
>> > > >>>> This patch series is improving the size calculation and allocation
>> > > >>>> of the uvc requests. Using the currenlty setup frame duration of the
>> > > >>>> stream it is possible to calculate the number of requests based on the
>> > > >>>> interval length.
>> > > >>>
>> > > >>> The basic concept here is right. But unfortunatly we found out that
>> > > >>> together with Patch [1] and the current zero length request pump
>> > > >>> mechanism [2] and [3] this is not working as expected.
>> > > >>>
>> > > >>> The conclusion that we can not queue more than one frame at once into
>> > > >>> the hw led to [1]. The current implementation of zero length reqeusts
>> > > >>> which will be queued while we are waiting for the frame to finish
>> > > >>> transferring will enlarge the frame duration. Since every zero-length
>> > > >>> request is still taking up at least one frame interval of 125 us.
>> > > >>
>> > > >> I haven't taken a super close look at your patches, so please feel free
>> > > >> to correct me if I am misunderstanding something.
>> > > >>
>> > > >> It looks like the goal of the patches is to determine a better number
>> > > >> and size of usb_requests from the given framerate such that we send exactly
>> > > >> nreqs requests per frame where nreqs is determined to be the exact number
>> > > >> of requests that can be sent in one frame interval?
>> > > >
>> > > > It does not need to be the exact time, actually it may not be exact.
>> > > > Scattering the data over all requests would not leave any headroom for
>> > > > any latencies or overhead.
>> > >
>> > > IIUC, patch 3/3 sets the number of requests to frameinterval / 125 us,
>> > > which gives us the number of requests we can send in exactly one frame interval,
>> > > and then sets the size of the request as max framesize / nreq, which means the
>> > > frames will be evenly divided up into all available requests (with a little
>> > > fuzz factor here and there).
>> > >
>> > > This effectively means that (assuming no other delays) one frame will take
>> > > ~one frameinterval to be transmitted?
>> > >
>> > > >
>> > > >> As the logic stands, we need some 0-length requests to be circulating to
>> > > >> ensure that we don't miss ISOC deadlines. The current logic unconditionally
>> > > >> sends half of all allocated requests to be circulated.
>> > > >>
>> > > >> With those two things in mind, this means than video_pump can at encode
>> > > >> at most half a frame in one go, and then has to wait for complete
>> > > >> callbacks to come in. In such cases, the theoretical worst case for
>> > > >> encode time is
>> > > >> 125us * (number of requests needed per frame / 2) + scheduling delays
>> > > >> as after the first half of the frame has been encoded, the video_pump
>> > > >> thread will have to wait 125us for each of the zero length requests to
>> > > >> be returned.
>> > > >>
>> > > >> The underlying assumption behind the "queue 0-length requests" approach
>> > > >> was that video_pump encodes the frames in as few requests as possible
>> > > >> and that there are spare requests to maintain a pressure on the
>> > > >> ISOC queue without hindering the video_pump thread, and unfortunately
>> > > >> it seems like patch 3/3 is breaking both of them?
>> > > >
>> > > > Right.
>> > > >
>> > > >> Assuming my understanding of your patches is correct, my question
>> > > >> is: Why do we want to spread the frame uniformly over the requests
>> > > >> instead of encoding it in as few requests as possible. Spreading
>> > > >> the frame over more requests artificially increases the encode time
>> > > >> required by video_pump, and AFAICT there is no real benefit to it?
>> > > >
>> > > > Thinh gave me the advise that it is better to use the isoc stream
>> > > > constantly filled. Rather then streaming big amounts of data in the
>> > > > beginning of an frameinterval and having then a lot of spare time
>> > > > where the bandwidth is completely unsused.
>> > > >
>> > > > In our reallife scenario streaming big requests had the impact, that
>> > > > the dwc3 core could not keep up with reading the amount of data
>> > > > from the memory bus, as the bus is already under heavy load. When the
>> > > > HW was then not able to transfer the requested and actually available
>> > > > amount of data in the interval, the hw did give us the usual missed
>> > > > interrupt answer.
>> > > >
>> > > > Using smaller requests solved the problem here, as it really was
>> > > > unnecessary to stress the memory and usb bus in the beginning as
>> > > > we had enough headroom in the temporal domain.
>> > >
>> > > Ah, I see. This was not a consideration, and it makes sense if USB
>> > > bus is under contention from a few different streams. So the solution
>> > > seems to be to spread the frame of as many requests as we can transmit
>> > > in one frameinterval?
>> > >
>> > > As an experiment, while we wait for others to respond, could you try
>> > > doubling (or 2.5x'ing to be extra safe) the number of requests allocated
>> > > by patch 3/3 without changing the request's buffer size?
>> > >
>> > > It won't help with the error reporting but should help with ensuring
>> > > that frames are sent out in one frameinterval with little to no
>> > > 0-length requests between them.
>> > >
>> > > The idea is that video_pump will have enough requests available to fully
>> > > encode the frame in one burst, and another frame's worth of request will be
>> > > re-added to req_free list for video_pump to fill up in the time that the next
>> > > frame comes in.
>> > >
>> > > >
>> > > > Which then led to the conclusion that the number of needed requests
>> > > > per image frame interval is calculatable since we know the usb
>> > > > interval length.
>> > > >
>> > > > @Thinh: Correct me if I am saying something wrong here.
>> >
>> > Right, if you max out the data rate per uframe, there's less opportunity
>> > for the host to schedule everything for that interval (e.g. affected
>> > from other endpoint/device traffics, link commands etc). It also
>> > increases the latency of DMA. In many cases, many other vendor hosts
>> > can't handle 48KB/uframe for SuperSpeed and 96KB/uframe for SuperSpeed
>> > Plus. So, you'd need to test your platform find the optimal request size
>> > so it can work for most hosts.
>> >
>> > > >
>> > > >>> Therefor to properly make those patches work, we will have to get rid of
>> >
>> > Sorry if I may have missed the explaination, but why do we need to rid
>> > of this?
>>
>>
>> The uvc_video gadget is queueing requests with ep_queue whenever they
>> are prepared. However for uvc we may not send EOF to the host until
>> we know that the frame was transmitted correct or wrong.
>>
>> To ensure this the gadget is waiting for the last request to be
>> completed from dwc3. Until this request was not received, the current
>> workflow is to enqueue zero-length requests into the dwc3 hw. With that,
>> the final EOF request for the frame will be transmitted after the
>> zero-length requests have passed the hw. (They have no data, but they
>> still take one frameinterval durtion). This sparsed frame with
>> zero-requests inbetween will interfere with the precalculation for
>> request data we fill every request with based on the expected frame
>> duration.
>>
>> I know this seems very interlocked. It is very complex indeed. Tell
>> me if you still have questions and I will come up with some more
>> details to the current uvc_video driver.
>>
>> > > >>> the zero length pump mechanism again and make sure that the whole
>> > > >>> business logic of what to be queued and when will only be done in the
>> > > >>> pump worker. It is possible to let the dwc3 udc run dry, as we are
>> > > >>> actively waiting for the frame to finish, the last request in the
>> > > >>> prepared and started list will stop the current dwc3 stream and  for
>> > > >>> no underruns will occur with the next ep_queue.
>> > > >>
>> > > >> One thing to note here: The reason we moved to queuing 0-length requests
>> > > >> from complete callback was because even with realtime priority, video_pump
>> > > >> thread doesn't always meet the ISOC queueing cadence. I think stopping and
>> > > >> starting the stream was briefly discussed in our initial discussion in
>> > > >> https://urldefense.com/v3/__https://lore.kernel.org/all/20230419001143.pdxflhzyecf4kvee@synopsys.com/__;!!A4F2R9G_pg!ZmfvrPq4rs7MIhxNrrEqmgGrlYTJ12WgdzaqQhfEehKfjKqxPr2bC1RzUqaa9tvdBtAvXdyK2GpxYzvslpV6$
>> > > >> and Thinh mentioned that dwc3 controller does it if it detects an underrun,
>> > > >> but I am not sure if starting and stopping an ISOC stream is good practice.
>> >
>> > There's a workaround specific for UVC in dwc3 to "guess" when underrun
>> > happen. It's not foolproof. dwc3 should not need to do that.
>> >
>> > Isoc data is periodic and continuous. We should not expect this
>> > unconventional re-synchronization.
>>
>> I think we have to discuss what is ment by resynchronization here. If
>> the trb ring buffer did run dry and the software is aware of this
>> (elemnt in the started and prepared list) then the interrupt handler
>> already is calling End Stream Command.
>
>The driver only aware of this when the controller tells it, which may be
>already too late.

In our special case there should not be any too late any more. Since we
ensure that all requests will be enqueued for one transfer (which will
represent one frame) in time and we are not dependent on the complete
handler for nothing else than telling the uvc driver that the last
request came back or if there was some error in the current active
frame.

As already stated we also have to wait with enqueueing the next frame
to the hardware and only are allowed to enqueue one frame at a time.
Otherwise it is not possible to tell the host if the frame was broken or
not.

I have the following scheme in my mind. It is simplified to take frames
of only 4 requests into account. (>80 chars warning)


frameinterval:                |       125 us       |       125 us       |       125 us       |       125 us       |       125 us       |       125 us       |       125 us       |
                               |                    |                    |                    |                    |                    |                    |                    |
pump thread:   queue          |rqA1 rqA2 rqA3 rqA4'|                    |                    |                    |                    |rqB0 rqB1 rqB2 rqB3 |rqB4'               |
irq  thread:   complete       |                    |rqA1                |rqA2                |rqA3                |rqA4'               |                    |rqB0                | rqB1
qbuf thread:   encode         |rqB1 rqB2 rqB3 rqB4'|                    |                    |                    |                    |rqA1 rqA2 rqA3 rqA4'|                    |

dwc3 thread:   Hardware                            < Start Transfer                                                       End Transfer >                    < Start Transfer      ....

legend:
- rq'  : last request of a frame
- rqB0 : first request of the next transfer with no payload but the header only
          telling the host that the last frame was ok/broken

assumption:

- pump thread is never interrupted by a kernel thread but only by some short running irq
- if one request comes back with -EXDEV the rest of the enqueued requests should be flushed

In the no_interrupt case we would also only generate the interrupt for
the last request and giveback all four requests in the last interval.
This should still work fine.

We also only start streaming when one frame is totally available to be
enqueued in one run. So in case frames with rqA and rqB both did come back
with errors the start of the next frame will only begin after the next
frame was completely and fully encoded.

>> When the stream is stopped, what implications does this have on the bus?
>>
>> When the Endpoint is enabled, will the hardware then send zero-length
>> requests on its own?
>
>For isoc endpoint IN, yes. If the host requests for isoc data IN while
>no TRB is prepared, then the controller will automatically send 0-length
>packet respond.

Perfect! This will help a lot and will make active queueing of own
zero-length requests run unnecessary.

>> With the next ep_queue we start another stream and when we keep up with
>> this stream there is no underruns, right?
>>
>> I picture this scenario in my mind:
>>
>> thread 1: uvc->queue_buf is called:
>>   - we encode the frame buffer data into all available requests
>>     and put them into the per uvc_buffer perpared list
>>     (as we precalculated the amount of requests properly to the expected
>>      frame duration and buffer size there will be enough requests
>>      available)
>>   - wake up the pump thread
>>
>> thread 2: pump_worker is triggered
>>   - take all requests from the prepared available buffer and enqueue them
>>     into the hardware
>>     (The pump worker is running with while(1) while it finds requests in
>>      the per buffer prepared list) and therefor will have a high chance
>>      to finish the pumping for one complete frame.
>>   - check for any errors reported from the complete handlers
>>     - on error
>>       - stop enqueing new requests from current frame
>>       - wait for the last request from errornous frame has returned
>>   - only start pumping new requests from the next buffer when the last
>>     request from the active frame has finished
>>   - In the beginning of the next frame send one extra request with
>>     EOF/ERR tag so the host knows that the last one was ok or not.
>>
>> thread 3: complete handler (interrupt)
>>   - give back the requests into the empty_list
>>   - report EXDEV and errors
>>   - wake up the pump thread
>>
>> With this method we will continously drain the hw trb stream of the dwc3
>> controller per frame and therefor will not shoot into one window where
>> the current stream could be missed. With the data spreading over the
>> many requests we also avoid the missed requests when the DMA was to
>> slow.
>>
>
>This sounds good.
>
>As long as we can maintain more than X number of requests enqueued to
>accomodate for the worst latency, then we can avoid underrun. The driver
>should monitor how many requests are enqueued and hopefully can keep up
>the queue with zero-length requests.

Except I totally misunderstood something or did oversimplify to much,
the above explanation should run this unnecessary.

Although we are able to track the amount of enqueued requests in the udc
driver and compare that with the amount of completed requests.

Also we have the usb_gadget_frame_number callback to the udc controller
to ask in which frame it is operating at the moment. This way we would
be able to calculate not to enqueue requests into a transfer that did
not come back yet completely but would run into missed transfers.

Michael

Thinh Nguyen May 22, 2024, 1:41 a.m. UTC | #9

On Wed, May 22, 2024, Michael Grzeschik wrote:
> On Fri, May 17, 2024 at 01:44:05AM +0000, Thinh Nguyen wrote:
> > On Mon, May 13, 2024, Michael Grzeschik wrote:
> > > 
> > > I think we have to discuss what is ment by resynchronization here. If
> > > the trb ring buffer did run dry and the software is aware of this
> > > (elemnt in the started and prepared list) then the interrupt handler
> > > already is calling End Stream Command.
> > 
> > The driver only aware of this when the controller tells it, which may be
> > already too late.
> 
> In our special case there should not be any too late any more. Since we
> ensure that all requests will be enqueued for one transfer (which will
> represent one frame) in time and we are not dependent on the complete
> handler for nothing else than telling the uvc driver that the last
> request came back or if there was some error in the current active
> frame.
> 
> As already stated we also have to wait with enqueueing the next frame
> to the hardware and only are allowed to enqueue one frame at a time.
> Otherwise it is not possible to tell the host if the frame was broken or
> not.
> 
> I have the following scheme in my mind. It is simplified to take frames
> of only 4 requests into account. (>80 chars warning)
> 
> 
> frameinterval:                |       125 us       |       125 us       |       125 us       |       125 us       |       125 us       |       125 us       |       125 us       |
>                               |                    |                    |                    |                    |                    |                    |                    |
> pump thread:   queue          |rqA1 rqA2 rqA3 rqA4'|                    |                    |                    |                    |rqB0 rqB1 rqB2 rqB3 |rqB4'               |
> irq  thread:   complete       |                    |rqA1                |rqA2                |rqA3                |rqA4'               |                    |rqB0                | rqB1
> qbuf thread:   encode         |rqB1 rqB2 rqB3 rqB4'|                    |                    |                    |                    |rqA1 rqA2 rqA3 rqA4'|                    |
> 
> dwc3 thread:   Hardware                            < Start Transfer                                                       End Transfer >                    < Start Transfer      ....
> 
> legend:
> - rq'  : last request of a frame
> - rqB0 : first request of the next transfer with no payload but the header only
>          telling the host that the last frame was ok/broken
> 
> assumption:
> 
> - pump thread is never interrupted by a kernel thread but only by some short running irq
> - if one request comes back with -EXDEV the rest of the enqueued requests should be flushed
> 
> In the no_interrupt case we would also only generate the interrupt for
> the last request and giveback all four requests in the last interval.
> This should still work fine.
> 
> We also only start streaming when one frame is totally available to be
> enqueued in one run. So in case frames with rqA and rqB both did come back
> with errors the start of the next frame will only begin after the next
> frame was completely and fully encoded.

Yes. This is better.

> 
> > > When the stream is stopped, what implications does this have on the bus?
> > > 
> > > When the Endpoint is enabled, will the hardware then send zero-length
> > > requests on its own?
> > 
> > For isoc endpoint IN, yes. If the host requests for isoc data IN while
> > no TRB is prepared, then the controller will automatically send 0-length
> > packet respond.
> 
> Perfect! This will help a lot and will make active queueing of own
> zero-length requests run unnecessary.

Yes, if we rely on the current start/stop isoc transfer scheme for UVC,
then this will work.

> 
> > > With the next ep_queue we start another stream and when we keep up with
> > > this stream there is no underruns, right?
> > > 
> > > I picture this scenario in my mind:
> > > 
> > > thread 1: uvc->queue_buf is called:
> > >   - we encode the frame buffer data into all available requests
> > >     and put them into the per uvc_buffer perpared list
> > >     (as we precalculated the amount of requests properly to the expected
> > >      frame duration and buffer size there will be enough requests
> > >      available)
> > >   - wake up the pump thread
> > > 
> > > thread 2: pump_worker is triggered
> > >   - take all requests from the prepared available buffer and enqueue them
> > >     into the hardware
> > >     (The pump worker is running with while(1) while it finds requests in
> > >      the per buffer prepared list) and therefor will have a high chance
> > >      to finish the pumping for one complete frame.
> > >   - check for any errors reported from the complete handlers
> > >     - on error
> > >       - stop enqueing new requests from current frame
> > >       - wait for the last request from errornous frame has returned
> > >   - only start pumping new requests from the next buffer when the last
> > >     request from the active frame has finished
> > >   - In the beginning of the next frame send one extra request with
> > >     EOF/ERR tag so the host knows that the last one was ok or not.
> > > 
> > > thread 3: complete handler (interrupt)
> > >   - give back the requests into the empty_list
> > >   - report EXDEV and errors
> > >   - wake up the pump thread
> > > 
> > > With this method we will continously drain the hw trb stream of the dwc3
> > > controller per frame and therefor will not shoot into one window where
> > > the current stream could be missed. With the data spreading over the
> > > many requests we also avoid the missed requests when the DMA was to
> > > slow.
> > > 
> > 
> > This sounds good.
> > 
> > As long as we can maintain more than X number of requests enqueued to
> > accomodate for the worst latency, then we can avoid underrun. The driver
> > should monitor how many requests are enqueued and hopefully can keep up
> > the queue with zero-length requests.
> 
> Except I totally misunderstood something or did oversimplify to much,
> the above explanation should run this unnecessary.
> 
> Although we are able to track the amount of enqueued requests in the udc
> driver and compare that with the amount of completed requests.
> 
> Also we have the usb_gadget_frame_number callback to the udc controller
> to ask in which frame it is operating at the moment. This way we would
> be able to calculate not to enqueue requests into a transfer that did
> not come back yet completely but would run into missed transfers.
> 

I would not depend too much on usb_gadget_frame_number(). There's not
really a hard requirement for the output. It's controller specific. For
dwc3 controller, if operate in highspeed or higher, it returns
microframe number. For fullspeed, it returns frame number.

As you noted, if you can wait and queue all the requests of a video
frame at once, then this will work also.

Thanks,
Thinh

Michael Grzeschik May 22, 2024, 11:16 a.m. UTC | #10

On Tue, Apr 23, 2024 at 04:23:23PM -0700, Avichal Rakesh wrote:
>
>
>On 4/23/24 07:25, Michael Grzeschik wrote:
>> Ccing:
>>
>> Michael Riesch <michael.riesch@wolfvision.net>
>> Thinh Nguyen <Thinh.Nguyen@synopsys.com>
>>
>> On Mon, Apr 22, 2024 at 05:21:09PM -0700, Avichal Rakesh wrote:
>>> On 4/21/24 16:25, Michael Grzeschik wrote:
>>>> On Tue, Apr 09, 2024 at 11:24:56PM +0200, Michael Grzeschik wrote:
>>>>> This patch series is improving the size calculation and allocation
>>>>> of the uvc requests. Using the currenlty setup frame duration of the
>>>>> stream it is possible to calculate the number of requests based on the
>>>>> interval length.
>>>>
>>>> The basic concept here is right. But unfortunatly we found out that
>>>> together with Patch [1] and the current zero length request pump
>>>> mechanism [2] and [3] this is not working as expected.
>>>>
>>>> The conclusion that we can not queue more than one frame at once into
>>>> the hw led to [1]. The current implementation of zero length reqeusts
>>>> which will be queued while we are waiting for the frame to finish
>>>> transferring will enlarge the frame duration. Since every zero-length
>>>> request is still taking up at least one frame interval of 125 us.
>>>
>>> I haven't taken a super close look at your patches, so please feel free
>>> to correct me if I am misunderstanding something.
>>>
>>> It looks like the goal of the patches is to determine a better number
>>> and size of usb_requests from the given framerate such that we send exactly
>>> nreqs requests per frame where nreqs is determined to be the exact number
>>> of requests that can be sent in one frame interval?
>>
>> It does not need to be the exact time, actually it may not be exact.
>> Scattering the data over all requests would not leave any headroom for
>> any latencies or overhead.
>
>IIUC, patch 3/3 sets the number of requests to frameinterval / 125 us,
>which gives us the number of requests we can send in exactly one frame interval,
>and then sets the size of the request as max framesize / nreq, which means the
>frames will be evenly divided up into all available requests (with a little
>fuzz factor here and there).
>
>This effectively means that (assuming no other delays) one frame will take
>~one frameinterval to be transmitted?

In theory yes. But we have to add some headroom delay into the
calculation. The worst case we have to takle here is to leave the
encoding thread enough time to encode one full frame.

>>> As the logic stands, we need some 0-length requests to be circulating to
>>> ensure that we don't miss ISOC deadlines. The current logic unconditionally
>>> sends half of all allocated requests to be circulated.
>>>
>>> With those two things in mind, this means than video_pump can at encode
>>> at most half a frame in one go, and then has to wait for complete
>>> callbacks to come in. In such cases, the theoretical worst case for
>>> encode time is
>>> 125us * (number of requests needed per frame / 2) + scheduling delays
>>> as after the first half of the frame has been encoded, the video_pump
>>> thread will have to wait 125us for each of the zero length requests to
>>> be returned.
>>>
>>> The underlying assumption behind the "queue 0-length requests" approach
>>> was that video_pump encodes the frames in as few requests as possible
>>> and that there are spare requests to maintain a pressure on the
>>> ISOC queue without hindering the video_pump thread, and unfortunately
>>> it seems like patch 3/3 is breaking both of them?
>>
>> Right.
>>
>>> Assuming my understanding of your patches is correct, my question
>>> is: Why do we want to spread the frame uniformly over the requests
>>> instead of encoding it in as few requests as possible. Spreading
>>> the frame over more requests artificially increases the encode time
>>> required by video_pump, and AFAICT there is no real benefit to it?
>>
>> Thinh gave me the advise that it is better to use the isoc stream
>> constantly filled. Rather then streaming big amounts of data in the
>> beginning of an frameinterval and having then a lot of spare time
>> where the bandwidth is completely unsused.
>>
>> In our reallife scenario streaming big requests had the impact, that
>> the dwc3 core could not keep up with reading the amount of data
>> from the memory bus, as the bus is already under heavy load. When the
>> HW was then not able to transfer the requested and actually available
>> amount of data in the interval, the hw did give us the usual missed
>> interrupt answer.
>>
>> Using smaller requests solved the problem here, as it really was
>> unnecessary to stress the memory and usb bus in the beginning as
>> we had enough headroom in the temporal domain.
>
>Ah, I see. This was not a consideration, and it makes sense if USB
>bus is under contention from a few different streams. So the solution
>seems to be to spread the frame of as many requests as we can transmit
>in one frameinterval?

Right.

>As an experiment, while we wait for others to respond, could you try
>doubling (or 2.5x'ing to be extra safe) the number of requests allocated
>by patch 3/3 without changing the request's buffer size?
>
>It won't help with the error reporting but should help with ensuring
>that frames are sent out in one frameinterval with little to no
>0-length requests between them.

This is okay, but will not help since we have to wait for the frame to
finish.

>The idea is that video_pump will have enough requests available to fully
>encode the frame in one burst, and another frame's worth of request will be
>re-added to req_free list for video_pump to fill up in the time that the next
>frame comes in.

Right but because of the zero requests between the frame finish request
we will unnecessary blow up the frame duration. Although there are not
many zero-requests.

>> Which then led to the conclusion that the number of needed requests
>> per image frame interval is calculatable since we know the usb
>> interval length.
>>
>> @Thinh: Correct me if I am saying something wrong here.
>>
>>>> Therefor to properly make those patches work, we will have to get rid of
>>>> the zero length pump mechanism again and make sure that the whole
>>>> business logic of what to be queued and when will only be done in the
>>>> pump worker. It is possible to let the dwc3 udc run dry, as we are
>>>> actively waiting for the frame to finish, the last request in the
>>>> prepared and started list will stop the current dwc3 stream and  for
>>>> no underruns will occur with the next ep_queue.
>>>
>>> One thing to note here: The reason we moved to queuing 0-length requests
>>> from complete callback was because even with realtime priority, video_pump
>>> thread doesn't always meet the ISOC queueing cadence. I think stopping and
>>> starting the stream was briefly discussed in our initial discussion in
>>> https://lore.kernel.org/all/20230419001143.pdxflhzyecf4kvee@synopsys.com/
>>> and Thinh mentioned that dwc3 controller does it if it detects an underrun,
>>> but I am not sure if starting and stopping an ISOC stream is good practice.
>>
>> The realtime latency aspect is not an issue anymore if we ensure that we
>> always keep only one frame in the hw ring buffer. When the pump worker
>> ensure that it will always run through one full frame the scheduler has
>> no chance to break our running dwc3 stream. Since the pump is running
>> under a while(1) this should be possible.
>
>I'll wait for your patch to see, but are you saying that we should have the
>pump worker busy spinning  when there are no frames available? Cameras cannot
>produce frames fast enough for video_pump to be constantly encoding frames.
>IIRC, "encoding" a frame to usb_requests took less than a ms or two on my
>device, and frame interval is 33ms for a 30fps stream, so the CPU would be
>busy spinning for ~30ms which is an unreasonable time for a CPU to be
>idling.

I hope the whole Idea got much clearer to you while I was discussing the
topic and the roadmap with Thinh the last days.

>> Also with the request amount precalculation we can always encode the
>> whole frame into all available requests and don't have to wait for
>> requests to be available again.
>>
>> Together with the latest knowladge about the underlying hw we even need to only
>> keep one frame in the HW ring buffer. Since we have some interrupt latency,
>> keeping more frames in the ring buffer, would mean that we are not able to tag
>> the currently streamed frame properly as errornous if the dwc3 hw ring buffer
>> is already telling the host some data about the next frame. And as we already
>> need to wait for the end of the frame to finish, based on the assumption that
>> only one frame is enqueued in the ring buffer the hw will stop the stream and
>> the next requst will start a new stream. So there will no missed underruns be
>> happening.
>>
>> So the main fact here is, that telling the host some status about a
>> frame in the past is impossible! Therefor the first request of the next
>> hw stream need to be the one that is telling the Host if the previous frame
>> is ment to be drawn or not.
>
>This is a fair point, but the timing on this becomes a little difficult if
>the frame is sent over the entire frameinterval. If we wait for the entire
>frame to be transmitted, then we have 125us between the last request of a
>frame being transmitted and the first request of the next frame being
>queued. The userspace app producing the frames will have timing variations
>larger than 125us, so we cannot rely on a frame being available exactly as
>one frame is fully transmitted, or of us being notified of transmission
>status by the time the next frame comes in.
>
>>
>>> Someone better versed in USB protocol can probably confirm, but it seems
>>> somewhat hacky to stop the ISOC stream at the end of the frame and restart
>>> with the next frame.
>>
>> All I know is that the HW mechanism that is reading from the trb ring buffer is
>> started or stopped I don't know if really the ISOC stream is stopped and
>> restarted here or what that means on the real wire. And if so, I am unsure if
>> that is really a problem or not. Thinh?
>
>Oh? That's great! If the controller can keep the ISOC stream from underruning
>without the gadget feeding it 0-length requests, then we can simplify the
>gadget side implementation quite a bit!

Exactly that is the case.

Can you followup to my discussion with Thinh and add your thougts
to the whole idea.

Regards,
Michael

Alan Stern May 22, 2024, 2:50 p.m. UTC | #11

On Wed, May 22, 2024 at 01:41:42AM +0000, Thinh Nguyen wrote:
> On Wed, May 22, 2024, Michael Grzeschik wrote:
> > On Fri, May 17, 2024 at 01:44:05AM +0000, Thinh Nguyen wrote:
> > > For isoc endpoint IN, yes. If the host requests for isoc data IN while
> > > no TRB is prepared, then the controller will automatically send 0-length
> > > packet respond.
> > 
> > Perfect! This will help a lot and will make active queueing of own
> > zero-length requests run unnecessary.
> 
> Yes, if we rely on the current start/stop isoc transfer scheme for UVC,
> then this will work.

You shouldn't rely on this behavior.  Other device controllers might not 
behave this way; they might send no packet at all to the host (causing a 
USB protocol error) instead of sending a zero-length packet.

On the other hand, it may not make any difference.  The host's UVC 
driver most likely won't care about the difference between no packet and 
a 0-length packet.  :-)

Alan Stern

Thinh Nguyen May 22, 2024, 5:17 p.m. UTC | #12

On Wed, May 22, 2024, Alan Stern wrote:
> On Wed, May 22, 2024 at 01:41:42AM +0000, Thinh Nguyen wrote:
> > On Wed, May 22, 2024, Michael Grzeschik wrote:
> > > On Fri, May 17, 2024 at 01:44:05AM +0000, Thinh Nguyen wrote:
> > > > For isoc endpoint IN, yes. If the host requests for isoc data IN while
> > > > no TRB is prepared, then the controller will automatically send 0-length
> > > > packet respond.
> > > 
> > > Perfect! This will help a lot and will make active queueing of own
> > > zero-length requests run unnecessary.
> > 
> > Yes, if we rely on the current start/stop isoc transfer scheme for UVC,
> > then this will work.
> 
> You shouldn't rely on this behavior.  Other device controllers might not 
> behave this way; they might send no packet at all to the host (causing a 
> USB protocol error) instead of sending a zero-length packet.

I agree. The dwc3 driver has this workaround to somewhat work with the
UVC. This behavior is not something the controller expected, and this
workaround should not be a common behavior for different function
driver/protocol. Since this behavior was added a long time ago, it will
remain the default behavior in dwc3 to avoid regression with UVC (at
least until the UVC is changed). However, it would be nice for UVC to
not rely on this.

Side note, when the dwc3 driver reschedules/starts isoc transfer again,
the first transfer will be scheduled go out at some future interval and
not the next immediate microframe. For UVC, it probably won't be a
problem since it doesn't seem to need data going out every interval.

BR,
Thinh

> 
> On the other hand, it may not make any difference.  The host's UVC 
> driver most likely won't care about the difference between no packet and 
> a 0-length packet.  :-)
> 
> Alan Stern

Michael Grzeschik May 22, 2024, 5:37 p.m. UTC | #13

On Wed, May 22, 2024 at 05:17:02PM +0000, Thinh Nguyen wrote:
>On Wed, May 22, 2024, Alan Stern wrote:
>> On Wed, May 22, 2024 at 01:41:42AM +0000, Thinh Nguyen wrote:
>> > On Wed, May 22, 2024, Michael Grzeschik wrote:
>> > > On Fri, May 17, 2024 at 01:44:05AM +0000, Thinh Nguyen wrote:
>> > > > For isoc endpoint IN, yes. If the host requests for isoc data IN while
>> > > > no TRB is prepared, then the controller will automatically send 0-length
>> > > > packet respond.
>> > >
>> > > Perfect! This will help a lot and will make active queueing of own
>> > > zero-length requests run unnecessary.
>> >
>> > Yes, if we rely on the current start/stop isoc transfer scheme for UVC,
>> > then this will work.
>>
>> You shouldn't rely on this behavior.  Other device controllers might not
>> behave this way; they might send no packet at all to the host (causing a
>> USB protocol error) instead of sending a zero-length packet.
>
>I agree. The dwc3 driver has this workaround to somewhat work with the
>UVC. This behavior is not something the controller expected, and this
>workaround should not be a common behavior for different function
>driver/protocol. Since this behavior was added a long time ago, it will
>remain the default behavior in dwc3 to avoid regression with UVC (at
>least until the UVC is changed). However, it would be nice for UVC to
>not rely on this.

With "this" you mean exactly the following commit, right?

(f5e46aa4 usb: dwc3: gadget: when the started list is empty stop the active xfer)

When we start questioning this, then lets dig deeper here.

With the fast datarate of at least usb superspeed shouldn't they not all
completely work asynchronous with their in flight trbs?

In my understanding this validates that, with at least superspeed we are
unlikely to react fast enough to maintain a steady isoc dataflow, since
the driver above has to react to errors in the processing context.

This runs the above patch (f5e46aa4) a gadget independent solution
which has nothing to do with uvc in particular IMHO.

How do other controllers and their drivers work?

>Side note, when the dwc3 driver reschedules/starts isoc transfer again,
>the first transfer will be scheduled go out at some future interval and
>not the next immediate microframe. For UVC, it probably won't be a
>problem since it doesn't seem to need data going out every interval.

It should not make a difference. [TM]

>>
>> On the other hand, it may not make any difference.  The host's UVC
>> driver most likely won't care about the difference between no packet and
>> a 0-length packet.  :-)
>>
>> Alan Stern

Thinh Nguyen May 22, 2024, 6:23 p.m. UTC | #14

On Wed, May 22, 2024, Michael Grzeschik wrote:
> On Wed, May 22, 2024 at 05:17:02PM +0000, Thinh Nguyen wrote:
> > On Wed, May 22, 2024, Alan Stern wrote:
> > > On Wed, May 22, 2024 at 01:41:42AM +0000, Thinh Nguyen wrote:
> > > > On Wed, May 22, 2024, Michael Grzeschik wrote:
> > > > > On Fri, May 17, 2024 at 01:44:05AM +0000, Thinh Nguyen wrote:
> > > > > > For isoc endpoint IN, yes. If the host requests for isoc data IN while
> > > > > > no TRB is prepared, then the controller will automatically send 0-length
> > > > > > packet respond.
> > > > >
> > > > > Perfect! This will help a lot and will make active queueing of own
> > > > > zero-length requests run unnecessary.
> > > >
> > > > Yes, if we rely on the current start/stop isoc transfer scheme for UVC,
> > > > then this will work.
> > > 
> > > You shouldn't rely on this behavior.  Other device controllers might not
> > > behave this way; they might send no packet at all to the host (causing a
> > > USB protocol error) instead of sending a zero-length packet.
> > 
> > I agree. The dwc3 driver has this workaround to somewhat work with the
> > UVC. This behavior is not something the controller expected, and this
> > workaround should not be a common behavior for different function
> > driver/protocol. Since this behavior was added a long time ago, it will
> > remain the default behavior in dwc3 to avoid regression with UVC (at
> > least until the UVC is changed). However, it would be nice for UVC to
> > not rely on this.
> 
> With "this" you mean exactly the following commit, right?
> 
> (f5e46aa4 usb: dwc3: gadget: when the started list is empty stop the active xfer)

I believe that was the case. However, there were changes prior to that
that restarts the isoc on missed isoc, which shouldn't be the default
behavior.

> 
> When we start questioning this, then lets dig deeper here.
> 
> With the fast datarate of at least usb superspeed shouldn't they not all
> completely work asynchronous with their in flight trbs?

I'm not sure what you mean by asynchronous here.

> 
> In my understanding this validates that, with at least superspeed we are
> unlikely to react fast enough to maintain a steady isoc dataflow, since
> the driver above has to react to errors in the processing context.

The point is that we don't stop the isoc transfers unless there's a
change to the interface. The dwc3 driver should not need to do anything
else except reporting the missed request to the function driver. The
application/function driver should keep up with the continous data base
on the established periodic service interval. This is not bulk transfer.

> 
> This runs the above patch (f5e46aa4) a gadget independent solution
> which has nothing to do with uvc in particular IMHO.

If that's not the case, I stand corrected since I thought you sent
that patch in particular for UVC. Regardless, my other points still
stand.

> 
> How do other controllers and their drivers work?

If we stop and start the transfer because the function driver can't keep
up, then the data will go out in the wrong interval. The UVC host seems
to work base on the video frame rather than at the USB microframe
interval that it uses. It may not be the case for other protocols. If
isoc endpoint is used, then we expect timeliness matter.

BR,
Thinh

> 
> > Side note, when the dwc3 driver reschedules/starts isoc transfer again,
> > the first transfer will be scheduled go out at some future interval and
> > not the next immediate microframe. For UVC, it probably won't be a
> > problem since it doesn't seem to need data going out every interval.
> 
> It should not make a difference. [TM]
> 
> > > 
> > > On the other hand, it may not make any difference.  The host's UVC
> > > driver most likely won't care about the difference between no packet and
> > > a 0-length packet.  :-)
> > > 
> > > Alan Stern
> 
> -- 
> Pengutronix e.K.                           |                             |
> Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
> 31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
> Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

Alan Stern May 22, 2024, 6:58 p.m. UTC | #15

On Wed, May 22, 2024 at 07:37:59PM +0200, Michael Grzeschik wrote:
> On Wed, May 22, 2024 at 05:17:02PM +0000, Thinh Nguyen wrote:
> > I agree. The dwc3 driver has this workaround to somewhat work with the
> > UVC. This behavior is not something the controller expected, and this
> > workaround should not be a common behavior for different function
> > driver/protocol. Since this behavior was added a long time ago, it will
> > remain the default behavior in dwc3 to avoid regression with UVC (at
> > least until the UVC is changed). However, it would be nice for UVC to
> > not rely on this.
> 
> With "this" you mean exactly the following commit, right?
> 
> (f5e46aa4 usb: dwc3: gadget: when the started list is empty stop the active xfer)
> 
> When we start questioning this, then lets dig deeper here.
> 
> With the fast datarate of at least usb superspeed shouldn't they not all
> completely work asynchronous with their in flight trbs?
> 
> In my understanding this validates that, with at least superspeed we are
> unlikely to react fast enough to maintain a steady isoc dataflow, since
> the driver above has to react to errors in the processing context.
> 
> This runs the above patch (f5e46aa4) a gadget independent solution
> which has nothing to do with uvc in particular IMHO.
> 
> How do other controllers and their drivers work?

You should think of isochronous transfer requests as a pipeline flowing 
from the function driver to the UDC driver.  As long as the pipeline 
never becomes empty, transfers will occur with the desired timing: one 
packet (ignoring high-bandwidth issues) per isochronous interval.

But if the pipeline does become empty, because the function driver 
doesn't submit requests to the UDC driver quickly enough, the behavior 
is undetermined.  Obviously no data can be sent before the next request 
is submitted.  And even if the next request is submitted before the next 
time interval expires, the UDC driver might delay transferring the 
request's data until a later time interval.  Or it might not.  In short, 
when the function driver does submit its next request, there's no way to 
know in which interval its data will get sent to the host.  Furthermore, 
there's no way in the gadget framework for the function driver to ask 
that the request be sent in a particular transfer window.

This means that function drivers should do their best to make sure the 
pipeline never becomes empty, that there always is at least one request 
in progress at all times.  Even if this requires submitting zero-length 
requests because the necessary data isn't available yet.

Remember, isochronous transfers are meant only to be a best-effort 
attempt at low-latency data delivery, without guarantees (other than a 
reserved amount of bandwidth).  If a packet gets lost or dropped from 
time to time, whether because of transmission errors or because the data 
source was unable to keep up, it shouldn't matter very much in the end.
The receiver (i.e., the host) should be able to recover, resynchronize, 
and move on.

Alan Stern

Avichal Rakesh May 28, 2024, 5:30 p.m. UTC | #16

On 5/22/24 10:37, Michael Grzeschik wrote:
> On Wed, May 22, 2024 at 05:17:02PM +0000, Thinh Nguyen wrote:
>> On Wed, May 22, 2024, Alan Stern wrote:
>>> On Wed, May 22, 2024 at 01:41:42AM +0000, Thinh Nguyen wrote:
>>> > On Wed, May 22, 2024, Michael Grzeschik wrote:
>>> > > On Fri, May 17, 2024 at 01:44:05AM +0000, Thinh Nguyen wrote:
>>> > > > For isoc endpoint IN, yes. If the host requests for isoc data IN while
>>> > > > no TRB is prepared, then the controller will automatically send 0-length
>>> > > > packet respond.
>>> > >
>>> > > Perfect! This will help a lot and will make active queueing of own
>>> > > zero-length requests run unnecessary.
>>> >
>>> > Yes, if we rely on the current start/stop isoc transfer scheme for UVC,
>>> > then this will work.
>>>
>>> You shouldn't rely on this behavior.  Other device controllers might not
>>> behave this way; they might send no packet at all to the host (causing a
>>> USB protocol error) instead of sending a zero-length packet.
>>
>> I agree. The dwc3 driver has this workaround to somewhat work with the
>> UVC. This behavior is not something the controller expected, and this
>> workaround should not be a common behavior for different function
>> driver/protocol. Since this behavior was added a long time ago, it will
>> remain the default behavior in dwc3 to avoid regression with UVC (at
>> least until the UVC is changed). However, it would be nice for UVC to
>> not rely on this.
> 
> With "this" you mean exactly the following commit, right?
> 
> (f5e46aa4 usb: dwc3: gadget: when the started list is empty stop the active xfer)
> 
> When we start questioning this, then lets dig deeper here.
> 
> With the fast datarate of at least usb superspeed shouldn't they not all
> completely work asynchronous with their in flight trbs?
> 
> In my understanding this validates that, with at least superspeed we are
> unlikely to react fast enough to maintain a steady isoc dataflow, since
> the driver above has to react to errors in the processing context.
> 
> This runs the above patch (f5e46aa4) a gadget independent solution
> which has nothing to do with uvc in particular IMHO.
> 
> How do other controllers and their drivers work?
> 
>> Side note, when the dwc3 driver reschedules/starts isoc transfer again,
>> the first transfer will be scheduled go out at some future interval and
>> not the next immediate microframe. For UVC, it probably won't be a
>> problem since it doesn't seem to need data going out every interval.
> 
> It should not make a difference. [TM]
> 


Sorry for being absent for a lot of this discussion.

I want to take a step back from the details of how we're 
solving the problem to what problems we're trying to solve. 

So, question(s) for Michael, because I don't see an explicit 
answer in this thread (and my sincerest apologies if they've 
been answered already and I missed it):

What exactly is the bug (or bugs) we're trying to solve here?

So far, it looks like there are two main problems being 
discussed:

1. Reducing the bandwidth usage of individual usb_requests
2. Error handling for when transmission over the wire fails.

Is that correct, or are there other issues at play here?

(1) in isolation should be relatively easy to solve: Just
smear the encoded frame across some percentage 
(prefereably < 100%) of the usb_requests in one 
video frame interval.

(2) is more complicated, and your suggestion is to have a 
sentinel request between two video frames that tells the 
host if the transmission of "current" video frame was 
successful or not. (Is that correct?) 

Assuming my understanding of (2) is correct, how should
the host behave if the transmission of the sentinel
request fails after the video frame was sent 
successfully (isoc is best effort so transmission is 
never guaranteed)?


Best,
Avi.

Michael Grzeschik May 28, 2024, 8:22 p.m. UTC | #17

On Tue, May 28, 2024 at 10:30:30AM -0700, Avichal Rakesh wrote:
>
>
>On 5/22/24 10:37, Michael Grzeschik wrote:
>> On Wed, May 22, 2024 at 05:17:02PM +0000, Thinh Nguyen wrote:
>>> On Wed, May 22, 2024, Alan Stern wrote:
>>>> On Wed, May 22, 2024 at 01:41:42AM +0000, Thinh Nguyen wrote:
>>>> > On Wed, May 22, 2024, Michael Grzeschik wrote:
>>>> > > On Fri, May 17, 2024 at 01:44:05AM +0000, Thinh Nguyen wrote:
>>>> > > > For isoc endpoint IN, yes. If the host requests for isoc data IN while
>>>> > > > no TRB is prepared, then the controller will automatically send 0-length
>>>> > > > packet respond.
>>>> > >
>>>> > > Perfect! This will help a lot and will make active queueing of own
>>>> > > zero-length requests run unnecessary.
>>>> >
>>>> > Yes, if we rely on the current start/stop isoc transfer scheme for UVC,
>>>> > then this will work.
>>>>
>>>> You shouldn't rely on this behavior.  Other device controllers might not
>>>> behave this way; they might send no packet at all to the host (causing a
>>>> USB protocol error) instead of sending a zero-length packet.
>>>
>>> I agree. The dwc3 driver has this workaround to somewhat work with the
>>> UVC. This behavior is not something the controller expected, and this
>>> workaround should not be a common behavior for different function
>>> driver/protocol. Since this behavior was added a long time ago, it will
>>> remain the default behavior in dwc3 to avoid regression with UVC (at
>>> least until the UVC is changed). However, it would be nice for UVC to
>>> not rely on this.
>>
>> With "this" you mean exactly the following commit, right?
>>
>> (f5e46aa4 usb: dwc3: gadget: when the started list is empty stop the active xfer)
>>
>> When we start questioning this, then lets dig deeper here.
>>
>> With the fast datarate of at least usb superspeed shouldn't they not all
>> completely work asynchronous with their in flight trbs?
>>
>> In my understanding this validates that, with at least superspeed we are
>> unlikely to react fast enough to maintain a steady isoc dataflow, since
>> the driver above has to react to errors in the processing context.
>>
>> This runs the above patch (f5e46aa4) a gadget independent solution
>> which has nothing to do with uvc in particular IMHO.
>>
>> How do other controllers and their drivers work?
>>
>>> Side note, when the dwc3 driver reschedules/starts isoc transfer again,
>>> the first transfer will be scheduled go out at some future interval and
>>> not the next immediate microframe. For UVC, it probably won't be a
>>> problem since it doesn't seem to need data going out every interval.
>>
>> It should not make a difference. [TM]
>>
>
>
>Sorry for being absent for a lot of this discussion.
>
>I want to take a step back from the details of how we're
>solving the problem to what problems we're trying to solve.
>
>So, question(s) for Michael, because I don't see an explicit
>answer in this thread (and my sincerest apologies if they've
>been answered already and I missed it):
>
>What exactly is the bug (or bugs) we're trying to solve here?
>
>So far, it looks like there are two main problems being
>discussed:
>
>1. Reducing the bandwidth usage of individual usb_requests
>2. Error handling for when transmission over the wire fails.
>
>Is that correct, or are there other issues at play here?

That is correct.

>(1) in isolation should be relatively easy to solve: Just
>smear the encoded frame across some percentage
>(prefereably < 100%) of the usb_requests in one
>video frame interval.

Right.

>(2) is more complicated, and your suggestion is to have a
>sentinel request between two video frames that tells the
>host if the transmission of "current" video frame was
>successful or not. (Is that correct?)

Right.

>Assuming my understanding of (2) is correct, how should
>the host behave if the transmission of the sentinel
>request fails after the video frame was sent
>successfully (isoc is best effort so transmission is
>never guaranteed)?

If we have transmitted all requests of the frame but would only miss the
sentinel request to the host that includes the EOF, the host could
rather show it or drop it. The drop would not be necessary since the
host did see all data of the frame. The user would not see any
distortion in both cases.

If we have transmitted only partial data of the frame it would be
preferred if the host would drop the frame that did not finish with an
proper EOF tag.

AFAIK the linux kernel would still show the frame if the FID of the
currently handled request would change and would take this as the end of
frame indication. But I am not totally sure if this is by spec or
optional.

One option to be totally sure would be to resend the sentinel request to
be properly transmitted before starting the next frame. This resend
polling would probably include some extra zero-length requests. But also
if this resend keeps failing for n times, the driver should doubt there
is anything sane going on with the USB connection and bail out somehow.

Since we try to tackle case (1) to avoid transmit errors and also avoid
creating late enqueued requests in the running isoc transfer, the over
all chance to trigger missed transfers should be minimal.

Regards,
Michael

Avichal Rakesh May 28, 2024, 9:27 p.m. UTC | #18

On 5/28/24 13:22, Michael Grzeschik wrote:
> On Tue, May 28, 2024 at 10:30:30AM -0700, Avichal Rakesh wrote:
>>
>>
>> On 5/22/24 10:37, Michael Grzeschik wrote:
>>> On Wed, May 22, 2024 at 05:17:02PM +0000, Thinh Nguyen wrote:
>>>> On Wed, May 22, 2024, Alan Stern wrote:
>>>>> On Wed, May 22, 2024 at 01:41:42AM +0000, Thinh Nguyen wrote:
>>>>> > On Wed, May 22, 2024, Michael Grzeschik wrote:
>>>>> > > On Fri, May 17, 2024 at 01:44:05AM +0000, Thinh Nguyen wrote:
>>>>> > > > For isoc endpoint IN, yes. If the host requests for isoc data IN while
>>>>> > > > no TRB is prepared, then the controller will automatically send 0-length
>>>>> > > > packet respond.
>>>>> > >
>>>>> > > Perfect! This will help a lot and will make active queueing of own
>>>>> > > zero-length requests run unnecessary.
>>>>> >
>>>>> > Yes, if we rely on the current start/stop isoc transfer scheme for UVC,
>>>>> > then this will work.
>>>>>
>>>>> You shouldn't rely on this behavior.  Other device controllers might not
>>>>> behave this way; they might send no packet at all to the host (causing a
>>>>> USB protocol error) instead of sending a zero-length packet.
>>>>
>>>> I agree. The dwc3 driver has this workaround to somewhat work with the
>>>> UVC. This behavior is not something the controller expected, and this
>>>> workaround should not be a common behavior for different function
>>>> driver/protocol. Since this behavior was added a long time ago, it will
>>>> remain the default behavior in dwc3 to avoid regression with UVC (at
>>>> least until the UVC is changed). However, it would be nice for UVC to
>>>> not rely on this.
>>>
>>> With "this" you mean exactly the following commit, right?
>>>
>>> (f5e46aa4 usb: dwc3: gadget: when the started list is empty stop the active xfer)
>>>
>>> When we start questioning this, then lets dig deeper here.
>>>
>>> With the fast datarate of at least usb superspeed shouldn't they not all
>>> completely work asynchronous with their in flight trbs?
>>>
>>> In my understanding this validates that, with at least superspeed we are
>>> unlikely to react fast enough to maintain a steady isoc dataflow, since
>>> the driver above has to react to errors in the processing context.
>>>
>>> This runs the above patch (f5e46aa4) a gadget independent solution
>>> which has nothing to do with uvc in particular IMHO.
>>>
>>> How do other controllers and their drivers work?
>>>
>>>> Side note, when the dwc3 driver reschedules/starts isoc transfer again,
>>>> the first transfer will be scheduled go out at some future interval and
>>>> not the next immediate microframe. For UVC, it probably won't be a
>>>> problem since it doesn't seem to need data going out every interval.
>>>
>>> It should not make a difference. [TM]
>>>
>>
>>
>> Sorry for being absent for a lot of this discussion.
>>
>> I want to take a step back from the details of how we're
>> solving the problem to what problems we're trying to solve.
>>
>> So, question(s) for Michael, because I don't see an explicit
>> answer in this thread (and my sincerest apologies if they've
>> been answered already and I missed it):
>>
>> What exactly is the bug (or bugs) we're trying to solve here?
>>
>> So far, it looks like there are two main problems being
>> discussed:
>>
>> 1. Reducing the bandwidth usage of individual usb_requests
>> 2. Error handling for when transmission over the wire fails.
>>
>> Is that correct, or are there other issues at play here?
> 
> That is correct.
> 
>> (1) in isolation should be relatively easy to solve: Just
>> smear the encoded frame across some percentage
>> (prefereably < 100%) of the usb_requests in one
>> video frame interval.
> 
> Right.
> 
>> (2) is more complicated, and your suggestion is to have a
>> sentinel request between two video frames that tells the
>> host if the transmission of "current" video frame was
>> successful or not. (Is that correct?)
> 
> Right.
> 
>> Assuming my understanding of (2) is correct, how should
>> the host behave if the transmission of the sentinel
>> request fails after the video frame was sent
>> successfully (isoc is best effort so transmission is
>> never guaranteed)?
> 
> If we have transmitted all requests of the frame but would only miss the
> sentinel request to the host that includes the EOF, the host could
> rather show it or drop it. The drop would not be necessary since the
> host did see all data of the frame. The user would not see any
> distortion in both cases.
> 
> If we have transmitted only partial data of the frame it would be
> preferred if the host would drop the frame that did not finish with an
> proper EOF tag.
> 
> AFAIK the linux kernel would still show the frame if the FID of the
> currently handled request would change and would take this as the end of
> frame indication. But I am not totally sure if this is by spec or
> optional.
> 
> One option to be totally sure would be to resend the sentinel request to
> be properly transmitted before starting the next frame. This resend
> polling would probably include some extra zero-length requests. But also
> if this resend keeps failing for n times, the driver should doubt there
> is anything sane going on with the USB connection and bail out somehow.
> 
> Since we try to tackle case (1) to avoid transmit errors and also avoid
> creating late enqueued requests in the running isoc transfer, the over
> all chance to trigger missed transfers should be minimal.

Gotcha. It seems like the UVC gadget driver implicitly assumes that EOF 
flag will be used although the userspace application can technically 
make it optional.

Summarizing some of the discussions above: 
1. UVC gadget driver should _not_ rely on the usb controller to 
   enqueue 0-length requests on UVC gadget drivers behalf;
2. However keeping up the backpressure to the controller means the 
   EOF request will be delayed behind all the zero-length requests.

Out of curiosity: What is wrong with letting the host rely on 
FID alone? Decoding the jpeg payload _should_ fail if any of the 
usb_requests containing the payload failed to transmit.

Was there a situation where usb_requests were failing but jpeg
decoding succeeded  (i.e. the host was unaware of failure)? Looking at 
the error handling on the gadget driver side, I see there is space to 
improve it, but maybe we can do it in a way that doesn't add more time 
constraints!

- Avii.

Michael Grzeschik May 28, 2024, 10:43 p.m. UTC | #19

On Tue, May 28, 2024 at 02:27:34PM -0700, Avichal Rakesh wrote:
>
>
>On 5/28/24 13:22, Michael Grzeschik wrote:
>> On Tue, May 28, 2024 at 10:30:30AM -0700, Avichal Rakesh wrote:
>>>
>>>
>>> On 5/22/24 10:37, Michael Grzeschik wrote:
>>>> On Wed, May 22, 2024 at 05:17:02PM +0000, Thinh Nguyen wrote:
>>>>> On Wed, May 22, 2024, Alan Stern wrote:
>>>>>> On Wed, May 22, 2024 at 01:41:42AM +0000, Thinh Nguyen wrote:
>>>>>> > On Wed, May 22, 2024, Michael Grzeschik wrote:
>>>>>> > > On Fri, May 17, 2024 at 01:44:05AM +0000, Thinh Nguyen wrote:
>>>>>> > > > For isoc endpoint IN, yes. If the host requests for isoc data IN while
>>>>>> > > > no TRB is prepared, then the controller will automatically send 0-length
>>>>>> > > > packet respond.
>>>>>> > >
>>>>>> > > Perfect! This will help a lot and will make active queueing of own
>>>>>> > > zero-length requests run unnecessary.
>>>>>> >
>>>>>> > Yes, if we rely on the current start/stop isoc transfer scheme for UVC,
>>>>>> > then this will work.
>>>>>>
>>>>>> You shouldn't rely on this behavior.  Other device controllers might not
>>>>>> behave this way; they might send no packet at all to the host (causing a
>>>>>> USB protocol error) instead of sending a zero-length packet.
>>>>>
>>>>> I agree. The dwc3 driver has this workaround to somewhat work with the
>>>>> UVC. This behavior is not something the controller expected, and this
>>>>> workaround should not be a common behavior for different function
>>>>> driver/protocol. Since this behavior was added a long time ago, it will
>>>>> remain the default behavior in dwc3 to avoid regression with UVC (at
>>>>> least until the UVC is changed). However, it would be nice for UVC to
>>>>> not rely on this.
>>>>
>>>> With "this" you mean exactly the following commit, right?
>>>>
>>>> (f5e46aa4 usb: dwc3: gadget: when the started list is empty stop the active xfer)
>>>>
>>>> When we start questioning this, then lets dig deeper here.
>>>>
>>>> With the fast datarate of at least usb superspeed shouldn't they not all
>>>> completely work asynchronous with their in flight trbs?
>>>>
>>>> In my understanding this validates that, with at least superspeed we are
>>>> unlikely to react fast enough to maintain a steady isoc dataflow, since
>>>> the driver above has to react to errors in the processing context.
>>>>
>>>> This runs the above patch (f5e46aa4) a gadget independent solution
>>>> which has nothing to do with uvc in particular IMHO.
>>>>
>>>> How do other controllers and their drivers work?
>>>>
>>>>> Side note, when the dwc3 driver reschedules/starts isoc transfer again,
>>>>> the first transfer will be scheduled go out at some future interval and
>>>>> not the next immediate microframe. For UVC, it probably won't be a
>>>>> problem since it doesn't seem to need data going out every interval.
>>>>
>>>> It should not make a difference. [TM]
>>>>
>>>
>>>
>>> Sorry for being absent for a lot of this discussion.
>>>
>>> I want to take a step back from the details of how we're
>>> solving the problem to what problems we're trying to solve.
>>>
>>> So, question(s) for Michael, because I don't see an explicit
>>> answer in this thread (and my sincerest apologies if they've
>>> been answered already and I missed it):
>>>
>>> What exactly is the bug (or bugs) we're trying to solve here?
>>>
>>> So far, it looks like there are two main problems being
>>> discussed:
>>>
>>> 1. Reducing the bandwidth usage of individual usb_requests
>>> 2. Error handling for when transmission over the wire fails.
>>>
>>> Is that correct, or are there other issues at play here?
>>
>> That is correct.
>>
>>> (1) in isolation should be relatively easy to solve: Just
>>> smear the encoded frame across some percentage
>>> (prefereably < 100%) of the usb_requests in one
>>> video frame interval.
>>
>> Right.
>>
>>> (2) is more complicated, and your suggestion is to have a
>>> sentinel request between two video frames that tells the
>>> host if the transmission of "current" video frame was
>>> successful or not. (Is that correct?)
>>
>> Right.
>>
>>> Assuming my understanding of (2) is correct, how should
>>> the host behave if the transmission of the sentinel
>>> request fails after the video frame was sent
>>> successfully (isoc is best effort so transmission is
>>> never guaranteed)?
>>
>> If we have transmitted all requests of the frame but would only miss the
>> sentinel request to the host that includes the EOF, the host could
>> rather show it or drop it. The drop would not be necessary since the
>> host did see all data of the frame. The user would not see any
>> distortion in both cases.
>>
>> If we have transmitted only partial data of the frame it would be
>> preferred if the host would drop the frame that did not finish with an
>> proper EOF tag.
>>
>> AFAIK the linux kernel would still show the frame if the FID of the
>> currently handled request would change and would take this as the end of
>> frame indication. But I am not totally sure if this is by spec or
>> optional.
>>
>> One option to be totally sure would be to resend the sentinel request to
>> be properly transmitted before starting the next frame. This resend
>> polling would probably include some extra zero-length requests. But also
>> if this resend keeps failing for n times, the driver should doubt there
>> is anything sane going on with the USB connection and bail out somehow.
>>
>> Since we try to tackle case (1) to avoid transmit errors and also avoid
>> creating late enqueued requests in the running isoc transfer, the over
>> all chance to trigger missed transfers should be minimal.
>
>Gotcha. It seems like the UVC gadget driver implicitly assumes that EOF
>flag will be used although the userspace application can technically
>make it optional.

That is not all. The additional UVC_STREAM_ERR tag on the sentinel
request can be set optional by the host driver. But by spec the
userspace application has to drop the frame when the flag was set.

With my proposal this flag will be set, whenever we find out that
the currently transferred frame was erroneous.

>Summarizing some of the discussions above:
>1. UVC gadget driver should _not_ rely on the usb controller to
>   enqueue 0-length requests on UVC gadget drivers behalf;
>2. However keeping up the backpressure to the controller means the
>   EOF request will be delayed behind all the zero-length requests.

Exactly, this is why we have to somehow finetune the timedelay between
requests that trigger interrupts. And also monitor the amount of
requests currently enqueued in the hw ringbuffer. So that our drivers
enqueue dequeue mechanism is virtually adding only the minimum amount
of necessary zero-length requests in the hardware. This should be
possible.

I am currently thinking through the remaining steps the pump worker has
to do on each wakeup to maintain the minimum threshold while waiting
with submitting requests that contain actual image payload.

>Out of curiosity: What is wrong with letting the host rely on
>FID alone? Decoding the jpeg payload _should_ fail if any of the
>usb_requests containing the payload failed to transmit.

This is not totally true. We saw partially rendered jpeg frames on the
host stream. How the host behaves with broken data is totally undefined
if the typical uvc flags EOF/ERR are not used as specified. Then think
about uncompressed formats. So relying on the transferred image format
to solve our problems is just as wrong as relying on the gadgets
hardware behavior.

>Was there a situation where usb_requests were failing but jpeg
>decoding succeeded  (i.e. the host was unaware of failure)? Looking at
>the error handling on the gadget driver side, I see there is space to
>improve it, but maybe we can do it in a way that doesn't add more time
>constraints!

I think there is no way around time constraints. Since the gadget
hardware is introducing interrupt latency, we have to address with
the sentinel request and some well adjusted amount of zero-length
requests.

Regards,
Michael

Avichal Rakesh May 29, 2024, 12:33 a.m. UTC | #20

On 5/28/24 15:43, Michael Grzeschik wrote:
> On Tue, May 28, 2024 at 02:27:34PM -0700, Avichal Rakesh wrote:
>>
>>
>> On 5/28/24 13:22, Michael Grzeschik wrote:
>>> On Tue, May 28, 2024 at 10:30:30AM -0700, Avichal Rakesh wrote:
>>>>
>>>>
>>>> On 5/22/24 10:37, Michael Grzeschik wrote:
>>>>> On Wed, May 22, 2024 at 05:17:02PM +0000, Thinh Nguyen wrote:
>>>>>> On Wed, May 22, 2024, Alan Stern wrote:
>>>>>>> On Wed, May 22, 2024 at 01:41:42AM +0000, Thinh Nguyen wrote:
>>>>>>> > On Wed, May 22, 2024, Michael Grzeschik wrote:
>>>>>>> > > On Fri, May 17, 2024 at 01:44:05AM +0000, Thinh Nguyen wrote:
>>>>>>> > > > For isoc endpoint IN, yes. If the host requests for isoc data IN while
>>>>>>> > > > no TRB is prepared, then the controller will automatically send 0-length
>>>>>>> > > > packet respond.
>>>>>>> > >
>>>>>>> > > Perfect! This will help a lot and will make active queueing of own
>>>>>>> > > zero-length requests run unnecessary.
>>>>>>> >
>>>>>>> > Yes, if we rely on the current start/stop isoc transfer scheme for UVC,
>>>>>>> > then this will work.
>>>>>>>
>>>>>>> You shouldn't rely on this behavior.  Other device controllers might not
>>>>>>> behave this way; they might send no packet at all to the host (causing a
>>>>>>> USB protocol error) instead of sending a zero-length packet.
>>>>>>
>>>>>> I agree. The dwc3 driver has this workaround to somewhat work with the
>>>>>> UVC. This behavior is not something the controller expected, and this
>>>>>> workaround should not be a common behavior for different function
>>>>>> driver/protocol. Since this behavior was added a long time ago, it will
>>>>>> remain the default behavior in dwc3 to avoid regression with UVC (at
>>>>>> least until the UVC is changed). However, it would be nice for UVC to
>>>>>> not rely on this.
>>>>>
>>>>> With "this" you mean exactly the following commit, right?
>>>>>
>>>>> (f5e46aa4 usb: dwc3: gadget: when the started list is empty stop the active xfer)
>>>>>
>>>>> When we start questioning this, then lets dig deeper here.
>>>>>
>>>>> With the fast datarate of at least usb superspeed shouldn't they not all
>>>>> completely work asynchronous with their in flight trbs?
>>>>>
>>>>> In my understanding this validates that, with at least superspeed we are
>>>>> unlikely to react fast enough to maintain a steady isoc dataflow, since
>>>>> the driver above has to react to errors in the processing context.
>>>>>
>>>>> This runs the above patch (f5e46aa4) a gadget independent solution
>>>>> which has nothing to do with uvc in particular IMHO.
>>>>>
>>>>> How do other controllers and their drivers work?
>>>>>
>>>>>> Side note, when the dwc3 driver reschedules/starts isoc transfer again,
>>>>>> the first transfer will be scheduled go out at some future interval and
>>>>>> not the next immediate microframe. For UVC, it probably won't be a
>>>>>> problem since it doesn't seem to need data going out every interval.
>>>>>
>>>>> It should not make a difference. [TM]
>>>>>
>>>>
>>>>
>>>> Sorry for being absent for a lot of this discussion.
>>>>
>>>> I want to take a step back from the details of how we're
>>>> solving the problem to what problems we're trying to solve.
>>>>
>>>> So, question(s) for Michael, because I don't see an explicit
>>>> answer in this thread (and my sincerest apologies if they've
>>>> been answered already and I missed it):
>>>>
>>>> What exactly is the bug (or bugs) we're trying to solve here?
>>>>
>>>> So far, it looks like there are two main problems being
>>>> discussed:
>>>>
>>>> 1. Reducing the bandwidth usage of individual usb_requests
>>>> 2. Error handling for when transmission over the wire fails.
>>>>
>>>> Is that correct, or are there other issues at play here?
>>>
>>> That is correct.
>>>
>>>> (1) in isolation should be relatively easy to solve: Just
>>>> smear the encoded frame across some percentage
>>>> (prefereably < 100%) of the usb_requests in one
>>>> video frame interval.
>>>
>>> Right.
>>>
>>>> (2) is more complicated, and your suggestion is to have a
>>>> sentinel request between two video frames that tells the
>>>> host if the transmission of "current" video frame was
>>>> successful or not. (Is that correct?)
>>>
>>> Right.
>>>
>>>> Assuming my understanding of (2) is correct, how should
>>>> the host behave if the transmission of the sentinel
>>>> request fails after the video frame was sent
>>>> successfully (isoc is best effort so transmission is
>>>> never guaranteed)?
>>>
>>> If we have transmitted all requests of the frame but would only miss the
>>> sentinel request to the host that includes the EOF, the host could
>>> rather show it or drop it. The drop would not be necessary since the
>>> host did see all data of the frame. The user would not see any
>>> distortion in both cases.
>>>
>>> If we have transmitted only partial data of the frame it would be
>>> preferred if the host would drop the frame that did not finish with an
>>> proper EOF tag.
>>>
>>> AFAIK the linux kernel would still show the frame if the FID of the
>>> currently handled request would change and would take this as the end of
>>> frame indication. But I am not totally sure if this is by spec or
>>> optional.
>>>
>>> One option to be totally sure would be to resend the sentinel request to
>>> be properly transmitted before starting the next frame. This resend
>>> polling would probably include some extra zero-length requests. But also
>>> if this resend keeps failing for n times, the driver should doubt there
>>> is anything sane going on with the USB connection and bail out somehow.
>>>
>>> Since we try to tackle case (1) to avoid transmit errors and also avoid
>>> creating late enqueued requests in the running isoc transfer, the over
>>> all chance to trigger missed transfers should be minimal.
>>
>> Gotcha. It seems like the UVC gadget driver implicitly assumes that EOF
>> flag will be used although the userspace application can technically
>> make it optional.
> 
> That is not all. The additional UVC_STREAM_ERR tag on the sentinel
> request can be set optional by the host driver. But by spec the
> userspace application has to drop the frame when the flag was set.

Looking at the UVC specs, the ERR bit doesn't seem to refer to actual 
transmission error, only errors in frame generation (Section 4.3.1.7 
of UVC 1.5 Class Specification). Maybe "data discontinuity" can be 
used but the examples given are bad media, and encoder issues, which
suggests errors at higher level than the wire.

> 
> With my proposal this flag will be set, whenever we find out that
> the currently transferred frame was erroneous.
> 
>> Summarizing some of the discussions above:
>> 1. UVC gadget driver should _not_ rely on the usb controller to
>>   enqueue 0-length requests on UVC gadget drivers behalf;
>> 2. However keeping up the backpressure to the controller means the
>>   EOF request will be delayed behind all the zero-length requests.
> 
> Exactly, this is why we have to somehow finetune the timedelay between
> requests that trigger interrupts. And also monitor the amount of
> requests currently enqueued in the hw ringbuffer. So that our drivers
> enqueue dequeue mechanism is virtually adding only the minimum amount
> of necessary zero-length requests in the hardware. This should be
> possible.
> 
> I am currently thinking through the remaining steps the pump worker has
> to do on each wakeup to maintain the minimum threshold while waiting
> with submitting requests that contain actual image payload.
> 
>> Out of curiosity: What is wrong with letting the host rely on
>> FID alone? Decoding the jpeg payload _should_ fail if any of the
>> usb_requests containing the payload failed to transmit.
> 
> This is not totally true. We saw partially rendered jpeg frames on the
> host stream. How the host behaves with broken data is totally undefined
> if the typical uvc flags EOF/ERR are not used as specified. Then think
> about uncompressed formats. So relying on the transferred image format
> to solve our problems is just as wrong as relying on the gadgets
> hardware behavior.

Do you know if the partially rendered frames were valid JPEGs, or 
if the host was simply making a best effort at displaying a broken
JPEG? Perhaps the fix should go to the host instead?

Following is my opinion, feel free to disagree (and correct me if 
something is factually incorrect):

The fundamental issue here is that ISOC doesn't guarantee
delivery of usb_requests or even basic data consistency upon delivery.
So the gadget driver has no way to know the state of transmitted data. 
The gadget driver is notified of underruns but not of any other issues,
and ideally we should never have an underrun if the zero-length 
backpressure is working as intended.

So, UVC gadget driver can reduce the number of errors, but it'll never be 
able to guarantee that the data transmitted to the host isn't somehow
corrupted or missing unless a more reliable mode of transmission 
(bulk, for example) is used.

All of this to say: The host absolutely needs to be able to handle
all sorts of invalid and broken payloads. How the host handles it 
might be undefined, but the host can never rely on perfect knowledge 
about the transmission state. In cases like these, where the underlying 
transport is unreliable, the burden of enforcing consistency moves up 
a layer, i.e. to the encoded payload in this case. So it is perfectly 
fine for the host to rely on the encoding to determine if the payload 
is corrupt and handle it accordingly.

As for uncompressed format, you're correct that subtle corruptions
may not be caught, but outright missing usb_requests can be easily
checked by simply looking at the number of bytes in the payload. YUV 
frames are all of the same (predetermined) size for a given resolution.

So my recommendation is the following:
1. Fix the bandwidth problem by splitting the encoded video frame
   into more usb_requests (as your patch already does) making sure 
   there are enough free usb_request to encode the video frame in 
   one burst so we don't accidentally inflate the transmission 
   duration of a video frame by sneaking in zero-length requests in
   the middle.
2. Unless there is an unusually high rate of transmission failures
   when using the UVC gadget driver, it might be worth fixing the
   host side driver to handle broken frames better instead (assuming 
   host is linux as well).
2. Tighten up the error checking in UVC gadget driver -- We drop the 
   current frame whenever an EXDEV happens which is wrong. We should 
   only be dropping the current frame if the EXDEV corresponds to the 
   frame currently being encoded. If the frame is already fully queued 
   to the usb controller, the host can handle missing payload as it 
   sees fit.


- Avi.

Michael Grzeschik May 29, 2024, 9:24 p.m. UTC | #21

On Tue, May 28, 2024 at 05:33:46PM -0700, Avichal Rakesh wrote:
>
>
>On 5/28/24 15:43, Michael Grzeschik wrote:
>> On Tue, May 28, 2024 at 02:27:34PM -0700, Avichal Rakesh wrote:
>>>
>>>
>>> On 5/28/24 13:22, Michael Grzeschik wrote:
>>>> On Tue, May 28, 2024 at 10:30:30AM -0700, Avichal Rakesh wrote:
>>>>>
>>>>>
>>>>> On 5/22/24 10:37, Michael Grzeschik wrote:
>>>>>> On Wed, May 22, 2024 at 05:17:02PM +0000, Thinh Nguyen wrote:
>>>>>>> On Wed, May 22, 2024, Alan Stern wrote:
>>>>>>>> On Wed, May 22, 2024 at 01:41:42AM +0000, Thinh Nguyen wrote:
>>>>>>>> > On Wed, May 22, 2024, Michael Grzeschik wrote:
>>>>>>>> > > On Fri, May 17, 2024 at 01:44:05AM +0000, Thinh Nguyen wrote:
>>>>>>>> > > > For isoc endpoint IN, yes. If the host requests for isoc data IN while
>>>>>>>> > > > no TRB is prepared, then the controller will automatically send 0-length
>>>>>>>> > > > packet respond.
>>>>>>>> > >
>>>>>>>> > > Perfect! This will help a lot and will make active queueing of own
>>>>>>>> > > zero-length requests run unnecessary.
>>>>>>>> >
>>>>>>>> > Yes, if we rely on the current start/stop isoc transfer scheme for UVC,
>>>>>>>> > then this will work.
>>>>>>>>
>>>>>>>> You shouldn't rely on this behavior.  Other device controllers might not
>>>>>>>> behave this way; they might send no packet at all to the host (causing a
>>>>>>>> USB protocol error) instead of sending a zero-length packet.
>>>>>>>
>>>>>>> I agree. The dwc3 driver has this workaround to somewhat work with the
>>>>>>> UVC. This behavior is not something the controller expected, and this
>>>>>>> workaround should not be a common behavior for different function
>>>>>>> driver/protocol. Since this behavior was added a long time ago, it will
>>>>>>> remain the default behavior in dwc3 to avoid regression with UVC (at
>>>>>>> least until the UVC is changed). However, it would be nice for UVC to
>>>>>>> not rely on this.
>>>>>>
>>>>>> With "this" you mean exactly the following commit, right?
>>>>>>
>>>>>> (f5e46aa4 usb: dwc3: gadget: when the started list is empty stop the active xfer)
>>>>>>
>>>>>> When we start questioning this, then lets dig deeper here.
>>>>>>
>>>>>> With the fast datarate of at least usb superspeed shouldn't they not all
>>>>>> completely work asynchronous with their in flight trbs?
>>>>>>
>>>>>> In my understanding this validates that, with at least superspeed we are
>>>>>> unlikely to react fast enough to maintain a steady isoc dataflow, since
>>>>>> the driver above has to react to errors in the processing context.
>>>>>>
>>>>>> This runs the above patch (f5e46aa4) a gadget independent solution
>>>>>> which has nothing to do with uvc in particular IMHO.
>>>>>>
>>>>>> How do other controllers and their drivers work?
>>>>>>
>>>>>>> Side note, when the dwc3 driver reschedules/starts isoc transfer again,
>>>>>>> the first transfer will be scheduled go out at some future interval and
>>>>>>> not the next immediate microframe. For UVC, it probably won't be a
>>>>>>> problem since it doesn't seem to need data going out every interval.
>>>>>>
>>>>>> It should not make a difference. [TM]
>>>>>>
>>>>>
>>>>>
>>>>> Sorry for being absent for a lot of this discussion.
>>>>>
>>>>> I want to take a step back from the details of how we're
>>>>> solving the problem to what problems we're trying to solve.
>>>>>
>>>>> So, question(s) for Michael, because I don't see an explicit
>>>>> answer in this thread (and my sincerest apologies if they've
>>>>> been answered already and I missed it):
>>>>>
>>>>> What exactly is the bug (or bugs) we're trying to solve here?
>>>>>
>>>>> So far, it looks like there are two main problems being
>>>>> discussed:
>>>>>
>>>>> 1. Reducing the bandwidth usage of individual usb_requests
>>>>> 2. Error handling for when transmission over the wire fails.
>>>>>
>>>>> Is that correct, or are there other issues at play here?
>>>>
>>>> That is correct.
>>>>
>>>>> (1) in isolation should be relatively easy to solve: Just
>>>>> smear the encoded frame across some percentage
>>>>> (prefereably < 100%) of the usb_requests in one
>>>>> video frame interval.
>>>>
>>>> Right.
>>>>
>>>>> (2) is more complicated, and your suggestion is to have a
>>>>> sentinel request between two video frames that tells the
>>>>> host if the transmission of "current" video frame was
>>>>> successful or not. (Is that correct?)
>>>>
>>>> Right.
>>>>
>>>>> Assuming my understanding of (2) is correct, how should
>>>>> the host behave if the transmission of the sentinel
>>>>> request fails after the video frame was sent
>>>>> successfully (isoc is best effort so transmission is
>>>>> never guaranteed)?
>>>>
>>>> If we have transmitted all requests of the frame but would only miss the
>>>> sentinel request to the host that includes the EOF, the host could
>>>> rather show it or drop it. The drop would not be necessary since the
>>>> host did see all data of the frame. The user would not see any
>>>> distortion in both cases.
>>>>
>>>> If we have transmitted only partial data of the frame it would be
>>>> preferred if the host would drop the frame that did not finish with an
>>>> proper EOF tag.
>>>>
>>>> AFAIK the linux kernel would still show the frame if the FID of the
>>>> currently handled request would change and would take this as the end of
>>>> frame indication. But I am not totally sure if this is by spec or
>>>> optional.
>>>>
>>>> One option to be totally sure would be to resend the sentinel request to
>>>> be properly transmitted before starting the next frame. This resend
>>>> polling would probably include some extra zero-length requests. But also
>>>> if this resend keeps failing for n times, the driver should doubt there
>>>> is anything sane going on with the USB connection and bail out somehow.
>>>>
>>>> Since we try to tackle case (1) to avoid transmit errors and also avoid
>>>> creating late enqueued requests in the running isoc transfer, the over
>>>> all chance to trigger missed transfers should be minimal.
>>>
>>> Gotcha. It seems like the UVC gadget driver implicitly assumes that EOF
>>> flag will be used although the userspace application can technically
>>> make it optional.
>>
>> That is not all. The additional UVC_STREAM_ERR tag on the sentinel
>> request can be set optional by the host driver. But by spec the
>> userspace application has to drop the frame when the flag was set.
>
>Looking at the UVC specs, the ERR bit doesn't seem to refer to actual
>transmission error, only errors in frame generation (Section 4.3.1.7
>of UVC 1.5 Class Specification). Maybe "data discontinuity" can be
>used but the examples given are bad media, and encoder issues, which
>suggests errors at higher level than the wire.

Oh! That is a new perspective I did not consider.

With the definition of UVC_STREAM_ERR by spec, the uvc_video driver
would in no case set this header bit for the current frame on its own?
Is that correct?

>> With my proposal this flag will be set, whenever we find out that
>> the currently transferred frame was erroneous.
>>
>>> Summarizing some of the discussions above:
>>> 1. UVC gadget driver should _not_ rely on the usb controller to
>>>   enqueue 0-length requests on UVC gadget drivers behalf;
>>> 2. However keeping up the backpressure to the controller means the
>>>   EOF request will be delayed behind all the zero-length requests.
>>
>> Exactly, this is why we have to somehow finetune the timedelay between
>> requests that trigger interrupts. And also monitor the amount of
>> requests currently enqueued in the hw ringbuffer. So that our drivers
>> enqueue dequeue mechanism is virtually adding only the minimum amount
>> of necessary zero-length requests in the hardware. This should be
>> possible.
>>
>> I am currently thinking through the remaining steps the pump worker has
>> to do on each wakeup to maintain the minimum threshold while waiting
>> with submitting requests that contain actual image payload.
>>
>>> Out of curiosity: What is wrong with letting the host rely on
>>> FID alone? Decoding the jpeg payload _should_ fail if any of the
>>> usb_requests containing the payload failed to transmit.
>>
>> This is not totally true. We saw partially rendered jpeg frames on the
>> host stream. How the host behaves with broken data is totally undefined
>> if the typical uvc flags EOF/ERR are not used as specified. Then think
>> about uncompressed formats. So relying on the transferred image format
>> to solve our problems is just as wrong as relying on the gadgets
>> hardware behavior.
>
>Do you know if the partially rendered frames were valid JPEGs, or
>if the host was simply making a best effort at displaying a broken
>JPEG? Perhaps the fix should go to the host instead?

I can fully reproduce this with linux and windows hosts. For linux
machines I saw that the host was taking the FID change as a marker
to see the previous frame as ready and just rendered what got through.
This did not lead to garbage but only to partially displayed frames
with jpeg macroblock alignment.

>Following is my opinion, feel free to disagree (and correct me if
>something is factually incorrect):
>
>The fundamental issue here is that ISOC doesn't guarantee
>delivery of usb_requests or even basic data consistency upon delivery.
>So the gadget driver has no way to know the state of transmitted data.
>The gadget driver is notified of underruns but not of any other issues,
>and ideally we should never have an underrun if the zero-length
>backpressure is working as intended.
>
>So, UVC gadget driver can reduce the number of errors, but it'll never be
>able to guarantee that the data transmitted to the host isn't somehow
>corrupted or missing unless a more reliable mode of transmission
>(bulk, for example) is used.
>
>All of this to say: The host absolutely needs to be able to handle
>all sorts of invalid and broken payloads. How the host handles it
>might be undefined, but the host can never rely on perfect knowledge
>about the transmission state. In cases like these, where the underlying
>transport is unreliable, the burden of enforcing consistency moves up
>a layer, i.e. to the encoded payload in this case. So it is perfectly
>fine for the host to rely on the encoding to determine if the payload
>is corrupt and handle it accordingly.

Right.

>As for uncompressed format, you're correct that subtle corruptions
>may not be caught, but outright missing usb_requests can be easily
>checked by simply looking at the number of bytes in the payload. YUV
>frames are all of the same (predetermined) size for a given resolution.

That was also my thought about five minutes after I did send you the
previous mail. So sure, this is no real issue for the host.

>So my recommendation is the following:
>1. Fix the bandwidth problem by splitting the encoded video frame
>   into more usb_requests (as your patch already does) making sure
>   there are enough free usb_request to encode the video frame in
>   one burst so we don't accidentally inflate the transmission
>   duration of a video frame by sneaking in zero-length requests in
>   the middle.

Ack. This should already solve a lot of issues.

For this I would still suggest to move the usb_ep_queue to be done in
the pump worker again. Its a bit back and forth, but IMHO its worth the
extra mile since only this way we would respect the dwc3 interrupt
threads assumption to run *very* short.

>2. Unless there is an unusually high rate of transmission failures
>   when using the UVC gadget driver, it might be worth fixing the
>   host side driver to handle broken frames better instead (assuming
>   host is linux as well).

Agreed, but this needs a separate scoped undestanding of the host side
behaviour over all layers.

>2. Tighten up the error checking in UVC gadget driver -- We drop the
>   current frame whenever an EXDEV happens which is wrong. We should
>   only be dropping the current frame if the EXDEV corresponds to the
>   frame currently being encoded.

What do you mean by drop?

I would suggest to immediatly switch the uvc_buffer that is being
enqueued and start queueing prepared requests from the next buffers prep
list. As suggested, the idea is to have per uvc_buffer prep_list
requests which would make this task easy.

>   If the frame is already fully queued to the usb controller, the host
>   can handle missing payload as it sees fit.

Ack

This roadmap sounds like a good one.

Michael

[0/3] usb: gadget: uvc: allocate requests based on frame interval length and buffersize

Message

Comments