diff mbox series

usb: gadget: uvc: reduce the request size to increase the throughput

Message ID 20231109233341.2461129-1-m.grzeschik@pengutronix.de (mailing list archive)
State New, archived
Headers show
Series usb: gadget: uvc: reduce the request size to increase the throughput | expand

Commit Message

Michael Grzeschik Nov. 9, 2023, 11:33 p.m. UTC
One misconception of queueing request to the usb gadget controller is,
that the amount of data that one usb_request is representing is the same
as the hardware is able to transfer in one interval.

This exact idea applies to the uvc_video gadget that assumes that the
request needs to have the size as the gadgets bandwidth parameters are
configured with. (maxpacket, multiplier and burst)

Although it is actually no problem for the hardware to queue a big
request, in the case of isochronous transfers it is impractical to queue
big amount of data per request to the hardware. Especially if the
necessary bandwidth was increased on purpose to maintain high amounts of
data.

E.g. in the dwc3-controller it has the negative impact that the endpoint
FIFO will not be filled fast enough by the internal hardware engine.
Since each transfer buffer (TRB) represents one big request, the
hardware will need a long time to prefill the hardware FIFO. This can be
avoided by queueing more smaller requests which will then lead to
smaller TRBs which the hardware engine can keep up with.

This patch is simply dropping the maxburst as an multiplier for the size
calculation and therefor decreases the overall maximum request size.

Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
---
This patch is created as an result from the discussion in the following thread:

https://lore.kernel.org/linux-usb/20231031231841.dbhtrgqounu2rvgn@synopsys.com/

 drivers/usb/gadget/function/uvc_queue.c | 1 -
 drivers/usb/gadget/function/uvc_video.c | 1 -
 2 files changed, 2 deletions(-)

Comments

Thinh Nguyen Nov. 10, 2023, 2:16 a.m. UTC | #1
Hi,

On Fri, Nov 10, 2023, Michael Grzeschik wrote:
> One misconception of queueing request to the usb gadget controller is,
> that the amount of data that one usb_request is representing is the same
> as the hardware is able to transfer in one interval.
> 
> This exact idea applies to the uvc_video gadget that assumes that the
> request needs to have the size as the gadgets bandwidth parameters are
> configured with. (maxpacket, multiplier and burst)
> 
> Although it is actually no problem for the hardware to queue a big
> request, in the case of isochronous transfers it is impractical to queue
> big amount of data per request to the hardware. Especially if the
> necessary bandwidth was increased on purpose to maintain high amounts of
> data.
> 
> E.g. in the dwc3-controller it has the negative impact that the endpoint
> FIFO will not be filled fast enough by the internal hardware engine.
> Since each transfer buffer (TRB) represents one big request, the
> hardware will need a long time to prefill the hardware FIFO. This can be
> avoided by queueing more smaller requests which will then lead to
> smaller TRBs which the hardware engine can keep up with.

Just want to clarify here to avoid any confusion, the hardware TX FIFO
size is relatively small, usually can be smaller than the TRB. It should
be fine whether the TRB is larger or smaller than the FIFO size. The
hardware does not "prefill" the FIFO. It just fills whichever TRB it's
currently processing (I think you may be mixing up with TRB cache).

The performance impact from this change is to reduce the USB bandwidth
usage. The current calculation in uvc function can use 48KB/uframe for
each request in SuperSpeed, the max size for isoc in SuperSpeed. I know
many hosts can't handle this kind of transfer rate from their hardware.
(e.g. It gets worse when scheduling transfers for multiple endpoints and
under multiple tier hubs).

The bandwidth can be impacted by multiple factors and not just from the
gadget side as noted in the discussion before.

> 
> This patch is simply dropping the maxburst as an multiplier for the size
> calculation and therefor decreases the overall maximum request size.
> 
> Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
> ---
> This patch is created as an result from the discussion in the following thread:
> 
> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20231031231841.dbhtrgqounu2rvgn@synopsys.com/__;!!A4F2R9G_pg!fTaIo4tDljSbEvUY5SZLkNvKWcz0YeN0Ekzs0CPWyD73RGRmErRC2frODFgnMB1M4Nse0oKKgwxC65gePhGAtauKJq1Vnzlj$ 
> 
>  drivers/usb/gadget/function/uvc_queue.c | 1 -
>  drivers/usb/gadget/function/uvc_video.c | 1 -
>  2 files changed, 2 deletions(-)
> 
> diff --git a/drivers/usb/gadget/function/uvc_queue.c b/drivers/usb/gadget/function/uvc_queue.c
> index 0aa3d7e1f3cc32..1d3c3c09ff97cb 100644
> --- a/drivers/usb/gadget/function/uvc_queue.c
> +++ b/drivers/usb/gadget/function/uvc_queue.c
> @@ -55,7 +55,6 @@ static int uvc_queue_setup(struct vb2_queue *vq,
>  	sizes[0] = video->imagesize;
>  
>  	req_size = video->ep->maxpacket
> -		 * max_t(unsigned int, video->ep->maxburst, 1)

I think you're reducing a bit too much here? Also, take advantage of
burst. So, perhaps keep request size to max(16K, MPS * maxburst)?

Can be more or less depending on how much video data is needed to
transfer over a video frame.

BR,
Thinh

>  		 * (video->ep->mult);
>  
>  	/* We divide by two, to increase the chance to run
> diff --git a/drivers/usb/gadget/function/uvc_video.c b/drivers/usb/gadget/function/uvc_video.c
> index 91af3b1ef0d412..c6b61843bad3d7 100644
> --- a/drivers/usb/gadget/function/uvc_video.c
> +++ b/drivers/usb/gadget/function/uvc_video.c
> @@ -329,7 +329,6 @@ uvc_video_alloc_requests(struct uvc_video *video)
>  	BUG_ON(video->req_size);
>  
>  	req_size = video->ep->maxpacket
> -		 * max_t(unsigned int, video->ep->maxburst, 1)
>  		 * (video->ep->mult);
>  
>  	video->ureq = kcalloc(video->uvc_num_requests, sizeof(struct uvc_request), GFP_KERNEL);
> -- 
> 2.39.2
> 
>
Thinh Nguyen Nov. 10, 2023, 10:42 p.m. UTC | #2
On Fri, Nov 10, 2023, Thinh Nguyen wrote:
> Hi,
> 
> On Fri, Nov 10, 2023, Michael Grzeschik wrote:
> > One misconception of queueing request to the usb gadget controller is,
> > that the amount of data that one usb_request is representing is the same
> > as the hardware is able to transfer in one interval.
> > 
> > This exact idea applies to the uvc_video gadget that assumes that the
> > request needs to have the size as the gadgets bandwidth parameters are
> > configured with. (maxpacket, multiplier and burst)
> > 
> > Although it is actually no problem for the hardware to queue a big
> > request, in the case of isochronous transfers it is impractical to queue
> > big amount of data per request to the hardware. Especially if the
> > necessary bandwidth was increased on purpose to maintain high amounts of
> > data.
> > 
> > E.g. in the dwc3-controller it has the negative impact that the endpoint
> > FIFO will not be filled fast enough by the internal hardware engine.
> > Since each transfer buffer (TRB) represents one big request, the
> > hardware will need a long time to prefill the hardware FIFO. This can be
> > avoided by queueing more smaller requests which will then lead to
> > smaller TRBs which the hardware engine can keep up with.
> 
> Just want to clarify here to avoid any confusion, the hardware TX FIFO
> size is relatively small, usually can be smaller than the TRB. It should
> be fine whether the TRB is larger or smaller than the FIFO size. The
> hardware does not "prefill" the FIFO. It just fills whichever TRB it's
> currently processing (I think you may be mixing up with TRB cache).
> 
> The performance impact from this change is to reduce the USB bandwidth
> usage. The current calculation in uvc function can use 48KB/uframe for
> each request in SuperSpeed, the max size for isoc in SuperSpeed. I know
> many hosts can't handle this kind of transfer rate from their hardware.
> (e.g. It gets worse when scheduling transfers for multiple endpoints and
> under multiple tier hubs).
> 
> The bandwidth can be impacted by multiple factors and not just from the
> gadget side as noted in the discussion before.
> 
> > 
> > This patch is simply dropping the maxburst as an multiplier for the size
> > calculation and therefor decreases the overall maximum request size.
> > 
> > Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
> > ---
> > This patch is created as an result from the discussion in the following thread:
> > 
> > https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20231031231841.dbhtrgqounu2rvgn@synopsys.com/__;!!A4F2R9G_pg!fTaIo4tDljSbEvUY5SZLkNvKWcz0YeN0Ekzs0CPWyD73RGRmErRC2frODFgnMB1M4Nse0oKKgwxC65gePhGAtauKJq1Vnzlj$ 
> > 
> >  drivers/usb/gadget/function/uvc_queue.c | 1 -
> >  drivers/usb/gadget/function/uvc_video.c | 1 -
> >  2 files changed, 2 deletions(-)
> > 
> > diff --git a/drivers/usb/gadget/function/uvc_queue.c b/drivers/usb/gadget/function/uvc_queue.c
> > index 0aa3d7e1f3cc32..1d3c3c09ff97cb 100644
> > --- a/drivers/usb/gadget/function/uvc_queue.c
> > +++ b/drivers/usb/gadget/function/uvc_queue.c
> > @@ -55,7 +55,6 @@ static int uvc_queue_setup(struct vb2_queue *vq,
> >  	sizes[0] = video->imagesize;
> >  
> >  	req_size = video->ep->maxpacket
> > -		 * max_t(unsigned int, video->ep->maxburst, 1)
> 
> I think you're reducing a bit too much here? Also, take advantage of
> burst. So, perhaps keep request size to max(16K, MPS * maxburst)?

I mean to potentially cap at 16K, it didn't translate correctly to
code.... I meant something like this:

max_size = mult * maxburst * MPS;
clamp(max_size, mult * MPS, 16K);

> 
> Can be more or less depending on how much video data is needed to
> transfer over a video frame.

Can the uvc gadget figure out the max video data over a video frame? Or
is it not available at this layer? I figure different setup requires
more video data payload.

If we have this info, then you can properly cap the request size.

Thanks,
Thinh
Michael Grzeschik Nov. 13, 2023, 8:43 a.m. UTC | #3
On Fri, Nov 10, 2023 at 02:16:59AM +0000, Thinh Nguyen wrote:
>On Fri, Nov 10, 2023, Michael Grzeschik wrote:
>> One misconception of queueing request to the usb gadget controller is,
>> that the amount of data that one usb_request is representing is the same
>> as the hardware is able to transfer in one interval.
>>
>> This exact idea applies to the uvc_video gadget that assumes that the
>> request needs to have the size as the gadgets bandwidth parameters are
>> configured with. (maxpacket, multiplier and burst)
>>
>> Although it is actually no problem for the hardware to queue a big
>> request, in the case of isochronous transfers it is impractical to queue
>> big amount of data per request to the hardware. Especially if the
>> necessary bandwidth was increased on purpose to maintain high amounts of
>> data.
>>
>> E.g. in the dwc3-controller it has the negative impact that the endpoint
>> FIFO will not be filled fast enough by the internal hardware engine.
>> Since each transfer buffer (TRB) represents one big request, the
>> hardware will need a long time to prefill the hardware FIFO. This can be
>> avoided by queueing more smaller requests which will then lead to
>> smaller TRBs which the hardware engine can keep up with.
>
>Just want to clarify here to avoid any confusion, the hardware TX FIFO
>size is relatively small, usually can be smaller than the TRB. It should
>be fine whether the TRB is larger or smaller than the FIFO size. The
>hardware does not "prefill" the FIFO. It just fills whichever TRB it's
>currently processing (I think you may be mixing up with TRB cache).

What I see is, that by using bigger TRBs the hardware is not able
to keep up with reading from the memory when the system is under
heavy memory pressure. This is the main reason for this change.

Since we found out that increasing the FIFO size had an effect to how
high we are able to set the hardware endpoint configuration on our
gadget side (params.param0), until we saw the issue reoccur.

So the Idea here was to have a tweak on how the hardware handles the
data from the memory to the Hardware-FIFO which seems not to underrun
with smaller TRBs.

>The performance impact from this change is to reduce the USB bandwidth
>usage. The current calculation in uvc function can use 48KB/uframe for
>each request in SuperSpeed, the max size for isoc in SuperSpeed. I know
>many hosts can't handle this kind of transfer rate from their hardware.
>(e.g. It gets worse when scheduling transfers for multiple endpoints and
>under multiple tier hubs).

I think I don't fully understand here.

We change the overall buffersize of the usb_request and therefor limit
the size of possible TRBs. This should even only have most effect on the
trbsize for the memcopy path, since the scatter gather requests are
already split into multiple trbs which is capped to the maximum mappable
memory range of PAGE_SIZE (4k).

Other then that, the parameterization of the endpoint on our gadget side
is not changed by this patch. The endpoint configuration is set as follows:

params.param0 |= DWC3_DEPCFG_BURST_SIZE(burst - 1) |
                  DWC3_DEPCFG_MAX_PACKET_SIZE(usb_endpoint_maxp(desc));

So by changing the request_size there should not be any other change in the
gadget side hardware configuration.

How is the overall bandwidth usage affected by this change then other
than we have more smaller potential trbs in the queue.

If the Intervallength is not coupled to the amount of to be transfered
TRBs in any case, it should not have an effect to the bandwidth.

If I am mistaken here, can you point me to some code?

>The bandwidth can be impacted by multiple factors and not just from the
>gadget side as noted in the discussion before.

Right.

>> This patch is simply dropping the maxburst as an multiplier for the size
>> calculation and therefor decreases the overall maximum request size.
>>
>> Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
>> ---
>> This patch is created as an result from the discussion in the following thread:
>>
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20231031231841.dbhtrgqounu2rvgn@synopsys.com/__;!!A4F2R9G_pg!fTaIo4tDljSbEvUY5SZLkNvKWcz0YeN0Ekzs0CPWyD73RGRmErRC2frODFgnMB1M4Nse0oKKgwxC65gePhGAtauKJq1Vnzlj$
>>
>>  drivers/usb/gadget/function/uvc_queue.c | 1 -
>>  drivers/usb/gadget/function/uvc_video.c | 1 -
>>  2 files changed, 2 deletions(-)
>>
>> diff --git a/drivers/usb/gadget/function/uvc_queue.c b/drivers/usb/gadget/function/uvc_queue.c
>> index 0aa3d7e1f3cc32..1d3c3c09ff97cb 100644
>> --- a/drivers/usb/gadget/function/uvc_queue.c
>> +++ b/drivers/usb/gadget/function/uvc_queue.c
>> @@ -55,7 +55,6 @@ static int uvc_queue_setup(struct vb2_queue *vq,
>>  	sizes[0] = video->imagesize;
>>
>>  	req_size = video->ep->maxpacket
>> -		 * max_t(unsigned int, video->ep->maxburst, 1)
>
>I think you're reducing a bit too much here? Also, take advantage of
>burst. So, perhaps keep request size to max(16K, MPS * maxburst)?
>
>Can be more or less depending on how much video data is needed to
>transfer over a video frame.
>
>BR,
>Thinh
>
>>  		 * (video->ep->mult);
>>
>>  	/* We divide by two, to increase the chance to run
>> diff --git a/drivers/usb/gadget/function/uvc_video.c b/drivers/usb/gadget/function/uvc_video.c
>> index 91af3b1ef0d412..c6b61843bad3d7 100644
>> --- a/drivers/usb/gadget/function/uvc_video.c
>> +++ b/drivers/usb/gadget/function/uvc_video.c
>> @@ -329,7 +329,6 @@ uvc_video_alloc_requests(struct uvc_video *video)
>>  	BUG_ON(video->req_size);
>>
>>  	req_size = video->ep->maxpacket
>> -		 * max_t(unsigned int, video->ep->maxburst, 1)
>>  		 * (video->ep->mult);
>>
>>  	video->ureq = kcalloc(video->uvc_num_requests, sizeof(struct uvc_request), GFP_KERNEL);
>> --
>> 2.39.2
>>
>>
Thinh Nguyen Nov. 17, 2023, 2:39 a.m. UTC | #4
On Mon, Nov 13, 2023, Michael Grzeschik wrote:
> On Fri, Nov 10, 2023 at 02:16:59AM +0000, Thinh Nguyen wrote:
> > On Fri, Nov 10, 2023, Michael Grzeschik wrote:
> > > One misconception of queueing request to the usb gadget controller is,
> > > that the amount of data that one usb_request is representing is the same
> > > as the hardware is able to transfer in one interval.
> > > 
> > > This exact idea applies to the uvc_video gadget that assumes that the
> > > request needs to have the size as the gadgets bandwidth parameters are
> > > configured with. (maxpacket, multiplier and burst)
> > > 
> > > Although it is actually no problem for the hardware to queue a big
> > > request, in the case of isochronous transfers it is impractical to queue
> > > big amount of data per request to the hardware. Especially if the
> > > necessary bandwidth was increased on purpose to maintain high amounts of
> > > data.
> > > 
> > > E.g. in the dwc3-controller it has the negative impact that the endpoint
> > > FIFO will not be filled fast enough by the internal hardware engine.
> > > Since each transfer buffer (TRB) represents one big request, the
> > > hardware will need a long time to prefill the hardware FIFO. This can be
> > > avoided by queueing more smaller requests which will then lead to
> > > smaller TRBs which the hardware engine can keep up with.
> > 
> > Just want to clarify here to avoid any confusion, the hardware TX FIFO
> > size is relatively small, usually can be smaller than the TRB. It should
> > be fine whether the TRB is larger or smaller than the FIFO size. The
> > hardware does not "prefill" the FIFO. It just fills whichever TRB it's
> > currently processing (I think you may be mixing up with TRB cache).
> 
> What I see is, that by using bigger TRBs the hardware is not able
> to keep up with reading from the memory when the system is under
> heavy memory pressure. This is the main reason for this change.
> 
> Since we found out that increasing the FIFO size had an effect to how
> high we are able to set the hardware endpoint configuration on our
> gadget side (params.param0), until we saw the issue reoccur.
> 
> So the Idea here was to have a tweak on how the hardware handles the
> data from the memory to the Hardware-FIFO which seems not to underrun
> with smaller TRBs.
> 
> > The performance impact from this change is to reduce the USB bandwidth
> > usage. The current calculation in uvc function can use 48KB/uframe for
> > each request in SuperSpeed, the max size for isoc in SuperSpeed. I know
> > many hosts can't handle this kind of transfer rate from their hardware.
> > (e.g. It gets worse when scheduling transfers for multiple endpoints and
> > under multiple tier hubs).
> 
> I think I don't fully understand here.
> 
> We change the overall buffersize of the usb_request and therefor limit
> the size of possible TRBs. This should even only have most effect on the
> trbsize for the memcopy path, since the scatter gather requests are
> already split into multiple trbs which is capped to the maximum mappable
> memory range of PAGE_SIZE (4k).
> 
> Other then that, the parameterization of the endpoint on our gadget side
> is not changed by this patch. The endpoint configuration is set as follows:
> 
> params.param0 |= DWC3_DEPCFG_BURST_SIZE(burst - 1) |
>                  DWC3_DEPCFG_MAX_PACKET_SIZE(usb_endpoint_maxp(desc));
> 
> So by changing the request_size there should not be any other change in the
> gadget side hardware configuration.
> 
> How is the overall bandwidth usage affected by this change then other
> than we have more smaller potential trbs in the queue.
> 
> If the Intervallength is not coupled to the amount of to be transfered
> TRBs in any case, it should not have an effect to the bandwidth.
> 
> If I am mistaken here, can you point me to some code?
> 

My point was to clarify that the reduction of request size would mean
more time available for the transfer to go out within the given usb
bandwidth.

Both the host and the device sides can affect when and how much data can
be sent within an interval. The host has it own calculation on how much
usb bandwidth is allocated for the device and endpoint. Some hosts give
more, some give less. The more data is transferred, the less time it
has to send out all the data. Your tweaking of the FIFO size and the bus
burst help improve the performance on the device side.

I also want to note that the transfer size of the usb request is what
matter since it accounts for transfer over an interval. If you use SG
request, then you may have many smaller TRBs. For newer controllers, the
performance should be equivalent to a request using a large TRB.

Regarding the question earlier, can you cap the size of the request
depending on how much video data is expected over a given video frame
instead?

Thanks,
Thinh
diff mbox series

Patch

diff --git a/drivers/usb/gadget/function/uvc_queue.c b/drivers/usb/gadget/function/uvc_queue.c
index 0aa3d7e1f3cc32..1d3c3c09ff97cb 100644
--- a/drivers/usb/gadget/function/uvc_queue.c
+++ b/drivers/usb/gadget/function/uvc_queue.c
@@ -55,7 +55,6 @@  static int uvc_queue_setup(struct vb2_queue *vq,
 	sizes[0] = video->imagesize;
 
 	req_size = video->ep->maxpacket
-		 * max_t(unsigned int, video->ep->maxburst, 1)
 		 * (video->ep->mult);
 
 	/* We divide by two, to increase the chance to run
diff --git a/drivers/usb/gadget/function/uvc_video.c b/drivers/usb/gadget/function/uvc_video.c
index 91af3b1ef0d412..c6b61843bad3d7 100644
--- a/drivers/usb/gadget/function/uvc_video.c
+++ b/drivers/usb/gadget/function/uvc_video.c
@@ -329,7 +329,6 @@  uvc_video_alloc_requests(struct uvc_video *video)
 	BUG_ON(video->req_size);
 
 	req_size = video->ep->maxpacket
-		 * max_t(unsigned int, video->ep->maxburst, 1)
 		 * (video->ep->mult);
 
 	video->ureq = kcalloc(video->uvc_num_requests, sizeof(struct uvc_request), GFP_KERNEL);