Message ID | 20190802131226.123800-1-shik@chromium.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | media: uvcvideo: Use streaming DMA APIs to transfer buffers | expand |
On Fri, Aug 2, 2019 at 10:12 PM Shik Chen <shik@chromium.org> wrote: > > Similar to the commit 1161db6776bd ("media: usb: pwc: Don't use coherent > DMA buffers for ISO transfer") [1] for the pwc driver. Use streaming DMA > APIs to transfer buffers and sync them explicitly, because accessing > buffers allocated by usb_alloc_coherent() is slow on platforms without > hardware coherent DMA. > > Tested on x86-64 (Intel Celeron 4305U) and armv7l (Rockchip RK3288) with > Logitech Brio 4K camera at 1920x1080 using the WebRTC sample site [3]. > > | | URB (us) | Decode (Gbps) | CPU (%) | > |------------------|------------|---------------|---------| > | x86-64 coherent | 53 +- 20 | 50.6 | 0.24 | > | x86-64 streaming | 55 +- 19 | 50.1 | 0.25 | > | armv7l coherent | 342 +- 379 | 1.8 | 2.16 | > | armv7l streaming | 99 +- 98 | 11.0 | 0.36 | > > The performance characteristics are improved a lot on armv7l, and > remained (almost) unchanged on x86-64. The code used for measurement can > be found at [2]. > > [1] https://git.kernel.org/torvalds/c/1161db6776bd > [2] https://crrev.com/c/1729133 > [3] https://webrtc.github.io/samples/src/content/getusermedia/resolution/ > > Signed-off-by: Shik Chen <shik@chromium.org> > --- > The allocated buffer could be as large as 768K when streaming 4K video. > Ideally we should use some generic helper to allocate non-coherent > memory in a more efficient way, such as https://lwn.net/Articles/774429/ > ("make the non-consistent DMA allocator more userful"). > > But I cannot find any existing helper so a simple kzalloc() is used in > this patch. The logic to figure out the DMA addressable GFP flags is > similar to __dma_direct_alloc_pages() without the optimistic retries: > https://elixir.bootlin.com/linux/v5.2.5/source/kernel/dma/direct.c#L96 > > drivers/media/usb/uvc/uvc_video.c | 65 +++++++++++++++++++++---------- > 1 file changed, 45 insertions(+), 20 deletions(-) > > diff --git a/drivers/media/usb/uvc/uvc_video.c b/drivers/media/usb/uvc/uvc_video.c > index 8fa77a81dd7f2c..962c35478896c4 100644 > --- a/drivers/media/usb/uvc/uvc_video.c > +++ b/drivers/media/usb/uvc/uvc_video.c > @@ -1539,6 +1539,8 @@ static void uvc_video_complete(struct urb *urb) > * Process the URB headers, and optionally queue expensive memcpy tasks > * to be deferred to a work queue. > */ > + dma_sync_single_for_cpu(&urb->dev->dev, urb->transfer_dma, > + urb->transfer_buffer_length, DMA_FROM_DEVICE); > stream->decode(uvc_urb, buf, buf_meta); > > /* If no async work is needed, resubmit the URB immediately. */ > @@ -1565,18 +1567,51 @@ static void uvc_free_urb_buffers(struct uvc_streaming *stream) > if (!uvc_urb->buffer) > continue; > > -#ifndef CONFIG_DMA_NONCOHERENT > - usb_free_coherent(stream->dev->udev, stream->urb_size, > - uvc_urb->buffer, uvc_urb->dma); > -#else > + dma_unmap_single(&stream->dev->udev->dev, uvc_urb->dma, > + stream->urb_size, DMA_FROM_DEVICE); > kfree(uvc_urb->buffer); > -#endif > - uvc_urb->buffer = NULL; > } > > stream->urb_size = 0; > } > > +static gfp_t uvc_alloc_gfp_flags(struct device *dev) > +{ > + u64 mask = dma_get_mask(dev); > + > + if (dev->bus_dma_mask) > + mask &= dev->bus_dma_mask; > + > + if (mask < DMA_BIT_MASK(32) && IS_ENABLED(CONFIG_ZONE_DMA)) > + return GFP_DMA; > + > + if (mask < DMA_BIT_MASK(64)) { > + if (IS_ENABLED(CONFIG_ZONE_DMA32)) > + return GFP_DMA32; > + if (IS_ENABLED(CONFIG_ZONE_DMA)) > + return GFP_DMA; > + } > + > + return 0; > +} > + > +static char *uvc_alloc_urb_buffer(struct device *dev, size_t size, > + gfp_t gfp_flags, dma_addr_t *dma_handle) > +{ > + void *buffer = kzalloc(size, gfp_flags | uvc_alloc_gfp_flags(dev)); > + > + if (!buffer) > + return NULL; > + > + *dma_handle = dma_map_single(dev, buffer, size, DMA_FROM_DEVICE); > + if (dma_mapping_error(dev, *dma_handle)) { > + kfree(buffer); > + return NULL; > + } > + > + return buffer; > +} > + > /* > * Allocate transfer buffers. This function can be called with buffers > * already allocated when resuming from suspend, in which case it will > @@ -1607,18 +1642,14 @@ static int uvc_alloc_urb_buffers(struct uvc_streaming *stream, > > /* Retry allocations until one succeed. */ > for (; npackets > 1; npackets /= 2) { > + stream->urb_size = psize * npackets; > + > for (i = 0; i < UVC_URBS; ++i) { > struct uvc_urb *uvc_urb = &stream->uvc_urb[i]; > > - stream->urb_size = psize * npackets; > -#ifndef CONFIG_DMA_NONCOHERENT > - uvc_urb->buffer = usb_alloc_coherent( > - stream->dev->udev, stream->urb_size, > + uvc_urb->buffer = uvc_alloc_urb_buffer( > + &stream->dev->udev->dev, stream->urb_size, > gfp_flags | __GFP_NOWARN, &uvc_urb->dma); > -#else > - uvc_urb->buffer = > - kmalloc(stream->urb_size, gfp_flags | __GFP_NOWARN); > -#endif > if (!uvc_urb->buffer) { > uvc_free_urb_buffers(stream); > break; > @@ -1728,12 +1759,8 @@ static int uvc_init_video_isoc(struct uvc_streaming *stream, > urb->context = uvc_urb; > urb->pipe = usb_rcvisocpipe(stream->dev->udev, > ep->desc.bEndpointAddress); > -#ifndef CONFIG_DMA_NONCOHERENT > urb->transfer_flags = URB_ISO_ASAP | URB_NO_TRANSFER_DMA_MAP; > urb->transfer_dma = uvc_urb->dma; > -#else > - urb->transfer_flags = URB_ISO_ASAP; > -#endif > urb->interval = ep->desc.bInterval; > urb->transfer_buffer = uvc_urb->buffer; > urb->complete = uvc_video_complete; > @@ -1793,10 +1820,8 @@ static int uvc_init_video_bulk(struct uvc_streaming *stream, > > usb_fill_bulk_urb(urb, stream->dev->udev, pipe, uvc_urb->buffer, > size, uvc_video_complete, uvc_urb); > -#ifndef CONFIG_DMA_NONCOHERENT > urb->transfer_flags = URB_NO_TRANSFER_DMA_MAP; > urb->transfer_dma = uvc_urb->dma; > -#endif > > uvc_urb->urb = urb; > } > -- > 2.22.0.770.g0f2c4a37fd-goog > Gentle ping. Also, oops, looks like we forgot to add Kieran on CC. Let me fix it now. Best regards, Tomasz
On Fri, Aug 2, 2019 at 9:13 PM Shik Chen <shik@chromium.org> wrote: > > Similar to the commit 1161db6776bd ("media: usb: pwc: Don't use coherent > DMA buffers for ISO transfer") [1] for the pwc driver. Use streaming DMA > APIs to transfer buffers and sync them explicitly, because accessing > buffers allocated by usb_alloc_coherent() is slow on platforms without > hardware coherent DMA. > > Tested on x86-64 (Intel Celeron 4305U) and armv7l (Rockchip RK3288) with > Logitech Brio 4K camera at 1920x1080 using the WebRTC sample site [3]. > > | | URB (us) | Decode (Gbps) | CPU (%) | > |------------------|------------|---------------|---------| > | x86-64 coherent | 53 +- 20 | 50.6 | 0.24 | > | x86-64 streaming | 55 +- 19 | 50.1 | 0.25 | > | armv7l coherent | 342 +- 379 | 1.8 | 2.16 | > | armv7l streaming | 99 +- 98 | 11.0 | 0.36 | > > The performance characteristics are improved a lot on armv7l, and > remained (almost) unchanged on x86-64. The code used for measurement can > be found at [2]. > > [1] https://git.kernel.org/torvalds/c/1161db6776bd > [2] https://crrev.com/c/1729133 > [3] https://webrtc.github.io/samples/src/content/getusermedia/resolution/ > > Signed-off-by: Shik Chen <shik@chromium.org> > --- > The allocated buffer could be as large as 768K when streaming 4K video. > Ideally we should use some generic helper to allocate non-coherent > memory in a more efficient way, such as https://lwn.net/Articles/774429/ > ("make the non-consistent DMA allocator more userful"). > > But I cannot find any existing helper so a simple kzalloc() is used in > this patch. The logic to figure out the DMA addressable GFP flags is > similar to __dma_direct_alloc_pages() without the optimistic retries: > https://elixir.bootlin.com/linux/v5.2.5/source/kernel/dma/direct.c#L96 > > drivers/media/usb/uvc/uvc_video.c | 65 +++++++++++++++++++++---------- > 1 file changed, 45 insertions(+), 20 deletions(-) > > diff --git a/drivers/media/usb/uvc/uvc_video.c b/drivers/media/usb/uvc/uvc_video.c > index 8fa77a81dd7f2c..962c35478896c4 100644 > --- a/drivers/media/usb/uvc/uvc_video.c > +++ b/drivers/media/usb/uvc/uvc_video.c > @@ -1539,6 +1539,8 @@ static void uvc_video_complete(struct urb *urb) > * Process the URB headers, and optionally queue expensive memcpy tasks > * to be deferred to a work queue. > */ > + dma_sync_single_for_cpu(&urb->dev->dev, urb->transfer_dma, > + urb->transfer_buffer_length, DMA_FROM_DEVICE); > stream->decode(uvc_urb, buf, buf_meta); > > /* If no async work is needed, resubmit the URB immediately. */ > @@ -1565,18 +1567,51 @@ static void uvc_free_urb_buffers(struct uvc_streaming *stream) > if (!uvc_urb->buffer) > continue; > > -#ifndef CONFIG_DMA_NONCOHERENT > - usb_free_coherent(stream->dev->udev, stream->urb_size, > - uvc_urb->buffer, uvc_urb->dma); > -#else > + dma_unmap_single(&stream->dev->udev->dev, uvc_urb->dma, > + stream->urb_size, DMA_FROM_DEVICE); > kfree(uvc_urb->buffer); > -#endif > - uvc_urb->buffer = NULL; > } > > stream->urb_size = 0; > } > > +static gfp_t uvc_alloc_gfp_flags(struct device *dev) > +{ > + u64 mask = dma_get_mask(dev); > + > + if (dev->bus_dma_mask) > + mask &= dev->bus_dma_mask; > + > + if (mask < DMA_BIT_MASK(32) && IS_ENABLED(CONFIG_ZONE_DMA)) > + return GFP_DMA; > + > + if (mask < DMA_BIT_MASK(64)) { > + if (IS_ENABLED(CONFIG_ZONE_DMA32)) > + return GFP_DMA32; We're hitting issues with this on 64-bit ARM platform, where ZONE_DMA32 is enabled (default), the kmalloc allocation with GFP_DMA32 fails. There is no slab cache for GFP_DMA32, so those calls are expected to fail, AFAICT there are no other (working) kmalloc(..., .. | GFP_DMA32) users in the kernel, so I don't think we want to add a cache. If this helps, some discussion here why the cache wasn't added: https://lore.kernel.org/patchwork/patch/1009563/#1198622 This function looks out of place in a high-level driver, but from your comment above (below ---), I guess we may have to live with that until the kernel provides a better API? For the time being, looking at how __dma_direct_alloc_pages works, could you use alloc_pages_node (or dma_alloc_contiguous?) instead of kmalloc, that would let you use GFP_DMA32 as needed? > + if (IS_ENABLED(CONFIG_ZONE_DMA)) > + return GFP_DMA; > + } > + > + return 0; > +} > + > +static char *uvc_alloc_urb_buffer(struct device *dev, size_t size, > + gfp_t gfp_flags, dma_addr_t *dma_handle) > +{ > + void *buffer = kzalloc(size, gfp_flags | uvc_alloc_gfp_flags(dev)); > + > + if (!buffer) > + return NULL; > + > + *dma_handle = dma_map_single(dev, buffer, size, DMA_FROM_DEVICE); > + if (dma_mapping_error(dev, *dma_handle)) { > + kfree(buffer); > + return NULL; > + } > + > + return buffer; > +} > + > /* > * Allocate transfer buffers. This function can be called with buffers > * already allocated when resuming from suspend, in which case it will > @@ -1607,18 +1642,14 @@ static int uvc_alloc_urb_buffers(struct uvc_streaming *stream, > > /* Retry allocations until one succeed. */ > for (; npackets > 1; npackets /= 2) { > + stream->urb_size = psize * npackets; > + > for (i = 0; i < UVC_URBS; ++i) { > struct uvc_urb *uvc_urb = &stream->uvc_urb[i]; > > - stream->urb_size = psize * npackets; > -#ifndef CONFIG_DMA_NONCOHERENT > - uvc_urb->buffer = usb_alloc_coherent( > - stream->dev->udev, stream->urb_size, > + uvc_urb->buffer = uvc_alloc_urb_buffer( > + &stream->dev->udev->dev, stream->urb_size, > gfp_flags | __GFP_NOWARN, &uvc_urb->dma); > -#else > - uvc_urb->buffer = > - kmalloc(stream->urb_size, gfp_flags | __GFP_NOWARN); > -#endif > if (!uvc_urb->buffer) { > uvc_free_urb_buffers(stream); > break; > @@ -1728,12 +1759,8 @@ static int uvc_init_video_isoc(struct uvc_streaming *stream, > urb->context = uvc_urb; > urb->pipe = usb_rcvisocpipe(stream->dev->udev, > ep->desc.bEndpointAddress); > -#ifndef CONFIG_DMA_NONCOHERENT > urb->transfer_flags = URB_ISO_ASAP | URB_NO_TRANSFER_DMA_MAP; > urb->transfer_dma = uvc_urb->dma; > -#else > - urb->transfer_flags = URB_ISO_ASAP; > -#endif > urb->interval = ep->desc.bInterval; > urb->transfer_buffer = uvc_urb->buffer; > urb->complete = uvc_video_complete; > @@ -1793,10 +1820,8 @@ static int uvc_init_video_bulk(struct uvc_streaming *stream, > > usb_fill_bulk_urb(urb, stream->dev->udev, pipe, uvc_urb->buffer, > size, uvc_video_complete, uvc_urb); > -#ifndef CONFIG_DMA_NONCOHERENT > urb->transfer_flags = URB_NO_TRANSFER_DMA_MAP; > urb->transfer_dma = uvc_urb->dma; > -#endif > > uvc_urb->urb = urb; > } > -- > 2.22.0.770.g0f2c4a37fd-goog >
On Sat, Sep 28, 2019 at 11:33:16AM +0800, Nicolas Boichat wrote: > > +static gfp_t uvc_alloc_gfp_flags(struct device *dev) > > +{ > > + u64 mask = dma_get_mask(dev); > > + > > + if (dev->bus_dma_mask) > > + mask &= dev->bus_dma_mask; > > + > > + if (mask < DMA_BIT_MASK(32) && IS_ENABLED(CONFIG_ZONE_DMA)) > > + return GFP_DMA; > > + > > + if (mask < DMA_BIT_MASK(64)) { > > + if (IS_ENABLED(CONFIG_ZONE_DMA32)) > > + return GFP_DMA32; > > We're hitting issues with this on 64-bit ARM platform, where > ZONE_DMA32 is enabled (default), the kmalloc allocation with GFP_DMA32 > fails. > > There is no slab cache for GFP_DMA32, so those calls are expected to > fail, AFAICT there are no other (working) kmalloc(..., .. | GFP_DMA32) > users in the kernel, so I don't think we want to add a cache. If this > helps, some discussion here why the cache wasn't added: > https://lore.kernel.org/patchwork/patch/1009563/#1198622 And drivers really have no business looking at the dma mask. I have a plan for dma_alloc_pages API that could replace that cruft, but until then please use GFP_KERNEL and let the dma subsystem bounce buffer if needed.
On Mon, Sep 30, 2019 at 01:23:10AM -0700, Christoph Hellwig wrote: > And drivers really have no business looking at the dma mask. I have > a plan for dma_alloc_pages API that could replace that cruft, but > until then please use GFP_KERNEL and let the dma subsystem bounce > buffer if needed. Can you try this series: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma_alloc_pages and see if it does whay you need for usb?
+Sergey Senozhatsky who's going to be looking into this. Hi Christoph, On Tue, Oct 1, 2019 at 3:37 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Mon, Sep 30, 2019 at 01:23:10AM -0700, Christoph Hellwig wrote: > > And drivers really have no business looking at the dma mask. I have > > a plan for dma_alloc_pages API that could replace that cruft, but > > until then please use GFP_KERNEL and let the dma subsystem bounce > > buffer if needed. > > Can you try this series: > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma_alloc_pages > > and see if it does whay you need for usb? Reviving this thread. Sorry for no updates for a long time. dma_alloc_pages() still wouldn't be an equivalent replacement of the existing dma_alloc_coherent() (used behind the scenes by usb_alloc_coherent()). That's because the latter can allocate non-contiguous memory if the DMA device can handle it (i.e. is behind an IOMMU), but the former can only allocate a contiguous range of pages. That said, I noticed that you also put a lot of effort into making the NONCONSISTENT attribute more usable. Perhaps that's the way to go here then? Of course we would need to make sure that the attribute is handled properly on ARM and ARM64, which are the most affected platforms. Right now neither handles them. The former doesn't use the generic DMA mapping ops, while the latter does, but doesn't enable a Kconfig option needed to allow generic inconsistent allocations. Any hints would be appreciated. Best regards, Tomasz
On Thu, Feb 27, 2020 at 7:28 AM Tomasz Figa <tfiga@chromium.org> wrote: > > +Sergey Senozhatsky who's going to be looking into this. > > Hi Christoph, > > On Tue, Oct 1, 2019 at 3:37 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Mon, Sep 30, 2019 at 01:23:10AM -0700, Christoph Hellwig wrote: > > > And drivers really have no business looking at the dma mask. I have > > > a plan for dma_alloc_pages API that could replace that cruft, but > > > until then please use GFP_KERNEL and let the dma subsystem bounce > > > buffer if needed. > > > > Can you try this series: > > > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma_alloc_pages > > > > and see if it does whay you need for usb? > > Reviving this thread. Sorry for no updates for a long time. > > dma_alloc_pages() still wouldn't be an equivalent replacement of the > existing dma_alloc_coherent() (used behind the scenes by > usb_alloc_coherent()). That's because the latter can allocate > non-contiguous memory if the DMA device can handle it (i.e. is behind > an IOMMU), but the former can only allocate a contiguous range of > pages. > > That said, I noticed that you also put a lot of effort into making the > NONCONSISTENT attribute more usable. Perhaps that's the way to go here > then? Of course we would need to make sure that the attribute is > handled properly on ARM and ARM64, which are the most affected > platforms. Right now neither handles them. The former doesn't use the > generic DMA mapping ops, while the latter does, but doesn't enable a > Kconfig option needed to allow generic inconsistent allocations. > > Any hints would be appreciated. Hi Christoph, would you have some time to check the above? Hi Catalin, Will, do you know why CONFIG_DMA_NONCOHERENT_CACHE_SYNC is not enabled on arm64? Thanks in advance. :) Best regards, Tomasz
On Tue, Apr 21, 2020 at 01:21:15PM +0200, Tomasz Figa wrote: > On Thu, Feb 27, 2020 at 7:28 AM Tomasz Figa <tfiga@chromium.org> wrote: > > > > +Sergey Senozhatsky who's going to be looking into this. > > > > Hi Christoph, > > > > That said, I noticed that you also put a lot of effort into making the > > NONCONSISTENT attribute more usable. Perhaps that's the way to go here > > then? Of course we would need to make sure that the attribute is > > handled properly on ARM and ARM64, which are the most affected > > platforms. Right now neither handles them. The former doesn't use the > > generic DMA mapping ops, while the latter does, but doesn't enable a > > Kconfig option needed to allow generic inconsistent allocations. > > > > Any hints would be appreciated. > > Hi Christoph, would you have some time to check the above? > > Hi Catalin, Will, do you know why CONFIG_DMA_NONCOHERENT_CACHE_SYNC is > not enabled on arm64? NONCONSISTENT is still a mess, mostly because dma_cache_sync is such a horrible API. I've been wanting to switch to the normal sync_for_device / sync_for_cpu primitives instead. Let me see if I can expedite that.
diff --git a/drivers/media/usb/uvc/uvc_video.c b/drivers/media/usb/uvc/uvc_video.c index 8fa77a81dd7f2c..962c35478896c4 100644 --- a/drivers/media/usb/uvc/uvc_video.c +++ b/drivers/media/usb/uvc/uvc_video.c @@ -1539,6 +1539,8 @@ static void uvc_video_complete(struct urb *urb) * Process the URB headers, and optionally queue expensive memcpy tasks * to be deferred to a work queue. */ + dma_sync_single_for_cpu(&urb->dev->dev, urb->transfer_dma, + urb->transfer_buffer_length, DMA_FROM_DEVICE); stream->decode(uvc_urb, buf, buf_meta); /* If no async work is needed, resubmit the URB immediately. */ @@ -1565,18 +1567,51 @@ static void uvc_free_urb_buffers(struct uvc_streaming *stream) if (!uvc_urb->buffer) continue; -#ifndef CONFIG_DMA_NONCOHERENT - usb_free_coherent(stream->dev->udev, stream->urb_size, - uvc_urb->buffer, uvc_urb->dma); -#else + dma_unmap_single(&stream->dev->udev->dev, uvc_urb->dma, + stream->urb_size, DMA_FROM_DEVICE); kfree(uvc_urb->buffer); -#endif - uvc_urb->buffer = NULL; } stream->urb_size = 0; } +static gfp_t uvc_alloc_gfp_flags(struct device *dev) +{ + u64 mask = dma_get_mask(dev); + + if (dev->bus_dma_mask) + mask &= dev->bus_dma_mask; + + if (mask < DMA_BIT_MASK(32) && IS_ENABLED(CONFIG_ZONE_DMA)) + return GFP_DMA; + + if (mask < DMA_BIT_MASK(64)) { + if (IS_ENABLED(CONFIG_ZONE_DMA32)) + return GFP_DMA32; + if (IS_ENABLED(CONFIG_ZONE_DMA)) + return GFP_DMA; + } + + return 0; +} + +static char *uvc_alloc_urb_buffer(struct device *dev, size_t size, + gfp_t gfp_flags, dma_addr_t *dma_handle) +{ + void *buffer = kzalloc(size, gfp_flags | uvc_alloc_gfp_flags(dev)); + + if (!buffer) + return NULL; + + *dma_handle = dma_map_single(dev, buffer, size, DMA_FROM_DEVICE); + if (dma_mapping_error(dev, *dma_handle)) { + kfree(buffer); + return NULL; + } + + return buffer; +} + /* * Allocate transfer buffers. This function can be called with buffers * already allocated when resuming from suspend, in which case it will @@ -1607,18 +1642,14 @@ static int uvc_alloc_urb_buffers(struct uvc_streaming *stream, /* Retry allocations until one succeed. */ for (; npackets > 1; npackets /= 2) { + stream->urb_size = psize * npackets; + for (i = 0; i < UVC_URBS; ++i) { struct uvc_urb *uvc_urb = &stream->uvc_urb[i]; - stream->urb_size = psize * npackets; -#ifndef CONFIG_DMA_NONCOHERENT - uvc_urb->buffer = usb_alloc_coherent( - stream->dev->udev, stream->urb_size, + uvc_urb->buffer = uvc_alloc_urb_buffer( + &stream->dev->udev->dev, stream->urb_size, gfp_flags | __GFP_NOWARN, &uvc_urb->dma); -#else - uvc_urb->buffer = - kmalloc(stream->urb_size, gfp_flags | __GFP_NOWARN); -#endif if (!uvc_urb->buffer) { uvc_free_urb_buffers(stream); break; @@ -1728,12 +1759,8 @@ static int uvc_init_video_isoc(struct uvc_streaming *stream, urb->context = uvc_urb; urb->pipe = usb_rcvisocpipe(stream->dev->udev, ep->desc.bEndpointAddress); -#ifndef CONFIG_DMA_NONCOHERENT urb->transfer_flags = URB_ISO_ASAP | URB_NO_TRANSFER_DMA_MAP; urb->transfer_dma = uvc_urb->dma; -#else - urb->transfer_flags = URB_ISO_ASAP; -#endif urb->interval = ep->desc.bInterval; urb->transfer_buffer = uvc_urb->buffer; urb->complete = uvc_video_complete; @@ -1793,10 +1820,8 @@ static int uvc_init_video_bulk(struct uvc_streaming *stream, usb_fill_bulk_urb(urb, stream->dev->udev, pipe, uvc_urb->buffer, size, uvc_video_complete, uvc_urb); -#ifndef CONFIG_DMA_NONCOHERENT urb->transfer_flags = URB_NO_TRANSFER_DMA_MAP; urb->transfer_dma = uvc_urb->dma; -#endif uvc_urb->urb = urb; }
Similar to the commit 1161db6776bd ("media: usb: pwc: Don't use coherent DMA buffers for ISO transfer") [1] for the pwc driver. Use streaming DMA APIs to transfer buffers and sync them explicitly, because accessing buffers allocated by usb_alloc_coherent() is slow on platforms without hardware coherent DMA. Tested on x86-64 (Intel Celeron 4305U) and armv7l (Rockchip RK3288) with Logitech Brio 4K camera at 1920x1080 using the WebRTC sample site [3]. | | URB (us) | Decode (Gbps) | CPU (%) | |------------------|------------|---------------|---------| | x86-64 coherent | 53 +- 20 | 50.6 | 0.24 | | x86-64 streaming | 55 +- 19 | 50.1 | 0.25 | | armv7l coherent | 342 +- 379 | 1.8 | 2.16 | | armv7l streaming | 99 +- 98 | 11.0 | 0.36 | The performance characteristics are improved a lot on armv7l, and remained (almost) unchanged on x86-64. The code used for measurement can be found at [2]. [1] https://git.kernel.org/torvalds/c/1161db6776bd [2] https://crrev.com/c/1729133 [3] https://webrtc.github.io/samples/src/content/getusermedia/resolution/ Signed-off-by: Shik Chen <shik@chromium.org> --- The allocated buffer could be as large as 768K when streaming 4K video. Ideally we should use some generic helper to allocate non-coherent memory in a more efficient way, such as https://lwn.net/Articles/774429/ ("make the non-consistent DMA allocator more userful"). But I cannot find any existing helper so a simple kzalloc() is used in this patch. The logic to figure out the DMA addressable GFP flags is similar to __dma_direct_alloc_pages() without the optimistic retries: https://elixir.bootlin.com/linux/v5.2.5/source/kernel/dma/direct.c#L96 drivers/media/usb/uvc/uvc_video.c | 65 +++++++++++++++++++++---------- 1 file changed, 45 insertions(+), 20 deletions(-)