diff mbox series

[RFC] Experimental DMA-BUF Device Heaps

Message ID 20200816172246.69146-1-ezequiel@collabora.com (mailing list archive)
State New, archived
Headers show
Series [RFC] Experimental DMA-BUF Device Heaps | expand

Commit Message

Ezequiel Garcia Aug. 16, 2020, 5:22 p.m. UTC
This heap is basically a wrapper around DMA-API dma_alloc_attrs,
which will allocate memory suitable for the given device.

The implementation is mostly a port of the Contiguous Videobuf2
memory allocator (see videobuf2/videobuf2-dma-contig.c)
over to the DMA-BUF Heap interface.

The intention of this allocator is to provide applications
with a more system-agnostic API: the only thing the application
needs to know is which device to get the buffer for.

Whether the buffer is backed by CMA, IOMMU or a DMA Pool
is unknown to the application.

I'm not really expecting this patch to be correct or even
a good idea, but just submitting it to start a discussion on DMA-BUF
heap discovery and negotiation.

Given Plumbers is just a couple weeks from now, I've submitted
a BoF proposal to discuss this, as perhaps it would make
sense to discuss this live?

Not-signed-off-by: Ezequiel Garcia <ezequiel@collabora.com>
---
 drivers/dma-buf/heaps/Kconfig       |   9 +
 drivers/dma-buf/heaps/Makefile      |   1 +
 drivers/dma-buf/heaps/device_heap.c | 268 ++++++++++++++++++++++++++++
 include/linux/device.h              |   5 +
 include/linux/dma-heap.h            |   6 +
 5 files changed, 289 insertions(+)
 create mode 100644 drivers/dma-buf/heaps/device_heap.c

Comments

Brian Starkey Aug. 17, 2020, 3:18 p.m. UTC | #1
Hi Ezequiel,

On Sun, Aug 16, 2020 at 02:22:46PM -0300, Ezequiel Garcia wrote:
> This heap is basically a wrapper around DMA-API dma_alloc_attrs,
> which will allocate memory suitable for the given device.
> 
> The implementation is mostly a port of the Contiguous Videobuf2
> memory allocator (see videobuf2/videobuf2-dma-contig.c)
> over to the DMA-BUF Heap interface.
> 
> The intention of this allocator is to provide applications
> with a more system-agnostic API: the only thing the application
> needs to know is which device to get the buffer for.
> 
> Whether the buffer is backed by CMA, IOMMU or a DMA Pool
> is unknown to the application.
> 
> I'm not really expecting this patch to be correct or even
> a good idea, but just submitting it to start a discussion on DMA-BUF
> heap discovery and negotiation.
> 

My initial reaction is that I thought dmabuf heaps are meant for use
to allocate buffers for sharing across devices, which doesn't fit very
well with having per-device heaps.

For single-device allocations, would using the buffer allocation
functionality of that device's native API be better in most
cases? (Some other possibly relevant discussion at [1])

I can see that this can save some boilerplate for devices that want
to expose private chunks of memory, but might it also lead to 100
aliases for the system's generic coherent memory pool?

I wonder if a set of helpers to allow devices to expose whatever they
want with minimal effort would be better.

Cheers,
-Brian

1. https://lore.kernel.org/dri-devel/57062477-30e7-a3de-6723-a50d03a402c4@kapsi.fi/

> Given Plumbers is just a couple weeks from now, I've submitted
> a BoF proposal to discuss this, as perhaps it would make
> sense to discuss this live?
> 
> Not-signed-off-by: Ezequiel Garcia <ezequiel@collabora.com>
James Jones Aug. 18, 2020, 3:49 a.m. UTC | #2
On 8/17/20 8:18 AM, Brian Starkey wrote:
> Hi Ezequiel,
> 
> On Sun, Aug 16, 2020 at 02:22:46PM -0300, Ezequiel Garcia wrote:
>> This heap is basically a wrapper around DMA-API dma_alloc_attrs,
>> which will allocate memory suitable for the given device.
>>
>> The implementation is mostly a port of the Contiguous Videobuf2
>> memory allocator (see videobuf2/videobuf2-dma-contig.c)
>> over to the DMA-BUF Heap interface.
>>
>> The intention of this allocator is to provide applications
>> with a more system-agnostic API: the only thing the application
>> needs to know is which device to get the buffer for.
>>
>> Whether the buffer is backed by CMA, IOMMU or a DMA Pool
>> is unknown to the application.
>>
>> I'm not really expecting this patch to be correct or even
>> a good idea, but just submitting it to start a discussion on DMA-BUF
>> heap discovery and negotiation.
>>
> 
> My initial reaction is that I thought dmabuf heaps are meant for use
> to allocate buffers for sharing across devices, which doesn't fit very
> well with having per-device heaps.
> 
> For single-device allocations, would using the buffer allocation
> functionality of that device's native API be better in most
> cases? (Some other possibly relevant discussion at [1])
> 
> I can see that this can save some boilerplate for devices that want
> to expose private chunks of memory, but might it also lead to 100
> aliases for the system's generic coherent memory pool?
> 
> I wonder if a set of helpers to allow devices to expose whatever they
> want with minimal effort would be better.

I'm rather interested on where this goes, as I was toying with using 
some sort of heap ID as a basis for a "device-local" constraint in the 
memory constraints proposals Simon and I will be discussing at XDC this 
year.  It would be rather elegant if there was one type of heap ID used 
universally throughout the kernel that could provide a unique handle for 
the shared system memory heap(s), as well as accelerator-local heaps on 
fancy NICs, GPUs, NN accelerators, capture devices, etc. so apps could 
negotiate a location among themselves.  This patch seems to be a step 
towards that in a way, but I agree it would be counterproductive if a 
bunch of devices that were using the same underlying system memory ended 
up each getting their own heap ID just because they used some SW 
framework that worked that way.

Would appreciate it if you could send along a pointer to your BoF if it 
happens!

Thanks,
-James

> Cheers,
> -Brian
> 
> 1. https://lore.kernel.org/dri-devel/57062477-30e7-a3de-6723-a50d03a402c4@kapsi.fi/
> 
>> Given Plumbers is just a couple weeks from now, I've submitted
>> a BoF proposal to discuss this, as perhaps it would make
>> sense to discuss this live?
>>
>> Not-signed-off-by: Ezequiel Garcia <ezequiel@collabora.com>
John Stultz Aug. 18, 2020, 4:13 a.m. UTC | #3
On Sun, Aug 16, 2020 at 10:23 AM Ezequiel Garcia <ezequiel@collabora.com> wrote:
>
> This heap is basically a wrapper around DMA-API dma_alloc_attrs,
> which will allocate memory suitable for the given device.
>
> The implementation is mostly a port of the Contiguous Videobuf2
> memory allocator (see videobuf2/videobuf2-dma-contig.c)
> over to the DMA-BUF Heap interface.
>
> The intention of this allocator is to provide applications
> with a more system-agnostic API: the only thing the application
> needs to know is which device to get the buffer for.
>
> Whether the buffer is backed by CMA, IOMMU or a DMA Pool
> is unknown to the application.

My hesitancy here is that the main reason we have DMA BUF Heaps, and
ION before it, was to expose different types of memory allocations to
userspace. The main premise that often these buffers are shared with
multiple devices, which have differing constraints and it is userspace
that best understands the path a buffer will take through a series of
devices. So userspace is best positioned to determine what type of
memory should be allocated to satisfy the various devices constraints
(there were some design attempts to allow DMA BUFs to use multiple
attach with deferred alloc at map time to handle this constraint
solving in-kernel, but it was never adopted in practice).

This however, requires some system specific policy - implemented in
the Android userspace by gralloc which maps "usage" types (device
pipeline flows) to heaps. I liken it to fstab, which helps map mount
points to partitions - it's not going to be the same on every system.

What you seem to be proposing seems a bit contrary to this premise -
Userland doesn't know what type of memory it needs, but given a device
can somehow find the heap it should allocate for? This seems to assume
the buffer is only to be used with a single device?

There was at some point a discussion where folks (maybe it was
DanielV? I can't remember...) suggested having something like a sysfs
device node link from a device to a dma-buf heap chardev. This seems
like it would add a similar behavior as what you're proposing, however
without adding possibly a ton of new device specific heaps to the
/dev/dma_heap/ dir. However, we would potentially need any missing
heap types added first.

> I'm not really expecting this patch to be correct or even
> a good idea, but just submitting it to start a discussion on DMA-BUF
> heap discovery and negotiation.
>
> Given Plumbers is just a couple weeks from now, I've submitted
> a BoF proposal to discuss this, as perhaps it would make
> sense to discuss this live?

I do think it's an interesting idea. I agree that having every driver
implement a dmabuf exporter is a bit silly, but I also think Brian's
point that maybe we need some drm helper functions that provide
similar functionality along with a more standardized device ioctl for
single device allocations might be better.

thanks
-john
Ezequiel Garcia Aug. 20, 2020, 8:07 a.m. UTC | #4
Hi Brian,

Thanks a lot for reviewing this patch.

I'm happy my humble proof-of-concept
is able to spark some interest and raise
some questions.

On Mon, 2020-08-17 at 16:18 +0100, Brian Starkey wrote:
> Hi Ezequiel,
> 
> On Sun, Aug 16, 2020 at 02:22:46PM -0300, Ezequiel Garcia wrote:
> > This heap is basically a wrapper around DMA-API dma_alloc_attrs,
> > which will allocate memory suitable for the given device.
> > 
> > The implementation is mostly a port of the Contiguous Videobuf2
> > memory allocator (see videobuf2/videobuf2-dma-contig.c)
> > over to the DMA-BUF Heap interface.
> > 
> > The intention of this allocator is to provide applications
> > with a more system-agnostic API: the only thing the application
> > needs to know is which device to get the buffer for.
> > 
> > Whether the buffer is backed by CMA, IOMMU or a DMA Pool
> > is unknown to the application.
> > 
> > I'm not really expecting this patch to be correct or even
> > a good idea, but just submitting it to start a discussion on DMA-BUF
> > heap discovery and negotiation.
> > 
> 
> My initial reaction is that I thought dmabuf heaps are meant for use
> to allocate buffers for sharing across devices, which doesn't fit very
> well with having per-device heaps.
> 

That's true.

> For single-device allocations, would using the buffer allocation
> functionality of that device's native API be better in most
> cases? (Some other possibly relevant discussion at [1])
> 

That may be true for existing subsystems.

However, how about a new subsystem/API wanting to
leverage the heap API and avoid exposing yet another
allocator API?

And then, also, if we have an allocator API, perhaps
we could imagine a future where applications would only
need to worry about that, and not about each per-framework
allocator.

> I can see that this can save some boilerplate for devices that want
> to expose private chunks of memory, but might it also lead to 100
> aliases for the system's generic coherent memory pool?
> 

The idea (even if PoC) was to let drivers decide if they are special
enough to add themselves (via dev_dma_heap_add).

Aliasing the heap would be ceratainly too much (although trying
it for every device was a fun experiment to watch). 

Thanks,
Ezequiel

> I wonder if a set of helpers to allow devices to expose whatever they
> want with minimal effort would be better.
> 
> Cheers,
> -Brian
> 
> 1. https://lore.kernel.org/dri-devel/57062477-30e7-a3de-6723-a50d03a402c4@kapsi.fi/
> 
> > Given Plumbers is just a couple weeks from now, I've submitted
> > a BoF proposal to discuss this, as perhaps it would make
> > sense to discuss this live?
> > 
> > Not-signed-off-by: Ezequiel Garcia <ezequiel@collabora.com>
Ezequiel Garcia Aug. 20, 2020, 8:15 a.m. UTC | #5
On Mon, 2020-08-17 at 20:49 -0700, James Jones wrote:
> On 8/17/20 8:18 AM, Brian Starkey wrote:
> > Hi Ezequiel,
> > 
> > On Sun, Aug 16, 2020 at 02:22:46PM -0300, Ezequiel Garcia wrote:
> > > This heap is basically a wrapper around DMA-API dma_alloc_attrs,
> > > which will allocate memory suitable for the given device.
> > > 
> > > The implementation is mostly a port of the Contiguous Videobuf2
> > > memory allocator (see videobuf2/videobuf2-dma-contig.c)
> > > over to the DMA-BUF Heap interface.
> > > 
> > > The intention of this allocator is to provide applications
> > > with a more system-agnostic API: the only thing the application
> > > needs to know is which device to get the buffer for.
> > > 
> > > Whether the buffer is backed by CMA, IOMMU or a DMA Pool
> > > is unknown to the application.
> > > 
> > > I'm not really expecting this patch to be correct or even
> > > a good idea, but just submitting it to start a discussion on DMA-BUF
> > > heap discovery and negotiation.
> > > 
> > 
> > My initial reaction is that I thought dmabuf heaps are meant for use
> > to allocate buffers for sharing across devices, which doesn't fit very
> > well with having per-device heaps.
> > 
> > For single-device allocations, would using the buffer allocation
> > functionality of that device's native API be better in most
> > cases? (Some other possibly relevant discussion at [1])
> > 
> > I can see that this can save some boilerplate for devices that want
> > to expose private chunks of memory, but might it also lead to 100
> > aliases for the system's generic coherent memory pool?
> > 
> > I wonder if a set of helpers to allow devices to expose whatever they
> > want with minimal effort would be better.
> 
> I'm rather interested on where this goes, as I was toying with using 
> some sort of heap ID as a basis for a "device-local" constraint in the 
> memory constraints proposals Simon and I will be discussing at XDC this 
> year.  It would be rather elegant if there was one type of heap ID used 
> universally throughout the kernel that could provide a unique handle for 
> the shared system memory heap(s), as well as accelerator-local heaps on 
> fancy NICs, GPUs, NN accelerators, capture devices, etc. so apps could 
> negotiate a location among themselves.  This patch seems to be a step 
> towards that in a way, but I agree it would be counterproductive if a 
> bunch of devices that were using the same underlying system memory ended 
> up each getting their own heap ID just because they used some SW 
> framework that worked that way.
> 
> Would appreciate it if you could send along a pointer to your BoF if it 
> happens!
> 

Here is it:

https://linuxplumbersconf.org/event/7/contributions/818/

It would be great to see you there and discuss this,
given I was hoping we could talk about how to meet a
userspace allocator library expectations as well.

Thanks,
Ezequiel

> Thanks,
> -James
> 
> > Cheers,
> > -Brian
> > 
> > 1. https://lore.kernel.org/dri-devel/57062477-30e7-a3de-6723-a50d03a402c4@kapsi.fi/
> > 
> > > Given Plumbers is just a couple weeks from now, I've submitted
> > > a BoF proposal to discuss this, as perhaps it would make
> > > sense to discuss this live?
> > > 
> > > Not-signed-off-by: Ezequiel Garcia <ezequiel@collabora.com>
Ezequiel Garcia Aug. 20, 2020, 8:36 a.m. UTC | #6
Hi John,

Thanks a ton for taking the time
to go thru this.

On Mon, 2020-08-17 at 21:13 -0700, John Stultz wrote:
> On Sun, Aug 16, 2020 at 10:23 AM Ezequiel Garcia <ezequiel@collabora.com> wrote:
> > This heap is basically a wrapper around DMA-API dma_alloc_attrs,
> > which will allocate memory suitable for the given device.
> > 
> > The implementation is mostly a port of the Contiguous Videobuf2
> > memory allocator (see videobuf2/videobuf2-dma-contig.c)
> > over to the DMA-BUF Heap interface.
> > 
> > The intention of this allocator is to provide applications
> > with a more system-agnostic API: the only thing the application
> > needs to know is which device to get the buffer for.
> > 
> > Whether the buffer is backed by CMA, IOMMU or a DMA Pool
> > is unknown to the application.
> 
> My hesitancy here is that the main reason we have DMA BUF Heaps, and
> ION before it, was to expose different types of memory allocations to
> userspace. The main premise that often these buffers are shared with
> multiple devices, which have differing constraints and it is userspace
> that best understands the path a buffer will take through a series of
> devices. So userspace is best positioned to determine what type of
> memory should be allocated to satisfy the various devices constraints
> (there were some design attempts to allow DMA BUFs to use multiple
> attach with deferred alloc at map time to handle this constraint
> solving in-kernel, but it was never adopted in practice).
> 
> This however, requires some system specific policy - implemented in
> the Android userspace by gralloc which maps "usage" types (device
> pipeline flows) to heaps. I liken it to fstab, which helps map mount
> points to partitions - it's not going to be the same on every system.
> 
> What you seem to be proposing seems a bit contrary to this premise -
> Userland doesn't know what type of memory it needs, but given a device
> can somehow find the heap it should allocate for? This seems to assume
> the buffer is only to be used with a single device?
> 

Single-device usage wasn't the intention. I see now that this patch
looks too naive and it's confusing. The idea is of course to get buffers
that can be shared.

I'm thinking you need to share your picture buffer with a decoder,
a renderer, possibly something else. Each with its own set
of constraints and limitations.	

Of course, a buffer that works for device A may be unsuitable for
device B and so this per-device heap is surely way too naive.

As you rightly mention, the main intention of this RFC is to
question exactly the current premise: "userspace knows".
I fail to see how will (generic and system-agnostic) applications
know which heap to use.

Just for completion, let me throw a simple example: i.MX 8M
and some Rockchip platforms share the same VPU block, except the
latter have an IOMMU.

So applications would need to query an iommu presence
to get buffer from CMA or not.

> There was at some point a discussion where folks (maybe it was
> DanielV? I can't remember...) suggested having something like a sysfs
> device node link from a device to a dma-buf heap chardev. This seems
> like it would add a similar behavior as what you're proposing, however
> without adding possibly a ton of new device specific heaps to the
> /dev/dma_heap/ dir. However, we would potentially need any missing
> heap types added first.
> 
> > I'm not really expecting this patch to be correct or even
> > a good idea, but just submitting it to start a discussion on DMA-BUF
> > heap discovery and negotiation.
> > 
> > Given Plumbers is just a couple weeks from now, I've submitted
> > a BoF proposal to discuss this, as perhaps it would make
> > sense to discuss this live?
> 
> I do think it's an interesting idea. I agree that having every driver
> implement a dmabuf exporter is a bit silly, but I also think Brian's
> point that maybe we need some drm helper functions that provide
> similar functionality along with a more standardized device ioctl for
> single device allocations might be better.
> 

I'm unsure we should treat single device specially.
 
Exposing device limitations and constraints properly,
allowing some sort of negotation would hopefully solve
single device and shared requirements.

Thanks,
Ezequiel

> thanks
> -john
Brian Starkey Aug. 20, 2020, 9:14 a.m. UTC | #7
Hi,

On Thu, Aug 20, 2020 at 09:07:29AM +0100, Ezequiel Garcia wrote:
> > For single-device allocations, would using the buffer allocation
> > functionality of that device's native API be better in most
> > cases? (Some other possibly relevant discussion at [1])
> > 
> 
> That may be true for existing subsystems.
> 
> However, how about a new subsystem/API wanting to
> leverage the heap API and avoid exposing yet another
> allocator API?
> 
> And then, also, if we have an allocator API, perhaps
> we could imagine a future where applications would only
> need to worry about that, and not about each per-framework
> allocator.

Yeah both are reasonable points. I was thinking in the context of the
thread I linked where allocating lots of GEM handles for process-local
use is preferable to importing loads of dma_buf fds, but in that case
the userspace graphics driver is somewhat "special" rather than a
generic application in the traditional sense.

I do think the best usage model for applications though is to use
libraries which can hide the details, instead of going at the kernel
APIs directly.

> 
> > I can see that this can save some boilerplate for devices that want
> > to expose private chunks of memory, but might it also lead to 100
> > aliases for the system's generic coherent memory pool?
> > 
> 
> The idea (even if PoC) was to let drivers decide if they are special
> enough to add themselves (via dev_dma_heap_add).

OK, that makes sense. I think it's tricky to know how many "special"
chunks of memory will already be hooked up to the DMA infrastructure
and how many would need some extra/special work.

Cheers,
-Brian
Laurent Pinchart Aug. 20, 2020, 3:54 p.m. UTC | #8
Hi Ezequiel,

On Thu, Aug 20, 2020 at 05:36:40AM -0300, Ezequiel Garcia wrote:
> Hi John,
> 
> Thanks a ton for taking the time
> to go thru this.
> 
> On Mon, 2020-08-17 at 21:13 -0700, John Stultz wrote:
> > On Sun, Aug 16, 2020 at 10:23 AM Ezequiel Garcia <ezequiel@collabora.com> wrote:
> > > This heap is basically a wrapper around DMA-API dma_alloc_attrs,
> > > which will allocate memory suitable for the given device.
> > > 
> > > The implementation is mostly a port of the Contiguous Videobuf2
> > > memory allocator (see videobuf2/videobuf2-dma-contig.c)
> > > over to the DMA-BUF Heap interface.
> > > 
> > > The intention of this allocator is to provide applications
> > > with a more system-agnostic API: the only thing the application
> > > needs to know is which device to get the buffer for.
> > > 
> > > Whether the buffer is backed by CMA, IOMMU or a DMA Pool
> > > is unknown to the application.
> > 
> > My hesitancy here is that the main reason we have DMA BUF Heaps, and
> > ION before it, was to expose different types of memory allocations to
> > userspace. The main premise that often these buffers are shared with
> > multiple devices, which have differing constraints and it is userspace
> > that best understands the path a buffer will take through a series of
> > devices. So userspace is best positioned to determine what type of
> > memory should be allocated to satisfy the various devices constraints
> > (there were some design attempts to allow DMA BUFs to use multiple
> > attach with deferred alloc at map time to handle this constraint
> > solving in-kernel, but it was never adopted in practice).
> > 
> > This however, requires some system specific policy - implemented in
> > the Android userspace by gralloc which maps "usage" types (device
> > pipeline flows) to heaps. I liken it to fstab, which helps map mount
> > points to partitions - it's not going to be the same on every system.
> > 
> > What you seem to be proposing seems a bit contrary to this premise -
> > Userland doesn't know what type of memory it needs, but given a device
> > can somehow find the heap it should allocate for? This seems to assume
> > the buffer is only to be used with a single device?
> 
> Single-device usage wasn't the intention. I see now that this patch
> looks too naive and it's confusing. The idea is of course to get buffers
> that can be shared.
> 
> I'm thinking you need to share your picture buffer with a decoder,
> a renderer, possibly something else. Each with its own set
> of constraints and limitations.	
> 
> Of course, a buffer that works for device A may be unsuitable for
> device B and so this per-device heap is surely way too naive.
> 
> As you rightly mention, the main intention of this RFC is to
> question exactly the current premise: "userspace knows".
> I fail to see how will (generic and system-agnostic) applications
> know which heap to use.
> 
> Just for completion, let me throw a simple example: i.MX 8M
> and some Rockchip platforms share the same VPU block, except the
> latter have an IOMMU.
> 
> So applications would need to query an iommu presence
> to get buffer from CMA or not.

I don't think we can do this in a system-agnostic way. What I'd like to
see is an API for the kernel to expose opaque constraints for each
device, and a userspace library to reconcile constraints, selecting a
suitable heap, and producing heap-specific parameters for the
allocation.

The constraints need to be transported in a generic way, but the
contents can be opaque for applications. Only the library would need to
interpret them. This can be done with platform-specific code inside the
library. A separate API for the library to interect with the kernel to
further query or negotiate configuration parameters could be part of
that picture.

This would offer standardized APIs to applications (to gather
constraints, pass them to the constraint resolution library, and receive
a heap ID and heap parameters), while still allowing platform-specific
code in userspace.

> > There was at some point a discussion where folks (maybe it was
> > DanielV? I can't remember...) suggested having something like a sysfs
> > device node link from a device to a dma-buf heap chardev. This seems
> > like it would add a similar behavior as what you're proposing, however
> > without adding possibly a ton of new device specific heaps to the
> > /dev/dma_heap/ dir. However, we would potentially need any missing
> > heap types added first.
> > 
> > > I'm not really expecting this patch to be correct or even
> > > a good idea, but just submitting it to start a discussion on DMA-BUF
> > > heap discovery and negotiation.
> > > 
> > > Given Plumbers is just a couple weeks from now, I've submitted
> > > a BoF proposal to discuss this, as perhaps it would make
> > > sense to discuss this live?
> > 
> > I do think it's an interesting idea. I agree that having every driver
> > implement a dmabuf exporter is a bit silly, but I also think Brian's
> > point that maybe we need some drm helper functions that provide
> > similar functionality along with a more standardized device ioctl for
> > single device allocations might be better.
> 
> I'm unsure we should treat single device specially.
>  
> Exposing device limitations and constraints properly,
> allowing some sort of negotation would hopefully solve
> single device and shared requirements.
James Jones Aug. 23, 2020, 8:04 p.m. UTC | #9
On 8/20/20 1:15 AM, Ezequiel Garcia wrote:
> On Mon, 2020-08-17 at 20:49 -0700, James Jones wrote:
>> On 8/17/20 8:18 AM, Brian Starkey wrote:
>>> Hi Ezequiel,
>>>
>>> On Sun, Aug 16, 2020 at 02:22:46PM -0300, Ezequiel Garcia wrote:
>>>> This heap is basically a wrapper around DMA-API dma_alloc_attrs,
>>>> which will allocate memory suitable for the given device.
>>>>
>>>> The implementation is mostly a port of the Contiguous Videobuf2
>>>> memory allocator (see videobuf2/videobuf2-dma-contig.c)
>>>> over to the DMA-BUF Heap interface.
>>>>
>>>> The intention of this allocator is to provide applications
>>>> with a more system-agnostic API: the only thing the application
>>>> needs to know is which device to get the buffer for.
>>>>
>>>> Whether the buffer is backed by CMA, IOMMU or a DMA Pool
>>>> is unknown to the application.
>>>>
>>>> I'm not really expecting this patch to be correct or even
>>>> a good idea, but just submitting it to start a discussion on DMA-BUF
>>>> heap discovery and negotiation.
>>>>
>>>
>>> My initial reaction is that I thought dmabuf heaps are meant for use
>>> to allocate buffers for sharing across devices, which doesn't fit very
>>> well with having per-device heaps.
>>>
>>> For single-device allocations, would using the buffer allocation
>>> functionality of that device's native API be better in most
>>> cases? (Some other possibly relevant discussion at [1])
>>>
>>> I can see that this can save some boilerplate for devices that want
>>> to expose private chunks of memory, but might it also lead to 100
>>> aliases for the system's generic coherent memory pool?
>>>
>>> I wonder if a set of helpers to allow devices to expose whatever they
>>> want with minimal effort would be better.
>>
>> I'm rather interested on where this goes, as I was toying with using
>> some sort of heap ID as a basis for a "device-local" constraint in the
>> memory constraints proposals Simon and I will be discussing at XDC this
>> year.  It would be rather elegant if there was one type of heap ID used
>> universally throughout the kernel that could provide a unique handle for
>> the shared system memory heap(s), as well as accelerator-local heaps on
>> fancy NICs, GPUs, NN accelerators, capture devices, etc. so apps could
>> negotiate a location among themselves.  This patch seems to be a step
>> towards that in a way, but I agree it would be counterproductive if a
>> bunch of devices that were using the same underlying system memory ended
>> up each getting their own heap ID just because they used some SW
>> framework that worked that way.
>>
>> Would appreciate it if you could send along a pointer to your BoF if it
>> happens!
>>
> 
> Here is it:
> 
> https://linuxplumbersconf.org/event/7/contributions/818/
> 
> It would be great to see you there and discuss this,
> given I was hoping we could talk about how to meet a
> userspace allocator library expectations as well.

Thanks!  I hadn't registered for LPC and it looks like it's sold out, 
but I'll try to watch the live stream.

This is very interesting, in that it looks like we're both trying to 
solve roughly the same set of problems but approaching it from different 
angles.  From what I gather, your approach is that a "heap" encompasses 
all the allocation constraints a device may have.

The approach Simon Ser and I are tossing around so far is somewhat 
different, but may potentially leverage dma-buf heaps a bit as well.

Our approach looks more like what I described at XDC a few years ago, 
where memory constraints for a given device's usage of an image are 
exposed up to applications, which can then somehow perform boolean 
intersection/union operations on them to arrive at a common set of 
constraints that describe something compatible with all the devices & 
usages desired (or fail to do so, and fall back to copying things around 
presumably).  I believe this is more flexible than your initial proposal 
in that devices often support multiple usages (E.g., different formats, 
different proprietary layouts represented by format modifiers, etc.), 
and it avoids adding a combinatorial number of heaps to manage that.

In my view, heaps are more like blobs of memory that can be allocated 
from in various different ways to satisfy constraints.  I realize heaps 
mean something specific in the dma-buf heap design (specifically, 
something closer to an association between an "allocation mechanism" and 
"physical memory"), but I hope we don't have massive heap/allocator 
mechanism proliferation due to constraints alone.  Perhaps some 
constraints, such as contiguous memory or device-local memory, are 
properly expressed as a specific heap, but consider the proliferation 
implied by even that simple pair of examples: How do you express 
contiguous device-local memory?  Do you need to spawn two heaps on the 
underlying device-local memory, one for contiguous allocations and one 
for non-contiguous allocations?  Seems excessive.

Of course, our approach also has downsides and is still being worked on. 
  For example, it works best in an ideal world where all the allocators 
available understand all the constraints that exist.  Dealing with a 
reality where there are probably a handful of allocators, another 
handful of userspace libraries and APIs, and still more applications 
trying to make use of all this is one of the larger remaining challenges 
of the design.

We'll present our work at XDC 2020.  Hope you can check that out as well!

Thanks,
-James

> Thanks,
> Ezequiel
> 
>> Thanks,
>> -James
>>
>>> Cheers,
>>> -Brian
>>>
>>> 1. https://lore.kernel.org/dri-devel/57062477-30e7-a3de-6723-a50d03a402c4@kapsi.fi/
>>>
>>>> Given Plumbers is just a couple weeks from now, I've submitted
>>>> a BoF proposal to discuss this, as perhaps it would make
>>>> sense to discuss this live?
>>>>
>>>> Not-signed-off-by: Ezequiel Garcia <ezequiel@collabora.com>
> 
>
Laurent Pinchart Aug. 23, 2020, 8:46 p.m. UTC | #10
Hi James,

On Sun, Aug 23, 2020 at 01:04:43PM -0700, James Jones wrote:
> On 8/20/20 1:15 AM, Ezequiel Garcia wrote:
> > On Mon, 2020-08-17 at 20:49 -0700, James Jones wrote:
> >> On 8/17/20 8:18 AM, Brian Starkey wrote:
> >>> On Sun, Aug 16, 2020 at 02:22:46PM -0300, Ezequiel Garcia wrote:
> >>>> This heap is basically a wrapper around DMA-API dma_alloc_attrs,
> >>>> which will allocate memory suitable for the given device.
> >>>>
> >>>> The implementation is mostly a port of the Contiguous Videobuf2
> >>>> memory allocator (see videobuf2/videobuf2-dma-contig.c)
> >>>> over to the DMA-BUF Heap interface.
> >>>>
> >>>> The intention of this allocator is to provide applications
> >>>> with a more system-agnostic API: the only thing the application
> >>>> needs to know is which device to get the buffer for.
> >>>>
> >>>> Whether the buffer is backed by CMA, IOMMU or a DMA Pool
> >>>> is unknown to the application.
> >>>>
> >>>> I'm not really expecting this patch to be correct or even
> >>>> a good idea, but just submitting it to start a discussion on DMA-BUF
> >>>> heap discovery and negotiation.
> >>>>
> >>>
> >>> My initial reaction is that I thought dmabuf heaps are meant for use
> >>> to allocate buffers for sharing across devices, which doesn't fit very
> >>> well with having per-device heaps.
> >>>
> >>> For single-device allocations, would using the buffer allocation
> >>> functionality of that device's native API be better in most
> >>> cases? (Some other possibly relevant discussion at [1])
> >>>
> >>> I can see that this can save some boilerplate for devices that want
> >>> to expose private chunks of memory, but might it also lead to 100
> >>> aliases for the system's generic coherent memory pool?
> >>>
> >>> I wonder if a set of helpers to allow devices to expose whatever they
> >>> want with minimal effort would be better.
> >>
> >> I'm rather interested on where this goes, as I was toying with using
> >> some sort of heap ID as a basis for a "device-local" constraint in the
> >> memory constraints proposals Simon and I will be discussing at XDC this
> >> year.  It would be rather elegant if there was one type of heap ID used
> >> universally throughout the kernel that could provide a unique handle for
> >> the shared system memory heap(s), as well as accelerator-local heaps on
> >> fancy NICs, GPUs, NN accelerators, capture devices, etc. so apps could
> >> negotiate a location among themselves.  This patch seems to be a step
> >> towards that in a way, but I agree it would be counterproductive if a
> >> bunch of devices that were using the same underlying system memory ended
> >> up each getting their own heap ID just because they used some SW
> >> framework that worked that way.
> >>
> >> Would appreciate it if you could send along a pointer to your BoF if it
> >> happens!
> > 
> > Here is it:
> > 
> > https://linuxplumbersconf.org/event/7/contributions/818/
> > 
> > It would be great to see you there and discuss this,
> > given I was hoping we could talk about how to meet a
> > userspace allocator library expectations as well.
> 
> Thanks!  I hadn't registered for LPC and it looks like it's sold out, 
> but I'll try to watch the live stream.
> 
> This is very interesting, in that it looks like we're both trying to 
> solve roughly the same set of problems but approaching it from different 
> angles.  From what I gather, your approach is that a "heap" encompasses 
> all the allocation constraints a device may have.
> 
> The approach Simon Ser and I are tossing around so far is somewhat 
> different, but may potentially leverage dma-buf heaps a bit as well.
> 
> Our approach looks more like what I described at XDC a few years ago, 
> where memory constraints for a given device's usage of an image are 
> exposed up to applications, which can then somehow perform boolean 
> intersection/union operations on them to arrive at a common set of 
> constraints that describe something compatible with all the devices & 
> usages desired (or fail to do so, and fall back to copying things around 
> presumably).  I believe this is more flexible than your initial proposal 
> in that devices often support multiple usages (E.g., different formats, 
> different proprietary layouts represented by format modifiers, etc.), 
> and it avoids adding a combinatorial number of heaps to manage that.
> 
> In my view, heaps are more like blobs of memory that can be allocated 
> from in various different ways to satisfy constraints.  I realize heaps 
> mean something specific in the dma-buf heap design (specifically, 
> something closer to an association between an "allocation mechanism" and 
> "physical memory"), but I hope we don't have massive heap/allocator 
> mechanism proliferation due to constraints alone.  Perhaps some 
> constraints, such as contiguous memory or device-local memory, are 
> properly expressed as a specific heap, but consider the proliferation 
> implied by even that simple pair of examples: How do you express 
> contiguous device-local memory?  Do you need to spawn two heaps on the 
> underlying device-local memory, one for contiguous allocations and one 
> for non-contiguous allocations?  Seems excessive.
> 
> Of course, our approach also has downsides and is still being worked on. 
>   For example, it works best in an ideal world where all the allocators 
> available understand all the constraints that exist.

Shouldn't allocators be decoupled of constraints ? In my imagination I
see devices exposing constraints, and allocators exposing parameters,
with a userspace library to reconcile the constraints and produce
allocator parameters from them.

> Dealing with a 
> reality where there are probably a handful of allocators, another 
> handful of userspace libraries and APIs, and still more applications 
> trying to make use of all this is one of the larger remaining challenges 
> of the design.
> 
> We'll present our work at XDC 2020.  Hope you can check that out as well!
> 
> >>> 1. https://lore.kernel.org/dri-devel/57062477-30e7-a3de-6723-a50d03a402c4@kapsi.fi/
> >>>
> >>>> Given Plumbers is just a couple weeks from now, I've submitted
> >>>> a BoF proposal to discuss this, as perhaps it would make
> >>>> sense to discuss this live?
> >>>>
> >>>> Not-signed-off-by: Ezequiel Garcia <ezequiel@collabora.com>
James Jones Aug. 23, 2020, 10:53 p.m. UTC | #11
On 8/23/20 1:46 PM, Laurent Pinchart wrote:
> Hi James,
> 
> On Sun, Aug 23, 2020 at 01:04:43PM -0700, James Jones wrote:
>> On 8/20/20 1:15 AM, Ezequiel Garcia wrote:
>>> On Mon, 2020-08-17 at 20:49 -0700, James Jones wrote:
>>>> On 8/17/20 8:18 AM, Brian Starkey wrote:
>>>>> On Sun, Aug 16, 2020 at 02:22:46PM -0300, Ezequiel Garcia wrote:
>>>>>> This heap is basically a wrapper around DMA-API dma_alloc_attrs,
>>>>>> which will allocate memory suitable for the given device.
>>>>>>
>>>>>> The implementation is mostly a port of the Contiguous Videobuf2
>>>>>> memory allocator (see videobuf2/videobuf2-dma-contig.c)
>>>>>> over to the DMA-BUF Heap interface.
>>>>>>
>>>>>> The intention of this allocator is to provide applications
>>>>>> with a more system-agnostic API: the only thing the application
>>>>>> needs to know is which device to get the buffer for.
>>>>>>
>>>>>> Whether the buffer is backed by CMA, IOMMU or a DMA Pool
>>>>>> is unknown to the application.
>>>>>>
>>>>>> I'm not really expecting this patch to be correct or even
>>>>>> a good idea, but just submitting it to start a discussion on DMA-BUF
>>>>>> heap discovery and negotiation.
>>>>>>
>>>>>
>>>>> My initial reaction is that I thought dmabuf heaps are meant for use
>>>>> to allocate buffers for sharing across devices, which doesn't fit very
>>>>> well with having per-device heaps.
>>>>>
>>>>> For single-device allocations, would using the buffer allocation
>>>>> functionality of that device's native API be better in most
>>>>> cases? (Some other possibly relevant discussion at [1])
>>>>>
>>>>> I can see that this can save some boilerplate for devices that want
>>>>> to expose private chunks of memory, but might it also lead to 100
>>>>> aliases for the system's generic coherent memory pool?
>>>>>
>>>>> I wonder if a set of helpers to allow devices to expose whatever they
>>>>> want with minimal effort would be better.
>>>>
>>>> I'm rather interested on where this goes, as I was toying with using
>>>> some sort of heap ID as a basis for a "device-local" constraint in the
>>>> memory constraints proposals Simon and I will be discussing at XDC this
>>>> year.  It would be rather elegant if there was one type of heap ID used
>>>> universally throughout the kernel that could provide a unique handle for
>>>> the shared system memory heap(s), as well as accelerator-local heaps on
>>>> fancy NICs, GPUs, NN accelerators, capture devices, etc. so apps could
>>>> negotiate a location among themselves.  This patch seems to be a step
>>>> towards that in a way, but I agree it would be counterproductive if a
>>>> bunch of devices that were using the same underlying system memory ended
>>>> up each getting their own heap ID just because they used some SW
>>>> framework that worked that way.
>>>>
>>>> Would appreciate it if you could send along a pointer to your BoF if it
>>>> happens!
>>>
>>> Here is it:
>>>
>>> https://linuxplumbersconf.org/event/7/contributions/818/
>>>
>>> It would be great to see you there and discuss this,
>>> given I was hoping we could talk about how to meet a
>>> userspace allocator library expectations as well.
>>
>> Thanks!  I hadn't registered for LPC and it looks like it's sold out,
>> but I'll try to watch the live stream.
>>
>> This is very interesting, in that it looks like we're both trying to
>> solve roughly the same set of problems but approaching it from different
>> angles.  From what I gather, your approach is that a "heap" encompasses
>> all the allocation constraints a device may have.
>>
>> The approach Simon Ser and I are tossing around so far is somewhat
>> different, but may potentially leverage dma-buf heaps a bit as well.
>>
>> Our approach looks more like what I described at XDC a few years ago,
>> where memory constraints for a given device's usage of an image are
>> exposed up to applications, which can then somehow perform boolean
>> intersection/union operations on them to arrive at a common set of
>> constraints that describe something compatible with all the devices &
>> usages desired (or fail to do so, and fall back to copying things around
>> presumably).  I believe this is more flexible than your initial proposal
>> in that devices often support multiple usages (E.g., different formats,
>> different proprietary layouts represented by format modifiers, etc.),
>> and it avoids adding a combinatorial number of heaps to manage that.
>>
>> In my view, heaps are more like blobs of memory that can be allocated
>> from in various different ways to satisfy constraints.  I realize heaps
>> mean something specific in the dma-buf heap design (specifically,
>> something closer to an association between an "allocation mechanism" and
>> "physical memory"), but I hope we don't have massive heap/allocator
>> mechanism proliferation due to constraints alone.  Perhaps some
>> constraints, such as contiguous memory or device-local memory, are
>> properly expressed as a specific heap, but consider the proliferation
>> implied by even that simple pair of examples: How do you express
>> contiguous device-local memory?  Do you need to spawn two heaps on the
>> underlying device-local memory, one for contiguous allocations and one
>> for non-contiguous allocations?  Seems excessive.
>>
>> Of course, our approach also has downsides and is still being worked on.
>>    For example, it works best in an ideal world where all the allocators
>> available understand all the constraints that exist.
> 
> Shouldn't allocators be decoupled of constraints ? In my imagination I
> see devices exposing constraints, and allocators exposing parameters,
> with a userspace library to reconcile the constraints and produce
> allocator parameters from them.

Perhaps another level of abstraction would help.  I'll have to think 
about that.

However, as far as I can tell, it wouldn't remove the need to 
communicate a lot of constraints from multiple engines/devices/etc. to 
the allocator (likely a single allocator.  I'd be interested to know if 
anyone has a design that effectively uses multiple allocators to satisfy 
a single allocation request, but I haven't come up with a good one) 
somehow.  Either the constraints are directly used as the parameters, or 
there's a translation/second level of abstraction, but either way much 
of the information needs to make it to the allocator, or represent the 
need to use a particular allocator.  Simple things like pitch and offset 
alignment can be done without help from a kernel-level allocator, but 
others such as cache coherency, physical memory bank placement, or 
device-local memory will need to make it all the way down to the kernel 
some how I believe.

Thanks,
-James

>> Dealing with a
>> reality where there are probably a handful of allocators, another
>> handful of userspace libraries and APIs, and still more applications
>> trying to make use of all this is one of the larger remaining challenges
>> of the design.
>>
>> We'll present our work at XDC 2020.  Hope you can check that out as well!
>>
>>>>> 1. https://lore.kernel.org/dri-devel/57062477-30e7-a3de-6723-a50d03a402c4@kapsi.fi/
>>>>>
>>>>>> Given Plumbers is just a couple weeks from now, I've submitted
>>>>>> a BoF proposal to discuss this, as perhaps it would make
>>>>>> sense to discuss this live?
>>>>>>
>>>>>> Not-signed-off-by: Ezequiel Garcia <ezequiel@collabora.com>
>
Nicolas Dufresne Aug. 25, 2020, 8:26 p.m. UTC | #12
Le jeudi 20 août 2020 à 18:54 +0300, Laurent Pinchart a écrit :
> Hi Ezequiel,
> 
> On Thu, Aug 20, 2020 at 05:36:40AM -0300, Ezequiel Garcia wrote:
> > Hi John,
> > 
> > Thanks a ton for taking the time
> > to go thru this.
> > 
> > On Mon, 2020-08-17 at 21:13 -0700, John Stultz wrote:
> > > On Sun, Aug 16, 2020 at 10:23 AM Ezequiel Garcia <ezequiel@collabora.com> wrote:
> > > > This heap is basically a wrapper around DMA-API dma_alloc_attrs,
> > > > which will allocate memory suitable for the given device.
> > > > 
> > > > The implementation is mostly a port of the Contiguous Videobuf2
> > > > memory allocator (see videobuf2/videobuf2-dma-contig.c)
> > > > over to the DMA-BUF Heap interface.
> > > > 
> > > > The intention of this allocator is to provide applications
> > > > with a more system-agnostic API: the only thing the application
> > > > needs to know is which device to get the buffer for.
> > > > 
> > > > Whether the buffer is backed by CMA, IOMMU or a DMA Pool
> > > > is unknown to the application.
> > > 
> > > My hesitancy here is that the main reason we have DMA BUF Heaps, and
> > > ION before it, was to expose different types of memory allocations to
> > > userspace. The main premise that often these buffers are shared with
> > > multiple devices, which have differing constraints and it is userspace
> > > that best understands the path a buffer will take through a series of
> > > devices. So userspace is best positioned to determine what type of
> > > memory should be allocated to satisfy the various devices constraints
> > > (there were some design attempts to allow DMA BUFs to use multiple
> > > attach with deferred alloc at map time to handle this constraint
> > > solving in-kernel, but it was never adopted in practice).
> > > 
> > > This however, requires some system specific policy - implemented in
> > > the Android userspace by gralloc which maps "usage" types (device
> > > pipeline flows) to heaps. I liken it to fstab, which helps map mount
> > > points to partitions - it's not going to be the same on every system.
> > > 
> > > What you seem to be proposing seems a bit contrary to this premise -
> > > Userland doesn't know what type of memory it needs, but given a device
> > > can somehow find the heap it should allocate for? This seems to assume
> > > the buffer is only to be used with a single device?
> > 
> > Single-device usage wasn't the intention. I see now that this patch
> > looks too naive and it's confusing. The idea is of course to get buffers
> > that can be shared.
> > 
> > I'm thinking you need to share your picture buffer with a decoder,
> > a renderer, possibly something else. Each with its own set
> > of constraints and limitations.	
> > 
> > Of course, a buffer that works for device A may be unsuitable for
> > device B and so this per-device heap is surely way too naive.
> > 
> > As you rightly mention, the main intention of this RFC is to
> > question exactly the current premise: "userspace knows".
> > I fail to see how will (generic and system-agnostic) applications
> > know which heap to use.
> > 
> > Just for completion, let me throw a simple example: i.MX 8M
> > and some Rockchip platforms share the same VPU block, except the
> > latter have an IOMMU.
> > 
> > So applications would need to query an iommu presence
> > to get buffer from CMA or not.
> 
> I don't think we can do this in a system-agnostic way. What I'd like to
> see is an API for the kernel to expose opaque constraints for each

Please, take into consideration that constraints can also come from
userspace. These days, there exist things we may want to do using the
CPU, but the SIMD instructions and the associated algorithm will
introduce constraints too. If these constraints comes too opaque, but
you will also potentially limit some valid CPU interaction with HW in
term of buffer access. CPU constraints todays are fairly small and one
should be able to express them I believe. Of course, these are not
media agnostic, some constraints may depends on the media (like an
image buffer, a matrix buffer or audio buffer) and the associated
algorithm to be used.

An example would be an image buffers produced or modified on CPU but
encoded with HW.

> device, and a userspace library to reconcile constraints, selecting a
> suitable heap, and producing heap-specific parameters for the
> allocation.
> 
> The constraints need to be transported in a generic way, but the
> contents can be opaque for applications. Only the library would need to
> interpret them. This can be done with platform-specific code inside the
> library. A separate API for the library to interect with the kernel to
> further query or negotiate configuration parameters could be part of
> that picture.
> 
> This would offer standardized APIs to applications (to gather
> constraints, pass them to the constraint resolution library, and receive
> a heap ID and heap parameters), while still allowing platform-specific
> code in userspace.
> 
> > > There was at some point a discussion where folks (maybe it was
> > > DanielV? I can't remember...) suggested having something like a sysfs
> > > device node link from a device to a dma-buf heap chardev. This seems
> > > like it would add a similar behavior as what you're proposing, however
> > > without adding possibly a ton of new device specific heaps to the
> > > /dev/dma_heap/ dir. However, we would potentially need any missing
> > > heap types added first.
> > > 
> > > > I'm not really expecting this patch to be correct or even
> > > > a good idea, but just submitting it to start a discussion on DMA-BUF
> > > > heap discovery and negotiation.
> > > > 
> > > > Given Plumbers is just a couple weeks from now, I've submitted
> > > > a BoF proposal to discuss this, as perhaps it would make
> > > > sense to discuss this live?
> > > 
> > > I do think it's an interesting idea. I agree that having every driver
> > > implement a dmabuf exporter is a bit silly, but I also think Brian's
> > > point that maybe we need some drm helper functions that provide
> > > similar functionality along with a more standardized device ioctl for
> > > single device allocations might be better.
> > 
> > I'm unsure we should treat single device specially.
> >  
> > Exposing device limitations and constraints properly,
> > allowing some sort of negotation would hopefully solve
> > single device and shared requirements.
Simon Ser Aug. 27, 2020, 10:05 a.m. UTC | #13
On Tuesday, August 25, 2020 10:26 PM, Nicolas Dufresne <nicolas.dufresne@collabora.com> wrote:

> > I don't think we can do this in a system-agnostic way. What I'd like to
> > see is an API for the kernel to expose opaque constraints for each
>
> Please, take into consideration that constraints can also come from
> userspace. These days, there exist things we may want to do using the
> CPU, but the SIMD instructions and the associated algorithm will
> introduce constraints too. If these constraints comes too opaque, but
> you will also potentially limit some valid CPU interaction with HW in
> term of buffer access. CPU constraints todays are fairly small and one
> should be able to express them I believe. Of course, these are not
> media agnostic, some constraints may depends on the media (like an
> image buffer, a matrix buffer or audio buffer) and the associated
> algorithm to be used.
>
> An example would be an image buffers produced or modified on CPU but
> encoded with HW.

Actually, in the proposal we're working on, constraints may come from
user-space too. Rendering APIs (ie. mesa) will need to expose
constraints for buffers they can render on or texture from.

Constraints would be opaque for users like compositors (ie. programs
that merely pass constraints around and merge them don't need to
understand them), but not for user-space drivers.
Simon Ser Aug. 27, 2020, 2:52 p.m. UTC | #14
On Thursday, August 20, 2020 10:15 AM, Ezequiel Garcia <ezequiel@collabora.com> wrote:

> > I'm rather interested on where this goes, as I was toying with using
> > some sort of heap ID as a basis for a "device-local" constraint in the
> > memory constraints proposals Simon and I will be discussing at XDC this
> > year. It would be rather elegant if there was one type of heap ID used
> > universally throughout the kernel that could provide a unique handle for
> > the shared system memory heap(s), as well as accelerator-local heaps on
> > fancy NICs, GPUs, NN accelerators, capture devices, etc. so apps could
> > negotiate a location among themselves. This patch seems to be a step
> > towards that in a way, but I agree it would be counterproductive if a
> > bunch of devices that were using the same underlying system memory ended
> > up each getting their own heap ID just because they used some SW
> > framework that worked that way.
> > Would appreciate it if you could send along a pointer to your BoF if it
> > happens!
>
> Here is it:
>
> https://linuxplumbersconf.org/event/7/contributions/818/
>
> It would be great to see you there and discuss this,
> given I was hoping we could talk about how to meet a
> userspace allocator library expectations as well.

Unfortunately there's no livestream for BoFs. Would be very interested
in having a summary of the discussions once the BoF is over!
Ezequiel Garcia Aug. 31, 2020, 3:04 a.m. UTC | #15
Dear all,

Here are the notes we took during the BoF.

I believe the meeting was super interesting.

Although it felt a bit short for the topic,
we left with a few interesting ideas.

Thanks everyone!
Ezequiel

---
LPC 2020 BoF: Negotiating DMA-BUF Heaps

Attendees:
* Brian Starkey
* Daniel Stone
* Ezequiel Garcia
* James Jones
* John Reitan
* Laura Abbott
* Laurent Pinchart
* Sumit Semwal
* Robert Beckett

# Replacing subsystem memory allocators with dma-buf heaps

* Laurent: we should not have subsystem implement their own allocator.
  Using heaps could be a good idea.

* Laura: Wary to add too much support to heaps,
  from the ION experience.

* Laurent: V4L2: most drivers use videobuf2,
  which is a fairly complex piece of code.
  Three constraints, sg, contig, and vmalloc:
  these are fairly generic and not video-specific, why can't these just use heaps?

* Brian: In-kernel API will most likely just care of dma-buf
  not necessarily FDs. This was discussed recently, see "Role of DMA Heaps
  vs GEM in allocation",https://www.spinics.net/lists/dri-devel/msg268103.html

* Do we expect to get a non-file descriptor identifier for a dma-buf? No.

* Laurent proposes a two steps API (similar to existing GEM API),
  where we have one interface to allocate a buffer, with an identifier
  local to a process, and then another interface to wrap the buffer
  on a dma-buf (fd).

* If devices are not meant to be shared, then we might want to avoid
  the DMA-BUF design entirely. As Sumit mentioned, the fundamental
  idea behind DMA-BUF is that they are expected to be shared.
  OTOH, it was mentioned that sometimes we don't know if a buffer
  will be shared or not, so that's why the ability to wrap a buffer
over dma-buf is useful.

* New subsytems would possibly want to avoid implementing
  its own allocator interface. But unfortunately, we don't want
  to produce a fd per buffer, so that will mean a new subsystem
  will eventually require its own API (GEM-like).
  If a subsystem doesn't need many buffers, and the FD semantic is fine,
  then it would be acceptable to avoid a subsystem-specific API.

* It would be interesting to experiment replacing videobuf2-dma-contig
  with just dma-buf heap usage, and see what kind of code save we'd save.

* John Stultz has ideas around providing in-kernel accessors
  for the heaps - the idea is for drivers to not have to implement
  full exporter functionality for an already existing dma-buf heap type.

* Drawback to this idea of reusing dma-buf heaps to allocate buffers,
  is that it means marking every buffer that gets exported as shareable.

* The benefits in having a centralized implementation would be in unifying
  the semantics, reusable concepts that can be used to build future APIs around,
  rather than trying to created unified APIs around disparate kernel allocation
  APIs at only the userspace level.

* Robert: Is there an in-kernel user for the in-kernel dma-buf request?
  A possible answer would be for scratch buffers. The idea would be
  to avoid getting details wrong. However, doing this would allow every
  buffer to be exportable. Also, it sounds like this means
  re-implementing DMA-API?

* DMA-BUF are designed to be shared, not necessarily an allocator.

* Want something to expose device-local memory to rest of kernel.
Could be a dma-buf heap?
Laurent Pinchart Aug. 31, 2020, 3:08 p.m. UTC | #16
Hi James,

On Sun, Aug 23, 2020 at 03:53:50PM -0700, James Jones wrote:
> On 8/23/20 1:46 PM, Laurent Pinchart wrote:
> > On Sun, Aug 23, 2020 at 01:04:43PM -0700, James Jones wrote:
> >> On 8/20/20 1:15 AM, Ezequiel Garcia wrote:
> >>> On Mon, 2020-08-17 at 20:49 -0700, James Jones wrote:
> >>>> On 8/17/20 8:18 AM, Brian Starkey wrote:
> >>>>> On Sun, Aug 16, 2020 at 02:22:46PM -0300, Ezequiel Garcia wrote:
> >>>>>> This heap is basically a wrapper around DMA-API dma_alloc_attrs,
> >>>>>> which will allocate memory suitable for the given device.
> >>>>>>
> >>>>>> The implementation is mostly a port of the Contiguous Videobuf2
> >>>>>> memory allocator (see videobuf2/videobuf2-dma-contig.c)
> >>>>>> over to the DMA-BUF Heap interface.
> >>>>>>
> >>>>>> The intention of this allocator is to provide applications
> >>>>>> with a more system-agnostic API: the only thing the application
> >>>>>> needs to know is which device to get the buffer for.
> >>>>>>
> >>>>>> Whether the buffer is backed by CMA, IOMMU or a DMA Pool
> >>>>>> is unknown to the application.
> >>>>>>
> >>>>>> I'm not really expecting this patch to be correct or even
> >>>>>> a good idea, but just submitting it to start a discussion on DMA-BUF
> >>>>>> heap discovery and negotiation.
> >>>>>>
> >>>>>
> >>>>> My initial reaction is that I thought dmabuf heaps are meant for use
> >>>>> to allocate buffers for sharing across devices, which doesn't fit very
> >>>>> well with having per-device heaps.
> >>>>>
> >>>>> For single-device allocations, would using the buffer allocation
> >>>>> functionality of that device's native API be better in most
> >>>>> cases? (Some other possibly relevant discussion at [1])
> >>>>>
> >>>>> I can see that this can save some boilerplate for devices that want
> >>>>> to expose private chunks of memory, but might it also lead to 100
> >>>>> aliases for the system's generic coherent memory pool?
> >>>>>
> >>>>> I wonder if a set of helpers to allow devices to expose whatever they
> >>>>> want with minimal effort would be better.
> >>>>
> >>>> I'm rather interested on where this goes, as I was toying with using
> >>>> some sort of heap ID as a basis for a "device-local" constraint in the
> >>>> memory constraints proposals Simon and I will be discussing at XDC this
> >>>> year.  It would be rather elegant if there was one type of heap ID used
> >>>> universally throughout the kernel that could provide a unique handle for
> >>>> the shared system memory heap(s), as well as accelerator-local heaps on
> >>>> fancy NICs, GPUs, NN accelerators, capture devices, etc. so apps could
> >>>> negotiate a location among themselves.  This patch seems to be a step
> >>>> towards that in a way, but I agree it would be counterproductive if a
> >>>> bunch of devices that were using the same underlying system memory ended
> >>>> up each getting their own heap ID just because they used some SW
> >>>> framework that worked that way.
> >>>>
> >>>> Would appreciate it if you could send along a pointer to your BoF if it
> >>>> happens!
> >>>
> >>> Here is it:
> >>>
> >>> https://linuxplumbersconf.org/event/7/contributions/818/
> >>>
> >>> It would be great to see you there and discuss this,
> >>> given I was hoping we could talk about how to meet a
> >>> userspace allocator library expectations as well.
> >>
> >> Thanks!  I hadn't registered for LPC and it looks like it's sold out,
> >> but I'll try to watch the live stream.
> >>
> >> This is very interesting, in that it looks like we're both trying to
> >> solve roughly the same set of problems but approaching it from different
> >> angles.  From what I gather, your approach is that a "heap" encompasses
> >> all the allocation constraints a device may have.
> >>
> >> The approach Simon Ser and I are tossing around so far is somewhat
> >> different, but may potentially leverage dma-buf heaps a bit as well.
> >>
> >> Our approach looks more like what I described at XDC a few years ago,
> >> where memory constraints for a given device's usage of an image are
> >> exposed up to applications, which can then somehow perform boolean
> >> intersection/union operations on them to arrive at a common set of
> >> constraints that describe something compatible with all the devices &
> >> usages desired (or fail to do so, and fall back to copying things around
> >> presumably).  I believe this is more flexible than your initial proposal
> >> in that devices often support multiple usages (E.g., different formats,
> >> different proprietary layouts represented by format modifiers, etc.),
> >> and it avoids adding a combinatorial number of heaps to manage that.
> >>
> >> In my view, heaps are more like blobs of memory that can be allocated
> >> from in various different ways to satisfy constraints.  I realize heaps
> >> mean something specific in the dma-buf heap design (specifically,
> >> something closer to an association between an "allocation mechanism" and
> >> "physical memory"), but I hope we don't have massive heap/allocator
> >> mechanism proliferation due to constraints alone.  Perhaps some
> >> constraints, such as contiguous memory or device-local memory, are
> >> properly expressed as a specific heap, but consider the proliferation
> >> implied by even that simple pair of examples: How do you express
> >> contiguous device-local memory?  Do you need to spawn two heaps on the
> >> underlying device-local memory, one for contiguous allocations and one
> >> for non-contiguous allocations?  Seems excessive.
> >>
> >> Of course, our approach also has downsides and is still being worked on.
> >>    For example, it works best in an ideal world where all the allocators
> >> available understand all the constraints that exist.
> > 
> > Shouldn't allocators be decoupled of constraints ? In my imagination I
> > see devices exposing constraints, and allocators exposing parameters,
> > with a userspace library to reconcile the constraints and produce
> > allocator parameters from them.
> 
> Perhaps another level of abstraction would help.  I'll have to think 
> about that.
> 
> However, as far as I can tell, it wouldn't remove the need to 
> communicate a lot of constraints from multiple engines/devices/etc. to 
> the allocator (likely a single allocator.  I'd be interested to know if 
> anyone has a design that effectively uses multiple allocators to satisfy 
> a single allocation request, but I haven't come up with a good one) 
> somehow.  Either the constraints are directly used as the parameters, or 
> there's a translation/second level of abstraction, but either way much 
> of the information needs to make it to the allocator, or represent the 
> need to use a particular allocator.  Simple things like pitch and offset 
> alignment can be done without help from a kernel-level allocator, but 
> others such as cache coherency, physical memory bank placement, or 
> device-local memory will need to make it all the way down to the kernel 
> some how I believe.

I fully agree that we'll need kernel support, but I don't think the
constraints reporting API and the allocator API need to speak the same
language. For instance, drivers will report alignment constraints, the
userspace allocator library will translate that to a pitch value, and
pass it to the allocator as an allocation parameter. The allocator won't
know about alignment constraints. That's a simple example, let's see how
it turns out with more complex constraints. With a centralized userspace
library we have the ability to decouple the two sides, which I believe
can be useful to keep the complexity of constraints and allocation
parameters (as) low (as possible).

> >> Dealing with a
> >> reality where there are probably a handful of allocators, another
> >> handful of userspace libraries and APIs, and still more applications
> >> trying to make use of all this is one of the larger remaining challenges
> >> of the design.
> >>
> >> We'll present our work at XDC 2020.  Hope you can check that out as well!
> >>
> >>>>> 1. https://lore.kernel.org/dri-devel/57062477-30e7-a3de-6723-a50d03a402c4@kapsi.fi/
> >>>>>
> >>>>>> Given Plumbers is just a couple weeks from now, I've submitted
> >>>>>> a BoF proposal to discuss this, as perhaps it would make
> >>>>>> sense to discuss this live?
> >>>>>>
> >>>>>> Not-signed-off-by: Ezequiel Garcia <ezequiel@collabora.com>
> >
Daniel Vetter Sept. 1, 2020, 7:32 a.m. UTC | #17
On Sun, Aug 16, 2020 at 02:22:46PM -0300, Ezequiel Garcia wrote:
> This heap is basically a wrapper around DMA-API dma_alloc_attrs,
> which will allocate memory suitable for the given device.
> 
> The implementation is mostly a port of the Contiguous Videobuf2
> memory allocator (see videobuf2/videobuf2-dma-contig.c)
> over to the DMA-BUF Heap interface.
> 
> The intention of this allocator is to provide applications
> with a more system-agnostic API: the only thing the application
> needs to know is which device to get the buffer for.
> 
> Whether the buffer is backed by CMA, IOMMU or a DMA Pool
> is unknown to the application.
> 
> I'm not really expecting this patch to be correct or even
> a good idea, but just submitting it to start a discussion on DMA-BUF
> heap discovery and negotiation.
> 
> Given Plumbers is just a couple weeks from now, I've submitted
> a BoF proposal to discuss this, as perhaps it would make
> sense to discuss this live?
> 
> Not-signed-off-by: Ezequiel Garcia <ezequiel@collabora.com>

I think on the uapi/constraint solving side there's been already tons of
discussions while I enjoyed vacations, so different concern from me
entirely on the implementation side:

In the past the only thing we had in upstream was subsystem/driver
specific allocators, and dma-buf for sharing. With dma-buf heaps we kinda
get a central allocator, which duplicates large chunks of of all these
allocators. And since it's a central allocator by design, the reason for
having per-subsystem allocators is kinda gone.

I think there's two approaches here:
- we convert e.g. drm allocator helpers to internally use the right heap
  implementation. That would also give us some fairly direct way to expose
  these constraints in sysfs so a userspace allocator knows which dma-buf
  heap to pick for shared stuff.

- we require that any heap is just a different uapi for an existing driver
  allocator, e.g. by having a dma-buf wrapper for all gem drivers.

Otherwise I think what we end up with is a pile of dma-buf heaps for
android's blob gpu driver world, and not used anywhere else. Not something
even remotely interesting for upstream :-)

tldr; I'd like to see how dma-buf heaps closely integrate with all the
existing buffer management code we have. Both kernel (and throuhg some
allocator library effort) in userspace.

Cheers, Daniel

> ---
>  drivers/dma-buf/heaps/Kconfig       |   9 +
>  drivers/dma-buf/heaps/Makefile      |   1 +
>  drivers/dma-buf/heaps/device_heap.c | 268 ++++++++++++++++++++++++++++
>  include/linux/device.h              |   5 +
>  include/linux/dma-heap.h            |   6 +
>  5 files changed, 289 insertions(+)
>  create mode 100644 drivers/dma-buf/heaps/device_heap.c
> 
> diff --git a/drivers/dma-buf/heaps/Kconfig b/drivers/dma-buf/heaps/Kconfig
> index a5eef06c4226..2bb3604184bd 100644
> --- a/drivers/dma-buf/heaps/Kconfig
> +++ b/drivers/dma-buf/heaps/Kconfig
> @@ -12,3 +12,12 @@ config DMABUF_HEAPS_CMA
>  	  Choose this option to enable dma-buf CMA heap. This heap is backed
>  	  by the Contiguous Memory Allocator (CMA). If your system has these
>  	  regions, you should say Y here.
> +
> +config DMABUF_HEAPS_DEVICES
> +	bool "DMA-BUF Device DMA Heap (Experimental)"
> +	depends on DMABUF_HEAPS
> +	help
> +	  Choose this option to enable dma-buf per-device heap. This heap is backed
> +	  by the DMA-API and it's an Experimental feature, meant mostly for testing
> +	  and experimentation.
> +	  Just say N here.
> diff --git a/drivers/dma-buf/heaps/Makefile b/drivers/dma-buf/heaps/Makefile
> index 6e54cdec3da0..c691d85b3044 100644
> --- a/drivers/dma-buf/heaps/Makefile
> +++ b/drivers/dma-buf/heaps/Makefile
> @@ -2,3 +2,4 @@
>  obj-y					+= heap-helpers.o
>  obj-$(CONFIG_DMABUF_HEAPS_SYSTEM)	+= system_heap.o
>  obj-$(CONFIG_DMABUF_HEAPS_CMA)		+= cma_heap.o
> +obj-$(CONFIG_DMABUF_HEAPS_DEVICES)	+= device_heap.o
> diff --git a/drivers/dma-buf/heaps/device_heap.c b/drivers/dma-buf/heaps/device_heap.c
> new file mode 100644
> index 000000000000..1803dc622dd8
> --- /dev/null
> +++ b/drivers/dma-buf/heaps/device_heap.c
> @@ -0,0 +1,268 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * DMABUF Device DMA heap exporter
> + *
> + * Copyright (C) 2020, Collabora Ltd.
> + *
> + * Based on:
> + *   videobuf2-dma-contig.c - DMA contig memory allocator for videobuf2
> + *   Copyright (C) 2010 Samsung Electronics
> + */
> +
> +#include <linux/device.h>
> +#include <linux/dma-buf.h>
> +#include <linux/dma-heap.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/scatterlist.h>
> +#include <linux/slab.h>
> +#include <linux/module.h>
> +
> +struct dev_dmabuf_attachment {
> +	struct sg_table sgt;
> +	enum dma_data_direction dma_dir;
> +};
> +
> +struct dev_dmabuf {
> +	struct dma_heap *heap;
> +	struct dma_buf *dmabuf;
> +	struct device *dev;
> +	size_t size;
> +	void *vaddr;
> +	dma_addr_t dma_addr;
> +	unsigned long attrs;
> +
> +	struct sg_table sgt;
> +};
> +
> +static struct sg_table *dev_dmabuf_ops_map(struct dma_buf_attachment *db_attach,
> +					   enum dma_data_direction dma_dir)
> +{
> +	struct dev_dmabuf_attachment *attach = db_attach->priv;
> +	/* stealing dmabuf mutex to serialize map/unmap operations */
> +	struct mutex *lock = &db_attach->dmabuf->lock;
> +	struct sg_table *sgt;
> +
> +	mutex_lock(lock);
> +
> +	sgt = &attach->sgt;
> +	/* return previously mapped sg table */
> +	if (attach->dma_dir == dma_dir) {
> +		mutex_unlock(lock);
> +		return sgt;
> +	}
> +
> +	/* release any previous cache */
> +	if (attach->dma_dir != DMA_NONE) {
> +		dma_unmap_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents,
> +				   attach->dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> +		attach->dma_dir = DMA_NONE;
> +	}
> +
> +	/*
> +	 * mapping to the client with new direction, no cache sync
> +	 * required see comment in .dmabuf_ops_detach()
> +	 */
> +	sgt->nents = dma_map_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents,
> +				      dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> +	if (!sgt->nents) {
> +		dev_err(db_attach->dev, "failed to map scatterlist\n");
> +		mutex_unlock(lock);
> +		return ERR_PTR(-EIO);
> +	}
> +
> +	attach->dma_dir = dma_dir;
> +
> +	mutex_unlock(lock);
> +
> +	return sgt;
> +}
> +
> +static void dev_dmabuf_ops_unmap(struct dma_buf_attachment *db_attach,
> +				 struct sg_table *sgt,
> +				 enum dma_data_direction dma_dir)
> +{
> +	/* nothing to be done here */
> +}
> +
> +static int dev_dmabuf_ops_attach(struct dma_buf *dmabuf,
> +				 struct dma_buf_attachment *dbuf_attach)
> +{
> +	struct dev_dmabuf_attachment *attach;
> +	unsigned int i;
> +	struct scatterlist *rd, *wr;
> +	struct sg_table *sgt;
> +	struct dev_dmabuf *buf = dmabuf->priv;
> +	int ret;
> +
> +	attach = kzalloc(sizeof(*attach), GFP_KERNEL);
> +	if (!attach)
> +		return -ENOMEM;
> +	sgt = &attach->sgt;
> +
> +	/*
> +	 * Copy the buf->sgt scatter list to the attachment, as we can't
> +	 * map the same scatter list to multiple attachments at the same time.
> +	 */
> +	ret = sg_alloc_table(sgt, buf->sgt.orig_nents, GFP_KERNEL);
> +	if (ret) {
> +		kfree(attach);
> +		return -ENOMEM;
> +	}
> +
> +	rd = buf->sgt.sgl;
> +	wr = sgt->sgl;
> +	for (i = 0; i < sgt->orig_nents; ++i) {
> +		sg_set_page(wr, sg_page(rd), rd->length, rd->offset);
> +		rd = sg_next(rd);
> +		wr = sg_next(wr);
> +	}
> +
> +	attach->dma_dir = DMA_NONE;
> +	dbuf_attach->priv = attach;
> +
> +	return 0;
> +}
> +
> +static void dev_dmabuf_ops_detach(struct dma_buf *dmabuf,
> +				  struct dma_buf_attachment *db_attach)
> +{
> +	struct dev_dmabuf_attachment *attach = db_attach->priv;
> +	struct sg_table *sgt;
> +
> +	if (!attach)
> +		return;
> +	sgt = &attach->sgt;
> +
> +	/* release the scatterlist cache */
> +	if (attach->dma_dir != DMA_NONE)
> +		/*
> +		 * Cache sync can be skipped here, as the memory is
> +		 * allocated from device coherent memory, which means the
> +		 * memory locations do not require any explicit cache
> +		 * maintenance prior or after being used by the device.
> +		 *
> +		 * XXX: This needs a revisit.
> +		 */
> +		dma_unmap_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents,
> +				   attach->dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> +	sg_free_table(sgt);
> +	kfree(attach);
> +	db_attach->priv = NULL;
> +}
> +
> +
> +static void *dev_dmabuf_ops_vmap(struct dma_buf *dmabuf)
> +{
> +	struct dev_dmabuf *buf = dmabuf->priv;
> +
> +	return buf->vaddr;
> +}
> +
> +static void dev_dmabuf_ops_release(struct dma_buf *dmabuf)
> +{
> +	struct dev_dmabuf *buf = dmabuf->priv;
> +
> +	sg_free_table(&buf->sgt);
> +	dma_free_attrs(buf->dev, buf->size, buf->vaddr,
> +		       buf->dma_addr, buf->attrs);
> +	put_device(buf->dev);
> +	kfree(buf);
> +}
> +
> +static int dev_dmabuf_ops_mmap(struct dma_buf *dmabuf,
> +			       struct vm_area_struct *vma)
> +{
> +	struct dev_dmabuf *buf = dmabuf->priv;
> +	int ret;
> +
> +	ret = dma_mmap_attrs(buf->dev, vma, buf->vaddr,
> +			     buf->dma_addr, buf->size,
> +			     buf->attrs);
> +	if (ret) {
> +		dev_err(buf->dev, "remapping memory failed, error: %d\n", ret);
> +		return ret;
> +	}
> +	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
> +
> +	return 0;
> +}
> +
> +static const struct dma_buf_ops dev_dmabuf_ops = {
> +	.attach = dev_dmabuf_ops_attach,
> +	.detach = dev_dmabuf_ops_detach,
> +	.map_dma_buf = dev_dmabuf_ops_map,
> +	.unmap_dma_buf = dev_dmabuf_ops_unmap,
> +	.vmap = dev_dmabuf_ops_vmap,
> +	.mmap = dev_dmabuf_ops_mmap,
> +	.release = dev_dmabuf_ops_release,
> +};
> +
> +static int dev_heap_allocate(struct dma_heap *heap,
> +			unsigned long size,
> +			unsigned long fd_flags,
> +			unsigned long heap_flags)
> +{
> +	struct device *dev = dma_heap_get_drvdata(heap);
> +	struct dev_dmabuf *buf;
> +	struct dma_buf_export_info exp_info = {};
> +	unsigned long attrs = 0;
> +	int ret = -ENOMEM;
> +
> +	buf = kzalloc(sizeof(*buf), GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	buf->vaddr = dma_alloc_attrs(dev, size, &buf->dma_addr,
> +				     GFP_KERNEL, attrs);
> +	/* Prevent the device from being released while the buffer is used */
> +	buf->dev = get_device(dev);
> +	buf->heap = heap;
> +	buf->size = size;
> +	buf->attrs = attrs;
> +
> +	/* XXX: This call is documented as unsafe. See dma_get_sgtable_attrs(). */
> +	ret = dma_get_sgtable_attrs(buf->dev, &buf->sgt,
> +				    buf->vaddr, buf->dma_addr,
> +				    buf->size, buf->attrs);
> +	if (ret < 0) {
> +		dev_err(buf->dev, "failed to get scatterlist from DMA API\n");
> +		return ret;
> +	}
> +
> +	exp_info.exp_name = dev_name(dev);
> +	exp_info.owner = THIS_MODULE;
> +	exp_info.ops = &dev_dmabuf_ops;
> +	exp_info.size = size;
> +	exp_info.flags = fd_flags;
> +	exp_info.priv = buf;
> +
> +	buf->dmabuf = dma_buf_export(&exp_info);
> +	if (IS_ERR(buf->dmabuf)) {
> +		dev_err(buf->dev, "failed to export dmabuf\n");
> +		return PTR_ERR(buf->dmabuf);
> +	}
> +
> +	ret = dma_buf_fd(buf->dmabuf, fd_flags);
> +	if (ret < 0) {
> +		dev_err(buf->dev, "failed to get dmabuf fd: %d\n", ret);
> +		return ret;
> +	}
> +
> +	return ret;
> +}
> +
> +static const struct dma_heap_ops dev_heap_ops = {
> +	.allocate = dev_heap_allocate,
> +};
> +
> +void dev_dma_heap_add(struct device *dev)
> +{
> +	struct dma_heap_export_info exp_info;
> +
> +	exp_info.name = dev_name(dev);
> +	exp_info.ops = &dev_heap_ops;
> +	exp_info.priv = dev;
> +
> +	dev->heap = dma_heap_add(&exp_info);
> +}
> +EXPORT_SYMBOL(dev_dma_heap_add);
> diff --git a/include/linux/device.h b/include/linux/device.h
> index ca18da4768e3..1fae95d55ea1 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -45,6 +45,7 @@ struct iommu_ops;
>  struct iommu_group;
>  struct dev_pin_info;
>  struct dev_iommu;
> +struct dma_heap;
>  
>  /**
>   * struct subsys_interface - interfaces to device functions
> @@ -597,6 +598,10 @@ struct device {
>  	struct iommu_group	*iommu_group;
>  	struct dev_iommu	*iommu;
>  
> +#ifdef CONFIG_DMABUF_HEAPS_DEVICES
> +	struct dma_heap		*heap;
> +#endif
> +
>  	bool			offline_disabled:1;
>  	bool			offline:1;
>  	bool			of_node_reused:1;
> diff --git a/include/linux/dma-heap.h b/include/linux/dma-heap.h
> index 454e354d1ffb..dcf7cca2f487 100644
> --- a/include/linux/dma-heap.h
> +++ b/include/linux/dma-heap.h
> @@ -56,4 +56,10 @@ void *dma_heap_get_drvdata(struct dma_heap *heap);
>   */
>  struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info);
>  
> +#ifdef CONFIG_DMABUF_HEAPS_DEVICES
> +void dev_dma_heap_add(struct device *dev);
> +#else
> +static inline void dev_dma_heap_add(struct device *dev) {}
> +#endif
> +
>  #endif /* _DMA_HEAPS_H */
> -- 
> 2.27.0
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Laurent Pinchart Sept. 8, 2020, 5:43 a.m. UTC | #18
Hi Daniel,

On Tue, Sep 01, 2020 at 09:32:22AM +0200, Daniel Vetter wrote:
> On Sun, Aug 16, 2020 at 02:22:46PM -0300, Ezequiel Garcia wrote:
> > This heap is basically a wrapper around DMA-API dma_alloc_attrs,
> > which will allocate memory suitable for the given device.
> > 
> > The implementation is mostly a port of the Contiguous Videobuf2
> > memory allocator (see videobuf2/videobuf2-dma-contig.c)
> > over to the DMA-BUF Heap interface.
> > 
> > The intention of this allocator is to provide applications
> > with a more system-agnostic API: the only thing the application
> > needs to know is which device to get the buffer for.
> > 
> > Whether the buffer is backed by CMA, IOMMU or a DMA Pool
> > is unknown to the application.
> > 
> > I'm not really expecting this patch to be correct or even
> > a good idea, but just submitting it to start a discussion on DMA-BUF
> > heap discovery and negotiation.
> > 
> > Given Plumbers is just a couple weeks from now, I've submitted
> > a BoF proposal to discuss this, as perhaps it would make
> > sense to discuss this live?
> > 
> > Not-signed-off-by: Ezequiel Garcia <ezequiel@collabora.com>
> 
> I think on the uapi/constraint solving side there's been already tons of
> discussions while I enjoyed vacations, so different concern from me
> entirely on the implementation side:
> 
> In the past the only thing we had in upstream was subsystem/driver
> specific allocators, and dma-buf for sharing. With dma-buf heaps we kinda
> get a central allocator, which duplicates large chunks of of all these
> allocators. And since it's a central allocator by design, the reason for
> having per-subsystem allocators is kinda gone.
> 
> I think there's two approaches here:
> - we convert e.g. drm allocator helpers to internally use the right heap
>   implementation.

We however don't always have a need for a dmabuf, as not all buffers are
meant to be shared (and we often can't tell beforehand, at allocation
time, if a given buffer will be shared or not). While the overhead of
allocating a dmabuf should be file, assigning a file descriptor to each
buffer would be prohibitely expensive.

We would need to decouple the dma heaps from file descriptors. I think
that's doable, but it changes the philosophy of dma heaps and needs
careful consideration. In particular, one may wonder what that would
bring us, compared to the DMA mapping API for instance.

> That would also give us some fairly direct way to expose
>   these constraints in sysfs so a userspace allocator knows which dma-buf
>   heap to pick for shared stuff.

sysfs won't work I'm afraid, as constraints may depend on the format for
instace. We need subsystem-specific APIs to expose constraints.

> - we require that any heap is just a different uapi for an existing driver
>   allocator, e.g. by having a dma-buf wrapper for all gem drivers.
> 
> Otherwise I think what we end up with is a pile of dma-buf heaps for
> android's blob gpu driver world, and not used anywhere else. Not something
> even remotely interesting for upstream :-)
> 
> tldr; I'd like to see how dma-buf heaps closely integrate with all the
> existing buffer management code we have. Both kernel (and throuhg some
> allocator library effort) in userspace.
> 
> > ---
> >  drivers/dma-buf/heaps/Kconfig       |   9 +
> >  drivers/dma-buf/heaps/Makefile      |   1 +
> >  drivers/dma-buf/heaps/device_heap.c | 268 ++++++++++++++++++++++++++++
> >  include/linux/device.h              |   5 +
> >  include/linux/dma-heap.h            |   6 +
> >  5 files changed, 289 insertions(+)
> >  create mode 100644 drivers/dma-buf/heaps/device_heap.c
> > 
> > diff --git a/drivers/dma-buf/heaps/Kconfig b/drivers/dma-buf/heaps/Kconfig
> > index a5eef06c4226..2bb3604184bd 100644
> > --- a/drivers/dma-buf/heaps/Kconfig
> > +++ b/drivers/dma-buf/heaps/Kconfig
> > @@ -12,3 +12,12 @@ config DMABUF_HEAPS_CMA
> >  	  Choose this option to enable dma-buf CMA heap. This heap is backed
> >  	  by the Contiguous Memory Allocator (CMA). If your system has these
> >  	  regions, you should say Y here.
> > +
> > +config DMABUF_HEAPS_DEVICES
> > +	bool "DMA-BUF Device DMA Heap (Experimental)"
> > +	depends on DMABUF_HEAPS
> > +	help
> > +	  Choose this option to enable dma-buf per-device heap. This heap is backed
> > +	  by the DMA-API and it's an Experimental feature, meant mostly for testing
> > +	  and experimentation.
> > +	  Just say N here.
> > diff --git a/drivers/dma-buf/heaps/Makefile b/drivers/dma-buf/heaps/Makefile
> > index 6e54cdec3da0..c691d85b3044 100644
> > --- a/drivers/dma-buf/heaps/Makefile
> > +++ b/drivers/dma-buf/heaps/Makefile
> > @@ -2,3 +2,4 @@
> >  obj-y					+= heap-helpers.o
> >  obj-$(CONFIG_DMABUF_HEAPS_SYSTEM)	+= system_heap.o
> >  obj-$(CONFIG_DMABUF_HEAPS_CMA)		+= cma_heap.o
> > +obj-$(CONFIG_DMABUF_HEAPS_DEVICES)	+= device_heap.o
> > diff --git a/drivers/dma-buf/heaps/device_heap.c b/drivers/dma-buf/heaps/device_heap.c
> > new file mode 100644
> > index 000000000000..1803dc622dd8
> > --- /dev/null
> > +++ b/drivers/dma-buf/heaps/device_heap.c
> > @@ -0,0 +1,268 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * DMABUF Device DMA heap exporter
> > + *
> > + * Copyright (C) 2020, Collabora Ltd.
> > + *
> > + * Based on:
> > + *   videobuf2-dma-contig.c - DMA contig memory allocator for videobuf2
> > + *   Copyright (C) 2010 Samsung Electronics
> > + */
> > +
> > +#include <linux/device.h>
> > +#include <linux/dma-buf.h>
> > +#include <linux/dma-heap.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/scatterlist.h>
> > +#include <linux/slab.h>
> > +#include <linux/module.h>
> > +
> > +struct dev_dmabuf_attachment {
> > +	struct sg_table sgt;
> > +	enum dma_data_direction dma_dir;
> > +};
> > +
> > +struct dev_dmabuf {
> > +	struct dma_heap *heap;
> > +	struct dma_buf *dmabuf;
> > +	struct device *dev;
> > +	size_t size;
> > +	void *vaddr;
> > +	dma_addr_t dma_addr;
> > +	unsigned long attrs;
> > +
> > +	struct sg_table sgt;
> > +};
> > +
> > +static struct sg_table *dev_dmabuf_ops_map(struct dma_buf_attachment *db_attach,
> > +					   enum dma_data_direction dma_dir)
> > +{
> > +	struct dev_dmabuf_attachment *attach = db_attach->priv;
> > +	/* stealing dmabuf mutex to serialize map/unmap operations */
> > +	struct mutex *lock = &db_attach->dmabuf->lock;
> > +	struct sg_table *sgt;
> > +
> > +	mutex_lock(lock);
> > +
> > +	sgt = &attach->sgt;
> > +	/* return previously mapped sg table */
> > +	if (attach->dma_dir == dma_dir) {
> > +		mutex_unlock(lock);
> > +		return sgt;
> > +	}
> > +
> > +	/* release any previous cache */
> > +	if (attach->dma_dir != DMA_NONE) {
> > +		dma_unmap_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents,
> > +				   attach->dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> > +		attach->dma_dir = DMA_NONE;
> > +	}
> > +
> > +	/*
> > +	 * mapping to the client with new direction, no cache sync
> > +	 * required see comment in .dmabuf_ops_detach()
> > +	 */
> > +	sgt->nents = dma_map_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents,
> > +				      dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> > +	if (!sgt->nents) {
> > +		dev_err(db_attach->dev, "failed to map scatterlist\n");
> > +		mutex_unlock(lock);
> > +		return ERR_PTR(-EIO);
> > +	}
> > +
> > +	attach->dma_dir = dma_dir;
> > +
> > +	mutex_unlock(lock);
> > +
> > +	return sgt;
> > +}
> > +
> > +static void dev_dmabuf_ops_unmap(struct dma_buf_attachment *db_attach,
> > +				 struct sg_table *sgt,
> > +				 enum dma_data_direction dma_dir)
> > +{
> > +	/* nothing to be done here */
> > +}
> > +
> > +static int dev_dmabuf_ops_attach(struct dma_buf *dmabuf,
> > +				 struct dma_buf_attachment *dbuf_attach)
> > +{
> > +	struct dev_dmabuf_attachment *attach;
> > +	unsigned int i;
> > +	struct scatterlist *rd, *wr;
> > +	struct sg_table *sgt;
> > +	struct dev_dmabuf *buf = dmabuf->priv;
> > +	int ret;
> > +
> > +	attach = kzalloc(sizeof(*attach), GFP_KERNEL);
> > +	if (!attach)
> > +		return -ENOMEM;
> > +	sgt = &attach->sgt;
> > +
> > +	/*
> > +	 * Copy the buf->sgt scatter list to the attachment, as we can't
> > +	 * map the same scatter list to multiple attachments at the same time.
> > +	 */
> > +	ret = sg_alloc_table(sgt, buf->sgt.orig_nents, GFP_KERNEL);
> > +	if (ret) {
> > +		kfree(attach);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	rd = buf->sgt.sgl;
> > +	wr = sgt->sgl;
> > +	for (i = 0; i < sgt->orig_nents; ++i) {
> > +		sg_set_page(wr, sg_page(rd), rd->length, rd->offset);
> > +		rd = sg_next(rd);
> > +		wr = sg_next(wr);
> > +	}
> > +
> > +	attach->dma_dir = DMA_NONE;
> > +	dbuf_attach->priv = attach;
> > +
> > +	return 0;
> > +}
> > +
> > +static void dev_dmabuf_ops_detach(struct dma_buf *dmabuf,
> > +				  struct dma_buf_attachment *db_attach)
> > +{
> > +	struct dev_dmabuf_attachment *attach = db_attach->priv;
> > +	struct sg_table *sgt;
> > +
> > +	if (!attach)
> > +		return;
> > +	sgt = &attach->sgt;
> > +
> > +	/* release the scatterlist cache */
> > +	if (attach->dma_dir != DMA_NONE)
> > +		/*
> > +		 * Cache sync can be skipped here, as the memory is
> > +		 * allocated from device coherent memory, which means the
> > +		 * memory locations do not require any explicit cache
> > +		 * maintenance prior or after being used by the device.
> > +		 *
> > +		 * XXX: This needs a revisit.
> > +		 */
> > +		dma_unmap_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents,
> > +				   attach->dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> > +	sg_free_table(sgt);
> > +	kfree(attach);
> > +	db_attach->priv = NULL;
> > +}
> > +
> > +
> > +static void *dev_dmabuf_ops_vmap(struct dma_buf *dmabuf)
> > +{
> > +	struct dev_dmabuf *buf = dmabuf->priv;
> > +
> > +	return buf->vaddr;
> > +}
> > +
> > +static void dev_dmabuf_ops_release(struct dma_buf *dmabuf)
> > +{
> > +	struct dev_dmabuf *buf = dmabuf->priv;
> > +
> > +	sg_free_table(&buf->sgt);
> > +	dma_free_attrs(buf->dev, buf->size, buf->vaddr,
> > +		       buf->dma_addr, buf->attrs);
> > +	put_device(buf->dev);
> > +	kfree(buf);
> > +}
> > +
> > +static int dev_dmabuf_ops_mmap(struct dma_buf *dmabuf,
> > +			       struct vm_area_struct *vma)
> > +{
> > +	struct dev_dmabuf *buf = dmabuf->priv;
> > +	int ret;
> > +
> > +	ret = dma_mmap_attrs(buf->dev, vma, buf->vaddr,
> > +			     buf->dma_addr, buf->size,
> > +			     buf->attrs);
> > +	if (ret) {
> > +		dev_err(buf->dev, "remapping memory failed, error: %d\n", ret);
> > +		return ret;
> > +	}
> > +	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
> > +
> > +	return 0;
> > +}
> > +
> > +static const struct dma_buf_ops dev_dmabuf_ops = {
> > +	.attach = dev_dmabuf_ops_attach,
> > +	.detach = dev_dmabuf_ops_detach,
> > +	.map_dma_buf = dev_dmabuf_ops_map,
> > +	.unmap_dma_buf = dev_dmabuf_ops_unmap,
> > +	.vmap = dev_dmabuf_ops_vmap,
> > +	.mmap = dev_dmabuf_ops_mmap,
> > +	.release = dev_dmabuf_ops_release,
> > +};
> > +
> > +static int dev_heap_allocate(struct dma_heap *heap,
> > +			unsigned long size,
> > +			unsigned long fd_flags,
> > +			unsigned long heap_flags)
> > +{
> > +	struct device *dev = dma_heap_get_drvdata(heap);
> > +	struct dev_dmabuf *buf;
> > +	struct dma_buf_export_info exp_info = {};
> > +	unsigned long attrs = 0;
> > +	int ret = -ENOMEM;
> > +
> > +	buf = kzalloc(sizeof(*buf), GFP_KERNEL);
> > +	if (!buf)
> > +		return -ENOMEM;
> > +
> > +	buf->vaddr = dma_alloc_attrs(dev, size, &buf->dma_addr,
> > +				     GFP_KERNEL, attrs);
> > +	/* Prevent the device from being released while the buffer is used */
> > +	buf->dev = get_device(dev);
> > +	buf->heap = heap;
> > +	buf->size = size;
> > +	buf->attrs = attrs;
> > +
> > +	/* XXX: This call is documented as unsafe. See dma_get_sgtable_attrs(). */
> > +	ret = dma_get_sgtable_attrs(buf->dev, &buf->sgt,
> > +				    buf->vaddr, buf->dma_addr,
> > +				    buf->size, buf->attrs);
> > +	if (ret < 0) {
> > +		dev_err(buf->dev, "failed to get scatterlist from DMA API\n");
> > +		return ret;
> > +	}
> > +
> > +	exp_info.exp_name = dev_name(dev);
> > +	exp_info.owner = THIS_MODULE;
> > +	exp_info.ops = &dev_dmabuf_ops;
> > +	exp_info.size = size;
> > +	exp_info.flags = fd_flags;
> > +	exp_info.priv = buf;
> > +
> > +	buf->dmabuf = dma_buf_export(&exp_info);
> > +	if (IS_ERR(buf->dmabuf)) {
> > +		dev_err(buf->dev, "failed to export dmabuf\n");
> > +		return PTR_ERR(buf->dmabuf);
> > +	}
> > +
> > +	ret = dma_buf_fd(buf->dmabuf, fd_flags);
> > +	if (ret < 0) {
> > +		dev_err(buf->dev, "failed to get dmabuf fd: %d\n", ret);
> > +		return ret;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static const struct dma_heap_ops dev_heap_ops = {
> > +	.allocate = dev_heap_allocate,
> > +};
> > +
> > +void dev_dma_heap_add(struct device *dev)
> > +{
> > +	struct dma_heap_export_info exp_info;
> > +
> > +	exp_info.name = dev_name(dev);
> > +	exp_info.ops = &dev_heap_ops;
> > +	exp_info.priv = dev;
> > +
> > +	dev->heap = dma_heap_add(&exp_info);
> > +}
> > +EXPORT_SYMBOL(dev_dma_heap_add);
> > diff --git a/include/linux/device.h b/include/linux/device.h
> > index ca18da4768e3..1fae95d55ea1 100644
> > --- a/include/linux/device.h
> > +++ b/include/linux/device.h
> > @@ -45,6 +45,7 @@ struct iommu_ops;
> >  struct iommu_group;
> >  struct dev_pin_info;
> >  struct dev_iommu;
> > +struct dma_heap;
> >  
> >  /**
> >   * struct subsys_interface - interfaces to device functions
> > @@ -597,6 +598,10 @@ struct device {
> >  	struct iommu_group	*iommu_group;
> >  	struct dev_iommu	*iommu;
> >  
> > +#ifdef CONFIG_DMABUF_HEAPS_DEVICES
> > +	struct dma_heap		*heap;
> > +#endif
> > +
> >  	bool			offline_disabled:1;
> >  	bool			offline:1;
> >  	bool			of_node_reused:1;
> > diff --git a/include/linux/dma-heap.h b/include/linux/dma-heap.h
> > index 454e354d1ffb..dcf7cca2f487 100644
> > --- a/include/linux/dma-heap.h
> > +++ b/include/linux/dma-heap.h
> > @@ -56,4 +56,10 @@ void *dma_heap_get_drvdata(struct dma_heap *heap);
> >   */
> >  struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info);
> >  
> > +#ifdef CONFIG_DMABUF_HEAPS_DEVICES
> > +void dev_dma_heap_add(struct device *dev);
> > +#else
> > +static inline void dev_dma_heap_add(struct device *dev) {}
> > +#endif
> > +
> >  #endif /* _DMA_HEAPS_H */
Daniel Vetter Sept. 8, 2020, 8:36 a.m. UTC | #19
On Tue, Sep 08, 2020 at 08:43:59AM +0300, Laurent Pinchart wrote:
> Hi Daniel,
> 
> On Tue, Sep 01, 2020 at 09:32:22AM +0200, Daniel Vetter wrote:
> > On Sun, Aug 16, 2020 at 02:22:46PM -0300, Ezequiel Garcia wrote:
> > > This heap is basically a wrapper around DMA-API dma_alloc_attrs,
> > > which will allocate memory suitable for the given device.
> > > 
> > > The implementation is mostly a port of the Contiguous Videobuf2
> > > memory allocator (see videobuf2/videobuf2-dma-contig.c)
> > > over to the DMA-BUF Heap interface.
> > > 
> > > The intention of this allocator is to provide applications
> > > with a more system-agnostic API: the only thing the application
> > > needs to know is which device to get the buffer for.
> > > 
> > > Whether the buffer is backed by CMA, IOMMU or a DMA Pool
> > > is unknown to the application.
> > > 
> > > I'm not really expecting this patch to be correct or even
> > > a good idea, but just submitting it to start a discussion on DMA-BUF
> > > heap discovery and negotiation.
> > > 
> > > Given Plumbers is just a couple weeks from now, I've submitted
> > > a BoF proposal to discuss this, as perhaps it would make
> > > sense to discuss this live?
> > > 
> > > Not-signed-off-by: Ezequiel Garcia <ezequiel@collabora.com>
> > 
> > I think on the uapi/constraint solving side there's been already tons of
> > discussions while I enjoyed vacations, so different concern from me
> > entirely on the implementation side:
> > 
> > In the past the only thing we had in upstream was subsystem/driver
> > specific allocators, and dma-buf for sharing. With dma-buf heaps we kinda
> > get a central allocator, which duplicates large chunks of of all these
> > allocators. And since it's a central allocator by design, the reason for
> > having per-subsystem allocators is kinda gone.
> > 
> > I think there's two approaches here:
> > - we convert e.g. drm allocator helpers to internally use the right heap
> >   implementation.
> 
> We however don't always have a need for a dmabuf, as not all buffers are
> meant to be shared (and we often can't tell beforehand, at allocation
> time, if a given buffer will be shared or not). While the overhead of
> allocating a dmabuf should be file, assigning a file descriptor to each
> buffer would be prohibitely expensive.
> 
> We would need to decouple the dma heaps from file descriptors. I think
> that's doable, but it changes the philosophy of dma heaps and needs
> careful consideration. In particular, one may wonder what that would
> bring us, compared to the DMA mapping API for instance.

Oh I don't mean that we use the dma-buf heaps interfaces, but the backend.
Underneath it's all a struct dma_buf, that's where I figured we should
intercept the code sharing. Definitely not file descriptor handles.

> > That would also give us some fairly direct way to expose
> >   these constraints in sysfs so a userspace allocator knows which dma-buf
> >   heap to pick for shared stuff.
> 
> sysfs won't work I'm afraid, as constraints may depend on the format for
> instace. We need subsystem-specific APIs to expose constraints.

Yeah sysfs would be the default, and maybe a list of useable heaps that
the subsystem api then tells you which to use when.
-Daniel

> 
> > - we require that any heap is just a different uapi for an existing driver
> >   allocator, e.g. by having a dma-buf wrapper for all gem drivers.
> > 
> > Otherwise I think what we end up with is a pile of dma-buf heaps for
> > android's blob gpu driver world, and not used anywhere else. Not something
> > even remotely interesting for upstream :-)
> > 
> > tldr; I'd like to see how dma-buf heaps closely integrate with all the
> > existing buffer management code we have. Both kernel (and throuhg some
> > allocator library effort) in userspace.
> > 
> > > ---
> > >  drivers/dma-buf/heaps/Kconfig       |   9 +
> > >  drivers/dma-buf/heaps/Makefile      |   1 +
> > >  drivers/dma-buf/heaps/device_heap.c | 268 ++++++++++++++++++++++++++++
> > >  include/linux/device.h              |   5 +
> > >  include/linux/dma-heap.h            |   6 +
> > >  5 files changed, 289 insertions(+)
> > >  create mode 100644 drivers/dma-buf/heaps/device_heap.c
> > > 
> > > diff --git a/drivers/dma-buf/heaps/Kconfig b/drivers/dma-buf/heaps/Kconfig
> > > index a5eef06c4226..2bb3604184bd 100644
> > > --- a/drivers/dma-buf/heaps/Kconfig
> > > +++ b/drivers/dma-buf/heaps/Kconfig
> > > @@ -12,3 +12,12 @@ config DMABUF_HEAPS_CMA
> > >  	  Choose this option to enable dma-buf CMA heap. This heap is backed
> > >  	  by the Contiguous Memory Allocator (CMA). If your system has these
> > >  	  regions, you should say Y here.
> > > +
> > > +config DMABUF_HEAPS_DEVICES
> > > +	bool "DMA-BUF Device DMA Heap (Experimental)"
> > > +	depends on DMABUF_HEAPS
> > > +	help
> > > +	  Choose this option to enable dma-buf per-device heap. This heap is backed
> > > +	  by the DMA-API and it's an Experimental feature, meant mostly for testing
> > > +	  and experimentation.
> > > +	  Just say N here.
> > > diff --git a/drivers/dma-buf/heaps/Makefile b/drivers/dma-buf/heaps/Makefile
> > > index 6e54cdec3da0..c691d85b3044 100644
> > > --- a/drivers/dma-buf/heaps/Makefile
> > > +++ b/drivers/dma-buf/heaps/Makefile
> > > @@ -2,3 +2,4 @@
> > >  obj-y					+= heap-helpers.o
> > >  obj-$(CONFIG_DMABUF_HEAPS_SYSTEM)	+= system_heap.o
> > >  obj-$(CONFIG_DMABUF_HEAPS_CMA)		+= cma_heap.o
> > > +obj-$(CONFIG_DMABUF_HEAPS_DEVICES)	+= device_heap.o
> > > diff --git a/drivers/dma-buf/heaps/device_heap.c b/drivers/dma-buf/heaps/device_heap.c
> > > new file mode 100644
> > > index 000000000000..1803dc622dd8
> > > --- /dev/null
> > > +++ b/drivers/dma-buf/heaps/device_heap.c
> > > @@ -0,0 +1,268 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * DMABUF Device DMA heap exporter
> > > + *
> > > + * Copyright (C) 2020, Collabora Ltd.
> > > + *
> > > + * Based on:
> > > + *   videobuf2-dma-contig.c - DMA contig memory allocator for videobuf2
> > > + *   Copyright (C) 2010 Samsung Electronics
> > > + */
> > > +
> > > +#include <linux/device.h>
> > > +#include <linux/dma-buf.h>
> > > +#include <linux/dma-heap.h>
> > > +#include <linux/dma-mapping.h>
> > > +#include <linux/scatterlist.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/module.h>
> > > +
> > > +struct dev_dmabuf_attachment {
> > > +	struct sg_table sgt;
> > > +	enum dma_data_direction dma_dir;
> > > +};
> > > +
> > > +struct dev_dmabuf {
> > > +	struct dma_heap *heap;
> > > +	struct dma_buf *dmabuf;
> > > +	struct device *dev;
> > > +	size_t size;
> > > +	void *vaddr;
> > > +	dma_addr_t dma_addr;
> > > +	unsigned long attrs;
> > > +
> > > +	struct sg_table sgt;
> > > +};
> > > +
> > > +static struct sg_table *dev_dmabuf_ops_map(struct dma_buf_attachment *db_attach,
> > > +					   enum dma_data_direction dma_dir)
> > > +{
> > > +	struct dev_dmabuf_attachment *attach = db_attach->priv;
> > > +	/* stealing dmabuf mutex to serialize map/unmap operations */
> > > +	struct mutex *lock = &db_attach->dmabuf->lock;
> > > +	struct sg_table *sgt;
> > > +
> > > +	mutex_lock(lock);
> > > +
> > > +	sgt = &attach->sgt;
> > > +	/* return previously mapped sg table */
> > > +	if (attach->dma_dir == dma_dir) {
> > > +		mutex_unlock(lock);
> > > +		return sgt;
> > > +	}
> > > +
> > > +	/* release any previous cache */
> > > +	if (attach->dma_dir != DMA_NONE) {
> > > +		dma_unmap_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents,
> > > +				   attach->dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> > > +		attach->dma_dir = DMA_NONE;
> > > +	}
> > > +
> > > +	/*
> > > +	 * mapping to the client with new direction, no cache sync
> > > +	 * required see comment in .dmabuf_ops_detach()
> > > +	 */
> > > +	sgt->nents = dma_map_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents,
> > > +				      dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> > > +	if (!sgt->nents) {
> > > +		dev_err(db_attach->dev, "failed to map scatterlist\n");
> > > +		mutex_unlock(lock);
> > > +		return ERR_PTR(-EIO);
> > > +	}
> > > +
> > > +	attach->dma_dir = dma_dir;
> > > +
> > > +	mutex_unlock(lock);
> > > +
> > > +	return sgt;
> > > +}
> > > +
> > > +static void dev_dmabuf_ops_unmap(struct dma_buf_attachment *db_attach,
> > > +				 struct sg_table *sgt,
> > > +				 enum dma_data_direction dma_dir)
> > > +{
> > > +	/* nothing to be done here */
> > > +}
> > > +
> > > +static int dev_dmabuf_ops_attach(struct dma_buf *dmabuf,
> > > +				 struct dma_buf_attachment *dbuf_attach)
> > > +{
> > > +	struct dev_dmabuf_attachment *attach;
> > > +	unsigned int i;
> > > +	struct scatterlist *rd, *wr;
> > > +	struct sg_table *sgt;
> > > +	struct dev_dmabuf *buf = dmabuf->priv;
> > > +	int ret;
> > > +
> > > +	attach = kzalloc(sizeof(*attach), GFP_KERNEL);
> > > +	if (!attach)
> > > +		return -ENOMEM;
> > > +	sgt = &attach->sgt;
> > > +
> > > +	/*
> > > +	 * Copy the buf->sgt scatter list to the attachment, as we can't
> > > +	 * map the same scatter list to multiple attachments at the same time.
> > > +	 */
> > > +	ret = sg_alloc_table(sgt, buf->sgt.orig_nents, GFP_KERNEL);
> > > +	if (ret) {
> > > +		kfree(attach);
> > > +		return -ENOMEM;
> > > +	}
> > > +
> > > +	rd = buf->sgt.sgl;
> > > +	wr = sgt->sgl;
> > > +	for (i = 0; i < sgt->orig_nents; ++i) {
> > > +		sg_set_page(wr, sg_page(rd), rd->length, rd->offset);
> > > +		rd = sg_next(rd);
> > > +		wr = sg_next(wr);
> > > +	}
> > > +
> > > +	attach->dma_dir = DMA_NONE;
> > > +	dbuf_attach->priv = attach;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static void dev_dmabuf_ops_detach(struct dma_buf *dmabuf,
> > > +				  struct dma_buf_attachment *db_attach)
> > > +{
> > > +	struct dev_dmabuf_attachment *attach = db_attach->priv;
> > > +	struct sg_table *sgt;
> > > +
> > > +	if (!attach)
> > > +		return;
> > > +	sgt = &attach->sgt;
> > > +
> > > +	/* release the scatterlist cache */
> > > +	if (attach->dma_dir != DMA_NONE)
> > > +		/*
> > > +		 * Cache sync can be skipped here, as the memory is
> > > +		 * allocated from device coherent memory, which means the
> > > +		 * memory locations do not require any explicit cache
> > > +		 * maintenance prior or after being used by the device.
> > > +		 *
> > > +		 * XXX: This needs a revisit.
> > > +		 */
> > > +		dma_unmap_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents,
> > > +				   attach->dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> > > +	sg_free_table(sgt);
> > > +	kfree(attach);
> > > +	db_attach->priv = NULL;
> > > +}
> > > +
> > > +
> > > +static void *dev_dmabuf_ops_vmap(struct dma_buf *dmabuf)
> > > +{
> > > +	struct dev_dmabuf *buf = dmabuf->priv;
> > > +
> > > +	return buf->vaddr;
> > > +}
> > > +
> > > +static void dev_dmabuf_ops_release(struct dma_buf *dmabuf)
> > > +{
> > > +	struct dev_dmabuf *buf = dmabuf->priv;
> > > +
> > > +	sg_free_table(&buf->sgt);
> > > +	dma_free_attrs(buf->dev, buf->size, buf->vaddr,
> > > +		       buf->dma_addr, buf->attrs);
> > > +	put_device(buf->dev);
> > > +	kfree(buf);
> > > +}
> > > +
> > > +static int dev_dmabuf_ops_mmap(struct dma_buf *dmabuf,
> > > +			       struct vm_area_struct *vma)
> > > +{
> > > +	struct dev_dmabuf *buf = dmabuf->priv;
> > > +	int ret;
> > > +
> > > +	ret = dma_mmap_attrs(buf->dev, vma, buf->vaddr,
> > > +			     buf->dma_addr, buf->size,
> > > +			     buf->attrs);
> > > +	if (ret) {
> > > +		dev_err(buf->dev, "remapping memory failed, error: %d\n", ret);
> > > +		return ret;
> > > +	}
> > > +	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static const struct dma_buf_ops dev_dmabuf_ops = {
> > > +	.attach = dev_dmabuf_ops_attach,
> > > +	.detach = dev_dmabuf_ops_detach,
> > > +	.map_dma_buf = dev_dmabuf_ops_map,
> > > +	.unmap_dma_buf = dev_dmabuf_ops_unmap,
> > > +	.vmap = dev_dmabuf_ops_vmap,
> > > +	.mmap = dev_dmabuf_ops_mmap,
> > > +	.release = dev_dmabuf_ops_release,
> > > +};
> > > +
> > > +static int dev_heap_allocate(struct dma_heap *heap,
> > > +			unsigned long size,
> > > +			unsigned long fd_flags,
> > > +			unsigned long heap_flags)
> > > +{
> > > +	struct device *dev = dma_heap_get_drvdata(heap);
> > > +	struct dev_dmabuf *buf;
> > > +	struct dma_buf_export_info exp_info = {};
> > > +	unsigned long attrs = 0;
> > > +	int ret = -ENOMEM;
> > > +
> > > +	buf = kzalloc(sizeof(*buf), GFP_KERNEL);
> > > +	if (!buf)
> > > +		return -ENOMEM;
> > > +
> > > +	buf->vaddr = dma_alloc_attrs(dev, size, &buf->dma_addr,
> > > +				     GFP_KERNEL, attrs);
> > > +	/* Prevent the device from being released while the buffer is used */
> > > +	buf->dev = get_device(dev);
> > > +	buf->heap = heap;
> > > +	buf->size = size;
> > > +	buf->attrs = attrs;
> > > +
> > > +	/* XXX: This call is documented as unsafe. See dma_get_sgtable_attrs(). */
> > > +	ret = dma_get_sgtable_attrs(buf->dev, &buf->sgt,
> > > +				    buf->vaddr, buf->dma_addr,
> > > +				    buf->size, buf->attrs);
> > > +	if (ret < 0) {
> > > +		dev_err(buf->dev, "failed to get scatterlist from DMA API\n");
> > > +		return ret;
> > > +	}
> > > +
> > > +	exp_info.exp_name = dev_name(dev);
> > > +	exp_info.owner = THIS_MODULE;
> > > +	exp_info.ops = &dev_dmabuf_ops;
> > > +	exp_info.size = size;
> > > +	exp_info.flags = fd_flags;
> > > +	exp_info.priv = buf;
> > > +
> > > +	buf->dmabuf = dma_buf_export(&exp_info);
> > > +	if (IS_ERR(buf->dmabuf)) {
> > > +		dev_err(buf->dev, "failed to export dmabuf\n");
> > > +		return PTR_ERR(buf->dmabuf);
> > > +	}
> > > +
> > > +	ret = dma_buf_fd(buf->dmabuf, fd_flags);
> > > +	if (ret < 0) {
> > > +		dev_err(buf->dev, "failed to get dmabuf fd: %d\n", ret);
> > > +		return ret;
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +static const struct dma_heap_ops dev_heap_ops = {
> > > +	.allocate = dev_heap_allocate,
> > > +};
> > > +
> > > +void dev_dma_heap_add(struct device *dev)
> > > +{
> > > +	struct dma_heap_export_info exp_info;
> > > +
> > > +	exp_info.name = dev_name(dev);
> > > +	exp_info.ops = &dev_heap_ops;
> > > +	exp_info.priv = dev;
> > > +
> > > +	dev->heap = dma_heap_add(&exp_info);
> > > +}
> > > +EXPORT_SYMBOL(dev_dma_heap_add);
> > > diff --git a/include/linux/device.h b/include/linux/device.h
> > > index ca18da4768e3..1fae95d55ea1 100644
> > > --- a/include/linux/device.h
> > > +++ b/include/linux/device.h
> > > @@ -45,6 +45,7 @@ struct iommu_ops;
> > >  struct iommu_group;
> > >  struct dev_pin_info;
> > >  struct dev_iommu;
> > > +struct dma_heap;
> > >  
> > >  /**
> > >   * struct subsys_interface - interfaces to device functions
> > > @@ -597,6 +598,10 @@ struct device {
> > >  	struct iommu_group	*iommu_group;
> > >  	struct dev_iommu	*iommu;
> > >  
> > > +#ifdef CONFIG_DMABUF_HEAPS_DEVICES
> > > +	struct dma_heap		*heap;
> > > +#endif
> > > +
> > >  	bool			offline_disabled:1;
> > >  	bool			offline:1;
> > >  	bool			of_node_reused:1;
> > > diff --git a/include/linux/dma-heap.h b/include/linux/dma-heap.h
> > > index 454e354d1ffb..dcf7cca2f487 100644
> > > --- a/include/linux/dma-heap.h
> > > +++ b/include/linux/dma-heap.h
> > > @@ -56,4 +56,10 @@ void *dma_heap_get_drvdata(struct dma_heap *heap);
> > >   */
> > >  struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info);
> > >  
> > > +#ifdef CONFIG_DMABUF_HEAPS_DEVICES
> > > +void dev_dma_heap_add(struct device *dev);
> > > +#else
> > > +static inline void dev_dma_heap_add(struct device *dev) {}
> > > +#endif
> > > +
> > >  #endif /* _DMA_HEAPS_H */
> 
> -- 
> Regards,
> 
> Laurent Pinchart
Daniel Vetter Sept. 16, 2020, 5:01 p.m. UTC | #20
On Mon, Aug 17, 2020 at 9:09 AM Ezequiel Garcia <ezequiel@collabora.com> wrote:
>
> This heap is basically a wrapper around DMA-API dma_alloc_attrs,
> which will allocate memory suitable for the given device.
>
> The implementation is mostly a port of the Contiguous Videobuf2
> memory allocator (see videobuf2/videobuf2-dma-contig.c)
> over to the DMA-BUF Heap interface.
>
> The intention of this allocator is to provide applications
> with a more system-agnostic API: the only thing the application
> needs to know is which device to get the buffer for.
>
> Whether the buffer is backed by CMA, IOMMU or a DMA Pool
> is unknown to the application.
>
> I'm not really expecting this patch to be correct or even
> a good idea, but just submitting it to start a discussion on DMA-BUF
> heap discovery and negotiation.
>
> Given Plumbers is just a couple weeks from now, I've submitted
> a BoF proposal to discuss this, as perhaps it would make
> sense to discuss this live?
>
> Not-signed-off-by: Ezequiel Garcia <ezequiel@collabora.com>

Adding Simon and James who've done a presentation at XDC just now
about allocation constraint solving and heap ids came up. There's
going to be a discussion tomorrow in the workshop track, might be
worth and try joining if you can make it happen (xdc is virtual and
free).

Cheers, Daniel

> ---
>  drivers/dma-buf/heaps/Kconfig       |   9 +
>  drivers/dma-buf/heaps/Makefile      |   1 +
>  drivers/dma-buf/heaps/device_heap.c | 268 ++++++++++++++++++++++++++++
>  include/linux/device.h              |   5 +
>  include/linux/dma-heap.h            |   6 +
>  5 files changed, 289 insertions(+)
>  create mode 100644 drivers/dma-buf/heaps/device_heap.c
>
> diff --git a/drivers/dma-buf/heaps/Kconfig b/drivers/dma-buf/heaps/Kconfig
> index a5eef06c4226..2bb3604184bd 100644
> --- a/drivers/dma-buf/heaps/Kconfig
> +++ b/drivers/dma-buf/heaps/Kconfig
> @@ -12,3 +12,12 @@ config DMABUF_HEAPS_CMA
>           Choose this option to enable dma-buf CMA heap. This heap is backed
>           by the Contiguous Memory Allocator (CMA). If your system has these
>           regions, you should say Y here.
> +
> +config DMABUF_HEAPS_DEVICES
> +       bool "DMA-BUF Device DMA Heap (Experimental)"
> +       depends on DMABUF_HEAPS
> +       help
> +         Choose this option to enable dma-buf per-device heap. This heap is backed
> +         by the DMA-API and it's an Experimental feature, meant mostly for testing
> +         and experimentation.
> +         Just say N here.
> diff --git a/drivers/dma-buf/heaps/Makefile b/drivers/dma-buf/heaps/Makefile
> index 6e54cdec3da0..c691d85b3044 100644
> --- a/drivers/dma-buf/heaps/Makefile
> +++ b/drivers/dma-buf/heaps/Makefile
> @@ -2,3 +2,4 @@
>  obj-y                                  += heap-helpers.o
>  obj-$(CONFIG_DMABUF_HEAPS_SYSTEM)      += system_heap.o
>  obj-$(CONFIG_DMABUF_HEAPS_CMA)         += cma_heap.o
> +obj-$(CONFIG_DMABUF_HEAPS_DEVICES)     += device_heap.o
> diff --git a/drivers/dma-buf/heaps/device_heap.c b/drivers/dma-buf/heaps/device_heap.c
> new file mode 100644
> index 000000000000..1803dc622dd8
> --- /dev/null
> +++ b/drivers/dma-buf/heaps/device_heap.c
> @@ -0,0 +1,268 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * DMABUF Device DMA heap exporter
> + *
> + * Copyright (C) 2020, Collabora Ltd.
> + *
> + * Based on:
> + *   videobuf2-dma-contig.c - DMA contig memory allocator for videobuf2
> + *   Copyright (C) 2010 Samsung Electronics
> + */
> +
> +#include <linux/device.h>
> +#include <linux/dma-buf.h>
> +#include <linux/dma-heap.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/scatterlist.h>
> +#include <linux/slab.h>
> +#include <linux/module.h>
> +
> +struct dev_dmabuf_attachment {
> +       struct sg_table sgt;
> +       enum dma_data_direction dma_dir;
> +};
> +
> +struct dev_dmabuf {
> +       struct dma_heap *heap;
> +       struct dma_buf *dmabuf;
> +       struct device *dev;
> +       size_t size;
> +       void *vaddr;
> +       dma_addr_t dma_addr;
> +       unsigned long attrs;
> +
> +       struct sg_table sgt;
> +};
> +
> +static struct sg_table *dev_dmabuf_ops_map(struct dma_buf_attachment *db_attach,
> +                                          enum dma_data_direction dma_dir)
> +{
> +       struct dev_dmabuf_attachment *attach = db_attach->priv;
> +       /* stealing dmabuf mutex to serialize map/unmap operations */
> +       struct mutex *lock = &db_attach->dmabuf->lock;
> +       struct sg_table *sgt;
> +
> +       mutex_lock(lock);
> +
> +       sgt = &attach->sgt;
> +       /* return previously mapped sg table */
> +       if (attach->dma_dir == dma_dir) {
> +               mutex_unlock(lock);
> +               return sgt;
> +       }
> +
> +       /* release any previous cache */
> +       if (attach->dma_dir != DMA_NONE) {
> +               dma_unmap_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents,
> +                                  attach->dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> +               attach->dma_dir = DMA_NONE;
> +       }
> +
> +       /*
> +        * mapping to the client with new direction, no cache sync
> +        * required see comment in .dmabuf_ops_detach()
> +        */
> +       sgt->nents = dma_map_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents,
> +                                     dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> +       if (!sgt->nents) {
> +               dev_err(db_attach->dev, "failed to map scatterlist\n");
> +               mutex_unlock(lock);
> +               return ERR_PTR(-EIO);
> +       }
> +
> +       attach->dma_dir = dma_dir;
> +
> +       mutex_unlock(lock);
> +
> +       return sgt;
> +}
> +
> +static void dev_dmabuf_ops_unmap(struct dma_buf_attachment *db_attach,
> +                                struct sg_table *sgt,
> +                                enum dma_data_direction dma_dir)
> +{
> +       /* nothing to be done here */
> +}
> +
> +static int dev_dmabuf_ops_attach(struct dma_buf *dmabuf,
> +                                struct dma_buf_attachment *dbuf_attach)
> +{
> +       struct dev_dmabuf_attachment *attach;
> +       unsigned int i;
> +       struct scatterlist *rd, *wr;
> +       struct sg_table *sgt;
> +       struct dev_dmabuf *buf = dmabuf->priv;
> +       int ret;
> +
> +       attach = kzalloc(sizeof(*attach), GFP_KERNEL);
> +       if (!attach)
> +               return -ENOMEM;
> +       sgt = &attach->sgt;
> +
> +       /*
> +        * Copy the buf->sgt scatter list to the attachment, as we can't
> +        * map the same scatter list to multiple attachments at the same time.
> +        */
> +       ret = sg_alloc_table(sgt, buf->sgt.orig_nents, GFP_KERNEL);
> +       if (ret) {
> +               kfree(attach);
> +               return -ENOMEM;
> +       }
> +
> +       rd = buf->sgt.sgl;
> +       wr = sgt->sgl;
> +       for (i = 0; i < sgt->orig_nents; ++i) {
> +               sg_set_page(wr, sg_page(rd), rd->length, rd->offset);
> +               rd = sg_next(rd);
> +               wr = sg_next(wr);
> +       }
> +
> +       attach->dma_dir = DMA_NONE;
> +       dbuf_attach->priv = attach;
> +
> +       return 0;
> +}
> +
> +static void dev_dmabuf_ops_detach(struct dma_buf *dmabuf,
> +                                 struct dma_buf_attachment *db_attach)
> +{
> +       struct dev_dmabuf_attachment *attach = db_attach->priv;
> +       struct sg_table *sgt;
> +
> +       if (!attach)
> +               return;
> +       sgt = &attach->sgt;
> +
> +       /* release the scatterlist cache */
> +       if (attach->dma_dir != DMA_NONE)
> +               /*
> +                * Cache sync can be skipped here, as the memory is
> +                * allocated from device coherent memory, which means the
> +                * memory locations do not require any explicit cache
> +                * maintenance prior or after being used by the device.
> +                *
> +                * XXX: This needs a revisit.
> +                */
> +               dma_unmap_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents,
> +                                  attach->dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> +       sg_free_table(sgt);
> +       kfree(attach);
> +       db_attach->priv = NULL;
> +}
> +
> +
> +static void *dev_dmabuf_ops_vmap(struct dma_buf *dmabuf)
> +{
> +       struct dev_dmabuf *buf = dmabuf->priv;
> +
> +       return buf->vaddr;
> +}
> +
> +static void dev_dmabuf_ops_release(struct dma_buf *dmabuf)
> +{
> +       struct dev_dmabuf *buf = dmabuf->priv;
> +
> +       sg_free_table(&buf->sgt);
> +       dma_free_attrs(buf->dev, buf->size, buf->vaddr,
> +                      buf->dma_addr, buf->attrs);
> +       put_device(buf->dev);
> +       kfree(buf);
> +}
> +
> +static int dev_dmabuf_ops_mmap(struct dma_buf *dmabuf,
> +                              struct vm_area_struct *vma)
> +{
> +       struct dev_dmabuf *buf = dmabuf->priv;
> +       int ret;
> +
> +       ret = dma_mmap_attrs(buf->dev, vma, buf->vaddr,
> +                            buf->dma_addr, buf->size,
> +                            buf->attrs);
> +       if (ret) {
> +               dev_err(buf->dev, "remapping memory failed, error: %d\n", ret);
> +               return ret;
> +       }
> +       vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
> +
> +       return 0;
> +}
> +
> +static const struct dma_buf_ops dev_dmabuf_ops = {
> +       .attach = dev_dmabuf_ops_attach,
> +       .detach = dev_dmabuf_ops_detach,
> +       .map_dma_buf = dev_dmabuf_ops_map,
> +       .unmap_dma_buf = dev_dmabuf_ops_unmap,
> +       .vmap = dev_dmabuf_ops_vmap,
> +       .mmap = dev_dmabuf_ops_mmap,
> +       .release = dev_dmabuf_ops_release,
> +};
> +
> +static int dev_heap_allocate(struct dma_heap *heap,
> +                       unsigned long size,
> +                       unsigned long fd_flags,
> +                       unsigned long heap_flags)
> +{
> +       struct device *dev = dma_heap_get_drvdata(heap);
> +       struct dev_dmabuf *buf;
> +       struct dma_buf_export_info exp_info = {};
> +       unsigned long attrs = 0;
> +       int ret = -ENOMEM;
> +
> +       buf = kzalloc(sizeof(*buf), GFP_KERNEL);
> +       if (!buf)
> +               return -ENOMEM;
> +
> +       buf->vaddr = dma_alloc_attrs(dev, size, &buf->dma_addr,
> +                                    GFP_KERNEL, attrs);
> +       /* Prevent the device from being released while the buffer is used */
> +       buf->dev = get_device(dev);
> +       buf->heap = heap;
> +       buf->size = size;
> +       buf->attrs = attrs;
> +
> +       /* XXX: This call is documented as unsafe. See dma_get_sgtable_attrs(). */
> +       ret = dma_get_sgtable_attrs(buf->dev, &buf->sgt,
> +                                   buf->vaddr, buf->dma_addr,
> +                                   buf->size, buf->attrs);
> +       if (ret < 0) {
> +               dev_err(buf->dev, "failed to get scatterlist from DMA API\n");
> +               return ret;
> +       }
> +
> +       exp_info.exp_name = dev_name(dev);
> +       exp_info.owner = THIS_MODULE;
> +       exp_info.ops = &dev_dmabuf_ops;
> +       exp_info.size = size;
> +       exp_info.flags = fd_flags;
> +       exp_info.priv = buf;
> +
> +       buf->dmabuf = dma_buf_export(&exp_info);
> +       if (IS_ERR(buf->dmabuf)) {
> +               dev_err(buf->dev, "failed to export dmabuf\n");
> +               return PTR_ERR(buf->dmabuf);
> +       }
> +
> +       ret = dma_buf_fd(buf->dmabuf, fd_flags);
> +       if (ret < 0) {
> +               dev_err(buf->dev, "failed to get dmabuf fd: %d\n", ret);
> +               return ret;
> +       }
> +
> +       return ret;
> +}
> +
> +static const struct dma_heap_ops dev_heap_ops = {
> +       .allocate = dev_heap_allocate,
> +};
> +
> +void dev_dma_heap_add(struct device *dev)
> +{
> +       struct dma_heap_export_info exp_info;
> +
> +       exp_info.name = dev_name(dev);
> +       exp_info.ops = &dev_heap_ops;
> +       exp_info.priv = dev;
> +
> +       dev->heap = dma_heap_add(&exp_info);
> +}
> +EXPORT_SYMBOL(dev_dma_heap_add);
> diff --git a/include/linux/device.h b/include/linux/device.h
> index ca18da4768e3..1fae95d55ea1 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -45,6 +45,7 @@ struct iommu_ops;
>  struct iommu_group;
>  struct dev_pin_info;
>  struct dev_iommu;
> +struct dma_heap;
>
>  /**
>   * struct subsys_interface - interfaces to device functions
> @@ -597,6 +598,10 @@ struct device {
>         struct iommu_group      *iommu_group;
>         struct dev_iommu        *iommu;
>
> +#ifdef CONFIG_DMABUF_HEAPS_DEVICES
> +       struct dma_heap         *heap;
> +#endif
> +
>         bool                    offline_disabled:1;
>         bool                    offline:1;
>         bool                    of_node_reused:1;
> diff --git a/include/linux/dma-heap.h b/include/linux/dma-heap.h
> index 454e354d1ffb..dcf7cca2f487 100644
> --- a/include/linux/dma-heap.h
> +++ b/include/linux/dma-heap.h
> @@ -56,4 +56,10 @@ void *dma_heap_get_drvdata(struct dma_heap *heap);
>   */
>  struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info);
>
> +#ifdef CONFIG_DMABUF_HEAPS_DEVICES
> +void dev_dma_heap_add(struct device *dev);
> +#else
> +static inline void dev_dma_heap_add(struct device *dev) {}
> +#endif
> +
>  #endif /* _DMA_HEAPS_H */
> --
> 2.27.0
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
diff mbox series

Patch

diff --git a/drivers/dma-buf/heaps/Kconfig b/drivers/dma-buf/heaps/Kconfig
index a5eef06c4226..2bb3604184bd 100644
--- a/drivers/dma-buf/heaps/Kconfig
+++ b/drivers/dma-buf/heaps/Kconfig
@@ -12,3 +12,12 @@  config DMABUF_HEAPS_CMA
 	  Choose this option to enable dma-buf CMA heap. This heap is backed
 	  by the Contiguous Memory Allocator (CMA). If your system has these
 	  regions, you should say Y here.
+
+config DMABUF_HEAPS_DEVICES
+	bool "DMA-BUF Device DMA Heap (Experimental)"
+	depends on DMABUF_HEAPS
+	help
+	  Choose this option to enable dma-buf per-device heap. This heap is backed
+	  by the DMA-API and it's an Experimental feature, meant mostly for testing
+	  and experimentation.
+	  Just say N here.
diff --git a/drivers/dma-buf/heaps/Makefile b/drivers/dma-buf/heaps/Makefile
index 6e54cdec3da0..c691d85b3044 100644
--- a/drivers/dma-buf/heaps/Makefile
+++ b/drivers/dma-buf/heaps/Makefile
@@ -2,3 +2,4 @@ 
 obj-y					+= heap-helpers.o
 obj-$(CONFIG_DMABUF_HEAPS_SYSTEM)	+= system_heap.o
 obj-$(CONFIG_DMABUF_HEAPS_CMA)		+= cma_heap.o
+obj-$(CONFIG_DMABUF_HEAPS_DEVICES)	+= device_heap.o
diff --git a/drivers/dma-buf/heaps/device_heap.c b/drivers/dma-buf/heaps/device_heap.c
new file mode 100644
index 000000000000..1803dc622dd8
--- /dev/null
+++ b/drivers/dma-buf/heaps/device_heap.c
@@ -0,0 +1,268 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * DMABUF Device DMA heap exporter
+ *
+ * Copyright (C) 2020, Collabora Ltd.
+ *
+ * Based on:
+ *   videobuf2-dma-contig.c - DMA contig memory allocator for videobuf2
+ *   Copyright (C) 2010 Samsung Electronics
+ */
+
+#include <linux/device.h>
+#include <linux/dma-buf.h>
+#include <linux/dma-heap.h>
+#include <linux/dma-mapping.h>
+#include <linux/scatterlist.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+
+struct dev_dmabuf_attachment {
+	struct sg_table sgt;
+	enum dma_data_direction dma_dir;
+};
+
+struct dev_dmabuf {
+	struct dma_heap *heap;
+	struct dma_buf *dmabuf;
+	struct device *dev;
+	size_t size;
+	void *vaddr;
+	dma_addr_t dma_addr;
+	unsigned long attrs;
+
+	struct sg_table sgt;
+};
+
+static struct sg_table *dev_dmabuf_ops_map(struct dma_buf_attachment *db_attach,
+					   enum dma_data_direction dma_dir)
+{
+	struct dev_dmabuf_attachment *attach = db_attach->priv;
+	/* stealing dmabuf mutex to serialize map/unmap operations */
+	struct mutex *lock = &db_attach->dmabuf->lock;
+	struct sg_table *sgt;
+
+	mutex_lock(lock);
+
+	sgt = &attach->sgt;
+	/* return previously mapped sg table */
+	if (attach->dma_dir == dma_dir) {
+		mutex_unlock(lock);
+		return sgt;
+	}
+
+	/* release any previous cache */
+	if (attach->dma_dir != DMA_NONE) {
+		dma_unmap_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents,
+				   attach->dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
+		attach->dma_dir = DMA_NONE;
+	}
+
+	/*
+	 * mapping to the client with new direction, no cache sync
+	 * required see comment in .dmabuf_ops_detach()
+	 */
+	sgt->nents = dma_map_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents,
+				      dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
+	if (!sgt->nents) {
+		dev_err(db_attach->dev, "failed to map scatterlist\n");
+		mutex_unlock(lock);
+		return ERR_PTR(-EIO);
+	}
+
+	attach->dma_dir = dma_dir;
+
+	mutex_unlock(lock);
+
+	return sgt;
+}
+
+static void dev_dmabuf_ops_unmap(struct dma_buf_attachment *db_attach,
+				 struct sg_table *sgt,
+				 enum dma_data_direction dma_dir)
+{
+	/* nothing to be done here */
+}
+
+static int dev_dmabuf_ops_attach(struct dma_buf *dmabuf,
+				 struct dma_buf_attachment *dbuf_attach)
+{
+	struct dev_dmabuf_attachment *attach;
+	unsigned int i;
+	struct scatterlist *rd, *wr;
+	struct sg_table *sgt;
+	struct dev_dmabuf *buf = dmabuf->priv;
+	int ret;
+
+	attach = kzalloc(sizeof(*attach), GFP_KERNEL);
+	if (!attach)
+		return -ENOMEM;
+	sgt = &attach->sgt;
+
+	/*
+	 * Copy the buf->sgt scatter list to the attachment, as we can't
+	 * map the same scatter list to multiple attachments at the same time.
+	 */
+	ret = sg_alloc_table(sgt, buf->sgt.orig_nents, GFP_KERNEL);
+	if (ret) {
+		kfree(attach);
+		return -ENOMEM;
+	}
+
+	rd = buf->sgt.sgl;
+	wr = sgt->sgl;
+	for (i = 0; i < sgt->orig_nents; ++i) {
+		sg_set_page(wr, sg_page(rd), rd->length, rd->offset);
+		rd = sg_next(rd);
+		wr = sg_next(wr);
+	}
+
+	attach->dma_dir = DMA_NONE;
+	dbuf_attach->priv = attach;
+
+	return 0;
+}
+
+static void dev_dmabuf_ops_detach(struct dma_buf *dmabuf,
+				  struct dma_buf_attachment *db_attach)
+{
+	struct dev_dmabuf_attachment *attach = db_attach->priv;
+	struct sg_table *sgt;
+
+	if (!attach)
+		return;
+	sgt = &attach->sgt;
+
+	/* release the scatterlist cache */
+	if (attach->dma_dir != DMA_NONE)
+		/*
+		 * Cache sync can be skipped here, as the memory is
+		 * allocated from device coherent memory, which means the
+		 * memory locations do not require any explicit cache
+		 * maintenance prior or after being used by the device.
+		 *
+		 * XXX: This needs a revisit.
+		 */
+		dma_unmap_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents,
+				   attach->dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
+	sg_free_table(sgt);
+	kfree(attach);
+	db_attach->priv = NULL;
+}
+
+
+static void *dev_dmabuf_ops_vmap(struct dma_buf *dmabuf)
+{
+	struct dev_dmabuf *buf = dmabuf->priv;
+
+	return buf->vaddr;
+}
+
+static void dev_dmabuf_ops_release(struct dma_buf *dmabuf)
+{
+	struct dev_dmabuf *buf = dmabuf->priv;
+
+	sg_free_table(&buf->sgt);
+	dma_free_attrs(buf->dev, buf->size, buf->vaddr,
+		       buf->dma_addr, buf->attrs);
+	put_device(buf->dev);
+	kfree(buf);
+}
+
+static int dev_dmabuf_ops_mmap(struct dma_buf *dmabuf,
+			       struct vm_area_struct *vma)
+{
+	struct dev_dmabuf *buf = dmabuf->priv;
+	int ret;
+
+	ret = dma_mmap_attrs(buf->dev, vma, buf->vaddr,
+			     buf->dma_addr, buf->size,
+			     buf->attrs);
+	if (ret) {
+		dev_err(buf->dev, "remapping memory failed, error: %d\n", ret);
+		return ret;
+	}
+	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+
+	return 0;
+}
+
+static const struct dma_buf_ops dev_dmabuf_ops = {
+	.attach = dev_dmabuf_ops_attach,
+	.detach = dev_dmabuf_ops_detach,
+	.map_dma_buf = dev_dmabuf_ops_map,
+	.unmap_dma_buf = dev_dmabuf_ops_unmap,
+	.vmap = dev_dmabuf_ops_vmap,
+	.mmap = dev_dmabuf_ops_mmap,
+	.release = dev_dmabuf_ops_release,
+};
+
+static int dev_heap_allocate(struct dma_heap *heap,
+			unsigned long size,
+			unsigned long fd_flags,
+			unsigned long heap_flags)
+{
+	struct device *dev = dma_heap_get_drvdata(heap);
+	struct dev_dmabuf *buf;
+	struct dma_buf_export_info exp_info = {};
+	unsigned long attrs = 0;
+	int ret = -ENOMEM;
+
+	buf = kzalloc(sizeof(*buf), GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	buf->vaddr = dma_alloc_attrs(dev, size, &buf->dma_addr,
+				     GFP_KERNEL, attrs);
+	/* Prevent the device from being released while the buffer is used */
+	buf->dev = get_device(dev);
+	buf->heap = heap;
+	buf->size = size;
+	buf->attrs = attrs;
+
+	/* XXX: This call is documented as unsafe. See dma_get_sgtable_attrs(). */
+	ret = dma_get_sgtable_attrs(buf->dev, &buf->sgt,
+				    buf->vaddr, buf->dma_addr,
+				    buf->size, buf->attrs);
+	if (ret < 0) {
+		dev_err(buf->dev, "failed to get scatterlist from DMA API\n");
+		return ret;
+	}
+
+	exp_info.exp_name = dev_name(dev);
+	exp_info.owner = THIS_MODULE;
+	exp_info.ops = &dev_dmabuf_ops;
+	exp_info.size = size;
+	exp_info.flags = fd_flags;
+	exp_info.priv = buf;
+
+	buf->dmabuf = dma_buf_export(&exp_info);
+	if (IS_ERR(buf->dmabuf)) {
+		dev_err(buf->dev, "failed to export dmabuf\n");
+		return PTR_ERR(buf->dmabuf);
+	}
+
+	ret = dma_buf_fd(buf->dmabuf, fd_flags);
+	if (ret < 0) {
+		dev_err(buf->dev, "failed to get dmabuf fd: %d\n", ret);
+		return ret;
+	}
+
+	return ret;
+}
+
+static const struct dma_heap_ops dev_heap_ops = {
+	.allocate = dev_heap_allocate,
+};
+
+void dev_dma_heap_add(struct device *dev)
+{
+	struct dma_heap_export_info exp_info;
+
+	exp_info.name = dev_name(dev);
+	exp_info.ops = &dev_heap_ops;
+	exp_info.priv = dev;
+
+	dev->heap = dma_heap_add(&exp_info);
+}
+EXPORT_SYMBOL(dev_dma_heap_add);
diff --git a/include/linux/device.h b/include/linux/device.h
index ca18da4768e3..1fae95d55ea1 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -45,6 +45,7 @@  struct iommu_ops;
 struct iommu_group;
 struct dev_pin_info;
 struct dev_iommu;
+struct dma_heap;
 
 /**
  * struct subsys_interface - interfaces to device functions
@@ -597,6 +598,10 @@  struct device {
 	struct iommu_group	*iommu_group;
 	struct dev_iommu	*iommu;
 
+#ifdef CONFIG_DMABUF_HEAPS_DEVICES
+	struct dma_heap		*heap;
+#endif
+
 	bool			offline_disabled:1;
 	bool			offline:1;
 	bool			of_node_reused:1;
diff --git a/include/linux/dma-heap.h b/include/linux/dma-heap.h
index 454e354d1ffb..dcf7cca2f487 100644
--- a/include/linux/dma-heap.h
+++ b/include/linux/dma-heap.h
@@ -56,4 +56,10 @@  void *dma_heap_get_drvdata(struct dma_heap *heap);
  */
 struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info);
 
+#ifdef CONFIG_DMABUF_HEAPS_DEVICES
+void dev_dma_heap_add(struct device *dev);
+#else
+static inline void dev_dma_heap_add(struct device *dev) {}
+#endif
+
 #endif /* _DMA_HEAPS_H */