diff mbox

[RFC,v2] dmabuf-sync: Introduce buffer synchronization framework

Message ID 1371467722-665-1-git-send-email-inki.dae@samsung.com (mailing list archive)
State New, archived
Headers show

Commit Message

Inki Dae June 17, 2013, 11:15 a.m. UTC
This patch adds a buffer synchronization framework based on DMA BUF[1]
and reservation[2] to use dma-buf resource, and based on ww-mutexes[3]
for lock mechanism.

The purpose of this framework is not only to couple cache operations,
and buffer access control to CPU and DMA but also to provide easy-to-use
interfaces for device drivers and potentially user application
(not implemented for user applications, yet). And this framework can be
used for all dma devices using system memory as dma buffer, especially
for most ARM based SoCs.

Changelog v2:
- use atomic_add_unless to avoid potential bug.
- add a macro for checking valid access type.
- code clean.

The mechanism of this framework has the following steps,
    1. Register dmabufs to a sync object - A task gets a new sync object and
    can add one or more dmabufs that the task wants to access.
    This registering should be performed when a device context or an event
    context such as a page flip event is created or before CPU accesses a shared
    buffer.

	dma_buf_sync_get(a sync object, a dmabuf);

    2. Lock a sync object - A task tries to lock all dmabufs added in its own
    sync object. Basically, the lock mechanism uses ww-mutex[1] to avoid dead
    lock issue and for race condition between CPU and CPU, CPU and DMA, and DMA
    and DMA. Taking a lock means that others cannot access all locked dmabufs
    until the task that locked the corresponding dmabufs, unlocks all the locked
    dmabufs.
    This locking should be performed before DMA or CPU accesses these dmabufs.

	dma_buf_sync_lock(a sync object);

    3. Unlock a sync object - The task unlocks all dmabufs added in its own sync
    object. The unlock means that the DMA or CPU accesses to the dmabufs have
    been completed so that others may access them.
    This unlocking should be performed after DMA or CPU has completed accesses
    to the dmabufs.

	dma_buf_sync_unlock(a sync object);

    4. Unregister one or all dmabufs from a sync object - A task unregisters
    the given dmabufs from the sync object. This means that the task dosen't
    want to lock the dmabufs.
    The unregistering should be performed after DMA or CPU has completed
    accesses to the dmabufs or when dma_buf_sync_lock() is failed.

	dma_buf_sync_put(a sync object, a dmabuf);
	dma_buf_sync_put_all(a sync object);

    The described steps may be summarized as:
	get -> lock -> CPU or DMA access to a buffer/s -> unlock -> put

This framework includes the following two features.
    1. read (shared) and write (exclusive) locks - A task is required to declare
    the access type when the task tries to register a dmabuf;
    READ, WRITE, READ DMA, or WRITE DMA.

    The below is example codes,
	struct dmabuf_sync *sync;

	sync = dmabuf_sync_init(NULL, "test sync");

	dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
	...

	And the below can be used as access types:
		DMA_BUF_ACCESS_READ,
		- CPU will access a buffer for read.
		DMA_BUF_ACCESS_WRITE,
		- CPU will access a buffer for read or write.
		DMA_BUF_ACCESS_READ | DMA_BUF_ACCESS_DMA,
		- DMA will access a buffer for read
		DMA_BUF_ACCESS_WRITE | DMA_BUF_ACCESS_DMA,
		- DMA will access a buffer for read or write.

    2. Mandatory resource releasing - a task cannot hold a lock indefinitely.
    A task may never try to unlock a buffer after taking a lock to the buffer.
    In this case, a timer handler to the corresponding sync object is called
    in five (default) seconds and then the timed-out buffer is unlocked by work
    queue handler to avoid lockups and to enforce resources of the buffer.

The below is how to use:
	1. Allocate and Initialize a sync object:
		struct dmabuf_sync *sync;

		sync = dmabuf_sync_init(NULL, "test sync");
		...

	2. Add a dmabuf to the sync object when setting up dma buffer relevant
	   registers:
		dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
		...

	3. Lock all dmabufs of the sync object before DMA or CPU accesses
	   the dmabufs:
		dmabuf_sync_lock(sync);
		...

	4. Now CPU or DMA can access all dmabufs locked in step 3.

	5. Unlock all dmabufs added in a sync object after DMA or CPU access
	   to these dmabufs is completed:
		dmabuf_sync_unlock(sync);

	   And call the following functions to release all resources,
		dmabuf_sync_put_all(sync);
		dmabuf_sync_fini(sync);

	You can refer to actual example codes:
		https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/
		commit/?h=dmabuf-sync&id=4030bdee9bab5841ad32faade528d04cc0c5fc94

		https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/
		commit/?h=dmabuf-sync&id=6ca548e9ea9e865592719ef6b1cde58366af9f5c

The framework performs cache operation based on the previous and current access
types to the dmabufs after the locks to all dmabufs are taken:
	Call dma_buf_begin_cpu_access() to invalidate cache if,
		previous access type is DMA_BUF_ACCESS_WRITE | DMA and
		current access type is DMA_BUF_ACCESS_READ

	Call dma_buf_end_cpu_access() to clean cache if,
		previous access type is DMA_BUF_ACCESS_WRITE and
		current access type is DMA_BUF_ACCESS_READ | DMA

Such cache operations are invoked via dma-buf interfaces so the dma buf exporter
should implement dmabuf->ops->begin_cpu_access/end_cpu_access callbacks.

[1] http://lwn.net/Articles/470339/
[2] http://lwn.net/Articles/532616/
[3] https://patchwork-mail1.kernel.org/patch/2625321/

Signed-off-by: Inki Dae <inki.dae@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
---
 Documentation/dma-buf-sync.txt |  246 ++++++++++++++++++
 drivers/base/Kconfig           |    7 +
 drivers/base/Makefile          |    1 +
 drivers/base/dmabuf-sync.c     |  545 ++++++++++++++++++++++++++++++++++++++++
 include/linux/dma-buf.h        |   14 +
 include/linux/dmabuf-sync.h    |  115 +++++++++
 include/linux/reservation.h    |    7 +
 7 files changed, 935 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/dma-buf-sync.txt
 create mode 100644 drivers/base/dmabuf-sync.c
 create mode 100644 include/linux/dmabuf-sync.h

Comments

Maarten Lankhorst June 17, 2013, 11:34 a.m. UTC | #1
Op 17-06-13 13:15, Inki Dae schreef:
> This patch adds a buffer synchronization framework based on DMA BUF[1]
> and reservation[2] to use dma-buf resource, and based on ww-mutexes[3]
> for lock mechanism.
>
> The purpose of this framework is not only to couple cache operations,
> and buffer access control to CPU and DMA but also to provide easy-to-use
> interfaces for device drivers and potentially user application
> (not implemented for user applications, yet). And this framework can be
> used for all dma devices using system memory as dma buffer, especially
> for most ARM based SoCs.
>
> Changelog v2:
> - use atomic_add_unless to avoid potential bug.
> - add a macro for checking valid access type.
> - code clean.
>
> The mechanism of this framework has the following steps,
>     1. Register dmabufs to a sync object - A task gets a new sync object and
>     can add one or more dmabufs that the task wants to access.
>     This registering should be performed when a device context or an event
>     context such as a page flip event is created or before CPU accesses a shared
>     buffer.
>
> 	dma_buf_sync_get(a sync object, a dmabuf);
>
>     2. Lock a sync object - A task tries to lock all dmabufs added in its own
>     sync object. Basically, the lock mechanism uses ww-mutex[1] to avoid dead
>     lock issue and for race condition between CPU and CPU, CPU and DMA, and DMA
>     and DMA. Taking a lock means that others cannot access all locked dmabufs
>     until the task that locked the corresponding dmabufs, unlocks all the locked
>     dmabufs.
>     This locking should be performed before DMA or CPU accesses these dmabufs.
>
> 	dma_buf_sync_lock(a sync object);
>
>     3. Unlock a sync object - The task unlocks all dmabufs added in its own sync
>     object. The unlock means that the DMA or CPU accesses to the dmabufs have
>     been completed so that others may access them.
>     This unlocking should be performed after DMA or CPU has completed accesses
>     to the dmabufs.
>
> 	dma_buf_sync_unlock(a sync object);
>
>     4. Unregister one or all dmabufs from a sync object - A task unregisters
>     the given dmabufs from the sync object. This means that the task dosen't
>     want to lock the dmabufs.
>     The unregistering should be performed after DMA or CPU has completed
>     accesses to the dmabufs or when dma_buf_sync_lock() is failed.
>
> 	dma_buf_sync_put(a sync object, a dmabuf);
> 	dma_buf_sync_put_all(a sync object);
>
>     The described steps may be summarized as:
> 	get -> lock -> CPU or DMA access to a buffer/s -> unlock -> put
>
> This framework includes the following two features.
>     1. read (shared) and write (exclusive) locks - A task is required to declare
>     the access type when the task tries to register a dmabuf;
>     READ, WRITE, READ DMA, or WRITE DMA.
>
>     The below is example codes,
> 	struct dmabuf_sync *sync;
>
> 	sync = dmabuf_sync_init(NULL, "test sync");
>
> 	dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
> 	...
>
> 	And the below can be used as access types:
> 		DMA_BUF_ACCESS_READ,
> 		- CPU will access a buffer for read.
> 		DMA_BUF_ACCESS_WRITE,
> 		- CPU will access a buffer for read or write.
> 		DMA_BUF_ACCESS_READ | DMA_BUF_ACCESS_DMA,
> 		- DMA will access a buffer for read
> 		DMA_BUF_ACCESS_WRITE | DMA_BUF_ACCESS_DMA,
> 		- DMA will access a buffer for read or write.
>
>     2. Mandatory resource releasing - a task cannot hold a lock indefinitely.
>     A task may never try to unlock a buffer after taking a lock to the buffer.
>     In this case, a timer handler to the corresponding sync object is called
>     in five (default) seconds and then the timed-out buffer is unlocked by work
>     queue handler to avoid lockups and to enforce resources of the buffer.
>
> The below is how to use:
> 	1. Allocate and Initialize a sync object:
> 		struct dmabuf_sync *sync;
>
> 		sync = dmabuf_sync_init(NULL, "test sync");
> 		...
>
> 	2. Add a dmabuf to the sync object when setting up dma buffer relevant
> 	   registers:
> 		dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
> 		...
>
> 	3. Lock all dmabufs of the sync object before DMA or CPU accesses
> 	   the dmabufs:
> 		dmabuf_sync_lock(sync);
> 		...
>
> 	4. Now CPU or DMA can access all dmabufs locked in step 3.
>
> 	5. Unlock all dmabufs added in a sync object after DMA or CPU access
> 	   to these dmabufs is completed:
> 		dmabuf_sync_unlock(sync);
>
> 	   And call the following functions to release all resources,
> 		dmabuf_sync_put_all(sync);
> 		dmabuf_sync_fini(sync);
>
> 	You can refer to actual example codes:
> 		https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/
> 		commit/?h=dmabuf-sync&id=4030bdee9bab5841ad32faade528d04cc0c5fc94
>
> 		https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/
> 		commit/?h=dmabuf-sync&id=6ca548e9ea9e865592719ef6b1cde58366af9f5c
>
> The framework performs cache operation based on the previous and current access
> types to the dmabufs after the locks to all dmabufs are taken:
> 	Call dma_buf_begin_cpu_access() to invalidate cache if,
> 		previous access type is DMA_BUF_ACCESS_WRITE | DMA and
> 		current access type is DMA_BUF_ACCESS_READ
>
> 	Call dma_buf_end_cpu_access() to clean cache if,
> 		previous access type is DMA_BUF_ACCESS_WRITE and
> 		current access type is DMA_BUF_ACCESS_READ | DMA
>
> Such cache operations are invoked via dma-buf interfaces so the dma buf exporter
> should implement dmabuf->ops->begin_cpu_access/end_cpu_access callbacks.
>
> [1] http://lwn.net/Articles/470339/
> [2] http://lwn.net/Articles/532616/
> [3] https://patchwork-mail1.kernel.org/patch/2625321/
>
Looks to me like you're just writing an api similar to the android syncpoint for this.
Is there any reason you had to reimplement the android syncpoint api?
I'm not going into a full review, you may wish to rethink the design first.
All the criticisms I had with the original design approach still apply.



A few things that stand out from a casual glance:

bool is_dmabuf_sync_supported(void)
-> any code that needs CONFIG_DMABUF_SYNC should select it in their kconfig
runtime enabling/disabling of synchronization is a recipe for disaster, remove the !CONFIG_DMABUF_SYNC fallbacks.
NEVER add a runtime way to influence locking behavior.

Considering you're also holding dmaobj->lock for the entire duration, is there any point to not simply call begin_cpu_access/end_cpu_access, and forget this ugly code ever existed?
I still don't see the problem you're trying to solve..

~Maarten
Inki Dae June 17, 2013, 1:04 p.m. UTC | #2
> -----Original Message-----
> From: Maarten Lankhorst [mailto:maarten.lankhorst@canonical.com]
> Sent: Monday, June 17, 2013 8:35 PM
> To: Inki Dae
> Cc: dri-devel@lists.freedesktop.org; linux-fbdev@vger.kernel.org; linux-
> arm-kernel@lists.infradead.org; linux-media@vger.kernel.org;
> daniel@ffwll.ch; robdclark@gmail.com; kyungmin.park@samsung.com;
> myungjoo.ham@samsung.com; yj44.cho@samsung.com
> Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
> framework
> 
> Op 17-06-13 13:15, Inki Dae schreef:
> > This patch adds a buffer synchronization framework based on DMA BUF[1]
> > and reservation[2] to use dma-buf resource, and based on ww-mutexes[3]
> > for lock mechanism.
> >
> > The purpose of this framework is not only to couple cache operations,
> > and buffer access control to CPU and DMA but also to provide easy-to-use
> > interfaces for device drivers and potentially user application
> > (not implemented for user applications, yet). And this framework can be
> > used for all dma devices using system memory as dma buffer, especially
> > for most ARM based SoCs.
> >
> > Changelog v2:
> > - use atomic_add_unless to avoid potential bug.
> > - add a macro for checking valid access type.
> > - code clean.
> >
> > The mechanism of this framework has the following steps,
> >     1. Register dmabufs to a sync object - A task gets a new sync object
> and
> >     can add one or more dmabufs that the task wants to access.
> >     This registering should be performed when a device context or an
> event
> >     context such as a page flip event is created or before CPU accesses
a
> shared
> >     buffer.
> >
> > 	dma_buf_sync_get(a sync object, a dmabuf);
> >
> >     2. Lock a sync object - A task tries to lock all dmabufs added in
its
> own
> >     sync object. Basically, the lock mechanism uses ww-mutex[1] to avoid
> dead
> >     lock issue and for race condition between CPU and CPU, CPU and DMA,
> and DMA
> >     and DMA. Taking a lock means that others cannot access all locked
> dmabufs
> >     until the task that locked the corresponding dmabufs, unlocks all
the
> locked
> >     dmabufs.
> >     This locking should be performed before DMA or CPU accesses these
> dmabufs.
> >
> > 	dma_buf_sync_lock(a sync object);
> >
> >     3. Unlock a sync object - The task unlocks all dmabufs added in its
> own sync
> >     object. The unlock means that the DMA or CPU accesses to the dmabufs
> have
> >     been completed so that others may access them.
> >     This unlocking should be performed after DMA or CPU has completed
> accesses
> >     to the dmabufs.
> >
> > 	dma_buf_sync_unlock(a sync object);
> >
> >     4. Unregister one or all dmabufs from a sync object - A task
> unregisters
> >     the given dmabufs from the sync object. This means that the task
> dosen't
> >     want to lock the dmabufs.
> >     The unregistering should be performed after DMA or CPU has completed
> >     accesses to the dmabufs or when dma_buf_sync_lock() is failed.
> >
> > 	dma_buf_sync_put(a sync object, a dmabuf);
> > 	dma_buf_sync_put_all(a sync object);
> >
> >     The described steps may be summarized as:
> > 	get -> lock -> CPU or DMA access to a buffer/s -> unlock -> put
> >
> > This framework includes the following two features.
> >     1. read (shared) and write (exclusive) locks - A task is required to
> declare
> >     the access type when the task tries to register a dmabuf;
> >     READ, WRITE, READ DMA, or WRITE DMA.
> >
> >     The below is example codes,
> > 	struct dmabuf_sync *sync;
> >
> > 	sync = dmabuf_sync_init(NULL, "test sync");
> >
> > 	dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
> > 	...
> >
> > 	And the below can be used as access types:
> > 		DMA_BUF_ACCESS_READ,
> > 		- CPU will access a buffer for read.
> > 		DMA_BUF_ACCESS_WRITE,
> > 		- CPU will access a buffer for read or write.
> > 		DMA_BUF_ACCESS_READ | DMA_BUF_ACCESS_DMA,
> > 		- DMA will access a buffer for read
> > 		DMA_BUF_ACCESS_WRITE | DMA_BUF_ACCESS_DMA,
> > 		- DMA will access a buffer for read or write.
> >
> >     2. Mandatory resource releasing - a task cannot hold a lock
> indefinitely.
> >     A task may never try to unlock a buffer after taking a lock to the
> buffer.
> >     In this case, a timer handler to the corresponding sync object is
> called
> >     in five (default) seconds and then the timed-out buffer is unlocked
> by work
> >     queue handler to avoid lockups and to enforce resources of the
buffer.
> >
> > The below is how to use:
> > 	1. Allocate and Initialize a sync object:
> > 		struct dmabuf_sync *sync;
> >
> > 		sync = dmabuf_sync_init(NULL, "test sync");
> > 		...
> >
> > 	2. Add a dmabuf to the sync object when setting up dma buffer
> relevant
> > 	   registers:
> > 		dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
> > 		...
> >
> > 	3. Lock all dmabufs of the sync object before DMA or CPU accesses
> > 	   the dmabufs:
> > 		dmabuf_sync_lock(sync);
> > 		...
> >
> > 	4. Now CPU or DMA can access all dmabufs locked in step 3.
> >
> > 	5. Unlock all dmabufs added in a sync object after DMA or CPU
> access
> > 	   to these dmabufs is completed:
> > 		dmabuf_sync_unlock(sync);
> >
> > 	   And call the following functions to release all resources,
> > 		dmabuf_sync_put_all(sync);
> > 		dmabuf_sync_fini(sync);
> >
> > 	You can refer to actual example codes:
> > 		https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> exynos.git/
> > 		commit/?h=dmabuf-
> sync&id=4030bdee9bab5841ad32faade528d04cc0c5fc94
> >
> > 		https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> exynos.git/
> > 		commit/?h=dmabuf-
> sync&id=6ca548e9ea9e865592719ef6b1cde58366af9f5c
> >
> > The framework performs cache operation based on the previous and current
> access
> > types to the dmabufs after the locks to all dmabufs are taken:
> > 	Call dma_buf_begin_cpu_access() to invalidate cache if,
> > 		previous access type is DMA_BUF_ACCESS_WRITE | DMA and
> > 		current access type is DMA_BUF_ACCESS_READ
> >
> > 	Call dma_buf_end_cpu_access() to clean cache if,
> > 		previous access type is DMA_BUF_ACCESS_WRITE and
> > 		current access type is DMA_BUF_ACCESS_READ | DMA
> >
> > Such cache operations are invoked via dma-buf interfaces so the dma buf
> exporter
> > should implement dmabuf->ops->begin_cpu_access/end_cpu_access callbacks.
> >
> > [1] http://lwn.net/Articles/470339/
> > [2] http://lwn.net/Articles/532616/
> > [3] https://patchwork-mail1.kernel.org/patch/2625321/
> >
> Looks to me like you're just writing an api similar to the android
> syncpoint for this.
> Is there any reason you had to reimplement the android syncpoint api?

Right, only difference is that maybe android sync driver, you mentioned as
syncpoint, doesn't use dma-buf resource. What I try to do is familiar to
android's one and also ARM's KDS (Kernel Dependency System). I think I
already mentioned enough through a document file about why I try to
implement this approach based on dma-buf; coupling cache operation and
buffer synchronization between CPU and DMA.

> I'm not going into a full review, you may wish to rethink the design
first.
> All the criticisms I had with the original design approach still apply.
> 

Isn't that enough if what I try to do is similar to android sync driver?
It's very simple and that's all I try to do.:)

> 
> 
> A few things that stand out from a casual glance:
> 
> bool is_dmabuf_sync_supported(void)
> -> any code that needs CONFIG_DMABUF_SYNC should select it in their
> kconfig
> runtime enabling/disabling of synchronization is a recipe for disaster,
> remove the !CONFIG_DMABUF_SYNC fallbacks.
> NEVER add a runtime way to influence locking behavior.
> 

Not enabling/disabling synchronization feature in runtime. That is
determined at build time.

> Considering you're also holding dmaobj->lock for the entire duration, is
> there any point to not simply call begin_cpu_access/end_cpu_access, and
> forget this ugly code ever existed?

You mean mutex_lock(&sync->lock)? Yeah, it seems unnecessary in that case.

> I still don't see the problem you're trying to solve..
> 

It's just to implement a thin sync framework coupling cache operation. This
approach is based on dma-buf for more generic implementation against android
sync driver or KDS.

The described steps may be summarized as:
	lock -> cache operation -> CPU or DMA access to a buffer/s -> unlock

I think that there is no need to get complicated for such approach at least
for most devices sharing system memory. Simple is best.


Thanks,
Inki Dae

> ~Maarten
Russell King - ARM Linux June 17, 2013, 1:31 p.m. UTC | #3
On Mon, Jun 17, 2013 at 10:04:45PM +0900, Inki Dae wrote:
> It's just to implement a thin sync framework coupling cache operation. This
> approach is based on dma-buf for more generic implementation against android
> sync driver or KDS.
> 
> The described steps may be summarized as:
> 	lock -> cache operation -> CPU or DMA access to a buffer/s -> unlock
> 
> I think that there is no need to get complicated for such approach at least
> for most devices sharing system memory. Simple is best.

But hang on, doesn't the dmabuf API already provide that?

The dmabuf API already uses dma_map_sg() and dma_unmap_sg() by providers,
and the rules around the DMA API are that:

	dma_map_sg()
	/* DMA _ONLY_ has access, CPU should not access */
	dma_unmap_sg()
	/* DMA may not access, CPU can access */

It's a little more than that if you include the sync_sg_for_cpu and
sync_sg_for_device APIs too - but the above is the general idea.  What
this means from the dmabuf API point of view is that once you attach to
a dma_buf, and call dma_buf_map_attachment() to get the SG list, the CPU
doesn't have ownership of the buffer and _must_ _not_ access it via any
other means - including using the other dma_buf methods, until either
the appropriate dma_sync_sg_for_cpu() call has been made or the DMA
mapping has been removed via dma_buf_unmap_attachment().

So, the sequence should be:

	dma_buf_map_attachment()
	/* do DMA */
	dma_buf_unmap_attachment()
	/* CPU can now access the buffer */
Maarten Lankhorst June 17, 2013, 1:31 p.m. UTC | #4
Op 17-06-13 15:04, Inki Dae schreef:
>
>> -----Original Message-----
>> From: Maarten Lankhorst [mailto:maarten.lankhorst@canonical.com]
>> Sent: Monday, June 17, 2013 8:35 PM
>> To: Inki Dae
>> Cc: dri-devel@lists.freedesktop.org; linux-fbdev@vger.kernel.org; linux-
>> arm-kernel@lists.infradead.org; linux-media@vger.kernel.org;
>> daniel@ffwll.ch; robdclark@gmail.com; kyungmin.park@samsung.com;
>> myungjoo.ham@samsung.com; yj44.cho@samsung.com
>> Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
>> framework
>>
>> Op 17-06-13 13:15, Inki Dae schreef:
>>> This patch adds a buffer synchronization framework based on DMA BUF[1]
>>> and reservation[2] to use dma-buf resource, and based on ww-mutexes[3]
>>> for lock mechanism.
>>>
>>> The purpose of this framework is not only to couple cache operations,
>>> and buffer access control to CPU and DMA but also to provide easy-to-use
>>> interfaces for device drivers and potentially user application
>>> (not implemented for user applications, yet). And this framework can be
>>> used for all dma devices using system memory as dma buffer, especially
>>> for most ARM based SoCs.
>>>
>>> Changelog v2:
>>> - use atomic_add_unless to avoid potential bug.
>>> - add a macro for checking valid access type.
>>> - code clean.
>>>
>>> The mechanism of this framework has the following steps,
>>>     1. Register dmabufs to a sync object - A task gets a new sync object
>> and
>>>     can add one or more dmabufs that the task wants to access.
>>>     This registering should be performed when a device context or an
>> event
>>>     context such as a page flip event is created or before CPU accesses
> a
>> shared
>>>     buffer.
>>>
>>> 	dma_buf_sync_get(a sync object, a dmabuf);
>>>
>>>     2. Lock a sync object - A task tries to lock all dmabufs added in
> its
>> own
>>>     sync object. Basically, the lock mechanism uses ww-mutex[1] to avoid
>> dead
>>>     lock issue and for race condition between CPU and CPU, CPU and DMA,
>> and DMA
>>>     and DMA. Taking a lock means that others cannot access all locked
>> dmabufs
>>>     until the task that locked the corresponding dmabufs, unlocks all
> the
>> locked
>>>     dmabufs.
>>>     This locking should be performed before DMA or CPU accesses these
>> dmabufs.
>>> 	dma_buf_sync_lock(a sync object);
>>>
>>>     3. Unlock a sync object - The task unlocks all dmabufs added in its
>> own sync
>>>     object. The unlock means that the DMA or CPU accesses to the dmabufs
>> have
>>>     been completed so that others may access them.
>>>     This unlocking should be performed after DMA or CPU has completed
>> accesses
>>>     to the dmabufs.
>>>
>>> 	dma_buf_sync_unlock(a sync object);
>>>
>>>     4. Unregister one or all dmabufs from a sync object - A task
>> unregisters
>>>     the given dmabufs from the sync object. This means that the task
>> dosen't
>>>     want to lock the dmabufs.
>>>     The unregistering should be performed after DMA or CPU has completed
>>>     accesses to the dmabufs or when dma_buf_sync_lock() is failed.
>>>
>>> 	dma_buf_sync_put(a sync object, a dmabuf);
>>> 	dma_buf_sync_put_all(a sync object);
>>>
>>>     The described steps may be summarized as:
>>> 	get -> lock -> CPU or DMA access to a buffer/s -> unlock -> put
>>>
>>> This framework includes the following two features.
>>>     1. read (shared) and write (exclusive) locks - A task is required to
>> declare
>>>     the access type when the task tries to register a dmabuf;
>>>     READ, WRITE, READ DMA, or WRITE DMA.
>>>
>>>     The below is example codes,
>>> 	struct dmabuf_sync *sync;
>>>
>>> 	sync = dmabuf_sync_init(NULL, "test sync");
>>>
>>> 	dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
>>> 	...
>>>
>>> 	And the below can be used as access types:
>>> 		DMA_BUF_ACCESS_READ,
>>> 		- CPU will access a buffer for read.
>>> 		DMA_BUF_ACCESS_WRITE,
>>> 		- CPU will access a buffer for read or write.
>>> 		DMA_BUF_ACCESS_READ | DMA_BUF_ACCESS_DMA,
>>> 		- DMA will access a buffer for read
>>> 		DMA_BUF_ACCESS_WRITE | DMA_BUF_ACCESS_DMA,
>>> 		- DMA will access a buffer for read or write.
>>>
>>>     2. Mandatory resource releasing - a task cannot hold a lock
>> indefinitely.
>>>     A task may never try to unlock a buffer after taking a lock to the
>> buffer.
>>>     In this case, a timer handler to the corresponding sync object is
>> called
>>>     in five (default) seconds and then the timed-out buffer is unlocked
>> by work
>>>     queue handler to avoid lockups and to enforce resources of the
> buffer.
>>> The below is how to use:
>>> 	1. Allocate and Initialize a sync object:
>>> 		struct dmabuf_sync *sync;
>>>
>>> 		sync = dmabuf_sync_init(NULL, "test sync");
>>> 		...
>>>
>>> 	2. Add a dmabuf to the sync object when setting up dma buffer
>> relevant
>>> 	   registers:
>>> 		dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
>>> 		...
>>>
>>> 	3. Lock all dmabufs of the sync object before DMA or CPU accesses
>>> 	   the dmabufs:
>>> 		dmabuf_sync_lock(sync);
>>> 		...
>>>
>>> 	4. Now CPU or DMA can access all dmabufs locked in step 3.
>>>
>>> 	5. Unlock all dmabufs added in a sync object after DMA or CPU
>> access
>>> 	   to these dmabufs is completed:
>>> 		dmabuf_sync_unlock(sync);
>>>
>>> 	   And call the following functions to release all resources,
>>> 		dmabuf_sync_put_all(sync);
>>> 		dmabuf_sync_fini(sync);
>>>
>>> 	You can refer to actual example codes:
>>> 		https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
>> exynos.git/
>>> 		commit/?h=dmabuf-
>> sync&id=4030bdee9bab5841ad32faade528d04cc0c5fc94
>>> 		https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
>> exynos.git/
>>> 		commit/?h=dmabuf-
>> sync&id=6ca548e9ea9e865592719ef6b1cde58366af9f5c
>>> The framework performs cache operation based on the previous and current
>> access
>>> types to the dmabufs after the locks to all dmabufs are taken:
>>> 	Call dma_buf_begin_cpu_access() to invalidate cache if,
>>> 		previous access type is DMA_BUF_ACCESS_WRITE | DMA and
>>> 		current access type is DMA_BUF_ACCESS_READ
>>>
>>> 	Call dma_buf_end_cpu_access() to clean cache if,
>>> 		previous access type is DMA_BUF_ACCESS_WRITE and
>>> 		current access type is DMA_BUF_ACCESS_READ | DMA
>>>
>>> Such cache operations are invoked via dma-buf interfaces so the dma buf
>> exporter
>>> should implement dmabuf->ops->begin_cpu_access/end_cpu_access callbacks.
>>>
>>> [1] http://lwn.net/Articles/470339/
>>> [2] http://lwn.net/Articles/532616/
>>> [3] https://patchwork-mail1.kernel.org/patch/2625321/
>>>
>> Looks to me like you're just writing an api similar to the android
>> syncpoint for this.
>> Is there any reason you had to reimplement the android syncpoint api?
> Right, only difference is that maybe android sync driver, you mentioned as
> syncpoint, doesn't use dma-buf resource. What I try to do is familiar to
> android's one and also ARM's KDS (Kernel Dependency System). I think I
> already mentioned enough through a document file about why I try to
> implement this approach based on dma-buf; coupling cache operation and
> buffer synchronization between CPU and DMA.
Making sure caching works correctly can be handled entirely in the begin_cpu_access/end_cpu_access callbacks, without requiring such.. framework

>> I'm not going into a full review, you may wish to rethink the design
> first.
>> All the criticisms I had with the original design approach still apply.
>>
> Isn't that enough if what I try to do is similar to android sync driver?
> It's very simple and that's all I try to do.:)
From what I can tell you should be able to make it work with just the use of fences, you don't need to keep the buffers locked for the entire duration,
just plug in the last fence in the correct slots and  you're done..

I'm not sure if the android sync api is compatible enough, but you shouldn't need to write code in this way to make it work..
>>
>> A few things that stand out from a casual glance:
>>
>> bool is_dmabuf_sync_supported(void)
>> -> any code that needs CONFIG_DMABUF_SYNC should select it in their
>> kconfig
>> runtime enabling/disabling of synchronization is a recipe for disaster,
>> remove the !CONFIG_DMABUF_SYNC fallbacks.
>> NEVER add a runtime way to influence locking behavior.
>>
> Not enabling/disabling synchronization feature in runtime. That is
> determined at build time.
>
>> Considering you're also holding dmaobj->lock for the entire duration, is
>> there any point to not simply call begin_cpu_access/end_cpu_access, and
>> forget this ugly code ever existed?
> You mean mutex_lock(&sync->lock)? Yeah, it seems unnecessary in that case.
>
>> I still don't see the problem you're trying to solve..
>>
> It's just to implement a thin sync framework coupling cache operation. This
> approach is based on dma-buf for more generic implementation against android
> sync driver or KDS.
>
> The described steps may be summarized as:
> 	lock -> cache operation -> CPU or DMA access to a buffer/s -> unlock
>
> I think that there is no need to get complicated for such approach at least
> for most devices sharing system memory. Simple is best.
>
>
> Thanks,
> Inki Dae
>
>> ~Maarten
Inki Dae June 17, 2013, 3:03 p.m. UTC | #5
2013/6/17 Russell King - ARM Linux <linux@arm.linux.org.uk>

> On Mon, Jun 17, 2013 at 10:04:45PM +0900, Inki Dae wrote:
> > It's just to implement a thin sync framework coupling cache operation.
> This
> > approach is based on dma-buf for more generic implementation against
> android
> > sync driver or KDS.
> >
> > The described steps may be summarized as:
> >       lock -> cache operation -> CPU or DMA access to a buffer/s ->
> unlock
> >
> > I think that there is no need to get complicated for such approach at
> least
> > for most devices sharing system memory. Simple is best.
>
> But hang on, doesn't the dmabuf API already provide that?
>
> The dmabuf API already uses dma_map_sg() and dma_unmap_sg() by providers,
> and the rules around the DMA API are that:
>
>         dma_map_sg()
>         /* DMA _ONLY_ has access, CPU should not access */
>         dma_unmap_sg()
>         /* DMA may not access, CPU can access */
>
> It's a little more than that if you include the sync_sg_for_cpu and
> sync_sg_for_device APIs too - but the above is the general idea.  What
> this means from the dmabuf API point of view is that once you attach to
> a dma_buf, and call dma_buf_map_attachment() to get the SG list, the CPU
> doesn't have ownership of the buffer and _must_ _not_ access it via any
> other means - including using the other dma_buf methods, until either
> the appropriate dma_sync_sg_for_cpu() call has been made or the DMA
> mapping has been removed via dma_buf_unmap_attachment().
>
> So, the sequence should be:
>
>         dma_buf_map_attachment()
>         /* do DMA */
>         dma_buf_unmap_attachment()
>         /* CPU can now access the buffer */
>

Exactly right. But that is not definitely my point. Could you please see
the below simple example?:
(Presume that CPU and DMA share a buffer and the buffer is mapped with user
space as cachable)

        handle1 = drm_gem_fd_to_handle(a dmabuf fd);  ----> 1
                 ...
        va1 = drm_gem_mmap(handle1);
        va2 = drm_gem_mmap(handle2);
        va3 = malloc(size);
                 ...

        while (conditions) {
                 memcpy(va1, some data, size);
                         ...
                 drm_xxx_set_src_dma_buffer(handle1, ...);
                         ...
                 drm_xxx_set_dst_dma_buffer(handle2, ...);
                         ...
                 /* user need to request cache clean at here. */   ----> 2
                         ...
                 /* blocked until dma operation is completed. */
                drm_xxx_start_dma(...);
                         ...
                 /* user need to request cache invalidate at here. */ ---->
3
                memcpy(va3, va2, size);
        }

        gem_close(handle1); -----> 4



dma_buf_map_attachment() is called by number 1 and
dma_buf_unmap_attachement() is called by number 4. However, how about
number 2 and number 3? My point is those. As I already mentioned through a
document file, user prcesses would need to request repeatedly cache
operation, number 2 and 3 as long as the conditions is true because kernel
side cannot be aware of when cpu accesses a shard buffer in this case. And
such codes could be used by user app here and there. However, we cannot
prevent such app from requesting excessive cache operations although such
app are overusing cache operations unnecessarily. So my approach is just
initial idea to prevent the overuse of cache operation: with this approach,
all cache operations are done in kernel side and that is why I try to
coulpe cache operation and buffer synchronization between CPU and DMA.

Thanks,
Inki Dae


> --
> To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Inki Dae June 17, 2013, 3:20 p.m. UTC | #6
2013/6/17 Maarten Lankhorst <maarten.lankhorst@canonical.com>

> Op 17-06-13 15:04, Inki Dae schreef:
> >
> >> -----Original Message-----
> >> From: Maarten Lankhorst [mailto:maarten.lankhorst@canonical.com]
> >> Sent: Monday, June 17, 2013 8:35 PM
> >> To: Inki Dae
> >> Cc: dri-devel@lists.freedesktop.org; linux-fbdev@vger.kernel.org;
> linux-
> >> arm-kernel@lists.infradead.org; linux-media@vger.kernel.org;
> >> daniel@ffwll.ch; robdclark@gmail.com; kyungmin.park@samsung.com;
> >> myungjoo.ham@samsung.com; yj44.cho@samsung.com
> >> Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer
> synchronization
> >> framework
> >>
> >> Op 17-06-13 13:15, Inki Dae schreef:
> >>> This patch adds a buffer synchronization framework based on DMA BUF[1]
> >>> and reservation[2] to use dma-buf resource, and based on ww-mutexes[3]
> >>> for lock mechanism.
> >>>
> >>> The purpose of this framework is not only to couple cache operations,
> >>> and buffer access control to CPU and DMA but also to provide
> easy-to-use
> >>> interfaces for device drivers and potentially user application
> >>> (not implemented for user applications, yet). And this framework can be
> >>> used for all dma devices using system memory as dma buffer, especially
> >>> for most ARM based SoCs.
> >>>
> >>> Changelog v2:
> >>> - use atomic_add_unless to avoid potential bug.
> >>> - add a macro for checking valid access type.
> >>> - code clean.
> >>>
> >>> The mechanism of this framework has the following steps,
> >>>     1. Register dmabufs to a sync object - A task gets a new sync
> object
> >> and
> >>>     can add one or more dmabufs that the task wants to access.
> >>>     This registering should be performed when a device context or an
> >> event
> >>>     context such as a page flip event is created or before CPU accesses
> > a
> >> shared
> >>>     buffer.
> >>>
> >>>     dma_buf_sync_get(a sync object, a dmabuf);
> >>>
> >>>     2. Lock a sync object - A task tries to lock all dmabufs added in
> > its
> >> own
> >>>     sync object. Basically, the lock mechanism uses ww-mutex[1] to
> avoid
> >> dead
> >>>     lock issue and for race condition between CPU and CPU, CPU and DMA,
> >> and DMA
> >>>     and DMA. Taking a lock means that others cannot access all locked
> >> dmabufs
> >>>     until the task that locked the corresponding dmabufs, unlocks all
> > the
> >> locked
> >>>     dmabufs.
> >>>     This locking should be performed before DMA or CPU accesses these
> >> dmabufs.
> >>>     dma_buf_sync_lock(a sync object);
> >>>
> >>>     3. Unlock a sync object - The task unlocks all dmabufs added in its
> >> own sync
> >>>     object. The unlock means that the DMA or CPU accesses to the
> dmabufs
> >> have
> >>>     been completed so that others may access them.
> >>>     This unlocking should be performed after DMA or CPU has completed
> >> accesses
> >>>     to the dmabufs.
> >>>
> >>>     dma_buf_sync_unlock(a sync object);
> >>>
> >>>     4. Unregister one or all dmabufs from a sync object - A task
> >> unregisters
> >>>     the given dmabufs from the sync object. This means that the task
> >> dosen't
> >>>     want to lock the dmabufs.
> >>>     The unregistering should be performed after DMA or CPU has
> completed
> >>>     accesses to the dmabufs or when dma_buf_sync_lock() is failed.
> >>>
> >>>     dma_buf_sync_put(a sync object, a dmabuf);
> >>>     dma_buf_sync_put_all(a sync object);
> >>>
> >>>     The described steps may be summarized as:
> >>>     get -> lock -> CPU or DMA access to a buffer/s -> unlock -> put
> >>>
> >>> This framework includes the following two features.
> >>>     1. read (shared) and write (exclusive) locks - A task is required
> to
> >> declare
> >>>     the access type when the task tries to register a dmabuf;
> >>>     READ, WRITE, READ DMA, or WRITE DMA.
> >>>
> >>>     The below is example codes,
> >>>     struct dmabuf_sync *sync;
> >>>
> >>>     sync = dmabuf_sync_init(NULL, "test sync");
> >>>
> >>>     dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
> >>>     ...
> >>>
> >>>     And the below can be used as access types:
> >>>             DMA_BUF_ACCESS_READ,
> >>>             - CPU will access a buffer for read.
> >>>             DMA_BUF_ACCESS_WRITE,
> >>>             - CPU will access a buffer for read or write.
> >>>             DMA_BUF_ACCESS_READ | DMA_BUF_ACCESS_DMA,
> >>>             - DMA will access a buffer for read
> >>>             DMA_BUF_ACCESS_WRITE | DMA_BUF_ACCESS_DMA,
> >>>             - DMA will access a buffer for read or write.
> >>>
> >>>     2. Mandatory resource releasing - a task cannot hold a lock
> >> indefinitely.
> >>>     A task may never try to unlock a buffer after taking a lock to the
> >> buffer.
> >>>     In this case, a timer handler to the corresponding sync object is
> >> called
> >>>     in five (default) seconds and then the timed-out buffer is unlocked
> >> by work
> >>>     queue handler to avoid lockups and to enforce resources of the
> > buffer.
> >>> The below is how to use:
> >>>     1. Allocate and Initialize a sync object:
> >>>             struct dmabuf_sync *sync;
> >>>
> >>>             sync = dmabuf_sync_init(NULL, "test sync");
> >>>             ...
> >>>
> >>>     2. Add a dmabuf to the sync object when setting up dma buffer
> >> relevant
> >>>        registers:
> >>>             dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
> >>>             ...
> >>>
> >>>     3. Lock all dmabufs of the sync object before DMA or CPU accesses
> >>>        the dmabufs:
> >>>             dmabuf_sync_lock(sync);
> >>>             ...
> >>>
> >>>     4. Now CPU or DMA can access all dmabufs locked in step 3.
> >>>
> >>>     5. Unlock all dmabufs added in a sync object after DMA or CPU
> >> access
> >>>        to these dmabufs is completed:
> >>>             dmabuf_sync_unlock(sync);
> >>>
> >>>        And call the following functions to release all resources,
> >>>             dmabuf_sync_put_all(sync);
> >>>             dmabuf_sync_fini(sync);
> >>>
> >>>     You can refer to actual example codes:
> >>>             https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> >> exynos.git/
> >>>             commit/?h=dmabuf-
> >> sync&id=4030bdee9bab5841ad32faade528d04cc0c5fc94
> >>>             https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> >> exynos.git/
> >>>             commit/?h=dmabuf-
> >> sync&id=6ca548e9ea9e865592719ef6b1cde58366af9f5c
> >>> The framework performs cache operation based on the previous and
> current
> >> access
> >>> types to the dmabufs after the locks to all dmabufs are taken:
> >>>     Call dma_buf_begin_cpu_access() to invalidate cache if,
> >>>             previous access type is DMA_BUF_ACCESS_WRITE | DMA and
> >>>             current access type is DMA_BUF_ACCESS_READ
> >>>
> >>>     Call dma_buf_end_cpu_access() to clean cache if,
> >>>             previous access type is DMA_BUF_ACCESS_WRITE and
> >>>             current access type is DMA_BUF_ACCESS_READ | DMA
> >>>
> >>> Such cache operations are invoked via dma-buf interfaces so the dma buf
> >> exporter
> >>> should implement dmabuf->ops->begin_cpu_access/end_cpu_access
> callbacks.
> >>>
> >>> [1] http://lwn.net/Articles/470339/
> >>> [2] http://lwn.net/Articles/532616/
> >>> [3] https://patchwork-mail1.kernel.org/patch/2625321/
> >>>
> >> Looks to me like you're just writing an api similar to the android
> >> syncpoint for this.
> >> Is there any reason you had to reimplement the android syncpoint api?
> > Right, only difference is that maybe android sync driver, you mentioned
> as
> > syncpoint, doesn't use dma-buf resource. What I try to do is familiar to
> > android's one and also ARM's KDS (Kernel Dependency System). I think I
> > already mentioned enough through a document file about why I try to
> > implement this approach based on dma-buf; coupling cache operation and
> > buffer synchronization between CPU and DMA.
> Making sure caching works correctly can be handled entirely in the
> begin_cpu_access/end_cpu_access callbacks, without requiring such..
> framework


Cache work will be handled by the begin_cpu_access/end_cpu_access
callbacks, actually by dmabuf exporter side. Could you give me your
comments again after reading my answer to Ressell? I'd like to make you
understand what I try to do.


> >> I'm not going into a full review, you may wish to rethink the design
> > first.
> >> All the criticisms I had with the original design approach still apply.
> >>
> > Isn't that enough if what I try to do is similar to android sync driver?
> > It's very simple and that's all I try to do.:)
> From what I can tell you should be able to make it work with just the use
> of fences, you don't need to keep the buffers locked for the entire
> duration,
> just plug in the last fence in the correct slots and  you're done..
>

I'm afraid that I don't understand what you say. My approach doesn't use
dma fence stuff anymore. Maybe it seems missing something. Could you show
me simple example? A little summary is ok. Otherwise, where I can refer
to example codes?

Thanks,
Inki Dae


>



I'm not sure if the android sync api is compatible enough, but you
> shouldn't need to write code in this way to make it work..
> >>
> >> A few things that stand out from a casual glance:
> >>
> >> bool is_dmabuf_sync_supported(void)
> >> -> any code that needs CONFIG_DMABUF_SYNC should select it in their
> >> kconfig
> >> runtime enabling/disabling of synchronization is a recipe for disaster,
> >> remove the !CONFIG_DMABUF_SYNC fallbacks.
> >> NEVER add a runtime way to influence locking behavior.
> >>
> > Not enabling/disabling synchronization feature in runtime. That is
> > determined at build time.
> >
> >> Considering you're also holding dmaobj->lock for the entire duration, is
> >> there any point to not simply call begin_cpu_access/end_cpu_access, and
> >> forget this ugly code ever existed?
> > You mean mutex_lock(&sync->lock)? Yeah, it seems unnecessary in that
> case.
> >
> >> I still don't see the problem you're trying to solve..
> >>
> > It's just to implement a thin sync framework coupling cache operation.
> This
> > approach is based on dma-buf for more generic implementation against
> android
> > sync driver or KDS.
> >
> > The described steps may be summarized as:
> >       lock -> cache operation -> CPU or DMA access to a buffer/s ->
> unlock
> >
> > I think that there is no need to get complicated for such approach at
> least
> > for most devices sharing system memory. Simple is best.
> >
> >
> > Thanks,
> > Inki Dae
> >
> >> ~Maarten
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Russell King - ARM Linux June 17, 2013, 3:42 p.m. UTC | #7
On Tue, Jun 18, 2013 at 12:03:31AM +0900, Inki Dae wrote:
> 2013/6/17 Russell King - ARM Linux <linux@arm.linux.org.uk>
> Exactly right. But that is not definitely my point. Could you please see
> the below simple example?:
> (Presume that CPU and DMA share a buffer and the buffer is mapped with user
> space as cachable)
> 
>         handle1 = drm_gem_fd_to_handle(a dmabuf fd);  ----> 1
>                  ...
>         va1 = drm_gem_mmap(handle1);
>         va2 = drm_gem_mmap(handle2);
>         va3 = malloc(size);
>                  ...
> 
>         while (conditions) {
>                  memcpy(va1, some data, size);

Nooooooooooooooooooooooooooooooooooooooooooooo!

Well, the first thing to say here is that under the requirements of the
DMA API, the above is immediately invalid, because you're writing to a
buffer which under the terms of the DMA API is currently owned by the
DMA agent, *not* by the CPU.  You're supposed to call dma_sync_sg_for_cpu()
before you do that - but how is userspace supposed to know that requirement?
Why should userspace even _have_ to know these requirements of the DMA
API?

It's also entirely possible that drm_gem_fd_to_handle() (which indirectly
causes dma_map_sg() on the buffers scatterlist) followed by mmap'ing it
into userspace is a bug too, as it has the potential to touch caches or
stuff in ways that maybe the DMA or IOMMU may not expect - but I'm not
going to make too big a deal about that, because I don't think we have
anything that picky.

However, the first point above is the most important one, and exposing
the quirks of the DMA API to userland is certainly not a nice thing to be
doing.  This needs to be fixed - we can't go and enforce an API which is
deeply embedded within the kernel all the way out to userland.

What we need is something along the lines of:
(a) dma_buf_map_attachment() _not_ to map the scatterlist for DMA.
or
(b) drm_gem_prime_import() not to call dma_buf_map_attachment() at all.

and for the scatterlist to be mapped for DMA at the point where the DMA
operation is initiated, and unmapped at the point where the DMA operation
is complete.

So no, the problem is not that we need more APIs and code - we need the
existing kernel API fixed so that we don't go exposing userspace to the
requirements of the DMA API.  Unless we do that, we're going to end
up with a huge world of pain, where kernel architecture people need to
audit every damned DRM userspace implementation that happens to be run
on their platform, and that's not something arch people really can
afford to do.

Basically, I think the dma_buf stuff needs to be rewritten with the
requirements of the DMA API in the forefront of whosever mind is doing
the rewriting.

Note: the existing stuff does have the nice side effect of being able
to pass buffers which do not have a struct page * associated with them
through the dma_buf API - I think we can still preserve that by having
dma_buf provide a couple of new APIs to do the SG list map/sync/unmap,
but in any case we need to fix the existing API so that:

	dma_buf_map_attachment() becomes dma_buf_get_sg()
	dma_buf_unmap_attachment() becomes dma_buf_put_sg()

both getting rid of the DMA direction argument, and then we have four
new dma_buf calls:

	dma_buf_map_sg()
	dma_buf_unmap_sg()
	dma_buf_sync_sg_for_cpu()
	dma_buf_sync_sg_for_device()

which do the actual sg map/unmap via the DMA API *at the appropriate
time for DMA*.

So, the summary of this is - at the moment, I regard DRM Prime and dmabuf
to be utterly broken in design for architectures such as ARM where the
requirements of the DMA API have to be followed if you're going to have
a happy life.
Russell King - ARM Linux June 17, 2013, 4:01 p.m. UTC | #8
On Mon, Jun 17, 2013 at 04:42:37PM +0100, Russell King - ARM Linux wrote:
> On Tue, Jun 18, 2013 at 12:03:31AM +0900, Inki Dae wrote:
> > 2013/6/17 Russell King - ARM Linux <linux@arm.linux.org.uk>
> > Exactly right. But that is not definitely my point. Could you please see
> > the below simple example?:
> > (Presume that CPU and DMA share a buffer and the buffer is mapped with user
> > space as cachable)
> > 
> >         handle1 = drm_gem_fd_to_handle(a dmabuf fd);  ----> 1
> >                  ...
> >         va1 = drm_gem_mmap(handle1);
> >         va2 = drm_gem_mmap(handle2);
> >         va3 = malloc(size);
> >                  ...
> > 
> >         while (conditions) {
> >                  memcpy(va1, some data, size);
> 
> Nooooooooooooooooooooooooooooooooooooooooooooo!
> 
> Well, the first thing to say here is that under the requirements of the
> DMA API, the above is immediately invalid, because you're writing to a
> buffer which under the terms of the DMA API is currently owned by the
> DMA agent, *not* by the CPU.  You're supposed to call dma_sync_sg_for_cpu()
> before you do that - but how is userspace supposed to know that requirement?
> Why should userspace even _have_ to know these requirements of the DMA
> API?
> 
> It's also entirely possible that drm_gem_fd_to_handle() (which indirectly
> causes dma_map_sg() on the buffers scatterlist) followed by mmap'ing it
> into userspace is a bug too, as it has the potential to touch caches or
> stuff in ways that maybe the DMA or IOMMU may not expect - but I'm not
> going to make too big a deal about that, because I don't think we have
> anything that picky.
> 
> However, the first point above is the most important one, and exposing
> the quirks of the DMA API to userland is certainly not a nice thing to be
> doing.  This needs to be fixed - we can't go and enforce an API which is
> deeply embedded within the kernel all the way out to userland.
> 
> What we need is something along the lines of:
> (a) dma_buf_map_attachment() _not_ to map the scatterlist for DMA.
> or
> (b) drm_gem_prime_import() not to call dma_buf_map_attachment() at all.
> 
> and for the scatterlist to be mapped for DMA at the point where the DMA
> operation is initiated, and unmapped at the point where the DMA operation
> is complete.
> 
> So no, the problem is not that we need more APIs and code - we need the
> existing kernel API fixed so that we don't go exposing userspace to the
> requirements of the DMA API.  Unless we do that, we're going to end
> up with a huge world of pain, where kernel architecture people need to
> audit every damned DRM userspace implementation that happens to be run
> on their platform, and that's not something arch people really can
> afford to do.
> 
> Basically, I think the dma_buf stuff needs to be rewritten with the
> requirements of the DMA API in the forefront of whosever mind is doing
> the rewriting.
> 
> Note: the existing stuff does have the nice side effect of being able
> to pass buffers which do not have a struct page * associated with them
> through the dma_buf API - I think we can still preserve that by having
> dma_buf provide a couple of new APIs to do the SG list map/sync/unmap,
> but in any case we need to fix the existing API so that:
> 
> 	dma_buf_map_attachment() becomes dma_buf_get_sg()
> 	dma_buf_unmap_attachment() becomes dma_buf_put_sg()
> 
> both getting rid of the DMA direction argument, and then we have four
> new dma_buf calls:
> 
> 	dma_buf_map_sg()
> 	dma_buf_unmap_sg()
> 	dma_buf_sync_sg_for_cpu()
> 	dma_buf_sync_sg_for_device()
> 
> which do the actual sg map/unmap via the DMA API *at the appropriate
> time for DMA*.
> 
> So, the summary of this is - at the moment, I regard DRM Prime and dmabuf
> to be utterly broken in design for architectures such as ARM where the
> requirements of the DMA API have to be followed if you're going to have
> a happy life.

An addendum to the above:

I'll also point out that the whole thing of having random buffers mapped
into userspace, and doing DMA on them is _highly_ architecture specific.
I'm told that PARISC is one architecture where this does not work (because
DMA needs to have some kind of congruence factor programmed into it which
can be different between kernel and userspace mappings of the same
physical mappings.

I ran into this when trying to come up with a cross-arch DMA-API mmap API,
and PARISC ended up killing the idea off (the remains of that attempt is
the dma_mmap_* stuff in ARM's asm/dma-mapping.h.)  However, this was for
the DMA coherent stuff, not the streaming API, which is what _this_
DMA prime/dma_buf stuff is about.

What I'm saying is think _very_ _carefully_ about userspace having
interleaved access to random DMA buffers.  Arguably, DRM should _not_
allow this.
Inki Dae June 17, 2013, 5:19 p.m. UTC | #9
2013/6/18 Russell King - ARM Linux <linux@arm.linux.org.uk>

> On Tue, Jun 18, 2013 at 12:03:31AM +0900, Inki Dae wrote:
> > 2013/6/17 Russell King - ARM Linux <linux@arm.linux.org.uk>
> > Exactly right. But that is not definitely my point. Could you please see
> > the below simple example?:
> > (Presume that CPU and DMA share a buffer and the buffer is mapped with
> user
> > space as cachable)
> >
> >         handle1 = drm_gem_fd_to_handle(a dmabuf fd);  ----> 1
> >                  ...
> >         va1 = drm_gem_mmap(handle1);
> >         va2 = drm_gem_mmap(handle2);
> >         va3 = malloc(size);
> >                  ...
> >
> >         while (conditions) {
> >                  memcpy(va1, some data, size);
>
> Nooooooooooooooooooooooooooooooooooooooooooooo!
>
> Well, the first thing to say here is that under the requirements of the
> DMA API, the above is immediately invalid, because you're writing to a
> buffer which under the terms of the DMA API is currently owned by the
> DMA agent, *not* by the CPU.  You're supposed to call dma_sync_sg_for_cpu()
> before you do that - but how is userspace supposed to know that
> requirement?
> Why should userspace even _have_ to know these requirements of the DMA
> API?
>
> It's also entirely possible that drm_gem_fd_to_handle() (which indirectly
> causes dma_map_sg() on the buffers scatterlist) followed by mmap'ing it
> into userspace is a bug too, as it has the potential to touch caches or
> stuff in ways that maybe the DMA or IOMMU may not expect - but I'm not
> going to make too big a deal about that, because I don't think we have
> anything that picky.
>
> However, the first point above is the most important one, and exposing
> the quirks of the DMA API to userland is certainly not a nice thing to be
> doing.  This needs to be fixed - we can't go and enforce an API which is
> deeply embedded within the kernel all the way out to userland.
>
> What we need is something along the lines of:
> (a) dma_buf_map_attachment() _not_ to map the scatterlist for DMA.
> or
> (b) drm_gem_prime_import() not to call dma_buf_map_attachment() at all.
>
> and for the scatterlist to be mapped for DMA at the point where the DMA
> operation is initiated, and unmapped at the point where the DMA operation
> is complete.
>

It seems like that all pages of the scatterlist should be mapped with
DMA every time DMA operation  is started (or drm_xxx_set_src_dma_buffer
function call), and the pages should be unmapped from DMA again every
time DMA operation is completed: internally, including each cache
operation. Isn't that big overhead? And If there is no problem, we should
accept such overhead? Actually, drm_gem_fd_to_handle() includes to map a
given dmabuf with iommu table (just logical data) of the DMA. And then, the
device address (or iova) already mapped will be set to buffer register of
the DMA with drm_xxx_set_src/dst_dma_buffer(handle1, ...) call.

Thanks,
Inki Dae


>
> So no, the problem is not that we need more APIs and code - we need the
> existing kernel API fixed so that we don't go exposing userspace to the
> requirements of the DMA API.  Unless we do that, we're going to end
> up with a huge world of pain, where kernel architecture people need to
> audit every damned DRM userspace implementation that happens to be run
> on their platform, and that's not something arch people really can
> afford to do.
>
> Basically, I think the dma_buf stuff needs to be rewritten with the
> requirements of the DMA API in the forefront of whosever mind is doing
> the rewriting.
>
> Note: the existing stuff does have the nice side effect of being able
> to pass buffers which do not have a struct page * associated with them
> through the dma_buf API - I think we can still preserve that by having
> dma_buf provide a couple of new APIs to do the SG list map/sync/unmap,
> but in any case we need to fix the existing API so that:
>
>         dma_buf_map_attachment() becomes dma_buf_get_sg()
>         dma_buf_unmap_attachment() becomes dma_buf_put_sg()
>
> both getting rid of the DMA direction argument, and then we have four
> new dma_buf calls:
>
>         dma_buf_map_sg()
>         dma_buf_unmap_sg()
>         dma_buf_sync_sg_for_cpu()
>         dma_buf_sync_sg_for_device()
>
> which do the actual sg map/unmap via the DMA API *at the appropriate
> time for DMA*.
>
> So, the summary of this is - at the moment, I regard DRM Prime and dmabuf
> to be utterly broken in design for architectures such as ARM where the
> requirements of the DMA API have to be followed if you're going to have
> a happy life.
>
Russell King - ARM Linux June 17, 2013, 6:21 p.m. UTC | #10
On Tue, Jun 18, 2013 at 02:19:04AM +0900, Inki Dae wrote:
> It seems like that all pages of the scatterlist should be mapped with
> DMA every time DMA operation  is started (or drm_xxx_set_src_dma_buffer
> function call), and the pages should be unmapped from DMA again every
> time DMA operation is completed: internally, including each cache
> operation.

Correct.

> Isn't that big overhead?

Yes, if you _have_ to do a cache operation to ensure that the DMA agent
can see the data the CPU has written.

> And If there is no problem, we should accept such overhead?

If there is no problem then why are we discussing this and why do we need
this API extension? :)

> Actually, drm_gem_fd_to_handle() includes to map a
> given dmabuf with iommu table (just logical data) of the DMA. And then, the
> device address (or iova) already mapped will be set to buffer register of
> the DMA with drm_xxx_set_src/dst_dma_buffer(handle1, ...) call.

Consider this with a PIPT cache:

	dma_map_sg()	- at this point, the kernel addresses of these
			buffers are cleaned and invalidated for the DMA

	userspace writes to the buffer, the data sits in the CPU cache
	Because the cache is PIPT, this data becomes visible to the
	kernel as well.

	DMA is started, and it writes to the buffer

Now, at this point, which is the correct data?  The data physically in the
RAM which the DMA has written, or the data in the CPU cache.  It may
the answer is - they both are, and the loss of either can be a potential
data corruption issue - there is no way to tell which data should be
kept but the system is in an inconsistent state and _one_ of them will
have to be discarded.

	dma_unmap_sg()	- at this point, the kernel addresses of the
			buffers are _invalidated_ and any data in those
			cache lines is discarded

Which also means that the data in userspace will also be discarded with
PIPT caches.

This is precisely why we have buffer rules associated with the DMA API,
which are these:

	dma_map_sg()
	- the buffer transfers ownership from the CPU to the DMA agent.
	- the CPU may not alter the buffer in any way.
	while (cpu_wants_access) {
		dma_sync_sg_for_cpu()
		- the buffer transfers ownership from the DMA to the CPU.
		- the DMA may not alter the buffer in any way.
		dma_sync_sg_for_device()
		- the buffer transfers ownership from the CPU to the DMA
		- the CPU may not alter the buffer in any way.
	}
	dma_unmap_sg()
	- the buffer transfers ownership from the DMA to the CPU.
	- the DMA may not alter the buffer in any way.

Any violation of that is likely to lead to data corruption.  Now, some
may say that the DMA API is only about the kernel mapping.  Yes it is,
because it takes no regard what so ever about what happens with the
userspace mappings.  This is absolutely true when you have VIVT or
aliasing VIPT caches.

Now consider that with a PIPT cache, or a non-aliasing VIPT cache (which
is exactly the same behaviourally from this aspect) any modification of
a userspace mapping is precisely the same as modifying the kernel space
mapping - and what you find is that the above rules end up _inherently_
applying to the userspace mappings as well.

In other words, userspace applications which have mapped the buffers
must _also_ respect these rules.

And there's no way in hell that I'd ever trust userspace to get this
anywhere near right, and I *really* do not like these kinds of internal
kernel API rules being exposed to userspace.

And so the idea that userspace should be allowed to setup DMA transfers
via "set source buffer", "set destination buffer" calls into an API is
just plain rediculous.  No, userspace should be allowed to ask for
"please copy from buffer X to buffer Y using this transform".

Also remember that drm_xxx_set_src/dst_dma_buffer() would have to
deal with the DRM object IDs for the buffer, and not the actual buffer
information themselves - that is kept within the kernel, so the kernel
knows what's happening.
Inki Dae June 18, 2013, 5:27 a.m. UTC | #11
> -----Original Message-----
> From: Russell King - ARM Linux [mailto:linux@arm.linux.org.uk]
> Sent: Tuesday, June 18, 2013 3:21 AM
> To: Inki Dae
> Cc: Maarten Lankhorst; linux-fbdev; Kyungmin Park; DRI mailing list; Rob
> Clark; myungjoo.ham; YoungJun Cho; Daniel Vetter; linux-arm-
> kernel@lists.infradead.org; linux-media@vger.kernel.org
> Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
> framework
> 
> On Tue, Jun 18, 2013 at 02:19:04AM +0900, Inki Dae wrote:
> > It seems like that all pages of the scatterlist should be mapped with
> > DMA every time DMA operation  is started (or drm_xxx_set_src_dma_buffer
> > function call), and the pages should be unmapped from DMA again every
> > time DMA operation is completed: internally, including each cache
> > operation.
> 
> Correct.
> 
> > Isn't that big overhead?
> 
> Yes, if you _have_ to do a cache operation to ensure that the DMA agent
> can see the data the CPU has written.
> 
> > And If there is no problem, we should accept such overhead?
> 
> If there is no problem then why are we discussing this and why do we need
> this API extension? :)

Ok, another issue regardless of dmabuf-sync. Reasonable to me even though
big overhead. Besides, it seems like that most DRM drivers have same issue.
Therefore, we may need to solve this issue like below:
	- do not map a dmabuf with DMA. And just create/update buffer object
of importer.
	- map the buffer with DMA every time DMA start or iommu page fault
occurs.
	- unmap the buffer from DMA every time DMA operation is completed

With the above approach, cache operation portion of my approach,
dmabuf-sync, can be removed. However, I'm not sure that we really have to
use the above approach with a big overhead. Of course, if we don't use the
above approach then user processes would need to do each cache operation
before DMA operation is started and also after DMA operation is completed in
some cases; user space mapped with physical memory as cachable, and CPU and
DMA share the same buffer.

So I'd like to ask for other DRM maintainers. How do you think about it? it
seems like that Intel DRM (maintained by Daniel), OMAP DRM (maintained by
Rob) and GEM CMA helper also have same issue Russell pointed out. I think
not only the above approach but also the performance is very important.

Thanks,
Inki Dae

> 
> > Actually, drm_gem_fd_to_handle() includes to map a
> > given dmabuf with iommu table (just logical data) of the DMA. And then,
> the
> > device address (or iova) already mapped will be set to buffer register
> of
> > the DMA with drm_xxx_set_src/dst_dma_buffer(handle1, ...) call.
> 
> Consider this with a PIPT cache:
> 
> 	dma_map_sg()	- at this point, the kernel addresses of these
> 			buffers are cleaned and invalidated for the DMA
> 
> 	userspace writes to the buffer, the data sits in the CPU cache
> 	Because the cache is PIPT, this data becomes visible to the
> 	kernel as well.
> 
> 	DMA is started, and it writes to the buffer
> 
> Now, at this point, which is the correct data?  The data physically in the
> RAM which the DMA has written, or the data in the CPU cache.  It may
> the answer is - they both are, and the loss of either can be a potential
> data corruption issue - there is no way to tell which data should be
> kept but the system is in an inconsistent state and _one_ of them will
> have to be discarded.
> 
> 	dma_unmap_sg()	- at this point, the kernel addresses of the
> 			buffers are _invalidated_ and any data in those
> 			cache lines is discarded
> 
> Which also means that the data in userspace will also be discarded with
> PIPT caches.
> 
> This is precisely why we have buffer rules associated with the DMA API,
> which are these:
> 
> 	dma_map_sg()
> 	- the buffer transfers ownership from the CPU to the DMA agent.
> 	- the CPU may not alter the buffer in any way.
> 	while (cpu_wants_access) {
> 		dma_sync_sg_for_cpu()
> 		- the buffer transfers ownership from the DMA to the CPU.
> 		- the DMA may not alter the buffer in any way.
> 		dma_sync_sg_for_device()
> 		- the buffer transfers ownership from the CPU to the DMA
> 		- the CPU may not alter the buffer in any way.
> 	}
> 	dma_unmap_sg()
> 	- the buffer transfers ownership from the DMA to the CPU.
> 	- the DMA may not alter the buffer in any way.
> 
> Any violation of that is likely to lead to data corruption.  Now, some
> may say that the DMA API is only about the kernel mapping.  Yes it is,
> because it takes no regard what so ever about what happens with the
> userspace mappings.  This is absolutely true when you have VIVT or
> aliasing VIPT caches.
> 
> Now consider that with a PIPT cache, or a non-aliasing VIPT cache (which
> is exactly the same behaviourally from this aspect) any modification of
> a userspace mapping is precisely the same as modifying the kernel space
> mapping - and what you find is that the above rules end up _inherently_
> applying to the userspace mappings as well.
> 
> In other words, userspace applications which have mapped the buffers
> must _also_ respect these rules.
> 
> And there's no way in hell that I'd ever trust userspace to get this
> anywhere near right, and I *really* do not like these kinds of internal
> kernel API rules being exposed to userspace.
> 
> And so the idea that userspace should be allowed to setup DMA transfers
> via "set source buffer", "set destination buffer" calls into an API is
> just plain rediculous.  No, userspace should be allowed to ask for
> "please copy from buffer X to buffer Y using this transform".
> 
> Also remember that drm_xxx_set_src/dst_dma_buffer() would have to
> deal with the DRM object IDs for the buffer, and not the actual buffer
> information themselves - that is kept within the kernel, so the kernel
> knows what's happening.
Daniel Vetter June 18, 2013, 7 a.m. UTC | #12
On Mon, Jun 17, 2013 at 04:42:37PM +0100, Russell King - ARM Linux wrote:
> On Tue, Jun 18, 2013 at 12:03:31AM +0900, Inki Dae wrote:
> > 2013/6/17 Russell King - ARM Linux <linux@arm.linux.org.uk>
> > Exactly right. But that is not definitely my point. Could you please see
> > the below simple example?:
> > (Presume that CPU and DMA share a buffer and the buffer is mapped with user
> > space as cachable)
> >
> >         handle1 = drm_gem_fd_to_handle(a dmabuf fd);  ----> 1
> >                  ...
> >         va1 = drm_gem_mmap(handle1);
> >         va2 = drm_gem_mmap(handle2);
> >         va3 = malloc(size);
> >                  ...
> >
> >         while (conditions) {
> >                  memcpy(va1, some data, size);
>
> Nooooooooooooooooooooooooooooooooooooooooooooo!
>
> Well, the first thing to say here is that under the requirements of the
> DMA API, the above is immediately invalid, because you're writing to a
> buffer which under the terms of the DMA API is currently owned by the
> DMA agent, *not* by the CPU.  You're supposed to call dma_sync_sg_for_cpu()
> before you do that - but how is userspace supposed to know that requirement?
> Why should userspace even _have_ to know these requirements of the DMA
> API?
>
> It's also entirely possible that drm_gem_fd_to_handle() (which indirectly
> causes dma_map_sg() on the buffers scatterlist) followed by mmap'ing it
> into userspace is a bug too, as it has the potential to touch caches or
> stuff in ways that maybe the DMA or IOMMU may not expect - but I'm not
> going to make too big a deal about that, because I don't think we have
> anything that picky.
>
> However, the first point above is the most important one, and exposing
> the quirks of the DMA API to userland is certainly not a nice thing to be
> doing.  This needs to be fixed - we can't go and enforce an API which is
> deeply embedded within the kernel all the way out to userland.
>
> What we need is something along the lines of:
> (a) dma_buf_map_attachment() _not_ to map the scatterlist for DMA.
> or
> (b) drm_gem_prime_import() not to call dma_buf_map_attachment() at all.

I strongly vote for (b) (plus adding a dma_buf_map_sync interface to allow
syncing to other devices/cpu whithout dropping the dma mappings). At least
that's been the idea behind things, but currently all (x86-based) drm
drivers cut corners here massively.

Aside: The entire reason behind hiding the dma map step in the dma-buf
attachment is to allow drivers to expose crazy iommus (e.g. the tiler unit
on omap) to other devices. So imo (a) isn't the right choice.
>
> and for the scatterlist to be mapped for DMA at the point where the DMA
> operation is initiated, and unmapped at the point where the DMA operation
> is complete.
>
> So no, the problem is not that we need more APIs and code - we need the
> existing kernel API fixed so that we don't go exposing userspace to the
> requirements of the DMA API.  Unless we do that, we're going to end
> up with a huge world of pain, where kernel architecture people need to
> audit every damned DRM userspace implementation that happens to be run
> on their platform, and that's not something arch people really can
> afford to do.
>
> Basically, I think the dma_buf stuff needs to be rewritten with the
> requirements of the DMA API in the forefront of whosever mind is doing
> the rewriting.
>
> Note: the existing stuff does have the nice side effect of being able
> to pass buffers which do not have a struct page * associated with them
> through the dma_buf API - I think we can still preserve that by having
> dma_buf provide a couple of new APIs to do the SG list map/sync/unmap,
> but in any case we need to fix the existing API so that:
>
> dma_buf_map_attachment() becomes dma_buf_get_sg()
> dma_buf_unmap_attachment() becomes dma_buf_put_sg()
>
> both getting rid of the DMA direction argument, and then we have four
> new dma_buf calls:
>
> dma_buf_map_sg()
> dma_buf_unmap_sg()
> dma_buf_sync_sg_for_cpu()
> dma_buf_sync_sg_for_device()
>
> which do the actual sg map/unmap via the DMA API *at the appropriate
> time for DMA*.

Hm, my idea was to just add a dma_buf_sync_attchment for the device side
syncing, since the cpu access stuff is already bracketed with the
begin/end cpu access stuff. We might need a sync_for_cpu or so for mmap,
but imo mmap support for dma_buf is a bit insane anyway, so I don't care
too much about it.

Since such dma mappings would be really longstanding in most cases anyway
drivers could just map with BIDIRECTIONAL and do all the real flushing
with the new sync stuff.

Another piece of fun will be integration with the fencing stuff Maarten is
doing. I'm not sure whether we should integrate that into the dma-buf
interface (for simple users who don't want to deal with the full
complexity at least).

>
> So, the summary of this is - at the moment, I regard DRM Prime and dmabuf
> to be utterly broken in design for architectures such as ARM where the
> requirements of the DMA API have to be followed if you're going to have
> a happy life.

Agreed. Unfortunately there are not many real drivers shipping in
upstream, and x86 is _very_ forgiving about all this stuff.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
Russell King - ARM Linux June 18, 2013, 8:43 a.m. UTC | #13
On Tue, Jun 18, 2013 at 02:27:40PM +0900, Inki Dae wrote:
> So I'd like to ask for other DRM maintainers. How do you think about it? it
> seems like that Intel DRM (maintained by Daniel), OMAP DRM (maintained by
> Rob) and GEM CMA helper also have same issue Russell pointed out. I think
> not only the above approach but also the performance is very important.

CMA uses coherent memory to back their buffers, though that might not be
true of memory obtained from other drivers via dma_buf.  Plus, there is
no support in the CMA helper for exporting or importng these buffers.

I guess Intel i915 is only used on x86, which is a coherent platform and
requires no cache maintanence for DMA.

OMAP DRM does not support importing non-DRM buffers buffers back into
DRM.  Moreover, it will suffer from the problems I described if any
attempt is made to write to the buffer after it has been re-imported.

Lastly, I should point out that the dma_buf stuff is really only useful
when you need to export a dma buffer from one driver and import it into
another driver - for example to pass data from a camera device driver to
a display device driver.  It shouldn't be used within a single driver
as a means of passing buffers between userspace and kernel space.
Inki Dae June 18, 2013, 9:04 a.m. UTC | #14
> -----Original Message-----
> From: Russell King - ARM Linux [mailto:linux@arm.linux.org.uk]
> Sent: Tuesday, June 18, 2013 5:43 PM
> To: Inki Dae
> Cc: 'Maarten Lankhorst'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI mailing
> list'; 'Rob Clark'; 'myungjoo.ham'; 'YoungJun Cho'; 'Daniel Vetter';
> linux-arm-kernel@lists.infradead.org; linux-media@vger.kernel.org
> Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
> framework
> 
> On Tue, Jun 18, 2013 at 02:27:40PM +0900, Inki Dae wrote:
> > So I'd like to ask for other DRM maintainers. How do you think about it?
> it
> > seems like that Intel DRM (maintained by Daniel), OMAP DRM (maintained
> by
> > Rob) and GEM CMA helper also have same issue Russell pointed out. I
> think
> > not only the above approach but also the performance is very important.
> 
> CMA uses coherent memory to back their buffers, though that might not be
> true of memory obtained from other drivers via dma_buf.  Plus, there is
> no support in the CMA helper for exporting or importng these buffers.
> 

It's not so. Please see Dave's drm next. recently dmabuf support for the CMA
helper has been merged to there.

> I guess Intel i915 is only used on x86, which is a coherent platform and
> requires no cache maintanence for DMA.
> 
> OMAP DRM does not support importing non-DRM buffers buffers back into

Correct. TODO yet.

> DRM.  Moreover, it will suffer from the problems I described if any
> attempt is made to write to the buffer after it has been re-imported.
> 
> Lastly, I should point out that the dma_buf stuff is really only useful
> when you need to export a dma buffer from one driver and import it into
> another driver - for example to pass data from a camera device driver to

Most people know that.

> a display device driver.  It shouldn't be used within a single driver
> as a means of passing buffers between userspace and kernel space.

What I try to do is not really such ugly thing. What I try to do is to
notify that, when CPU tries to access a buffer , to kernel side through
dmabuf interface. So it's not really to send the buffer to kernel.

Thanks,
Inki Dae
Russell King - ARM Linux June 18, 2013, 9:38 a.m. UTC | #15
On Tue, Jun 18, 2013 at 06:04:44PM +0900, Inki Dae wrote:
> 
> 
> > -----Original Message-----
> > From: Russell King - ARM Linux [mailto:linux@arm.linux.org.uk]
> > Sent: Tuesday, June 18, 2013 5:43 PM
> > To: Inki Dae
> > Cc: 'Maarten Lankhorst'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI mailing
> > list'; 'Rob Clark'; 'myungjoo.ham'; 'YoungJun Cho'; 'Daniel Vetter';
> > linux-arm-kernel@lists.infradead.org; linux-media@vger.kernel.org
> > Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
> > framework
> > 
> > On Tue, Jun 18, 2013 at 02:27:40PM +0900, Inki Dae wrote:
> > > So I'd like to ask for other DRM maintainers. How do you think about it?
> > it
> > > seems like that Intel DRM (maintained by Daniel), OMAP DRM (maintained
> > by
> > > Rob) and GEM CMA helper also have same issue Russell pointed out. I
> > think
> > > not only the above approach but also the performance is very important.
> > 
> > CMA uses coherent memory to back their buffers, though that might not be
> > true of memory obtained from other drivers via dma_buf.  Plus, there is
> > no support in the CMA helper for exporting or importng these buffers.
> > 
> 
> It's not so. Please see Dave's drm next. recently dmabuf support for the CMA
> helper has been merged to there.

The point stands: CMA is DMA coherent memory.  It doesn't need and must
never be dma-map-sg'd or dma-sync'd or dma-unmap'd.
Lucas Stach June 18, 2013, 9:47 a.m. UTC | #16
Am Dienstag, den 18.06.2013, 18:04 +0900 schrieb Inki Dae:
[...]
> 
> > a display device driver.  It shouldn't be used within a single driver
> > as a means of passing buffers between userspace and kernel space.
> 
> What I try to do is not really such ugly thing. What I try to do is to
> notify that, when CPU tries to access a buffer , to kernel side through
> dmabuf interface. So it's not really to send the buffer to kernel.
> 
> Thanks,
> Inki Dae
> 
The most basic question about why you are trying to implement this sort
of thing in the dma_buf framework still stands.

Once you imported a dma_buf into your DRM driver it's a GEM object and
you can and should use the native DRM ioctls to prepare/end a CPU access
to this BO. Then internally to your driver you can use the dma_buf
reservation/fence stuff to provide the necessary cross-device sync.

Regards,
Lucas
Russell King - ARM Linux June 18, 2013, 10:46 a.m. UTC | #17
On Tue, Jun 18, 2013 at 09:00:16AM +0200, Daniel Vetter wrote:
> On Mon, Jun 17, 2013 at 04:42:37PM +0100, Russell King - ARM Linux wrote:
> > What we need is something along the lines of:
> > (a) dma_buf_map_attachment() _not_ to map the scatterlist for DMA.
> > or
> > (b) drm_gem_prime_import() not to call dma_buf_map_attachment() at all.
> 
> I strongly vote for (b) (plus adding a dma_buf_map_sync interface to allow
> syncing to other devices/cpu whithout dropping the dma mappings). At least
> that's been the idea behind things, but currently all (x86-based) drm
> drivers cut corners here massively.
> 
> Aside: The entire reason behind hiding the dma map step in the dma-buf
> attachment is to allow drivers to expose crazy iommus (e.g. the tiler unit
> on omap) to other devices. So imo (a) isn't the right choice.

That's why I also talk below about adding the dma_buf_map/sync callbacks
below, so that a dma_buf implementation can do whatever is necessary with
the sg at the point of use.

However, you are correct that this should be unnecessary, as DRM should
only be calling dma_buf_map_attachment() when it needs to know about the
memory behind the object.  The problem is that people will cache that
information - it also may be expensive to setup for the dma_buf - as it
involves counting the number of pages, memory allocation, and maybe
obtaining the set of pages from shmem.

With (a) plus the callbacks below, it means that step is only performed
once, and then the DMA API part is entirely separate - we can have our
cake and eat it in that regard.

> > Note: the existing stuff does have the nice side effect of being able
> > to pass buffers which do not have a struct page * associated with them
> > through the dma_buf API - I think we can still preserve that by having
> > dma_buf provide a couple of new APIs to do the SG list map/sync/unmap,
> > but in any case we need to fix the existing API so that:
> >
> > dma_buf_map_attachment() becomes dma_buf_get_sg()
> > dma_buf_unmap_attachment() becomes dma_buf_put_sg()
> >
> > both getting rid of the DMA direction argument, and then we have four
> > new dma_buf calls:
> >
> > dma_buf_map_sg()
> > dma_buf_unmap_sg()
> > dma_buf_sync_sg_for_cpu()
> > dma_buf_sync_sg_for_device()
> >
> > which do the actual sg map/unmap via the DMA API *at the appropriate
> > time for DMA*.
> 
> Hm, my idea was to just add a dma_buf_sync_attchment for the device side
> syncing, since the cpu access stuff is already bracketed with the
> begin/end cpu access stuff. We might need a sync_for_cpu or so for mmap,
> but imo mmap support for dma_buf is a bit insane anyway, so I don't care
> too much about it.
> 
> Since such dma mappings would be really longstanding in most cases anyway
> drivers could just map with BIDIRECTIONAL and do all the real flushing
> with the new sync stuff.

Note that the DMA API debug doesn't allow you to change the direction
argument on an existing mapping (neither should it, again this is
documented in the DMA API stuff in Documentation/).  This is where you
would need the complete set of four functions I mention above which
reflect the functionality of the DMA API.

The dma_buf implementation doesn't _have_ to implement them if they
are no-ops - for example, your dma_buf sg array contains DMA pointers,
and the memory is already coherent in some way (for example, it's a
separate chunk of memory which isn't part of system RAM.)
Inki Dae June 19, 2013, 5:45 a.m. UTC | #18
> -----Original Message-----
> From: Lucas Stach [mailto:l.stach@pengutronix.de]
> Sent: Tuesday, June 18, 2013 6:47 PM
> To: Inki Dae
> Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI
> mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
> kernel@lists.infradead.org; linux-media@vger.kernel.org
> Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
> framework
> 
> Am Dienstag, den 18.06.2013, 18:04 +0900 schrieb Inki Dae:
> [...]
> >
> > > a display device driver.  It shouldn't be used within a single driver
> > > as a means of passing buffers between userspace and kernel space.
> >
> > What I try to do is not really such ugly thing. What I try to do is to
> > notify that, when CPU tries to access a buffer , to kernel side through
> > dmabuf interface. So it's not really to send the buffer to kernel.
> >
> > Thanks,
> > Inki Dae
> >
> The most basic question about why you are trying to implement this sort
> of thing in the dma_buf framework still stands.
> 
> Once you imported a dma_buf into your DRM driver it's a GEM object and
> you can and should use the native DRM ioctls to prepare/end a CPU access
> to this BO. Then internally to your driver you can use the dma_buf
> reservation/fence stuff to provide the necessary cross-device sync.
> 

I don't really want that is used only for DRM drivers. We really need it for all other DMA devices; i.e., v4l2 based drivers. That is what I try to do. And my approach uses reservation to use dma-buf resources but not dma fence stuff anymore. However, I'm looking into Radeon DRM driver for why we need dma fence stuff, and how we can use it if needed.

Thanks,
Inki Dae

> Regards,
> Lucas
> --
> Pengutronix e.K.                           | Lucas Stach                 |
> Industrial Linux Solutions                 | http://www.pengutronix.de/  |
> Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
> Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |
Lucas Stach June 19, 2013, 10:22 a.m. UTC | #19
Am Mittwoch, den 19.06.2013, 14:45 +0900 schrieb Inki Dae:
> 
> > -----Original Message-----
> > From: Lucas Stach [mailto:l.stach@pengutronix.de]
> > Sent: Tuesday, June 18, 2013 6:47 PM
> > To: Inki Dae
> > Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI
> > mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
> > kernel@lists.infradead.org; linux-media@vger.kernel.org
> > Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
> > framework
> > 
> > Am Dienstag, den 18.06.2013, 18:04 +0900 schrieb Inki Dae:
> > [...]
> > >
> > > > a display device driver.  It shouldn't be used within a single driver
> > > > as a means of passing buffers between userspace and kernel space.
> > >
> > > What I try to do is not really such ugly thing. What I try to do is to
> > > notify that, when CPU tries to access a buffer , to kernel side through
> > > dmabuf interface. So it's not really to send the buffer to kernel.
> > >
> > > Thanks,
> > > Inki Dae
> > >
> > The most basic question about why you are trying to implement this sort
> > of thing in the dma_buf framework still stands.
> > 
> > Once you imported a dma_buf into your DRM driver it's a GEM object and
> > you can and should use the native DRM ioctls to prepare/end a CPU access
> > to this BO. Then internally to your driver you can use the dma_buf
> > reservation/fence stuff to provide the necessary cross-device sync.
> > 
> 
> I don't really want that is used only for DRM drivers. We really need
> it for all other DMA devices; i.e., v4l2 based drivers. That is what I
> try to do. And my approach uses reservation to use dma-buf resources
> but not dma fence stuff anymore. However, I'm looking into Radeon DRM
> driver for why we need dma fence stuff, and how we can use it if
> needed.
> 

Still I don't see the point why you need syncpoints above dma-buf. In
both the DRM and the V4L2 world we have defined points in the API where
a buffer is allowed to change domain from device to CPU and vice versa.

In DRM if you want to access a buffer with the CPU you do a cpu_prepare.
The buffer changes back to GPU domain once you do the execbuf
validation, queue a pageflip to the buffer or similar things.

In V4L2 the syncpoints for cache operations are the queue/dequeue API
entry points. Those are also the exact points to synchronize with other
hardware thus using dma-buf reserve/fence.

In all this I can't see any need for a new syncpoint primitive slapped
on top of dma-buf.

Regards,
Lucas
Inki Dae June 19, 2013, 10:44 a.m. UTC | #20
> -----Original Message-----
> From: Lucas Stach [mailto:l.stach@pengutronix.de]
> Sent: Wednesday, June 19, 2013 7:22 PM
> To: Inki Dae
> Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI
> mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
> kernel@lists.infradead.org; linux-media@vger.kernel.org
> Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
> framework
> 
> Am Mittwoch, den 19.06.2013, 14:45 +0900 schrieb Inki Dae:
> >
> > > -----Original Message-----
> > > From: Lucas Stach [mailto:l.stach@pengutronix.de]
> > > Sent: Tuesday, June 18, 2013 6:47 PM
> > > To: Inki Dae
> > > Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI
> > > mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
> > > kernel@lists.infradead.org; linux-media@vger.kernel.org
> > > Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer
> synchronization
> > > framework
> > >
> > > Am Dienstag, den 18.06.2013, 18:04 +0900 schrieb Inki Dae:
> > > [...]
> > > >
> > > > > a display device driver.  It shouldn't be used within a single
> driver
> > > > > as a means of passing buffers between userspace and kernel space.
> > > >
> > > > What I try to do is not really such ugly thing. What I try to do is
> to
> > > > notify that, when CPU tries to access a buffer , to kernel side
> through
> > > > dmabuf interface. So it's not really to send the buffer to kernel.
> > > >
> > > > Thanks,
> > > > Inki Dae
> > > >
> > > The most basic question about why you are trying to implement this
> sort
> > > of thing in the dma_buf framework still stands.
> > >
> > > Once you imported a dma_buf into your DRM driver it's a GEM object and
> > > you can and should use the native DRM ioctls to prepare/end a CPU
> access
> > > to this BO. Then internally to your driver you can use the dma_buf
> > > reservation/fence stuff to provide the necessary cross-device sync.
> > >
> >
> > I don't really want that is used only for DRM drivers. We really need
> > it for all other DMA devices; i.e., v4l2 based drivers. That is what I
> > try to do. And my approach uses reservation to use dma-buf resources
> > but not dma fence stuff anymore. However, I'm looking into Radeon DRM
> > driver for why we need dma fence stuff, and how we can use it if
> > needed.
> >
> 
> Still I don't see the point why you need syncpoints above dma-buf. In
> both the DRM and the V4L2 world we have defined points in the API where
> a buffer is allowed to change domain from device to CPU and vice versa.
> 
> In DRM if you want to access a buffer with the CPU you do a cpu_prepare.
> The buffer changes back to GPU domain once you do the execbuf
> validation, queue a pageflip to the buffer or similar things.
> 
> In V4L2 the syncpoints for cache operations are the queue/dequeue API
> entry points. Those are also the exact points to synchronize with other
> hardware thus using dma-buf reserve/fence.


If so, what if we want to access a buffer with the CPU _in V4L2_? We should open a drm device node, and then do a cpu_prepare? 

Thanks,
Inki Dae

> 
> In all this I can't see any need for a new syncpoint primitive slapped
> on top of dma-buf.
> 
> Regards,
> Lucas
> --
> Pengutronix e.K.                           | Lucas Stach                 |
> Industrial Linux Solutions                 | http://www.pengutronix.de/  |
> Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
> Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |
Lucas Stach June 19, 2013, 12:34 p.m. UTC | #21
Am Mittwoch, den 19.06.2013, 19:44 +0900 schrieb Inki Dae:
> 
> > -----Original Message-----
> > From: Lucas Stach [mailto:l.stach@pengutronix.de]
> > Sent: Wednesday, June 19, 2013 7:22 PM
> > To: Inki Dae
> > Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI
> > mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
> > kernel@lists.infradead.org; linux-media@vger.kernel.org
> > Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
> > framework
> > 
> > Am Mittwoch, den 19.06.2013, 14:45 +0900 schrieb Inki Dae:
> > >
> > > > -----Original Message-----
> > > > From: Lucas Stach [mailto:l.stach@pengutronix.de]
> > > > Sent: Tuesday, June 18, 2013 6:47 PM
> > > > To: Inki Dae
> > > > Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI
> > > > mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
> > > > kernel@lists.infradead.org; linux-media@vger.kernel.org
> > > > Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer
> > synchronization
> > > > framework
> > > >
> > > > Am Dienstag, den 18.06.2013, 18:04 +0900 schrieb Inki Dae:
> > > > [...]
> > > > >
> > > > > > a display device driver.  It shouldn't be used within a single
> > driver
> > > > > > as a means of passing buffers between userspace and kernel space.
> > > > >
> > > > > What I try to do is not really such ugly thing. What I try to do is
> > to
> > > > > notify that, when CPU tries to access a buffer , to kernel side
> > through
> > > > > dmabuf interface. So it's not really to send the buffer to kernel.
> > > > >
> > > > > Thanks,
> > > > > Inki Dae
> > > > >
> > > > The most basic question about why you are trying to implement this
> > sort
> > > > of thing in the dma_buf framework still stands.
> > > >
> > > > Once you imported a dma_buf into your DRM driver it's a GEM object and
> > > > you can and should use the native DRM ioctls to prepare/end a CPU
> > access
> > > > to this BO. Then internally to your driver you can use the dma_buf
> > > > reservation/fence stuff to provide the necessary cross-device sync.
> > > >
> > >
> > > I don't really want that is used only for DRM drivers. We really need
> > > it for all other DMA devices; i.e., v4l2 based drivers. That is what I
> > > try to do. And my approach uses reservation to use dma-buf resources
> > > but not dma fence stuff anymore. However, I'm looking into Radeon DRM
> > > driver for why we need dma fence stuff, and how we can use it if
> > > needed.
> > >
> > 
> > Still I don't see the point why you need syncpoints above dma-buf. In
> > both the DRM and the V4L2 world we have defined points in the API where
> > a buffer is allowed to change domain from device to CPU and vice versa.
> > 
> > In DRM if you want to access a buffer with the CPU you do a cpu_prepare.
> > The buffer changes back to GPU domain once you do the execbuf
> > validation, queue a pageflip to the buffer or similar things.
> > 
> > In V4L2 the syncpoints for cache operations are the queue/dequeue API
> > entry points. Those are also the exact points to synchronize with other
> > hardware thus using dma-buf reserve/fence.
> 
> 
> If so, what if we want to access a buffer with the CPU _in V4L2_? We
> should open a drm device node, and then do a cpu_prepare? 
> 
Not at all. As I said the syncpoints are the queue/dequeue operations.
When dequeueing a buffer you are explicitly dragging the buffer domain
back from device into userspace and thus CPU domain.

If you are operating on an mmap of a V4L2 processed buffer it's either
before or after it got processed by the hardware and therefore all DMA
operations on the buffer are bracketed by the V4L2 qbug/dqbuf ioctls.
That is where cache operations and synchronization should happen. The
V4L2 driver shouldn't allow you to dequeue a buffer and thus dragging it
back into CPU domain while there is still DMA ongoing. Equally the queue
ioctrl should make sure caches are properly written back to memory. The
results of reading/writing to the mmap of a V4L2 buffer while it is
enqueued to the hardware is simply undefined and there is nothing
suggesting that this is a valid usecase.

Regards,
Lucas
Inki Dae June 19, 2013, 3:10 p.m. UTC | #22
2013/6/19 Lucas Stach <l.stach@pengutronix.de>

> Am Mittwoch, den 19.06.2013, 19:44 +0900 schrieb Inki Dae:
> >
> > > -----Original Message-----
> > > From: Lucas Stach [mailto:l.stach@pengutronix.de]
> > > Sent: Wednesday, June 19, 2013 7:22 PM
> > > To: Inki Dae
> > > Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI
> > > mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
> > > kernel@lists.infradead.org; linux-media@vger.kernel.org
> > > Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer
> synchronization
> > > framework
> > >
> > > Am Mittwoch, den 19.06.2013, 14:45 +0900 schrieb Inki Dae:
> > > >
> > > > > -----Original Message-----
> > > > > From: Lucas Stach [mailto:l.stach@pengutronix.de]
> > > > > Sent: Tuesday, June 18, 2013 6:47 PM
> > > > > To: Inki Dae
> > > > > Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park';
> 'DRI
> > > > > mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
> > > > > kernel@lists.infradead.org; linux-media@vger.kernel.org
> > > > > Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer
> > > synchronization
> > > > > framework
> > > > >
> > > > > Am Dienstag, den 18.06.2013, 18:04 +0900 schrieb Inki Dae:
> > > > > [...]
> > > > > >
> > > > > > > a display device driver.  It shouldn't be used within a single
> > > driver
> > > > > > > as a means of passing buffers between userspace and kernel
> space.
> > > > > >
> > > > > > What I try to do is not really such ugly thing. What I try to do
> is
> > > to
> > > > > > notify that, when CPU tries to access a buffer , to kernel side
> > > through
> > > > > > dmabuf interface. So it's not really to send the buffer to
> kernel.
> > > > > >
> > > > > > Thanks,
> > > > > > Inki Dae
> > > > > >
> > > > > The most basic question about why you are trying to implement this
> > > sort
> > > > > of thing in the dma_buf framework still stands.
> > > > >
> > > > > Once you imported a dma_buf into your DRM driver it's a GEM object
> and
> > > > > you can and should use the native DRM ioctls to prepare/end a CPU
> > > access
> > > > > to this BO. Then internally to your driver you can use the dma_buf
> > > > > reservation/fence stuff to provide the necessary cross-device sync.
> > > > >
> > > >
> > > > I don't really want that is used only for DRM drivers. We really need
> > > > it for all other DMA devices; i.e., v4l2 based drivers. That is what
> I
> > > > try to do. And my approach uses reservation to use dma-buf resources
> > > > but not dma fence stuff anymore. However, I'm looking into Radeon DRM
> > > > driver for why we need dma fence stuff, and how we can use it if
> > > > needed.
> > > >
> > >
> > > Still I don't see the point why you need syncpoints above dma-buf. In
> > > both the DRM and the V4L2 world we have defined points in the API where
> > > a buffer is allowed to change domain from device to CPU and vice versa.
> > >
> > > In DRM if you want to access a buffer with the CPU you do a
> cpu_prepare.
> > > The buffer changes back to GPU domain once you do the execbuf
> > > validation, queue a pageflip to the buffer or similar things.
> > >
> > > In V4L2 the syncpoints for cache operations are the queue/dequeue API
> > > entry points. Those are also the exact points to synchronize with other
> > > hardware thus using dma-buf reserve/fence.
> >
> >
> > If so, what if we want to access a buffer with the CPU _in V4L2_? We
> > should open a drm device node, and then do a cpu_prepare?
> >
> Not at all. As I said the syncpoints are the queue/dequeue operations.
> When dequeueing a buffer you are explicitly dragging the buffer domain
> back from device into userspace and thus CPU domain.
>
> If you are operating on an mmap of a V4L2 processed buffer it's either
> before or after it got processed by the hardware and therefore all DMA
> operations on the buffer are bracketed by the V4L2 qbug/dqbuf ioctls.
> That is where cache operations and synchronization should happen. The
> V4L2 driver shouldn't allow you to dequeue a buffer and thus dragging it
> back into CPU domain while there is still DMA ongoing. Equally the queue
> ioctrl should make sure caches are properly written back to memory. The
> results of reading/writing to the mmap of a V4L2 buffer while it is
> enqueued to the hardware is simply undefined and there is nothing
> suggesting that this is a valid usecase.


Thanks to comments. However, that's not definitely my point, and you
just say the conventional way. My point is to more enhance the conventional
way.

The conventional way is (sorry but I'm not really a good painter) :

CPU -> DMA,
        ioctl(qbuf command)      ioctl(streamon)
               |                                      |
               |                                      |
        qbuf  <- syncpoint       start streaming <- dma access

DMA accesses a queued buffer with start streaming if source and
destination queues are ready.

And DMA -> CPU,
        ioctl(dqbuf command)
              |
              |
        dqbuf <- syncpoint

Internally, dqbuf waits for until DMA opertion is completed. And if
completed then user process can access a dequeued buffer.


On the other hand, the below shows how we could enhance the conventional
way with my approach (just example):

CPU -> DMA,
        ioctl(qbuf command)              ioctl(streamon)
              |                                               |
              |                                               |
        qbuf  <- dma_buf_sync_get   start streaming <- syncpoint

dma_buf_sync_get just registers a sync buffer(dmabuf) to sync object. And
the syncpoint is performed by calling dma_buf_sync_lock(), and then DMA
accesses the sync buffer.

And DMA -> CPU,
        ioctl(dqbuf command)
              |
              |
        dqbuf <- nothing to do

Actual syncpoint is when DMA operation is completed (in interrupt handler):
the syncpoint is performed by calling dma_buf_sync_unlock().
Hence,  my approach is to move the syncpoints into just before dma access
as long as possible. Of course, user process need to call dma-buf
interfaces or simliar things for buffer synchronization with DMA side.
However, as of now, there is no good idea: I had implemented the user
interfaces in dma-buf framework but that was just to show you so ugly. The
eventual purpose of my approach is to integrate sync interfaces with dmabuf
sync, and this approach can be used commonly for v4l2, drm drivers, user
processes, and so on: as I already mentioned in document file, this
approach is for DMA devices using system memory as DMA buffer, especially
most ARM SoCs.

Thanks,
Inki Dae


> Regards,
> Lucas
>
> --
> Pengutronix e.K.                           | Lucas Stach                 |
> Industrial Linux Solutions                 | http://www.pengutronix.de/  |
> Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
> Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>
Russell King - ARM Linux June 19, 2013, 6:29 p.m. UTC | #23
On Thu, Jun 20, 2013 at 12:10:04AM +0900, Inki Dae wrote:
> On the other hand, the below shows how we could enhance the conventional
> way with my approach (just example):
> 
> CPU -> DMA,
>         ioctl(qbuf command)              ioctl(streamon)
>               |                                               |
>               |                                               |
>         qbuf  <- dma_buf_sync_get   start streaming <- syncpoint
> 
> dma_buf_sync_get just registers a sync buffer(dmabuf) to sync object. And
> the syncpoint is performed by calling dma_buf_sync_lock(), and then DMA
> accesses the sync buffer.
> 
> And DMA -> CPU,
>         ioctl(dqbuf command)
>               |
>               |
>         dqbuf <- nothing to do
> 
> Actual syncpoint is when DMA operation is completed (in interrupt handler):
> the syncpoint is performed by calling dma_buf_sync_unlock().
> Hence,  my approach is to move the syncpoints into just before dma access
> as long as possible.

What you've just described does *not* work on architectures such as
ARMv7 which do speculative cache fetches from memory at any time that
that memory is mapped with a cacheable status, and will lead to data
corruption.
Inki Dae June 20, 2013, 6:43 a.m. UTC | #24
> -----Original Message-----
> From: dri-devel-bounces+inki.dae=samsung.com@lists.freedesktop.org
> [mailto:dri-devel-bounces+inki.dae=samsung.com@lists.freedesktop.org] On
> Behalf Of Russell King - ARM Linux
> Sent: Thursday, June 20, 2013 3:29 AM
> To: Inki Dae
> Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun
> Cho; linux-media@vger.kernel.org; linux-arm-kernel@lists.infradead.org
> Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
> framework
> 
> On Thu, Jun 20, 2013 at 12:10:04AM +0900, Inki Dae wrote:
> > On the other hand, the below shows how we could enhance the conventional
> > way with my approach (just example):
> >
> > CPU -> DMA,
> >         ioctl(qbuf command)              ioctl(streamon)
> >               |                                               |
> >               |                                               |
> >         qbuf  <- dma_buf_sync_get   start streaming <- syncpoint
> >
> > dma_buf_sync_get just registers a sync buffer(dmabuf) to sync object.
> And
> > the syncpoint is performed by calling dma_buf_sync_lock(), and then DMA
> > accesses the sync buffer.
> >
> > And DMA -> CPU,
> >         ioctl(dqbuf command)
> >               |
> >               |
> >         dqbuf <- nothing to do
> >
> > Actual syncpoint is when DMA operation is completed (in interrupt
> handler):
> > the syncpoint is performed by calling dma_buf_sync_unlock().
> > Hence,  my approach is to move the syncpoints into just before dma
> access
> > as long as possible.
> 
> What you've just described does *not* work on architectures such as
> ARMv7 which do speculative cache fetches from memory at any time that
> that memory is mapped with a cacheable status, and will lead to data
> corruption.

I didn't explain that enough. Sorry about that. 'nothing to do' means that a
dmabuf sync interface isn't called but existing functions are called. So
this may be explained again:
        ioctl(dqbuf command)
            |
            |
        dqbuf <- 1. dma_unmap_sg
                    2. dma_buf_sync_unlock (syncpoint)

The syncpoint I mentioned means lock mechanism; not doing cache operation.

In addition, please see the below more detail examples.

The conventional way (without dmabuf-sync) is:
Task A                             
----------------------------
 1. CPU accesses buf          
 2. Send the buf to Task B  
 3. Wait for the buf from Task B
 4. go to 1

Task B
---------------------------
1. Wait for the buf from Task A
2. qbuf the buf                 
    2.1 insert the buf to incoming queue
3. stream on
    3.1 dma_map_sg if ready, and move the buf to ready queue
    3.2 get the buf from ready queue, and dma start.
4. dqbuf
    4.1 dma_unmap_sg after dma operation completion
    4.2 move the buf to outgoing queue
5. back the buf to Task A
6. go to 1

In case that two tasks share buffers, and data flow goes from Task A to Task
B, we would need IPC operation to send and receive buffers properly between
those two tasks every time CPU or DMA access to buffers is started or
completed.


With dmabuf-sync:

Task A                             
----------------------------
 1. dma_buf_sync_lock <- synpoint (call by user side)
 2. CPU accesses buf          
 3. dma_buf_sync_unlock <- syncpoint (call by user side)
 4. Send the buf to Task B (just one time)
 5. go to 1


Task B
---------------------------
1. Wait for the buf from Task A (just one time)
2. qbuf the buf                 
    1.1 insert the buf to incoming queue
3. stream on
    3.1 dma_buf_sync_lock <- syncpoint (call by kernel side)
    3.2 dma_map_sg if ready, and move the buf to ready queue
    3.3 get the buf from ready queue, and dma start.
4. dqbuf
    4.1 dma_buf_sync_unlock <- syncpoint (call by kernel side)
    4.2 dma_unmap_sg after dma operation completion
    4.3 move the buf to outgoing queue
5. go to 1

On the other hand, in case of using dmabuf-sync, as you can see the above
example, we would need IPC operation just one time. That way, I think we
could not only reduce performance overhead but also make user application
simplified. Of course, this approach can be used for all DMA device drivers
such as DRM. I'm not a specialist in v4l2 world so there may be missing
point.

Thanks,
Inki Dae

> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
Lucas Stach June 20, 2013, 7:47 a.m. UTC | #25
Am Donnerstag, den 20.06.2013, 15:43 +0900 schrieb Inki Dae:
> 
> > -----Original Message-----
> > From: dri-devel-bounces+inki.dae=samsung.com@lists.freedesktop.org
> > [mailto:dri-devel-bounces+inki.dae=samsung.com@lists.freedesktop.org] On
> > Behalf Of Russell King - ARM Linux
> > Sent: Thursday, June 20, 2013 3:29 AM
> > To: Inki Dae
> > Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun
> > Cho; linux-media@vger.kernel.org; linux-arm-kernel@lists.infradead.org
> > Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
> > framework
> > 
> > On Thu, Jun 20, 2013 at 12:10:04AM +0900, Inki Dae wrote:
> > > On the other hand, the below shows how we could enhance the conventional
> > > way with my approach (just example):
> > >
> > > CPU -> DMA,
> > >         ioctl(qbuf command)              ioctl(streamon)
> > >               |                                               |
> > >               |                                               |
> > >         qbuf  <- dma_buf_sync_get   start streaming <- syncpoint
> > >
> > > dma_buf_sync_get just registers a sync buffer(dmabuf) to sync object.
> > And
> > > the syncpoint is performed by calling dma_buf_sync_lock(), and then DMA
> > > accesses the sync buffer.
> > >
> > > And DMA -> CPU,
> > >         ioctl(dqbuf command)
> > >               |
> > >               |
> > >         dqbuf <- nothing to do
> > >
> > > Actual syncpoint is when DMA operation is completed (in interrupt
> > handler):
> > > the syncpoint is performed by calling dma_buf_sync_unlock().
> > > Hence,  my approach is to move the syncpoints into just before dma
> > access
> > > as long as possible.
> > 
> > What you've just described does *not* work on architectures such as
> > ARMv7 which do speculative cache fetches from memory at any time that
> > that memory is mapped with a cacheable status, and will lead to data
> > corruption.
> 
> I didn't explain that enough. Sorry about that. 'nothing to do' means that a
> dmabuf sync interface isn't called but existing functions are called. So
> this may be explained again:
>         ioctl(dqbuf command)
>             |
>             |
>         dqbuf <- 1. dma_unmap_sg
>                     2. dma_buf_sync_unlock (syncpoint)
> 
> The syncpoint I mentioned means lock mechanism; not doing cache operation.
> 
> In addition, please see the below more detail examples.
> 
> The conventional way (without dmabuf-sync) is:
> Task A                             
> ----------------------------
>  1. CPU accesses buf          
>  2. Send the buf to Task B  
>  3. Wait for the buf from Task B
>  4. go to 1
> 
> Task B
> ---------------------------
> 1. Wait for the buf from Task A
> 2. qbuf the buf                 
>     2.1 insert the buf to incoming queue
> 3. stream on
>     3.1 dma_map_sg if ready, and move the buf to ready queue
>     3.2 get the buf from ready queue, and dma start.
> 4. dqbuf
>     4.1 dma_unmap_sg after dma operation completion
>     4.2 move the buf to outgoing queue
> 5. back the buf to Task A
> 6. go to 1
> 
> In case that two tasks share buffers, and data flow goes from Task A to Task
> B, we would need IPC operation to send and receive buffers properly between
> those two tasks every time CPU or DMA access to buffers is started or
> completed.
> 
> 
> With dmabuf-sync:
> 
> Task A                             
> ----------------------------
>  1. dma_buf_sync_lock <- synpoint (call by user side)
>  2. CPU accesses buf          
>  3. dma_buf_sync_unlock <- syncpoint (call by user side)
>  4. Send the buf to Task B (just one time)
>  5. go to 1
> 
> 
> Task B
> ---------------------------
> 1. Wait for the buf from Task A (just one time)
> 2. qbuf the buf                 
>     1.1 insert the buf to incoming queue
> 3. stream on
>     3.1 dma_buf_sync_lock <- syncpoint (call by kernel side)
>     3.2 dma_map_sg if ready, and move the buf to ready queue
>     3.3 get the buf from ready queue, and dma start.
> 4. dqbuf
>     4.1 dma_buf_sync_unlock <- syncpoint (call by kernel side)
>     4.2 dma_unmap_sg after dma operation completion
>     4.3 move the buf to outgoing queue
> 5. go to 1
> 
> On the other hand, in case of using dmabuf-sync, as you can see the above
> example, we would need IPC operation just one time. That way, I think we
> could not only reduce performance overhead but also make user application
> simplified. Of course, this approach can be used for all DMA device drivers
> such as DRM. I'm not a specialist in v4l2 world so there may be missing
> point.
> 

You already need some kind of IPC between the two tasks, as I suspect
even in your example it wouldn't make much sense to queue the buffer
over and over again in task B without task A writing anything to it. So
task A has to signal task B there is new data in the buffer to be
processed.

There is no need to share the buffer over and over again just to get the
two processes to work together on the same thing. Just share the fd
between both and then do out-of-band completion signaling, as you need
this anyway. Without this you'll end up with unpredictable behavior.
Just because sync allows you to access the buffer doesn't mean it's
valid for your use-case. Without completion signaling you could easily
end up overwriting your data from task A multiple times before task B
even tries to lock the buffer for processing.

So the valid flow is (and this already works with the current APIs):
Task A                                    Task B
------                                    ------
CPU access buffer
         ----------completion signal--------->
                                          qbuf (dragging buffer into 
                                          device domain, flush caches,
                                          reserve buffer etc.)
                                                    |
                                          wait for device operation to
                                          complete
                                                    |
                                          dqbuf (dragging buffer back
                                          into CPU domain, invalidate
                                          caches, unreserve)
        <---------completion signal------------
CPU access buffer


Regards,
Lucas
Russell King - ARM Linux June 20, 2013, 8:17 a.m. UTC | #26
On Thu, Jun 20, 2013 at 09:47:07AM +0200, Lucas Stach wrote:
> Am Donnerstag, den 20.06.2013, 15:43 +0900 schrieb Inki Dae:
> > 
> > > -----Original Message-----
> > > From: dri-devel-bounces+inki.dae=samsung.com@lists.freedesktop.org
> > > [mailto:dri-devel-bounces+inki.dae=samsung.com@lists.freedesktop.org] On
> > > Behalf Of Russell King - ARM Linux
> > > Sent: Thursday, June 20, 2013 3:29 AM
> > > To: Inki Dae
> > > Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun
> > > Cho; linux-media@vger.kernel.org; linux-arm-kernel@lists.infradead.org
> > > Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
> > > framework
> > > 
> > > On Thu, Jun 20, 2013 at 12:10:04AM +0900, Inki Dae wrote:
> > > > On the other hand, the below shows how we could enhance the conventional
> > > > way with my approach (just example):
> > > >
> > > > CPU -> DMA,
> > > >         ioctl(qbuf command)              ioctl(streamon)
> > > >               |                                               |
> > > >               |                                               |
> > > >         qbuf  <- dma_buf_sync_get   start streaming <- syncpoint
> > > >
> > > > dma_buf_sync_get just registers a sync buffer(dmabuf) to sync object.
> > > And
> > > > the syncpoint is performed by calling dma_buf_sync_lock(), and then DMA
> > > > accesses the sync buffer.
> > > >
> > > > And DMA -> CPU,
> > > >         ioctl(dqbuf command)
> > > >               |
> > > >               |
> > > >         dqbuf <- nothing to do
> > > >
> > > > Actual syncpoint is when DMA operation is completed (in interrupt
> > > handler):
> > > > the syncpoint is performed by calling dma_buf_sync_unlock().
> > > > Hence,  my approach is to move the syncpoints into just before dma
> > > access
> > > > as long as possible.
> > > 
> > > What you've just described does *not* work on architectures such as
> > > ARMv7 which do speculative cache fetches from memory at any time that
> > > that memory is mapped with a cacheable status, and will lead to data
> > > corruption.
> > 
> > I didn't explain that enough. Sorry about that. 'nothing to do' means that a
> > dmabuf sync interface isn't called but existing functions are called. So
> > this may be explained again:
> >         ioctl(dqbuf command)
> >             |
> >             |
> >         dqbuf <- 1. dma_unmap_sg
> >                     2. dma_buf_sync_unlock (syncpoint)
> > 
> > The syncpoint I mentioned means lock mechanism; not doing cache operation.
> > 
> > In addition, please see the below more detail examples.
> > 
> > The conventional way (without dmabuf-sync) is:
> > Task A                             
> > ----------------------------
> >  1. CPU accesses buf          
> >  2. Send the buf to Task B  
> >  3. Wait for the buf from Task B
> >  4. go to 1
> > 
> > Task B
> > ---------------------------
> > 1. Wait for the buf from Task A
> > 2. qbuf the buf                 
> >     2.1 insert the buf to incoming queue
> > 3. stream on
> >     3.1 dma_map_sg if ready, and move the buf to ready queue
> >     3.2 get the buf from ready queue, and dma start.
> > 4. dqbuf
> >     4.1 dma_unmap_sg after dma operation completion
> >     4.2 move the buf to outgoing queue
> > 5. back the buf to Task A
> > 6. go to 1
> > 
> > In case that two tasks share buffers, and data flow goes from Task A to Task
> > B, we would need IPC operation to send and receive buffers properly between
> > those two tasks every time CPU or DMA access to buffers is started or
> > completed.
> > 
> > 
> > With dmabuf-sync:
> > 
> > Task A                             
> > ----------------------------
> >  1. dma_buf_sync_lock <- synpoint (call by user side)
> >  2. CPU accesses buf          
> >  3. dma_buf_sync_unlock <- syncpoint (call by user side)
> >  4. Send the buf to Task B (just one time)
> >  5. go to 1
> > 
> > 
> > Task B
> > ---------------------------
> > 1. Wait for the buf from Task A (just one time)
> > 2. qbuf the buf                 
> >     1.1 insert the buf to incoming queue
> > 3. stream on
> >     3.1 dma_buf_sync_lock <- syncpoint (call by kernel side)
> >     3.2 dma_map_sg if ready, and move the buf to ready queue
> >     3.3 get the buf from ready queue, and dma start.
> > 4. dqbuf
> >     4.1 dma_buf_sync_unlock <- syncpoint (call by kernel side)
> >     4.2 dma_unmap_sg after dma operation completion
> >     4.3 move the buf to outgoing queue
> > 5. go to 1
> > 
> > On the other hand, in case of using dmabuf-sync, as you can see the above
> > example, we would need IPC operation just one time. That way, I think we
> > could not only reduce performance overhead but also make user application
> > simplified. Of course, this approach can be used for all DMA device drivers
> > such as DRM. I'm not a specialist in v4l2 world so there may be missing
> > point.
> > 
> 
> You already need some kind of IPC between the two tasks, as I suspect
> even in your example it wouldn't make much sense to queue the buffer
> over and over again in task B without task A writing anything to it. So
> task A has to signal task B there is new data in the buffer to be
> processed.

Hang on.  Since when did dma_buf become another inter-process IPC
mechanism?  That's *not* it's design goal, and there's other much
better mechanisms already provided.

What dma_buf is about is passing a DMA buffer from subsystem to another
subsystem via userspace without exposing precise details of the buffer
to userspace.

It is also not about passing a DMA buffer from one subsystem to another
to gain access to the buffer via the other subsystem and bypassing the
access arrangements of the source subsystem.

I'm beginning to think that _that_ should be enforced by subsystems, by
having any buffer imported into a subsystem either not able to be mapped
into userspace, or if it is mappable into userspace, it must be a
read-only mapping.
Inki Dae June 20, 2013, 8:24 a.m. UTC | #27
> -----Original Message-----
> From: Lucas Stach [mailto:l.stach@pengutronix.de]
> Sent: Thursday, June 20, 2013 4:47 PM
> To: Inki Dae
> Cc: 'Russell King - ARM Linux'; 'Inki Dae'; 'linux-fbdev'; 'YoungJun Cho';
> 'Kyungmin Park'; 'myungjoo.ham'; 'DRI mailing list'; linux-arm-
> kernel@lists.infradead.org; linux-media@vger.kernel.org
> Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
> framework
> 
> Am Donnerstag, den 20.06.2013, 15:43 +0900 schrieb Inki Dae:
> >
> > > -----Original Message-----
> > > From: dri-devel-bounces+inki.dae=samsung.com@lists.freedesktop.org
> > > [mailto:dri-devel-bounces+inki.dae=samsung.com@lists.freedesktop.org]
> On
> > > Behalf Of Russell King - ARM Linux
> > > Sent: Thursday, June 20, 2013 3:29 AM
> > > To: Inki Dae
> > > Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham;
> YoungJun
> > > Cho; linux-media@vger.kernel.org; linux-arm-kernel@lists.infradead.org
> > > Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer
> synchronization
> > > framework
> > >
> > > On Thu, Jun 20, 2013 at 12:10:04AM +0900, Inki Dae wrote:
> > > > On the other hand, the below shows how we could enhance the
> conventional
> > > > way with my approach (just example):
> > > >
> > > > CPU -> DMA,
> > > >         ioctl(qbuf command)              ioctl(streamon)
> > > >               |                                               |
> > > >               |                                               |
> > > >         qbuf  <- dma_buf_sync_get   start streaming <- syncpoint
> > > >
> > > > dma_buf_sync_get just registers a sync buffer(dmabuf) to sync object.
> > > And
> > > > the syncpoint is performed by calling dma_buf_sync_lock(), and then
> DMA
> > > > accesses the sync buffer.
> > > >
> > > > And DMA -> CPU,
> > > >         ioctl(dqbuf command)
> > > >               |
> > > >               |
> > > >         dqbuf <- nothing to do
> > > >
> > > > Actual syncpoint is when DMA operation is completed (in interrupt
> > > handler):
> > > > the syncpoint is performed by calling dma_buf_sync_unlock().
> > > > Hence,  my approach is to move the syncpoints into just before dma
> > > access
> > > > as long as possible.
> > >
> > > What you've just described does *not* work on architectures such as
> > > ARMv7 which do speculative cache fetches from memory at any time that
> > > that memory is mapped with a cacheable status, and will lead to data
> > > corruption.
> >
> > I didn't explain that enough. Sorry about that. 'nothing to do' means
> that a
> > dmabuf sync interface isn't called but existing functions are called. So
> > this may be explained again:
> >         ioctl(dqbuf command)
> >             |
> >             |
> >         dqbuf <- 1. dma_unmap_sg
> >                     2. dma_buf_sync_unlock (syncpoint)
> >
> > The syncpoint I mentioned means lock mechanism; not doing cache
> operation.
> >
> > In addition, please see the below more detail examples.
> >
> > The conventional way (without dmabuf-sync) is:
> > Task A
> > ----------------------------
> >  1. CPU accesses buf
> >  2. Send the buf to Task B
> >  3. Wait for the buf from Task B
> >  4. go to 1
> >
> > Task B
> > ---------------------------
> > 1. Wait for the buf from Task A
> > 2. qbuf the buf
> >     2.1 insert the buf to incoming queue
> > 3. stream on
> >     3.1 dma_map_sg if ready, and move the buf to ready queue
> >     3.2 get the buf from ready queue, and dma start.
> > 4. dqbuf
> >     4.1 dma_unmap_sg after dma operation completion
> >     4.2 move the buf to outgoing queue
> > 5. back the buf to Task A
> > 6. go to 1
> >
> > In case that two tasks share buffers, and data flow goes from Task A to
> Task
> > B, we would need IPC operation to send and receive buffers properly
> between
> > those two tasks every time CPU or DMA access to buffers is started or
> > completed.
> >
> >
> > With dmabuf-sync:
> >
> > Task A
> > ----------------------------
> >  1. dma_buf_sync_lock <- synpoint (call by user side)
> >  2. CPU accesses buf
> >  3. dma_buf_sync_unlock <- syncpoint (call by user side)
> >  4. Send the buf to Task B (just one time)
> >  5. go to 1
> >
> >
> > Task B
> > ---------------------------
> > 1. Wait for the buf from Task A (just one time)
> > 2. qbuf the buf
> >     1.1 insert the buf to incoming queue
> > 3. stream on
> >     3.1 dma_buf_sync_lock <- syncpoint (call by kernel side)
> >     3.2 dma_map_sg if ready, and move the buf to ready queue
> >     3.3 get the buf from ready queue, and dma start.
> > 4. dqbuf
> >     4.1 dma_buf_sync_unlock <- syncpoint (call by kernel side)
> >     4.2 dma_unmap_sg after dma operation completion
> >     4.3 move the buf to outgoing queue
> > 5. go to 1
> >
> > On the other hand, in case of using dmabuf-sync, as you can see the
> above
> > example, we would need IPC operation just one time. That way, I think we
> > could not only reduce performance overhead but also make user
> application
> > simplified. Of course, this approach can be used for all DMA device
> drivers
> > such as DRM. I'm not a specialist in v4l2 world so there may be missing
> > point.
> >
> 
> You already need some kind of IPC between the two tasks, as I suspect
> even in your example it wouldn't make much sense to queue the buffer
> over and over again in task B without task A writing anything to it. So
> task A has to signal task B there is new data in the buffer to be
> processed.
> 
> There is no need to share the buffer over and over again just to get the
> two processes to work together on the same thing. Just share the fd
> between both and then do out-of-band completion signaling, as you need
> this anyway. Without this you'll end up with unpredictable behavior.
> Just because sync allows you to access the buffer doesn't mean it's
> valid for your use-case. Without completion signaling you could easily
> end up overwriting your data from task A multiple times before task B
> even tries to lock the buffer for processing.
> 
> So the valid flow is (and this already works with the current APIs):
> Task A                                    Task B
> ------                                    ------
> CPU access buffer
>          ----------completion signal--------->
>                                           qbuf (dragging buffer into
>                                           device domain, flush caches,
>                                           reserve buffer etc.)
>                                                     |
>                                           wait for device operation to
>                                           complete
>                                                     |
>                                           dqbuf (dragging buffer back
>                                           into CPU domain, invalidate
>                                           caches, unreserve)
>         <---------completion signal------------
> CPU access buffer
> 

Correct. In case that data flow goes from A to B, it needs some kind of IPC between the two tasks every time as you said. Then, without dmabuf-sync, how do think about the case that two tasks share the same buffer but these tasks access the buffer(buf1) as write, and data of the buffer(buf1) isn't needed to be shared?


With dmabuf-sync is:

 Task A
 ----------------------------
 1. dma_buf_sync_lock <- synpoint (call by user side)
 2. CPU writes something to buf1
 3. dma_buf_sync_unlock <- syncpoint (call by user side)
 4. copy buf1 to buf2
 5. go to 1


 Task B
 ---------------------------
 1. dma_buf_sync_lock
 2. CPU writes something to buf3
 3. dma_buf_sync_unlock
 4. qbuf the buf3(src) and buf1(dst)
     4.1 insert buf3,1 to incoming queue
     4.2 dma_buf_sync_lock <- syncpoint (call by kernel side)
 5. stream on
     5.1 dma_map_sg if ready, and move the buf to ready queue
     5.2 get the buf from ready queue, and dma start.
 6. dqbuf
     6.1 dma_buf_sync_unlock <- syncpoint (call by kernel side)
     6.2 dma_unmap_sg after dma operation completion
     6.3 move the buf3,1 to outgoing queue
7. go to 1


Thanks,
Inki Dae

> 
> Regards,
> Lucas
> --
> Pengutronix e.K.                           | Lucas Stach                 |
> Industrial Linux Solutions                 | http://www.pengutronix.de/  |
> Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
> Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |
Lucas Stach June 20, 2013, 8:26 a.m. UTC | #28
Am Donnerstag, den 20.06.2013, 09:17 +0100 schrieb Russell King - ARM
Linux:
> On Thu, Jun 20, 2013 at 09:47:07AM +0200, Lucas Stach wrote:
> > Am Donnerstag, den 20.06.2013, 15:43 +0900 schrieb Inki Dae:
> > > 
> > > > -----Original Message-----
> > > > From: dri-devel-bounces+inki.dae=samsung.com@lists.freedesktop.org
> > > > [mailto:dri-devel-bounces+inki.dae=samsung.com@lists.freedesktop.org] On
> > > > Behalf Of Russell King - ARM Linux
> > > > Sent: Thursday, June 20, 2013 3:29 AM
> > > > To: Inki Dae
> > > > Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun
> > > > Cho; linux-media@vger.kernel.org; linux-arm-kernel@lists.infradead.org
> > > > Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
> > > > framework
> > > > 
> > > > On Thu, Jun 20, 2013 at 12:10:04AM +0900, Inki Dae wrote:
> > > > > On the other hand, the below shows how we could enhance the conventional
> > > > > way with my approach (just example):
> > > > >
> > > > > CPU -> DMA,
> > > > >         ioctl(qbuf command)              ioctl(streamon)
> > > > >               |                                               |
> > > > >               |                                               |
> > > > >         qbuf  <- dma_buf_sync_get   start streaming <- syncpoint
> > > > >
> > > > > dma_buf_sync_get just registers a sync buffer(dmabuf) to sync object.
> > > > And
> > > > > the syncpoint is performed by calling dma_buf_sync_lock(), and then DMA
> > > > > accesses the sync buffer.
> > > > >
> > > > > And DMA -> CPU,
> > > > >         ioctl(dqbuf command)
> > > > >               |
> > > > >               |
> > > > >         dqbuf <- nothing to do
> > > > >
> > > > > Actual syncpoint is when DMA operation is completed (in interrupt
> > > > handler):
> > > > > the syncpoint is performed by calling dma_buf_sync_unlock().
> > > > > Hence,  my approach is to move the syncpoints into just before dma
> > > > access
> > > > > as long as possible.
> > > > 
> > > > What you've just described does *not* work on architectures such as
> > > > ARMv7 which do speculative cache fetches from memory at any time that
> > > > that memory is mapped with a cacheable status, and will lead to data
> > > > corruption.
> > > 
> > > I didn't explain that enough. Sorry about that. 'nothing to do' means that a
> > > dmabuf sync interface isn't called but existing functions are called. So
> > > this may be explained again:
> > >         ioctl(dqbuf command)
> > >             |
> > >             |
> > >         dqbuf <- 1. dma_unmap_sg
> > >                     2. dma_buf_sync_unlock (syncpoint)
> > > 
> > > The syncpoint I mentioned means lock mechanism; not doing cache operation.
> > > 
> > > In addition, please see the below more detail examples.
> > > 
> > > The conventional way (without dmabuf-sync) is:
> > > Task A                             
> > > ----------------------------
> > >  1. CPU accesses buf          
> > >  2. Send the buf to Task B  
> > >  3. Wait for the buf from Task B
> > >  4. go to 1
> > > 
> > > Task B
> > > ---------------------------
> > > 1. Wait for the buf from Task A
> > > 2. qbuf the buf                 
> > >     2.1 insert the buf to incoming queue
> > > 3. stream on
> > >     3.1 dma_map_sg if ready, and move the buf to ready queue
> > >     3.2 get the buf from ready queue, and dma start.
> > > 4. dqbuf
> > >     4.1 dma_unmap_sg after dma operation completion
> > >     4.2 move the buf to outgoing queue
> > > 5. back the buf to Task A
> > > 6. go to 1
> > > 
> > > In case that two tasks share buffers, and data flow goes from Task A to Task
> > > B, we would need IPC operation to send and receive buffers properly between
> > > those two tasks every time CPU or DMA access to buffers is started or
> > > completed.
> > > 
> > > 
> > > With dmabuf-sync:
> > > 
> > > Task A                             
> > > ----------------------------
> > >  1. dma_buf_sync_lock <- synpoint (call by user side)
> > >  2. CPU accesses buf          
> > >  3. dma_buf_sync_unlock <- syncpoint (call by user side)
> > >  4. Send the buf to Task B (just one time)
> > >  5. go to 1
> > > 
> > > 
> > > Task B
> > > ---------------------------
> > > 1. Wait for the buf from Task A (just one time)
> > > 2. qbuf the buf                 
> > >     1.1 insert the buf to incoming queue
> > > 3. stream on
> > >     3.1 dma_buf_sync_lock <- syncpoint (call by kernel side)
> > >     3.2 dma_map_sg if ready, and move the buf to ready queue
> > >     3.3 get the buf from ready queue, and dma start.
> > > 4. dqbuf
> > >     4.1 dma_buf_sync_unlock <- syncpoint (call by kernel side)
> > >     4.2 dma_unmap_sg after dma operation completion
> > >     4.3 move the buf to outgoing queue
> > > 5. go to 1
> > > 
> > > On the other hand, in case of using dmabuf-sync, as you can see the above
> > > example, we would need IPC operation just one time. That way, I think we
> > > could not only reduce performance overhead but also make user application
> > > simplified. Of course, this approach can be used for all DMA device drivers
> > > such as DRM. I'm not a specialist in v4l2 world so there may be missing
> > > point.
> > > 
> > 
> > You already need some kind of IPC between the two tasks, as I suspect
> > even in your example it wouldn't make much sense to queue the buffer
> > over and over again in task B without task A writing anything to it. So
> > task A has to signal task B there is new data in the buffer to be
> > processed.
> 
> Hang on.  Since when did dma_buf become another inter-process IPC
> mechanism?  That's *not* it's design goal, and there's other much
> better mechanisms already provided.
> 
That's why I said out-of-band completion signaling, particularly not
bound to the dma-buf itself.

My thinking was more along the lines of the wayland protocol, where one
process tells the compositor to use a buf as the pixel data for the next
frame and promises not to access it while the compositor uses it. When
the compositor finishes it tells the client that it's now free to reuse
the buffer. None of this is bound to the dma-buf.

Regards,
Lucas
Lucas Stach June 20, 2013, 10:11 a.m. UTC | #29
Am Donnerstag, den 20.06.2013, 17:24 +0900 schrieb Inki Dae:
[...]
> > > In addition, please see the below more detail examples.
> > >
> > > The conventional way (without dmabuf-sync) is:
> > > Task A
> > > ----------------------------
> > >  1. CPU accesses buf
> > >  2. Send the buf to Task B
> > >  3. Wait for the buf from Task B
> > >  4. go to 1
> > >
> > > Task B
> > > ---------------------------
> > > 1. Wait for the buf from Task A
> > > 2. qbuf the buf
> > >     2.1 insert the buf to incoming queue
> > > 3. stream on
> > >     3.1 dma_map_sg if ready, and move the buf to ready queue
> > >     3.2 get the buf from ready queue, and dma start.
> > > 4. dqbuf
> > >     4.1 dma_unmap_sg after dma operation completion
> > >     4.2 move the buf to outgoing queue
> > > 5. back the buf to Task A
> > > 6. go to 1
> > >
> > > In case that two tasks share buffers, and data flow goes from Task A to
> > Task
> > > B, we would need IPC operation to send and receive buffers properly
> > between
> > > those two tasks every time CPU or DMA access to buffers is started or
> > > completed.
> > >
> > >
> > > With dmabuf-sync:
> > >
> > > Task A
> > > ----------------------------
> > >  1. dma_buf_sync_lock <- synpoint (call by user side)
> > >  2. CPU accesses buf
> > >  3. dma_buf_sync_unlock <- syncpoint (call by user side)
> > >  4. Send the buf to Task B (just one time)
> > >  5. go to 1
> > >
> > >
> > > Task B
> > > ---------------------------
> > > 1. Wait for the buf from Task A (just one time)
> > > 2. qbuf the buf
> > >     1.1 insert the buf to incoming queue
> > > 3. stream on
> > >     3.1 dma_buf_sync_lock <- syncpoint (call by kernel side)
> > >     3.2 dma_map_sg if ready, and move the buf to ready queue
> > >     3.3 get the buf from ready queue, and dma start.
> > > 4. dqbuf
> > >     4.1 dma_buf_sync_unlock <- syncpoint (call by kernel side)
> > >     4.2 dma_unmap_sg after dma operation completion
> > >     4.3 move the buf to outgoing queue
> > > 5. go to 1
> > >
> > > On the other hand, in case of using dmabuf-sync, as you can see the
> > above
> > > example, we would need IPC operation just one time. That way, I think we
> > > could not only reduce performance overhead but also make user
> > application
> > > simplified. Of course, this approach can be used for all DMA device
> > drivers
> > > such as DRM. I'm not a specialist in v4l2 world so there may be missing
> > > point.
> > >
> > 
> > You already need some kind of IPC between the two tasks, as I suspect
> > even in your example it wouldn't make much sense to queue the buffer
> > over and over again in task B without task A writing anything to it. So
> > task A has to signal task B there is new data in the buffer to be
> > processed.
> > 
> > There is no need to share the buffer over and over again just to get the
> > two processes to work together on the same thing. Just share the fd
> > between both and then do out-of-band completion signaling, as you need
> > this anyway. Without this you'll end up with unpredictable behavior.
> > Just because sync allows you to access the buffer doesn't mean it's
> > valid for your use-case. Without completion signaling you could easily
> > end up overwriting your data from task A multiple times before task B
> > even tries to lock the buffer for processing.
> > 
> > So the valid flow is (and this already works with the current APIs):
> > Task A                                    Task B
> > ------                                    ------
> > CPU access buffer
> >          ----------completion signal--------->
> >                                           qbuf (dragging buffer into
> >                                           device domain, flush caches,
> >                                           reserve buffer etc.)
> >                                                     |
> >                                           wait for device operation to
> >                                           complete
> >                                                     |
> >                                           dqbuf (dragging buffer back
> >                                           into CPU domain, invalidate
> >                                           caches, unreserve)
> >         <---------completion signal------------
> > CPU access buffer
> > 
> 
> Correct. In case that data flow goes from A to B, it needs some kind
> of IPC between the two tasks every time as you said. Then, without
> dmabuf-sync, how do think about the case that two tasks share the same
> buffer but these tasks access the buffer(buf1) as write, and data of
> the buffer(buf1) isn't needed to be shared?
> 
Sorry, I don't see the point you are trying to solve here. If you share
a buffer and want its content to be clearly defined at every point in
time you have to synchronize the tasks working with the buffer, not just
the buffer accesses itself.

Easiest way to do so is doing sync through userspace with out-of-band
IPC, like in the example above. A more advanced way to achieve this
would be using cross-device fences to avoid going through userspace for
every syncpoint.

> 
> With dmabuf-sync is:
> 
>  Task A
>  ----------------------------
>  1. dma_buf_sync_lock <- synpoint (call by user side)
>  2. CPU writes something to buf1
>  3. dma_buf_sync_unlock <- syncpoint (call by user side)
>  4. copy buf1 to buf2
Random contents here? What's in the buffer, content from the CPU write,
or from V4L2 device write?

>  5. go to 1
> 
> 
>  Task B
>  ---------------------------
>  1. dma_buf_sync_lock
>  2. CPU writes something to buf3
>  3. dma_buf_sync_unlock
>  4. qbuf the buf3(src) and buf1(dst)
>      4.1 insert buf3,1 to incoming queue
>      4.2 dma_buf_sync_lock <- syncpoint (call by kernel side)
>  5. stream on
>      5.1 dma_map_sg if ready, and move the buf to ready queue
>      5.2 get the buf from ready queue, and dma start.
>  6. dqbuf
>      6.1 dma_buf_sync_unlock <- syncpoint (call by kernel side)
>      6.2 dma_unmap_sg after dma operation completion
>      6.3 move the buf3,1 to outgoing queue
> 7. go to 1
> 

Regards,
Lucas
Inki Dae June 20, 2013, 11:15 a.m. UTC | #30
> -----Original Message-----
> From: Lucas Stach [mailto:l.stach@pengutronix.de]
> Sent: Thursday, June 20, 2013 7:11 PM
> To: Inki Dae
> Cc: 'Russell King - ARM Linux'; 'Inki Dae'; 'linux-fbdev'; 'YoungJun Cho';
> 'Kyungmin Park'; 'myungjoo.ham'; 'DRI mailing list'; linux-arm-
> kernel@lists.infradead.org; linux-media@vger.kernel.org
> Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
> framework
> 
> Am Donnerstag, den 20.06.2013, 17:24 +0900 schrieb Inki Dae:
> [...]
> > > > In addition, please see the below more detail examples.
> > > >
> > > > The conventional way (without dmabuf-sync) is:
> > > > Task A
> > > > ----------------------------
> > > >  1. CPU accesses buf
> > > >  2. Send the buf to Task B
> > > >  3. Wait for the buf from Task B
> > > >  4. go to 1
> > > >
> > > > Task B
> > > > ---------------------------
> > > > 1. Wait for the buf from Task A
> > > > 2. qbuf the buf
> > > >     2.1 insert the buf to incoming queue
> > > > 3. stream on
> > > >     3.1 dma_map_sg if ready, and move the buf to ready queue
> > > >     3.2 get the buf from ready queue, and dma start.
> > > > 4. dqbuf
> > > >     4.1 dma_unmap_sg after dma operation completion
> > > >     4.2 move the buf to outgoing queue
> > > > 5. back the buf to Task A
> > > > 6. go to 1
> > > >
> > > > In case that two tasks share buffers, and data flow goes from Task A
> to
> > > Task
> > > > B, we would need IPC operation to send and receive buffers properly
> > > between
> > > > those two tasks every time CPU or DMA access to buffers is started
> or
> > > > completed.
> > > >
> > > >
> > > > With dmabuf-sync:
> > > >
> > > > Task A
> > > > ----------------------------
> > > >  1. dma_buf_sync_lock <- synpoint (call by user side)
> > > >  2. CPU accesses buf
> > > >  3. dma_buf_sync_unlock <- syncpoint (call by user side)
> > > >  4. Send the buf to Task B (just one time)
> > > >  5. go to 1
> > > >
> > > >
> > > > Task B
> > > > ---------------------------
> > > > 1. Wait for the buf from Task A (just one time)
> > > > 2. qbuf the buf
> > > >     1.1 insert the buf to incoming queue
> > > > 3. stream on
> > > >     3.1 dma_buf_sync_lock <- syncpoint (call by kernel side)
> > > >     3.2 dma_map_sg if ready, and move the buf to ready queue
> > > >     3.3 get the buf from ready queue, and dma start.
> > > > 4. dqbuf
> > > >     4.1 dma_buf_sync_unlock <- syncpoint (call by kernel side)
> > > >     4.2 dma_unmap_sg after dma operation completion
> > > >     4.3 move the buf to outgoing queue
> > > > 5. go to 1
> > > >
> > > > On the other hand, in case of using dmabuf-sync, as you can see the
> > > above
> > > > example, we would need IPC operation just one time. That way, I
> think we
> > > > could not only reduce performance overhead but also make user
> > > application
> > > > simplified. Of course, this approach can be used for all DMA device
> > > drivers
> > > > such as DRM. I'm not a specialist in v4l2 world so there may be
> missing
> > > > point.
> > > >
> > >
> > > You already need some kind of IPC between the two tasks, as I suspect
> > > even in your example it wouldn't make much sense to queue the buffer
> > > over and over again in task B without task A writing anything to it.
> So
> > > task A has to signal task B there is new data in the buffer to be
> > > processed.
> > >
> > > There is no need to share the buffer over and over again just to get
> the
> > > two processes to work together on the same thing. Just share the fd
> > > between both and then do out-of-band completion signaling, as you need
> > > this anyway. Without this you'll end up with unpredictable behavior.
> > > Just because sync allows you to access the buffer doesn't mean it's
> > > valid for your use-case. Without completion signaling you could easily
> > > end up overwriting your data from task A multiple times before task B
> > > even tries to lock the buffer for processing.
> > >
> > > So the valid flow is (and this already works with the current APIs):
> > > Task A                                    Task B
> > > ------                                    ------
> > > CPU access buffer
> > >          ----------completion signal--------->
> > >                                           qbuf (dragging buffer into
> > >                                           device domain, flush caches,
> > >                                           reserve buffer etc.)
> > >                                                     |
> > >                                           wait for device operation to
> > >                                           complete
> > >                                                     |
> > >                                           dqbuf (dragging buffer back
> > >                                           into CPU domain, invalidate
> > >                                           caches, unreserve)
> > >         <---------completion signal------------
> > > CPU access buffer
> > >
> >
> > Correct. In case that data flow goes from A to B, it needs some kind
> > of IPC between the two tasks every time as you said. Then, without
> > dmabuf-sync, how do think about the case that two tasks share the same
> > buffer but these tasks access the buffer(buf1) as write, and data of
> > the buffer(buf1) isn't needed to be shared?
> >
> Sorry, I don't see the point you are trying to solve here. If you share
> a buffer and want its content to be clearly defined at every point in
> time you have to synchronize the tasks working with the buffer, not just
> the buffer accesses itself.
> 
> Easiest way to do so is doing sync through userspace with out-of-band
> IPC, like in the example above.

In my opinion, that's not definitely easiest way. What I try to do is to avoid using *the out-of-band IPC*. As I mentioned in document file, the conventional mechanism not only makes user application complicated-user process needs to understand how the device driver is worked-but also may incur performance overhead by using the out-of-band IPC. The above my example may not be enough to you but there would be other cases able to use my approach efficiently.

> A more advanced way to achieve this
> would be using cross-device fences to avoid going through userspace for
> every syncpoint.
> 

Ok, maybe there is something I missed. So question. What is the cross-device fences? dma fence?. And how we can achieve the synchronization mechanism without going through user space for every syncpoint; CPU and DMA share a same buffer?. And could you explain it in detail as long as possible like I did?

> >
> > With dmabuf-sync is:
> >
> >  Task A
> >  ----------------------------
> >  1. dma_buf_sync_lock <- synpoint (call by user side)
> >  2. CPU writes something to buf1
> >  3. dma_buf_sync_unlock <- syncpoint (call by user side)
> >  4. copy buf1 to buf2
> Random contents here? What's in the buffer, content from the CPU write,
> or from V4L2 device write?
> 

Please presume that buf1 is physically non contiguous memory, and buf2 is physically contiguous memory; device A without IOMMU is seeing buf2. We would need to copy buf1 to buf2 to send the contents of the buf1 to device A because DMA of the device A cannot access the buf1 directly. And CPU and V4L2 device don't share the contents of the buf1 but share the buf1 as storage.

Thanks,
Inki Dae

> >  5. go to 1
> >
> >
> >  Task B
> >  ---------------------------
> >  1. dma_buf_sync_lock
> >  2. CPU writes something to buf3
> >  3. dma_buf_sync_unlock
> >  4. qbuf the buf3(src) and buf1(dst)
> >      4.1 insert buf3,1 to incoming queue
> >      4.2 dma_buf_sync_lock <- syncpoint (call by kernel side)
> >  5. stream on
> >      5.1 dma_map_sg if ready, and move the buf to ready queue
> >      5.2 get the buf from ready queue, and dma start.
> >  6. dqbuf
> >      6.1 dma_buf_sync_unlock <- syncpoint (call by kernel side)
> >      6.2 dma_unmap_sg after dma operation completion
> >      6.3 move the buf3,1 to outgoing queue
> > 7. go to 1
> >
> 
> Regards,
> Lucas
> --
> Pengutronix e.K.                           | Lucas Stach                 |
> Industrial Linux Solutions                 | http://www.pengutronix.de/  |
> Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
> Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |
Lucas Stach June 21, 2013, 8:54 a.m. UTC | #31
Am Donnerstag, den 20.06.2013, 20:15 +0900 schrieb Inki Dae:
[...]
> > > > You already need some kind of IPC between the two tasks, as I suspect
> > > > even in your example it wouldn't make much sense to queue the buffer
> > > > over and over again in task B without task A writing anything to it.
> > So
> > > > task A has to signal task B there is new data in the buffer to be
> > > > processed.
> > > >
> > > > There is no need to share the buffer over and over again just to get
> > the
> > > > two processes to work together on the same thing. Just share the fd
> > > > between both and then do out-of-band completion signaling, as you need
> > > > this anyway. Without this you'll end up with unpredictable behavior.
> > > > Just because sync allows you to access the buffer doesn't mean it's
> > > > valid for your use-case. Without completion signaling you could easily
> > > > end up overwriting your data from task A multiple times before task B
> > > > even tries to lock the buffer for processing.
> > > >
> > > > So the valid flow is (and this already works with the current APIs):
> > > > Task A                                    Task B
> > > > ------                                    ------
> > > > CPU access buffer
> > > >          ----------completion signal--------->
> > > >                                           qbuf (dragging buffer into
> > > >                                           device domain, flush caches,
> > > >                                           reserve buffer etc.)
> > > >                                                     |
> > > >                                           wait for device operation to
> > > >                                           complete
> > > >                                                     |
> > > >                                           dqbuf (dragging buffer back
> > > >                                           into CPU domain, invalidate
> > > >                                           caches, unreserve)
> > > >         <---------completion signal------------
> > > > CPU access buffer
> > > >
> > >
> > > Correct. In case that data flow goes from A to B, it needs some kind
> > > of IPC between the two tasks every time as you said. Then, without
> > > dmabuf-sync, how do think about the case that two tasks share the same
> > > buffer but these tasks access the buffer(buf1) as write, and data of
> > > the buffer(buf1) isn't needed to be shared?
> > >
> > Sorry, I don't see the point you are trying to solve here. If you share
> > a buffer and want its content to be clearly defined at every point in
> > time you have to synchronize the tasks working with the buffer, not just
> > the buffer accesses itself.
> > 
> > Easiest way to do so is doing sync through userspace with out-of-band
> > IPC, like in the example above.
> 
> In my opinion, that's not definitely easiest way. What I try to do is
> to avoid using *the out-of-band IPC*. As I mentioned in document file,
> the conventional mechanism not only makes user application
> complicated-user process needs to understand how the device driver is
> worked-but also may incur performance overhead by using the
> out-of-band IPC. The above my example may not be enough to you but
> there would be other cases able to use my approach efficiently.
> 

Yeah, you'll some knowledge and understanding about the API you are
working with to get things right. But I think it's not an unreasonable
thing to expect the programmer working directly with kernel interfaces
to read up on how things work.

Second thing: I'll rather have *one* consistent API for every subsystem,
even if they differ from each other than having to implement this
syncpoint thing in every subsystem. Remember: a single execbuf in DRM
might reference both GEM objects backed by dma-buf as well native SHM or
CMA backed objects. The dma-buf-mgr proposal already allows you to
handle dma-bufs much the same way during validation than native GEM
objects.

And to get back to my original point: if you have more than one task
operating together on a buffer you absolutely need some kind of real IPC
to sync them up and do something useful. Both you syncpoints and the
proposed dma-fences only protect the buffer accesses to make sure
different task don't stomp on each other. There is nothing in there to
make sure that the output of your pipeline is valid. You have to take
care of that yourself in userspace. I'll reuse your example to make it
clear what I mean:

Task A                                         Task B
------                                         -------
dma_buf_sync_lock(buf1)
CPU write buf1
dma_buf_sync_unlock(buf1)
          ---------schedule Task A again-------
dma_buf_sync_lock(buf1)
CPU write buf1
dma_buf_sync_unlock(buf1)
            ---------schedule Task B---------
                                               qbuf(buf1)
                                                  dma_buf_sync_lock(buf1)
                                               ....

This is what can happen if you don't take care of proper syncing. Task A
writes something to the buffer in expectation that Task B will take care
of it, but before Task B even gets scheduled Task A overwrites the
buffer again. Not what you wanted, isn't it?

So to make sure the output of a pipeline of some kind is what you expect
you have to do syncing with IPC. And once you do CPU access it is a
synchronous thing in the stream of events. I see that you might want to
have some kind of bracketed CPU access even for the fallback mmap case
for things like V4L2 that don't provide explicit sync by their own, but
in no way I can see why we would need a user/kernel shared syncpoint for
this.

> > A more advanced way to achieve this
> > would be using cross-device fences to avoid going through userspace for
> > every syncpoint.
> > 
> 
> Ok, maybe there is something I missed. So question. What is the
> cross-device fences? dma fence?. And how we can achieve the
> synchronization mechanism without going through user space for every
> syncpoint; CPU and DMA share a same buffer?. And could you explain it
> in detail as long as possible like I did?
> 
Yeah I'm talking about the proposed dma-fences. They would allow you to
just queue things into the kernel without waiting for a device operation
to finish. But you still have to make sure that your commands have the
right order and don't go wild. So for example you could do something
like this:

Userspace                                   Kernel
---------                                   ------
1. build DRM command stream
rendering into buf1
2. queue command stream with execbuf
                                            1. validate command stream
                                             1.1 reference buf1 for writing
                                                 through dma-buf-mgr
                                            2. kick off GPU processing
3. qbuf buf1 into V4L2
                                            3. reference buf1 for reading
                                             3.1 wait for fence from GPU to
                                                 signal
                                            4. kick off V4L2 processing

So you don't need to wait in userspace and potentially avoid some
context switches, but you still have to make sure that GPU commands are
queued before you queue the V4L2 operation to make sure things get
operated on in the right order.

Regards,
Lucas
Inki Dae June 21, 2013, 11:01 a.m. UTC | #32
2013/6/21 Lucas Stach <l.stach@pengutronix.de>

> Am Donnerstag, den 20.06.2013, 20:15 +0900 schrieb Inki Dae:
> [...]
> > > > > You already need some kind of IPC between the two tasks, as I
> suspect
> > > > > even in your example it wouldn't make much sense to queue the
> buffer
> > > > > over and over again in task B without task A writing anything to
> it.
> > > So
> > > > > task A has to signal task B there is new data in the buffer to be
> > > > > processed.
> > > > >
> > > > > There is no need to share the buffer over and over again just to
> get
> > > the
> > > > > two processes to work together on the same thing. Just share the fd
> > > > > between both and then do out-of-band completion signaling, as you
> need
> > > > > this anyway. Without this you'll end up with unpredictable
> behavior.
> > > > > Just because sync allows you to access the buffer doesn't mean it's
> > > > > valid for your use-case. Without completion signaling you could
> easily
> > > > > end up overwriting your data from task A multiple times before
> task B
> > > > > even tries to lock the buffer for processing.
> > > > >
> > > > > So the valid flow is (and this already works with the current
> APIs):
> > > > > Task A                                    Task B
> > > > > ------                                    ------
> > > > > CPU access buffer
> > > > >          ----------completion signal--------->
> > > > >                                           qbuf (dragging buffer
> into
> > > > >                                           device domain, flush
> caches,
> > > > >                                           reserve buffer etc.)
> > > > >                                                     |
> > > > >                                           wait for device
> operation to
> > > > >                                           complete
> > > > >                                                     |
> > > > >                                           dqbuf (dragging buffer
> back
> > > > >                                           into CPU domain,
> invalidate
> > > > >                                           caches, unreserve)
> > > > >         <---------completion signal------------
> > > > > CPU access buffer
> > > > >
> > > >
> > > > Correct. In case that data flow goes from A to B, it needs some kind
> > > > of IPC between the two tasks every time as you said. Then, without
> > > > dmabuf-sync, how do think about the case that two tasks share the
> same
> > > > buffer but these tasks access the buffer(buf1) as write, and data of
> > > > the buffer(buf1) isn't needed to be shared?
> > > >
> > > Sorry, I don't see the point you are trying to solve here. If you share
> > > a buffer and want its content to be clearly defined at every point in
> > > time you have to synchronize the tasks working with the buffer, not
> just
> > > the buffer accesses itself.
> > >
> > > Easiest way to do so is doing sync through userspace with out-of-band
> > > IPC, like in the example above.
> >
> > In my opinion, that's not definitely easiest way. What I try to do is
> > to avoid using *the out-of-band IPC*. As I mentioned in document file,
> > the conventional mechanism not only makes user application
> > complicated-user process needs to understand how the device driver is
> > worked-but also may incur performance overhead by using the
> > out-of-band IPC. The above my example may not be enough to you but
> > there would be other cases able to use my approach efficiently.
> >
>
> Yeah, you'll some knowledge and understanding about the API you are
> working with to get things right. But I think it's not an unreasonable
> thing to expect the programmer working directly with kernel interfaces
> to read up on how things work.
>
> Second thing: I'll rather have *one* consistent API for every subsystem,
> even if they differ from each other than having to implement this
> syncpoint thing in every subsystem. Remember: a single execbuf in DRM
> might reference both GEM objects backed by dma-buf as well native SHM or
> CMA backed objects. The dma-buf-mgr proposal already allows you to
> handle dma-bufs much the same way during validation than native GEM
> objects.
>

Actually, at first I had implemented a fence helper framework based on
reservation and dma fence to provide easy-use-interface for device
drivers. However, that was wrong implemention: I had not only customized
the dma fence but also not considered dead lock issue. After that, I have
reimplemented it as dmabuf sync to solve dead issue, and at that time, I
realized that we first need to concentrate on the most basic thing: the
fact CPU and CPU, CPU and DMA, or DMA and DMA can access a same buffer, And
the fact simple is the best, and the fact we need not only kernel side but
also user side interfaces. After that, I collected what is the common part
for all subsystems, and I have devised this dmabuf sync framework for it.
I'm not really specialist in Desktop world. So question. isn't the execbuf
used only for the GPU? the gpu has dedicated video memory(VRAM) so it needs
migration mechanism between system memory and the dedicated video memory,
and also to consider ordering issue while be migrated.


>
> And to get back to my original point: if you have more than one task
> operating together on a buffer you absolutely need some kind of real IPC
> to sync them up and do something useful. Both you syncpoints and the
> proposed dma-fences only protect the buffer accesses to make sure
> different task don't stomp on each other. There is nothing in there to
> make sure that the output of your pipeline is valid. You have to take
> care of that yourself in userspace. I'll reuse your example to make it
> clear what I mean:
>
> Task A                                         Task B
> ------                                         -------
> dma_buf_sync_lock(buf1)
> CPU write buf1
> dma_buf_sync_unlock(buf1)
>           ---------schedule Task A again-------
> dma_buf_sync_lock(buf1)
> CPU write buf1
> dma_buf_sync_unlock(buf1)
>             ---------schedule Task B---------
>                                                qbuf(buf1)
>                                                   dma_buf_sync_lock(buf1)
>                                                ....
>
> This is what can happen if you don't take care of proper syncing. Task A
> writes something to the buffer in expectation that Task B will take care
> of it, but before Task B even gets scheduled Task A overwrites the
> buffer again. Not what you wanted, isn't it?
>

Exactly wrong example. I had already mentioned about that. "In case that
data flow goes from A to B, it needs some kind of IPC between the two tasks
every time"  So again, your example would have no any problem in case that
*two tasks share the same buffer but these tasks access the buffer(buf1) as
write, and data of the buffer(buf1) isn't needed to be shared*.  They just
need to use the buffer as *storage*. So All they want is to avoid stomping
on the buffer in this case.


>
> So to make sure the output of a pipeline of some kind is what you expect
> you have to do syncing with IPC


So not true.


> . And once you do CPU access it is a
> synchronous thing in the stream of events. I see that you might want to
> have some kind of bracketed CPU access even for the fallback mmap case
> for things like V4L2 that don't provide explicit sync by their own, but
> in no way I can see why we would need a user/kernel shared syncpoint for
> this.
>
> > > A more advanced way to achieve this
> > > would be using cross-device fences to avoid going through userspace for
> > > every syncpoint.
> > >
> >
> > Ok, maybe there is something I missed. So question. What is the
> > cross-device fences? dma fence?. And how we can achieve the
> > synchronization mechanism without going through user space for every
> > syncpoint; CPU and DMA share a same buffer?. And could you explain it
> > in detail as long as possible like I did?
> >
> Yeah I'm talking about the proposed dma-fences. They would allow you to
> just queue things into the kernel without waiting for a device operation
> to finish. But you still have to make sure that your commands have the
> right order and don't go wild. So for example you could do something
> like this:
>
> Userspace                                   Kernel
> ---------                                   ------
> 1. build DRM command stream
> rendering into buf1
> 2. queue command stream with execbuf
>                                             1. validate command stream
>                                              1.1 reference buf1 for writing
>                                                  through dma-buf-mgr
>                                             2. kick off GPU processing
> 3. qbuf buf1 into V4L2
>                                             3. reference buf1 for reading
>                                              3.1 wait for fence from GPU to
>                                                  signal
>                                             4. kick off V4L2 processing
>
>
That seems like very specific to Desktop GPU. isn't it?


> So you don't need to wait in userspace and potentially avoid some
> context switches,


Also not true.


> but you still have to make sure that GPU commands are
> queued before you queue the V4L2 operation to make sure things get
> operated on in the right order.
>
>
>


I'd like to say you that my approach is not perfact so may definietly
have many problems and addition works - actually, I found some problems and
are solving them, in addition, the implememtation to generic user side
interfacing mechanism is in progress for the destination - , and thanks to
your comments. However, I think we can try to do for more better something.
Lastly, I'll look forward to keeping up your good advices.

Thanks,
Inki Dae



> Regards,
> Lucas
>
> --
> Pengutronix e.K.                           | Lucas Stach                 |
> Industrial Linux Solutions                 | http://www.pengutronix.de/  |
> Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
> Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |
>
>
Lucas Stach June 21, 2013, 12:27 p.m. UTC | #33
Hi Inki,

please refrain from sending HTML Mails, it makes proper quoting without
messing up the layout everywhere pretty hard.

Am Freitag, den 21.06.2013, 20:01 +0900 schrieb Inki Dae:
[...]

>         Yeah, you'll some knowledge and understanding about the API
>         you are
>         working with to get things right. But I think it's not an
>         unreasonable
>         thing to expect the programmer working directly with kernel
>         interfaces
>         to read up on how things work.
>         
>         Second thing: I'll rather have *one* consistent API for every
>         subsystem,
>         even if they differ from each other than having to implement
>         this
>         syncpoint thing in every subsystem. Remember: a single execbuf
>         in DRM
>         might reference both GEM objects backed by dma-buf as well
>         native SHM or
>         CMA backed objects. The dma-buf-mgr proposal already allows
>         you to
>         handle dma-bufs much the same way during validation than
>         native GEM
>         objects.
>  
> Actually, at first I had implemented a fence helper framework based on
> reservation and dma fence to provide easy-use-interface for device
> drivers. However, that was wrong implemention: I had not only
> customized the dma fence but also not considered dead lock issue.
> After that, I have reimplemented it as dmabuf sync to solve dead
> issue, and at that time, I realized that we first need to concentrate
> on the most basic thing: the fact CPU and CPU, CPU and DMA, or DMA and
> DMA can access a same buffer, And the fact simple is the best, and the
> fact we need not only kernel side but also user side interfaces. After
> that, I collected what is the common part for all subsystems, and I
> have devised this dmabuf sync framework for it. I'm not really
> specialist in Desktop world. So question. isn't the execbuf used only
> for the GPU? the gpu has dedicated video memory(VRAM) so it needs
> migration mechanism between system memory and the dedicated video
> memory, and also to consider ordering issue while be migrated.
>  

Yeah, execbuf is pretty GPU specific, but I don't see how this matters
for this discussion. Also I don't see a big difference between embedded
and desktop GPUs. Buffer migration is more of a detail here. Both take
command stream that potentially reference other buffers, which might be
native GEM or dma-buf backed objects. Both have to make sure the buffers
are in the right domain (caches cleaned and address mappings set up) and
are available for the desired operation, meaning you have to sync with
other DMA engines and maybe also with CPU.

The only case where sync isn't clearly defined right now by the current
API entrypoints is when you access memory through the dma-buf fallback
mmap support, which might happen with some software processing element
in a video pipeline or something. I agree that we will need a userspace
interface here, but I think this shouldn't be yet another sync object,
but rather more a prepare/fini_cpu_access ioctl on the dma-buf which
hooks into the existing dma-fence and reservation stuff.

>         
>         And to get back to my original point: if you have more than
>         one task
>         operating together on a buffer you absolutely need some kind
>         of real IPC
>         to sync them up and do something useful. Both you syncpoints
>         and the
>         proposed dma-fences only protect the buffer accesses to make
>         sure
>         different task don't stomp on each other. There is nothing in
>         there to
>         make sure that the output of your pipeline is valid. You have
>         to take
>         care of that yourself in userspace. I'll reuse your example to
>         make it
>         clear what I mean:
>         
>         Task A                                         Task B
>         ------                                         -------
>         dma_buf_sync_lock(buf1)
>         CPU write buf1
>         dma_buf_sync_unlock(buf1)
>                   ---------schedule Task A again-------
>         dma_buf_sync_lock(buf1)
>         CPU write buf1
>         dma_buf_sync_unlock(buf1)
>                     ---------schedule Task B---------
>                                                        qbuf(buf1)
>         
>         dma_buf_sync_lock(buf1)
>                                                        ....
>         
>         This is what can happen if you don't take care of proper
>         syncing. Task A
>         writes something to the buffer in expectation that Task B will
>         take care
>         of it, but before Task B even gets scheduled Task A overwrites
>         the
>         buffer again. Not what you wanted, isn't it?
>  
> Exactly wrong example. I had already mentioned about that. "In case
> that data flow goes from A to B, it needs some kind of IPC between the
> two tasks every time"  So again, your example would have no any
> problem in case that *two tasks share the same buffer but these tasks
> access the buffer(buf1) as write, and data of the buffer(buf1) isn't
> needed to be shared*.  They just need to use the buffer as *storage*.
> So All they want is to avoid stomping on the buffer in this case.
>  
Sorry, but I don't see the point. If no one is interested in the data of
the buffer, why are you sharing it in the first place?

>         
>         So to make sure the output of a pipeline of some kind is what
>         you expect
>         you have to do syncing with IPC
>  
> So not true.
>  
>         . And once you do CPU access it is a
>         synchronous thing in the stream of events. I see that you
>         might want to
>         have some kind of bracketed CPU access even for the fallback
>         mmap case
>         for things like V4L2 that don't provide explicit sync by their
>         own, but
>         in no way I can see why we would need a user/kernel shared
>         syncpoint for
>         this.
>         
>         > > A more advanced way to achieve this
>         > > would be using cross-device fences to avoid going through
>         userspace for
>         > > every syncpoint.
>         > >
>         >
>         > Ok, maybe there is something I missed. So question. What is
>         the
>         > cross-device fences? dma fence?. And how we can achieve the
>         > synchronization mechanism without going through user space
>         for every
>         > syncpoint; CPU and DMA share a same buffer?. And could you
>         explain it
>         > in detail as long as possible like I did?
>         >
>         
>         Yeah I'm talking about the proposed dma-fences. They would
>         allow you to
>         just queue things into the kernel without waiting for a device
>         operation
>         to finish. But you still have to make sure that your commands
>         have the
>         right order and don't go wild. So for example you could do
>         something
>         like this:
>         
>         Userspace                                   Kernel
>         ---------                                   ------
>         1. build DRM command stream
>         rendering into buf1
>         2. queue command stream with execbuf
>                                                     1. validate
>         command stream
>                                                      1.1 reference
>         buf1 for writing
>                                                          through
>         dma-buf-mgr
>                                                     2. kick off GPU
>         processing
>         3. qbuf buf1 into V4L2
>                                                     3. reference buf1
>         for reading
>                                                      3.1 wait for
>         fence from GPU to
>                                                          signal
>                                                     4. kick off V4L2
>         processing
>         
>  
> That seems like very specific to Desktop GPU. isn't it?
>  
Would you mind explaining what you think is desktop specific about that?

Regards,
Lucas
Inki Dae June 21, 2013, 4:55 p.m. UTC | #34
2013/6/21 Lucas Stach <l.stach@pengutronix.de>:
> Hi Inki,
>
> please refrain from sending HTML Mails, it makes proper quoting without
> messing up the layout everywhere pretty hard.
>

Sorry about that. I should have used text mode.

> Am Freitag, den 21.06.2013, 20:01 +0900 schrieb Inki Dae:
> [...]
>
>>         Yeah, you'll some knowledge and understanding about the API
>>         you are
>>         working with to get things right. But I think it's not an
>>         unreasonable
>>         thing to expect the programmer working directly with kernel
>>         interfaces
>>         to read up on how things work.
>>
>>         Second thing: I'll rather have *one* consistent API for every
>>         subsystem,
>>         even if they differ from each other than having to implement
>>         this
>>         syncpoint thing in every subsystem. Remember: a single execbuf
>>         in DRM
>>         might reference both GEM objects backed by dma-buf as well
>>         native SHM or
>>         CMA backed objects. The dma-buf-mgr proposal already allows
>>         you to
>>         handle dma-bufs much the same way during validation than
>>         native GEM
>>         objects.
>>
>> Actually, at first I had implemented a fence helper framework based on
>> reservation and dma fence to provide easy-use-interface for device
>> drivers. However, that was wrong implemention: I had not only
>> customized the dma fence but also not considered dead lock issue.
>> After that, I have reimplemented it as dmabuf sync to solve dead
>> issue, and at that time, I realized that we first need to concentrate
>> on the most basic thing: the fact CPU and CPU, CPU and DMA, or DMA and
>> DMA can access a same buffer, And the fact simple is the best, and the
>> fact we need not only kernel side but also user side interfaces. After
>> that, I collected what is the common part for all subsystems, and I
>> have devised this dmabuf sync framework for it. I'm not really
>> specialist in Desktop world. So question. isn't the execbuf used only
>> for the GPU? the gpu has dedicated video memory(VRAM) so it needs
>> migration mechanism between system memory and the dedicated video
>> memory, and also to consider ordering issue while be migrated.
>>
>
> Yeah, execbuf is pretty GPU specific, but I don't see how this matters
> for this discussion. Also I don't see a big difference between embedded
> and desktop GPUs. Buffer migration is more of a detail here. Both take
> command stream that potentially reference other buffers, which might be
> native GEM or dma-buf backed objects. Both have to make sure the buffers
> are in the right domain (caches cleaned and address mappings set up) and
> are available for the desired operation, meaning you have to sync with
> other DMA engines and maybe also with CPU.

Yeah, right. Then, in case of desktop gpu, does't it need additional
something to do when a buffer/s is/are migrated from system memory to
video memory domain, or from video memory to system memory domain? I
guess the below members does similar thing, and all other DMA devices
would not need them:
        struct fence {
                  ...
                  unsigned int context, seqno;
                  ...
        };

And,
       struct seqno_fence {
                 ...
                 uint32_t seqno_ofs;
                 ...
       };

>
> The only case where sync isn't clearly defined right now by the current
> API entrypoints is when you access memory through the dma-buf fallback
> mmap support, which might happen with some software processing element
> in a video pipeline or something. I agree that we will need a userspace
> interface here, but I think this shouldn't be yet another sync object,
> but rather more a prepare/fini_cpu_access ioctl on the dma-buf which
> hooks into the existing dma-fence and reservation stuff.

I think we don't need addition ioctl commands for that. I am thinking
of using existing resources as possible. My idea also is similar in
using the reservation stuff to your idea because my approach also
should use the dma-buf resource. However, My idea is that a user
process, that wants buffer synchronization with the other, sees a sync
object as a file descriptor like dma-buf does. The below shows simple
my idea about it:

ioctl(dmabuf_fd, DMA_BUF_IOC_OPEN_SYNC, &sync);

flock(sync->fd, LOCK_SH); <- LOCK_SH means a shared lock.
CPU access for read
flock(sync->fd, LOCK_UN);

Or

flock(sync->fd, LOCK_EX); <- LOCK_EX means an exclusive lock
CPU access for write
flock(sync->fd, LOCK_UN);

close(sync->fd);

As you know, that's similar to dmabuf export feature.

In addition, a more simple idea,
flock(dmabuf_fd, LOCK_SH/EX);
CPU access for read/write
flock(dmabuf_fd, LOCK_UN);

However, I'm not sure that the above examples could be worked well,
and there are no problems yet: actually, I don't fully understand
flock mechanism, so looking into it.

>
>>
>>         And to get back to my original point: if you have more than
>>         one task
>>         operating together on a buffer you absolutely need some kind
>>         of real IPC
>>         to sync them up and do something useful. Both you syncpoints
>>         and the
>>         proposed dma-fences only protect the buffer accesses to make
>>         sure
>>         different task don't stomp on each other. There is nothing in
>>         there to
>>         make sure that the output of your pipeline is valid. You have
>>         to take
>>         care of that yourself in userspace. I'll reuse your example to
>>         make it
>>         clear what I mean:
>>
>>         Task A                                         Task B
>>         ------                                         -------
>>         dma_buf_sync_lock(buf1)
>>         CPU write buf1
>>         dma_buf_sync_unlock(buf1)
>>                   ---------schedule Task A again-------
>>         dma_buf_sync_lock(buf1)
>>         CPU write buf1
>>         dma_buf_sync_unlock(buf1)
>>                     ---------schedule Task B---------
>>                                                        qbuf(buf1)
>>
>>         dma_buf_sync_lock(buf1)
>>                                                        ....
>>
>>         This is what can happen if you don't take care of proper
>>         syncing. Task A
>>         writes something to the buffer in expectation that Task B will
>>         take care
>>         of it, but before Task B even gets scheduled Task A overwrites
>>         the
>>         buffer again. Not what you wanted, isn't it?
>>
>> Exactly wrong example. I had already mentioned about that. "In case
>> that data flow goes from A to B, it needs some kind of IPC between the
>> two tasks every time"  So again, your example would have no any
>> problem in case that *two tasks share the same buffer but these tasks
>> access the buffer(buf1) as write, and data of the buffer(buf1) isn't
>> needed to be shared*.  They just need to use the buffer as *storage*.
>> So All they want is to avoid stomping on the buffer in this case.
>>
> Sorry, but I don't see the point. If no one is interested in the data of
> the buffer, why are you sharing it in the first place?
>

Just used as a storage. i.e., Task A fills the buffer with "AAAAAA"
using CPU, And Task B fills the buffer with "BBBBBB" using DMA. They
don't share data of the buffer, but they share *memory region* of the
buffer. That would be very useful for the embedded systems with very
small size system memory.

>>
>>         So to make sure the output of a pipeline of some kind is what
>>         you expect
>>         you have to do syncing with IPC
>>
>> So not true.
>>
>>         . And once you do CPU access it is a
>>         synchronous thing in the stream of events. I see that you
>>         might want to
>>         have some kind of bracketed CPU access even for the fallback
>>         mmap case
>>         for things like V4L2 that don't provide explicit sync by their
>>         own, but
>>         in no way I can see why we would need a user/kernel shared
>>         syncpoint for
>>         this.
>>
>>         > > A more advanced way to achieve this
>>         > > would be using cross-device fences to avoid going through
>>         userspace for
>>         > > every syncpoint.
>>         > >
>>         >
>>         > Ok, maybe there is something I missed. So question. What is
>>         the
>>         > cross-device fences? dma fence?. And how we can achieve the
>>         > synchronization mechanism without going through user space
>>         for every
>>         > syncpoint; CPU and DMA share a same buffer?. And could you
>>         explain it
>>         > in detail as long as possible like I did?
>>         >
>>
>>         Yeah I'm talking about the proposed dma-fences. They would
>>         allow you to
>>         just queue things into the kernel without waiting for a device
>>         operation
>>         to finish. But you still have to make sure that your commands
>>         have the
>>         right order and don't go wild. So for example you could do
>>         something
>>         like this:
>>
>>         Userspace                                   Kernel
>>         ---------                                   ------
>>         1. build DRM command stream
>>         rendering into buf1
>>         2. queue command stream with execbuf
>>                                                     1. validate
>>         command stream
>>                                                      1.1 reference
>>         buf1 for writing
>>                                                          through
>>         dma-buf-mgr
>>                                                     2. kick off GPU
>>         processing
>>         3. qbuf buf1 into V4L2
>>                                                     3. reference buf1
>>         for reading
>>                                                      3.1 wait for
>>         fence from GPU to
>>                                                          signal
>>                                                     4. kick off V4L2
>>         processing
>>
>>
>> That seems like very specific to Desktop GPU. isn't it?
>>
> Would you mind explaining what you think is desktop specific about that?
>

Sorry. there was my misunderstanding. That's not specfic to desktop
gpu. I guess the V4L2 device and GPU device are in a same graphic card
in case of desktop pc, and also they would use same memory domain.
However, If we use other device, that uses system memory as dma
buffer, instead of V4L2 device then doesn't it need specific something
to handle memory domain migration issue? My point is that such thing
is not common portion we want to do. I think things common to entire
archtectures such as ARM, x86, and so on should be placed in
drivers/base/here.

That's just my opinion.

Thanks,
Inki Dae

> Regards,
> Lucas
>
> --
> Pengutronix e.K.                           | Lucas Stach                 |
> Industrial Linux Solutions                 | http://www.pengutronix.de/  |
> Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
> Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |
>
Jerome Glisse June 21, 2013, 7:02 p.m. UTC | #35
On Fri, Jun 21, 2013 at 12:55 PM, Inki Dae <daeinki@gmail.com> wrote:
> 2013/6/21 Lucas Stach <l.stach@pengutronix.de>:
>> Hi Inki,
>>
>> please refrain from sending HTML Mails, it makes proper quoting without
>> messing up the layout everywhere pretty hard.
>>
>
> Sorry about that. I should have used text mode.
>
>> Am Freitag, den 21.06.2013, 20:01 +0900 schrieb Inki Dae:
>> [...]
>>
>>>         Yeah, you'll some knowledge and understanding about the API
>>>         you are
>>>         working with to get things right. But I think it's not an
>>>         unreasonable
>>>         thing to expect the programmer working directly with kernel
>>>         interfaces
>>>         to read up on how things work.
>>>
>>>         Second thing: I'll rather have *one* consistent API for every
>>>         subsystem,
>>>         even if they differ from each other than having to implement
>>>         this
>>>         syncpoint thing in every subsystem. Remember: a single execbuf
>>>         in DRM
>>>         might reference both GEM objects backed by dma-buf as well
>>>         native SHM or
>>>         CMA backed objects. The dma-buf-mgr proposal already allows
>>>         you to
>>>         handle dma-bufs much the same way during validation than
>>>         native GEM
>>>         objects.
>>>
>>> Actually, at first I had implemented a fence helper framework based on
>>> reservation and dma fence to provide easy-use-interface for device
>>> drivers. However, that was wrong implemention: I had not only
>>> customized the dma fence but also not considered dead lock issue.
>>> After that, I have reimplemented it as dmabuf sync to solve dead
>>> issue, and at that time, I realized that we first need to concentrate
>>> on the most basic thing: the fact CPU and CPU, CPU and DMA, or DMA and
>>> DMA can access a same buffer, And the fact simple is the best, and the
>>> fact we need not only kernel side but also user side interfaces. After
>>> that, I collected what is the common part for all subsystems, and I
>>> have devised this dmabuf sync framework for it. I'm not really
>>> specialist in Desktop world. So question. isn't the execbuf used only
>>> for the GPU? the gpu has dedicated video memory(VRAM) so it needs
>>> migration mechanism between system memory and the dedicated video
>>> memory, and also to consider ordering issue while be migrated.
>>>
>>
>> Yeah, execbuf is pretty GPU specific, but I don't see how this matters
>> for this discussion. Also I don't see a big difference between embedded
>> and desktop GPUs. Buffer migration is more of a detail here. Both take
>> command stream that potentially reference other buffers, which might be
>> native GEM or dma-buf backed objects. Both have to make sure the buffers
>> are in the right domain (caches cleaned and address mappings set up) and
>> are available for the desired operation, meaning you have to sync with
>> other DMA engines and maybe also with CPU.
>
> Yeah, right. Then, in case of desktop gpu, does't it need additional
> something to do when a buffer/s is/are migrated from system memory to
> video memory domain, or from video memory to system memory domain? I
> guess the below members does similar thing, and all other DMA devices
> would not need them:
>         struct fence {
>                   ...
>                   unsigned int context, seqno;
>                   ...
>         };
>
> And,
>        struct seqno_fence {
>                  ...
>                  uint32_t seqno_ofs;
>                  ...
>        };
>
>>
>> The only case where sync isn't clearly defined right now by the current
>> API entrypoints is when you access memory through the dma-buf fallback
>> mmap support, which might happen with some software processing element
>> in a video pipeline or something. I agree that we will need a userspace
>> interface here, but I think this shouldn't be yet another sync object,
>> but rather more a prepare/fini_cpu_access ioctl on the dma-buf which
>> hooks into the existing dma-fence and reservation stuff.
>
> I think we don't need addition ioctl commands for that. I am thinking
> of using existing resources as possible. My idea also is similar in
> using the reservation stuff to your idea because my approach also
> should use the dma-buf resource. However, My idea is that a user
> process, that wants buffer synchronization with the other, sees a sync
> object as a file descriptor like dma-buf does. The below shows simple
> my idea about it:
>
> ioctl(dmabuf_fd, DMA_BUF_IOC_OPEN_SYNC, &sync);
>
> flock(sync->fd, LOCK_SH); <- LOCK_SH means a shared lock.
> CPU access for read
> flock(sync->fd, LOCK_UN);
>
> Or
>
> flock(sync->fd, LOCK_EX); <- LOCK_EX means an exclusive lock
> CPU access for write
> flock(sync->fd, LOCK_UN);
>
> close(sync->fd);
>
> As you know, that's similar to dmabuf export feature.
>
> In addition, a more simple idea,
> flock(dmabuf_fd, LOCK_SH/EX);
> CPU access for read/write
> flock(dmabuf_fd, LOCK_UN);
>
> However, I'm not sure that the above examples could be worked well,
> and there are no problems yet: actually, I don't fully understand
> flock mechanism, so looking into it.
>
>>
>>>
>>>         And to get back to my original point: if you have more than
>>>         one task
>>>         operating together on a buffer you absolutely need some kind
>>>         of real IPC
>>>         to sync them up and do something useful. Both you syncpoints
>>>         and the
>>>         proposed dma-fences only protect the buffer accesses to make
>>>         sure
>>>         different task don't stomp on each other. There is nothing in
>>>         there to
>>>         make sure that the output of your pipeline is valid. You have
>>>         to take
>>>         care of that yourself in userspace. I'll reuse your example to
>>>         make it
>>>         clear what I mean:
>>>
>>>         Task A                                         Task B
>>>         ------                                         -------
>>>         dma_buf_sync_lock(buf1)
>>>         CPU write buf1
>>>         dma_buf_sync_unlock(buf1)
>>>                   ---------schedule Task A again-------
>>>         dma_buf_sync_lock(buf1)
>>>         CPU write buf1
>>>         dma_buf_sync_unlock(buf1)
>>>                     ---------schedule Task B---------
>>>                                                        qbuf(buf1)
>>>
>>>         dma_buf_sync_lock(buf1)
>>>                                                        ....
>>>
>>>         This is what can happen if you don't take care of proper
>>>         syncing. Task A
>>>         writes something to the buffer in expectation that Task B will
>>>         take care
>>>         of it, but before Task B even gets scheduled Task A overwrites
>>>         the
>>>         buffer again. Not what you wanted, isn't it?
>>>
>>> Exactly wrong example. I had already mentioned about that. "In case
>>> that data flow goes from A to B, it needs some kind of IPC between the
>>> two tasks every time"  So again, your example would have no any
>>> problem in case that *two tasks share the same buffer but these tasks
>>> access the buffer(buf1) as write, and data of the buffer(buf1) isn't
>>> needed to be shared*.  They just need to use the buffer as *storage*.
>>> So All they want is to avoid stomping on the buffer in this case.
>>>
>> Sorry, but I don't see the point. If no one is interested in the data of
>> the buffer, why are you sharing it in the first place?
>>
>
> Just used as a storage. i.e., Task A fills the buffer with "AAAAAA"
> using CPU, And Task B fills the buffer with "BBBBBB" using DMA. They
> don't share data of the buffer, but they share *memory region* of the
> buffer. That would be very useful for the embedded systems with very
> small size system memory.

Just so i understand. You want to share backing memory, you don't want
to share content ie you want to do memory management in userspace.
This sounds wrong on so many level (not even considering the security
implication).

If Task A need memory and then can release it for Task B usage that
should be the role of kernel memory management which of course needs
synchronization btw A and B. But in no case this should be done using
dma-buf. dma-buf is for sharing content btw different devices not
sharing resources.


Also don't over complicate the vram case, just consider desktop gpu as
using system memory directly. They can do it and they do it. Migration
to vram is orthogonal to all this, it's an optimization so to speak.

Cheers,
Jerome
Inki Dae June 22, 2013, 1:36 a.m. UTC | #36
2013/6/22 Jerome Glisse <j.glisse@gmail.com>:
> On Fri, Jun 21, 2013 at 12:55 PM, Inki Dae <daeinki@gmail.com> wrote:
>> 2013/6/21 Lucas Stach <l.stach@pengutronix.de>:
>>> Hi Inki,
>>>
>>> please refrain from sending HTML Mails, it makes proper quoting without
>>> messing up the layout everywhere pretty hard.
>>>
>>
>> Sorry about that. I should have used text mode.
>>
>>> Am Freitag, den 21.06.2013, 20:01 +0900 schrieb Inki Dae:
>>> [...]
>>>
>>>> Yeah, you'll some knowledge and understanding about the API
>>>> you are
>>>> working with to get things right. But I think it's not an
>>>> unreasonable
>>>> thing to expect the programmer working directly with kernel
>>>> interfaces
>>>> to read up on how things work.
>>>>
>>>> Second thing: I'll rather have *one* consistent API for every
>>>> subsystem,
>>>> even if they differ from each other than having to implement
>>>> this
>>>> syncpoint thing in every subsystem. Remember: a single execbuf
>>>> in DRM
>>>> might reference both GEM objects backed by dma-buf as well
>>>> native SHM or
>>>> CMA backed objects. The dma-buf-mgr proposal already allows
>>>> you to
>>>> handle dma-bufs much the same way during validation than
>>>> native GEM
>>>> objects.
>>>>
>>>> Actually, at first I had implemented a fence helper framework based on
>>>> reservation and dma fence to provide easy-use-interface for device
>>>> drivers. However, that was wrong implemention: I had not only
>>>> customized the dma fence but also not considered dead lock issue.
>>>> After that, I have reimplemented it as dmabuf sync to solve dead
>>>> issue, and at that time, I realized that we first need to concentrate
>>>> on the most basic thing: the fact CPU and CPU, CPU and DMA, or DMA and
>>>> DMA can access a same buffer, And the fact simple is the best, and the
>>>> fact we need not only kernel side but also user side interfaces. After
>>>> that, I collected what is the common part for all subsystems, and I
>>>> have devised this dmabuf sync framework for it. I'm not really
>>>> specialist in Desktop world. So question. isn't the execbuf used only
>>>> for the GPU? the gpu has dedicated video memory(VRAM) so it needs
>>>> migration mechanism between system memory and the dedicated video
>>>> memory, and also to consider ordering issue while be migrated.
>>>>
>>>
>>> Yeah, execbuf is pretty GPU specific, but I don't see how this matters
>>> for this discussion. Also I don't see a big difference between embedded
>>> and desktop GPUs. Buffer migration is more of a detail here. Both take
>>> command stream that potentially reference other buffers, which might be
>>> native GEM or dma-buf backed objects. Both have to make sure the buffers
>>> are in the right domain (caches cleaned and address mappings set up) and
>>> are available for the desired operation, meaning you have to sync with
>>> other DMA engines and maybe also with CPU.
>>
>> Yeah, right. Then, in case of desktop gpu, does't it need additional
>> something to do when a buffer/s is/are migrated from system memory to
>> video memory domain, or from video memory to system memory domain? I
>> guess the below members does similar thing, and all other DMA devices
>> would not need them:
>> struct fence {
>> ...
>> unsigned int context, seqno;
>> ...
>> };
>>
>> And,
>> struct seqno_fence {
>> ...
>> uint32_t seqno_ofs;
>> ...
>> };
>>
>>>
>>> The only case where sync isn't clearly defined right now by the current
>>> API entrypoints is when you access memory through the dma-buf fallback
>>> mmap support, which might happen with some software processing element
>>> in a video pipeline or something. I agree that we will need a userspace
>>> interface here, but I think this shouldn't be yet another sync object,
>>> but rather more a prepare/fini_cpu_access ioctl on the dma-buf which
>>> hooks into the existing dma-fence and reservation stuff.
>>
>> I think we don't need addition ioctl commands for that. I am thinking
>> of using existing resources as possible. My idea also is similar in
>> using the reservation stuff to your idea because my approach also
>> should use the dma-buf resource. However, My idea is that a user
>> process, that wants buffer synchronization with the other, sees a sync
>> object as a file descriptor like dma-buf does. The below shows simple
>> my idea about it:
>>
>> ioctl(dmabuf_fd, DMA_BUF_IOC_OPEN_SYNC, &sync);
>>
>> flock(sync->fd, LOCK_SH); <- LOCK_SH means a shared lock.
>> CPU access for read
>> flock(sync->fd, LOCK_UN);
>>
>> Or
>>
>> flock(sync->fd, LOCK_EX); <- LOCK_EX means an exclusive lock
>> CPU access for write
>> flock(sync->fd, LOCK_UN);
>>
>> close(sync->fd);
>>
>> As you know, that's similar to dmabuf export feature.
>>
>> In addition, a more simple idea,
>> flock(dmabuf_fd, LOCK_SH/EX);
>> CPU access for read/write
>> flock(dmabuf_fd, LOCK_UN);
>>
>> However, I'm not sure that the above examples could be worked well,
>> and there are no problems yet: actually, I don't fully understand
>> flock mechanism, so looking into it.
>>
>>>
>>>>
>>>> And to get back to my original point: if you have more than
>>>> one task
>>>> operating together on a buffer you absolutely need some kind
>>>> of real IPC
>>>> to sync them up and do something useful. Both you syncpoints
>>>> and the
>>>> proposed dma-fences only protect the buffer accesses to make
>>>> sure
>>>> different task don't stomp on each other. There is nothing in
>>>> there to
>>>> make sure that the output of your pipeline is valid. You have
>>>> to take
>>>> care of that yourself in userspace. I'll reuse your example to
>>>> make it
>>>> clear what I mean:
>>>>
>>>> Task A Task B
>>>> ------ -------
>>>> dma_buf_sync_lock(buf1)
>>>> CPU write buf1
>>>> dma_buf_sync_unlock(buf1)
>>>> ---------schedule Task A again-------
>>>> dma_buf_sync_lock(buf1)
>>>> CPU write buf1
>>>> dma_buf_sync_unlock(buf1)
>>>> ---------schedule Task B---------
>>>> qbuf(buf1)
>>>>
>>>> dma_buf_sync_lock(buf1)
>>>> ....
>>>>
>>>> This is what can happen if you don't take care of proper
>>>> syncing. Task A
>>>> writes something to the buffer in expectation that Task B will
>>>> take care
>>>> of it, but before Task B even gets scheduled Task A overwrites
>>>> the
>>>> buffer again. Not what you wanted, isn't it?
>>>>
>>>> Exactly wrong example. I had already mentioned about that. "In case
>>>> that data flow goes from A to B, it needs some kind of IPC between the
>>>> two tasks every time" So again, your example would have no any
>>>> problem in case that *two tasks share the same buffer but these tasks
>>>> access the buffer(buf1) as write, and data of the buffer(buf1) isn't
>>>> needed to be shared*. They just need to use the buffer as *storage*.
>>>> So All they want is to avoid stomping on the buffer in this case.
>>>>
>>> Sorry, but I don't see the point. If no one is interested in the data of
>>> the buffer, why are you sharing it in the first place?
>>>
>>
>> Just used as a storage. i.e., Task A fills the buffer with "AAAAAA"
>> using CPU, And Task B fills the buffer with "BBBBBB" using DMA. They
>> don't share data of the buffer, but they share *memory region* of the
>> buffer. That would be very useful for the embedded systems with very
>> small size system memory.
>
> Just so i understand. You want to share backing memory, you don't want
> to share content ie you want to do memory management in userspace.
> This sounds wrong on so many level (not even considering the security
> implication).
>
> If Task A need memory and then can release it for Task B usage that
> should be the role of kernel memory management which of course needs
> synchronization btw A and B. But in no case this should be done using
> dma-buf. dma-buf is for sharing content btw different devices not
> sharing resources.
>

Just simply let's think of the case that a user process doesn't want for
anyone, other CPU or DMA, to access a shared buffer while he is accessing
the shared buffer. In this case, useful.

Thanks,
Inki Dae

>
> Also don't over complicate the vram case, just consider desktop gpu as
> using system memory directly. They can do it and they do it. Migration
> to vram is orthogonal to all this, it's an optimization so to speak.
>
> Cheers,
> Jerome.
Inki Dae June 25, 2013, 9:09 a.m. UTC | #37
2013/6/22 Jerome Glisse <j.glisse@gmail.com>:
> On Fri, Jun 21, 2013 at 12:55 PM, Inki Dae <daeinki@gmail.com> wrote:
>> 2013/6/21 Lucas Stach <l.stach@pengutronix.de>:
>>> Hi Inki,
>>>
>>> please refrain from sending HTML Mails, it makes proper quoting without
>>> messing up the layout everywhere pretty hard.
>>>
>>
>> Sorry about that. I should have used text mode.
>>
>>> Am Freitag, den 21.06.2013, 20:01 +0900 schrieb Inki Dae:
>>> [...]
>>>
>>>>         Yeah, you'll some knowledge and understanding about the API
>>>>         you are
>>>>         working with to get things right. But I think it's not an
>>>>         unreasonable
>>>>         thing to expect the programmer working directly with kernel
>>>>         interfaces
>>>>         to read up on how things work.
>>>>
>>>>         Second thing: I'll rather have *one* consistent API for every
>>>>         subsystem,
>>>>         even if they differ from each other than having to implement
>>>>         this
>>>>         syncpoint thing in every subsystem. Remember: a single execbuf
>>>>         in DRM
>>>>         might reference both GEM objects backed by dma-buf as well
>>>>         native SHM or
>>>>         CMA backed objects. The dma-buf-mgr proposal already allows
>>>>         you to
>>>>         handle dma-bufs much the same way during validation than
>>>>         native GEM
>>>>         objects.
>>>>
>>>> Actually, at first I had implemented a fence helper framework based on
>>>> reservation and dma fence to provide easy-use-interface for device
>>>> drivers. However, that was wrong implemention: I had not only
>>>> customized the dma fence but also not considered dead lock issue.
>>>> After that, I have reimplemented it as dmabuf sync to solve dead
>>>> issue, and at that time, I realized that we first need to concentrate
>>>> on the most basic thing: the fact CPU and CPU, CPU and DMA, or DMA and
>>>> DMA can access a same buffer, And the fact simple is the best, and the
>>>> fact we need not only kernel side but also user side interfaces. After
>>>> that, I collected what is the common part for all subsystems, and I
>>>> have devised this dmabuf sync framework for it. I'm not really
>>>> specialist in Desktop world. So question. isn't the execbuf used only
>>>> for the GPU? the gpu has dedicated video memory(VRAM) so it needs
>>>> migration mechanism between system memory and the dedicated video
>>>> memory, and also to consider ordering issue while be migrated.
>>>>
>>>
>>> Yeah, execbuf is pretty GPU specific, but I don't see how this matters
>>> for this discussion. Also I don't see a big difference between embedded
>>> and desktop GPUs. Buffer migration is more of a detail here. Both take
>>> command stream that potentially reference other buffers, which might be
>>> native GEM or dma-buf backed objects. Both have to make sure the buffers
>>> are in the right domain (caches cleaned and address mappings set up) and
>>> are available for the desired operation, meaning you have to sync with
>>> other DMA engines and maybe also with CPU.
>>
>> Yeah, right. Then, in case of desktop gpu, does't it need additional
>> something to do when a buffer/s is/are migrated from system memory to
>> video memory domain, or from video memory to system memory domain? I
>> guess the below members does similar thing, and all other DMA devices
>> would not need them:
>>         struct fence {
>>                   ...
>>                   unsigned int context, seqno;
>>                   ...
>>         };
>>
>> And,
>>        struct seqno_fence {
>>                  ...
>>                  uint32_t seqno_ofs;
>>                  ...
>>        };
>>
>>>
>>> The only case where sync isn't clearly defined right now by the current
>>> API entrypoints is when you access memory through the dma-buf fallback
>>> mmap support, which might happen with some software processing element
>>> in a video pipeline or something. I agree that we will need a userspace
>>> interface here, but I think this shouldn't be yet another sync object,
>>> but rather more a prepare/fini_cpu_access ioctl on the dma-buf which
>>> hooks into the existing dma-fence and reservation stuff.
>>
>> I think we don't need addition ioctl commands for that. I am thinking
>> of using existing resources as possible. My idea also is similar in
>> using the reservation stuff to your idea because my approach also
>> should use the dma-buf resource. However, My idea is that a user
>> process, that wants buffer synchronization with the other, sees a sync
>> object as a file descriptor like dma-buf does. The below shows simple
>> my idea about it:
>>
>> ioctl(dmabuf_fd, DMA_BUF_IOC_OPEN_SYNC, &sync);
>>
>> flock(sync->fd, LOCK_SH); <- LOCK_SH means a shared lock.
>> CPU access for read
>> flock(sync->fd, LOCK_UN);
>>
>> Or
>>
>> flock(sync->fd, LOCK_EX); <- LOCK_EX means an exclusive lock
>> CPU access for write
>> flock(sync->fd, LOCK_UN);
>>
>> close(sync->fd);
>>
>> As you know, that's similar to dmabuf export feature.
>>
>> In addition, a more simple idea,
>> flock(dmabuf_fd, LOCK_SH/EX);
>> CPU access for read/write
>> flock(dmabuf_fd, LOCK_UN);
>>
>> However, I'm not sure that the above examples could be worked well,
>> and there are no problems yet: actually, I don't fully understand
>> flock mechanism, so looking into it.
>>
>>>
>>>>
>>>>         And to get back to my original point: if you have more than
>>>>         one task
>>>>         operating together on a buffer you absolutely need some kind
>>>>         of real IPC
>>>>         to sync them up and do something useful. Both you syncpoints
>>>>         and the
>>>>         proposed dma-fences only protect the buffer accesses to make
>>>>         sure
>>>>         different task don't stomp on each other. There is nothing in
>>>>         there to
>>>>         make sure that the output of your pipeline is valid. You have
>>>>         to take
>>>>         care of that yourself in userspace. I'll reuse your example to
>>>>         make it
>>>>         clear what I mean:
>>>>
>>>>         Task A                                         Task B
>>>>         ------                                         -------
>>>>         dma_buf_sync_lock(buf1)
>>>>         CPU write buf1
>>>>         dma_buf_sync_unlock(buf1)
>>>>                   ---------schedule Task A again-------
>>>>         dma_buf_sync_lock(buf1)
>>>>         CPU write buf1
>>>>         dma_buf_sync_unlock(buf1)
>>>>                     ---------schedule Task B---------
>>>>                                                        qbuf(buf1)
>>>>
>>>>         dma_buf_sync_lock(buf1)
>>>>                                                        ....
>>>>
>>>>         This is what can happen if you don't take care of proper
>>>>         syncing. Task A
>>>>         writes something to the buffer in expectation that Task B will
>>>>         take care
>>>>         of it, but before Task B even gets scheduled Task A overwrites
>>>>         the
>>>>         buffer again. Not what you wanted, isn't it?
>>>>
>>>> Exactly wrong example. I had already mentioned about that. "In case
>>>> that data flow goes from A to B, it needs some kind of IPC between the
>>>> two tasks every time"  So again, your example would have no any
>>>> problem in case that *two tasks share the same buffer but these tasks
>>>> access the buffer(buf1) as write, and data of the buffer(buf1) isn't
>>>> needed to be shared*.  They just need to use the buffer as *storage*.
>>>> So All they want is to avoid stomping on the buffer in this case.
>>>>
>>> Sorry, but I don't see the point. If no one is interested in the data of
>>> the buffer, why are you sharing it in the first place?
>>>
>>
>> Just used as a storage. i.e., Task A fills the buffer with "AAAAAA"
>> using CPU, And Task B fills the buffer with "BBBBBB" using DMA. They
>> don't share data of the buffer, but they share *memory region* of the
>> buffer. That would be very useful for the embedded systems with very
>> small size system memory.
>
> Just so i understand. You want to share backing memory, you don't want
> to share content ie you want to do memory management in userspace.
> This sounds wrong on so many level (not even considering the security
> implication).
>
> If Task A need memory and then can release it for Task B usage

Not true. Task A can never release memory because All task A can do is to
unreference dma buf object of sync object. And please know that user
interfaces hasn't been implemented yet, and we just have a plan for it as I
already mentioned.

> that
> should be the role of kernel memory management which of course needs
> synchronization btw A and B. But in no case this should be done using
> dma-buf. dma-buf is for sharing content btw different devices not
> sharing resources.
>

hmm, is that true? And are you sure? Then how do you think about
reservation? the reservation also uses dma-buf with same reason as long as
I know: actually, we use reservation to use dma-buf. As you may know, a
reservation object is allocated and initialized when a buffer object is
exported to a dma buf.

Thanks,
Inki Dae

>
> Also don't over complicate the vram case, just consider desktop gpu as
> using system memory directly. They can do it and they do it. Migration
> to vram is orthogonal to all this, it's an optimization so to speak.
>
> Cheers,
> Jerome
Daniel Vetter June 25, 2013, 9:23 a.m. UTC | #38
On Tue, Jun 18, 2013 at 12:46 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
>> > Note: the existing stuff does have the nice side effect of being able
>> > to pass buffers which do not have a struct page * associated with them
>> > through the dma_buf API - I think we can still preserve that by having
>> > dma_buf provide a couple of new APIs to do the SG list map/sync/unmap,
>> > but in any case we need to fix the existing API so that:
>> >
>> > dma_buf_map_attachment() becomes dma_buf_get_sg()
>> > dma_buf_unmap_attachment() becomes dma_buf_put_sg()
>> >
>> > both getting rid of the DMA direction argument, and then we have four
>> > new dma_buf calls:
>> >
>> > dma_buf_map_sg()
>> > dma_buf_unmap_sg()
>> > dma_buf_sync_sg_for_cpu()
>> > dma_buf_sync_sg_for_device()
>> >
>> > which do the actual sg map/unmap via the DMA API *at the appropriate
>> > time for DMA*.
>>
>> Hm, my idea was to just add a dma_buf_sync_attchment for the device side
>> syncing, since the cpu access stuff is already bracketed with the
>> begin/end cpu access stuff. We might need a sync_for_cpu or so for mmap,
>> but imo mmap support for dma_buf is a bit insane anyway, so I don't care
>> too much about it.
>>
>> Since such dma mappings would be really longstanding in most cases anyway
>> drivers could just map with BIDIRECTIONAL and do all the real flushing
>> with the new sync stuff.
>
> Note that the DMA API debug doesn't allow you to change the direction
> argument on an existing mapping (neither should it, again this is
> documented in the DMA API stuff in Documentation/).  This is where you
> would need the complete set of four functions I mention above which
> reflect the functionality of the DMA API.

[Been travelling a bit, hence the delay.]

Just a quick question on your assertion that we need all four
functions: Since we already have begin/end_cpu_access functions
(intention here was to allow the dma_buf exporter to ensure the memory
is pinned, e.g. for swapable gem objects, but also allow cpu cache
flushing if required) do we still need the sync_sg_for_cpu? At least
with i915 as the exporter we currently hide the cflushing behind our
begin_cpu_access callback. For device dma we currently punt on it due
to lack of the dma_buf_sync_sg_for_device interface.

Aside: I know that i915 doing the clflushing dance itself is a bit
ugly, but thus far we've been the only guys on the x86 block with
non-coherent dma. But it sounds like a bunch of other blocks on Atom
SoCs have similar needs, so I guess it would make sense to move that
into the dma layer.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
Rob Clark June 25, 2013, 11:32 a.m. UTC | #39
On Tue, Jun 25, 2013 at 5:09 AM, Inki Dae <daeinki@gmail.com> wrote:
>> that
>> should be the role of kernel memory management which of course needs
>> synchronization btw A and B. But in no case this should be done using
>> dma-buf. dma-buf is for sharing content btw different devices not
>> sharing resources.
>>
>
> hmm, is that true? And are you sure? Then how do you think about
> reservation? the reservation also uses dma-buf with same reason as long as I
> know: actually, we use reservation to use dma-buf. As you may know, a
> reservation object is allocated and initialized when a buffer object is
> exported to a dma buf.

no, this is why the reservation object can be passed in when you
construction the dmabuf.  The fallback is for dmabuf to create it's
own, for compatibility and to make life easier for simple devices with
few buffers... but I think pretty much all drm drivers would embed the
reservation object in the gem buffer and pass it in when the dmabuf is
created.

It is pretty much imperative that synchronization works independently
of dmabuf, you really don't want to have two different cases to deal
with in your driver, one for synchronizing non-exported objects, and
one for synchronizing dmabuf objects.

BR,
-R
Inki Dae June 25, 2013, 2:17 p.m. UTC | #40
2013/6/25 Rob Clark <robdclark@gmail.com>:
> On Tue, Jun 25, 2013 at 5:09 AM, Inki Dae <daeinki@gmail.com> wrote:
>>> that
>>> should be the role of kernel memory management which of course needs
>>> synchronization btw A and B. But in no case this should be done using
>>> dma-buf. dma-buf is for sharing content btw different devices not
>>> sharing resources.
>>>
>>
>> hmm, is that true? And are you sure? Then how do you think about
>> reservation? the reservation also uses dma-buf with same reason as long as I
>> know: actually, we use reservation to use dma-buf. As you may know, a
>> reservation object is allocated and initialized when a buffer object is
>> exported to a dma buf.
>
> no, this is why the reservation object can be passed in when you
> construction the dmabuf.

Right, that way, we could use dma buf for buffer synchronization. I
just wanted to ask for why Jerome said that "dma-buf is for sharing
content btw different devices not sharing resources".

> The fallback is for dmabuf to create it's
> own, for compatibility and to make life easier for simple devices with
> few buffers... but I think pretty much all drm drivers would embed the
> reservation object in the gem buffer and pass it in when the dmabuf is
> created.
>
> It is pretty much imperative that synchronization works independently
> of dmabuf, you really don't want to have two different cases to deal
> with in your driver, one for synchronizing non-exported objects, and
> one for synchronizing dmabuf objects.
>

Now my approach is concentrating on the most basic implementation,
buffer synchronization mechanism between CPU and CPU, CPU and DMA, and
DMA and DMA.  But I think reserveration could be used for other
purposes such as pipe line synchronization independently of dmabuf as
you said. Actually, I had already implemented pipe line
synchronization mechanism using the reservation: in case of MALI-400
DDK, there was pipe line issue between gp and pp jobs, and we had
solved the issue using the pipe line synchronization mechanism with
the reservation. So, we could add more features anytime; those two
different cases, dmabuf objects and non-exported objects, if needed
because we are using the reservation object.

Thanks,
Inki Dae

> BR,
> -R
Jerome Glisse June 25, 2013, 2:49 p.m. UTC | #41
On Tue, Jun 25, 2013 at 10:17 AM, Inki Dae <daeinki@gmail.com> wrote:
> 2013/6/25 Rob Clark <robdclark@gmail.com>:
>> On Tue, Jun 25, 2013 at 5:09 AM, Inki Dae <daeinki@gmail.com> wrote:
>>>> that
>>>> should be the role of kernel memory management which of course needs
>>>> synchronization btw A and B. But in no case this should be done using
>>>> dma-buf. dma-buf is for sharing content btw different devices not
>>>> sharing resources.
>>>>
>>>
>>> hmm, is that true? And are you sure? Then how do you think about
>>> reservation? the reservation also uses dma-buf with same reason as long as I
>>> know: actually, we use reservation to use dma-buf. As you may know, a
>>> reservation object is allocated and initialized when a buffer object is
>>> exported to a dma buf.
>>
>> no, this is why the reservation object can be passed in when you
>> construction the dmabuf.
>
> Right, that way, we could use dma buf for buffer synchronization. I
> just wanted to ask for why Jerome said that "dma-buf is for sharing
> content btw different devices not sharing resources".

From memory, the motivation of dma-buf was to done for few use case,
among them webcam capturing frame into a buffer and having gpu using
it directly without memcpy, or one big gpu rendering a scene into a
buffer that is then use by low power gpu for display ie it was done to
allow different device to operate on same data using same backing
memory.

AFAICT you seem to want to use dma-buf to create scratch buffer, ie a
process needs to use X amount of memory for an operation, it can
release|free this memory once its done and a process B can the use
this X memory for its own operation discarding content of process A. I
presume that next frame would have the sequence repeat, process A do
something, then process B does its thing. So to me it sounds like you
want to implement global scratch buffer using the dmabuf API and that
sounds bad to me.

I know most closed driver have several pool of memory, long lived
object, short lived object and scratch space, then user space allocate
from one of this pool and there is synchronization done by driver
using driver specific API to reclaim memory. Of course this work
nicely if you only talking about one logic block or at very least hw
that have one memory controller.

Now if you are thinking of doing scratch buffer for several different
device and share the memory among then you need to be aware of
security implication, most obvious being that you don't want process B
being able to read process A scratch memory. I know the argument about
it being graphic but one day this might become gpu code and it might
be able to insert jump to malicious gpu code.

Cheers,
Jerome
Inki Dae June 26, 2013, 4:06 p.m. UTC | #42
2013/6/25 Jerome Glisse <j.glisse@gmail.com>:
> On Tue, Jun 25, 2013 at 10:17 AM, Inki Dae <daeinki@gmail.com> wrote:
>> 2013/6/25 Rob Clark <robdclark@gmail.com>:
>>> On Tue, Jun 25, 2013 at 5:09 AM, Inki Dae <daeinki@gmail.com> wrote:
>>>>> that
>>>>> should be the role of kernel memory management which of course needs
>>>>> synchronization btw A and B. But in no case this should be done using
>>>>> dma-buf. dma-buf is for sharing content btw different devices not
>>>>> sharing resources.
>>>>>
>>>>
>>>> hmm, is that true? And are you sure? Then how do you think about
>>>> reservation? the reservation also uses dma-buf with same reason as long
>>>> as I
>>>> know: actually, we use reservation to use dma-buf. As you may know, a
>>>> reservation object is allocated and initialized when a buffer object is
>>>> exported to a dma buf.
>>>
>>> no, this is why the reservation object can be passed in when you
>>> construction the dmabuf.
>>
>> Right, that way, we could use dma buf for buffer synchronization. I
>> just wanted to ask for why Jerome said that "dma-buf is for sharing
>> content btw different devices not sharing resources".
>
> From memory, the motivation of dma-buf was to done for few use case,
> among them webcam capturing frame into a buffer and having gpu using
> it directly without memcpy, or one big gpu rendering a scene into a
> buffer that is then use by low power gpu for display ie it was done to
> allow different device to operate on same data using same backing
> memory.
>
> AFAICT you seem to want to use dma-buf to create scratch buffer, ie a
> process needs to use X amount of memory for an operation, it can
> release|free this memory once its done
> and a process B can the use
> this X memory for its own operation discarding content of process A.
> presume that next frame would have the sequence repeat, process A do
> something, then process B does its thing.
> So to me it sounds like you
> want to implement global scratch buffer using the dmabuf API and that
> sounds bad to me.
>
> I know most closed driver have several pool of memory, long lived
> object, short lived object and scratch space, then user space allocate
> from one of this pool and there is synchronization done by driver
> using driver specific API to reclaim memory.
> Of course this work
> nicely if you only talking about one logic block or at very least hw
> that have one memory controller.
>
> Now if you are thinking of doing scratch buffer for several different
> device and share the memory among then you need to be aware of
> security implication, most obvious being that you don't want process B
> being able to read process A scratch memory.
> I know the argument about
> it being graphic but one day this might become gpu code and it might
> be able to insert jump to malicious gpu code.
>

If you think so, it seems like that there is *definitely* your
misunderstanding. My approach is similar to dma fence: it guarantees
that a DMA cannot access a buffer while other DMA is accessing the
buffer. I guess now some gpu drivers in mainline have been using
specific mechanism for it. And when it comes to the portion you
commented, please know that I just introduced user side mechanism for
buffer sychronization between CPU and CPU, and CPU and DMA in
addition; not implemented but just planned.

Thanks,
Inki Dae

> Cheers,
> Jerome
Russell King - ARM Linux June 26, 2013, 5:18 p.m. UTC | #43
On Tue, Jun 25, 2013 at 11:23:21AM +0200, Daniel Vetter wrote:
> Just a quick question on your assertion that we need all four
> functions: Since we already have begin/end_cpu_access functions
> (intention here was to allow the dma_buf exporter to ensure the memory
> is pinned, e.g. for swapable gem objects, but also allow cpu cache
> flushing if required) do we still need the sync_sg_for_cpu?

Possibly not, but let's first look at that kind of scenario:

- It attaches to a dma_buf using dma_buf_attach()
- it maps the DMA buffer using dma_buf_map_attachment().  This does
  dma_map_sg().  This possibly incurs a cache flush depending on the
  coherence properties of the architecture.
- it then calls dma_buf_begin_cpu_access().  This presumably does a
  dma_sync_sg_for_cpu().  This possibly incurs another cache flush
  depending on the coherence properties of the architecture.
- then, presumably, it calls dma_buf_kmap_atomic() or dma_buf_kmap()
  to gain a pointer to the buffer.
- at some point, dma_buf_kunmap_atomic() or dma_buf_kunmap() gets
  called.  On certain cache architectures, kunmap_atomic() results in
  a cache flush.
- dma_buf_end_cpu_access() gets called, calling through presumably to
  dma_sync_sg_for_device().  Possibly incurs a cache flush.
- dma_buf_unmap_attachment()... another cache flush.

Out of those all of those five cache flushes, the only cache flush which is
really necessary out of all those would be the one in kunmap_atomic().  The
rest of them are completely irrelevant to the CPU access provided that there
is no DMA access by the device.  What's more is that you can't say "well,
the architecture should optimize them!" to which I'd respond "how does the
architecture know at dma_map_sg() time that there isn't going to be a
DMA access?"

Now, if we say that it's not necessary to call dma_buf_map_attachment()..
dma_buf_unmap_attachment() around the calls to dma_buf_begin_cpu_access()..
dma_buf_end_cpu_access(), then how does dma_buf_begin_cpu_access() know
whether the DMA buffer has been dma_map_sg()'d?  It's invalid to call
dma_sync_sg_for_cpu() on a buffer which has not been dma_map_sg()'d.
Bear in mind that dma_buf_begin_cpu_access() is passed only the
struct dma_buf and not the struct dma_buf_attachment *attach, there's
no hope for the begin_cpu_access() method to know which user of the
dma_buf is asking for this call, so you can't track any state there.

Thank you for pointing this out, I'm less impressed with this dma_buf
API now that I've looked deeper at those two interfaces, and even more
convinced it needs to be redesigned! :P
Inki Dae Aug. 12, 2013, 9:55 a.m. UTC | #44
Hello all,


The purpose of this email is to get other opinions and advices to buffer synchronization mechanism, and coupling cache operation feature with the buffer synchronization mechanism. First of all, I am not a native English speaker so I'm not sure that I can convey my intention to you. And I'm not a specialist in Linux than other people so also there might be my missing points. 

I had posted the buffer synchronization mechanism called dmabuf sync framework like below,
http://lists.infradead.org/pipermail/linux-arm-kernel/2013-June/177045.html

And I'm sending this email before posting next version with more stable patch set and features. The purpose of this framework is to provide not only buffer access control to CPU and DMA but also easy-to-use interfaces for device drivers and user application. This framework can be used for all DMA devices using system memory as DMA buffer, especially for most ARM based SoCs.
    
There are two cases we are using this buffer synchronization framework. One is to primarily enhance GPU rendering performance on Tizen platform in case of 3d app with compositing mode that the 3d app draws something in off-screen buffer.
    
And other is to couple buffer access control and cache operation between CPU and DMA; the cache operation is done by the dmabuf sync framework in kernel side.

    
Why do we need buffer access control between CPU and DMA?
---------------------------------------------------------
    
The below shows simple 3D software layers,
    
                    3D Application
      -----------------------------------------
        Standard OpenGL ES and EGL Interfaces  ------- [A]
      -----------------------------------------
      MALI OpenGL ES and EGL Native modules --------- [B]
      -----------------------------------------
         Exynos DRM Driver    |    GPU Driver ---------- [C]
    
3d application requests 3d rendering through A. And then B fills a 3d command buffer with the requests from A. And then 3D application calls glFlush so that the 3d command buffer can be submitted to C, and rendered by GPU hardware. Internally, glFlush(also glFinish) will call a function of native module[B] glFinish blocks caller's task until all GL execution is complete. On the other hand, glFlush forces execution of GL commands but doesn't block the caller's task until the completion.
    
In composting mode, in case of using glFinish, the 3d rendering performance is quite lower than using glFlush because glFinish makes CPU blocked until the execution of the 3d commands is completed. However, the use of glFlush has one issue that the shared buffer with GPU could be broken when CPU accesses the shared buffer at once after glFlush because CPU cannot be aware of the completion of GPU access to the shared buffer: actually, Tizen platform internally calls only eglSwapBuffer instead of glFlush and glFinish, and whether flushing or finishing is decided according to compositing mode or not. So in such case, we would need buffer access control between CPU and DMA for more performance.

    
About cache operation
---------------------
    
The dmabuf sync framework can include cache operation feature, and the below shows how the cache operation based on dmabuf sync framework is performed,
   device driver in kernel or fctrl in user land
          dmabuf_sync_lock or dmabuf_sync_single_lock
               check before and after buffer access
                  dma_buf_begin_cpu_access or dma_buf_end_cpu_access
                         begin_cpu_access or end_cpu_access of exporter
                                dma_sync_sg_for_device or dma_sync_sg_for_cpu
    
In case that using dmabuf sync framework, kernel can be aware of when CPU and DMA access to a shared buffer is completed so we can do cache operation in kernel so that way, we can couple buffer access control and cache operation. So with this, we can avoid that user land overuses cache operation.
    
I guess most Linux based platforms are using cachable mapped buffer for more performance: in case that CPU frequently accesses the shared buffer which is shared with DMA, the use of cachable mapped buffer is more fast than the use of non-cachable. However, this way could make cache operation overused by user land because only user land can be aware of the completion of CPU or DMA access to the shared buffer so user land could request cache operations every time it wants even the cache operation is unnecessary. That is how user land could overuse cache operations.
    
To Android, Chrome OS, and other forks,

Are there other cases that buffer access control between CPU and DMA is needed? I know Android sync driver and KDS are already being used for Android, Chrome OS, and so on.
How does your platform do cache operation? And How do you think about coupling buffer access control and cache operation between CPU and DMA?.

Lastly, I think we may need Linux generic buffer synchronization mechanism that uses only Linux standard interfaces (dmabuf) including user land interfaces (fcntl and select system calls), and the dmabuf sync framework could meet it.


Thanks,
Inki Dae
diff mbox

Patch

diff --git a/Documentation/dma-buf-sync.txt b/Documentation/dma-buf-sync.txt
new file mode 100644
index 0000000..e71b6f4
--- /dev/null
+++ b/Documentation/dma-buf-sync.txt
@@ -0,0 +1,246 @@ 
+                    DMA Buffer Synchronization Framework
+                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+                                  Inki Dae
+                      <inki dot dae at samsung dot com>
+                          <daeinki at gmail dot com>
+
+This document is a guide for device-driver writers describing the DMA buffer
+synchronization API. This document also describes how to use the API to
+use buffer synchronization between CPU and CPU, CPU and DMA, and DMA and DMA.
+
+The DMA Buffer synchronization API provides buffer synchronization mechanism;
+i.e., buffer access control to CPU and DMA, cache operations, and easy-to-use
+interfaces for device drivers and potentially user application
+(not implemented for user applications, yet). And this API can be used for all
+dma devices using system memory as dma buffer, especially for most ARM based
+SoCs.
+
+
+Motivation
+----------
+
+Sharing a buffer, a device cannot be aware of when the other device will access
+the shared buffer: a device may access a buffer containing wrong data if
+the device accesses the shared buffer while another device is still accessing
+the shared buffer. Therefore, a user process should have waited for
+the completion of DMA access by another device before a device tries to access
+the shared buffer.
+
+Besides, there is the same issue when CPU and DMA are sharing a buffer; i.e.,
+a user process should consider that when the user process have to send a buffer
+to a device driver for the device driver to access the buffer as input.
+This means that a user process needs to understand how the device driver is
+worked. Hence, the conventional mechanism not only makes user application
+complicated but also incurs performance overhead because the conventional
+mechanism cannot control devices precisely without additional and complex
+implemantations.
+
+In addition, in case of ARM based SoCs, most devices have no hardware cache
+consistency mechanisms between CPU and DMA devices because they do not use ACP
+(Accelerator Coherency Port). ACP can be connected to DMA engine or similar
+devices in order to keep cache coherency between CPU cache and DMA device.
+Thus, we need additional cache operations to have the devices operate properly;
+i.e., user applications should request cache operations to kernel before DMA
+accesses the buffer and after the completion of buffer access by CPU, or vise
+versa.
+
+	buffer access by CPU -> cache clean -> buffer access by DMA
+
+Or,
+	buffer access by DMA -> cache invalidate -> buffer access by CPU
+
+The below shows why cache operations should be requested by user
+process,
+    (Presume that CPU and DMA share a buffer and the buffer is mapped
+     with user space as cachable)
+
+	handle = drm_gem_alloc(size);
+	...
+	va1 = drm_gem_mmap(handle1);
+	va2 = malloc(size);
+	...
+
+	while(conditions) {
+		memcpy(va1, some data, size);
+		...
+		drm_xxx_set_dma_buffer(handle, ...);
+		...
+
+		/* user need to request cache clean at here. */
+
+		/* blocked until dma operation is completed. */
+		drm_xxx_start_dma(...);
+		...
+
+		/* user need to request cache invalidate at here. */
+
+		memcpy(va2, va1, size);
+	}
+
+The issue arises: user processes may incur cache operations: user processes may
+request unnecessary cache operations to kernel. Besides, kernel cannot prevent
+user processes from requesting such cache operations. Therefore, we need to
+prevent such excessive and unnecessary cache operations from user processes.
+
+
+Basic concept
+-------------
+
+The mechanism of this framework has the following steps,
+    1. Register dmabufs to a sync object - A task gets a new sync object and
+    can add one or more dmabufs that the task wants to access.
+    This registering should be performed when a device context or an event
+    context such as a page flip event is created or before CPU accesses a shared
+    buffer.
+
+	dma_buf_sync_get(a sync object, a dmabuf);
+
+    2. Lock a sync object - A task tries to lock all dmabufs added in its own
+    sync object. Basically, the lock mechanism uses ww-mutex[1] to avoid dead
+    lock issue and for race condition between CPU and CPU, CPU and DMA, and DMA
+    and DMA. Taking a lock means that others cannot access all locked dmabufs
+    until the task that locked the corresponding dmabufs, unlocks all the locked
+    dmabufs.
+    This locking should be performed before DMA or CPU accesses these dmabufs.
+
+	dma_buf_sync_lock(a sync object);
+
+    3. Unlock a sync object - The task unlocks all dmabufs added in its own sync
+    object. The unlock means that the DMA or CPU accesses to the dmabufs have
+    been completed so that others may access them.
+    This unlocking should be performed after DMA or CPU has completed accesses
+    to the dmabufs.
+
+	dma_buf_sync_unlock(a sync object);
+
+    4. Unregister one or all dmabufs from a sync object - A task unregisters
+    the given dmabufs from the sync object. This means that the task dosen't
+    want to lock the dmabufs.
+    The unregistering should be performed after DMA or CPU has completed
+    accesses to the dmabufs or when dma_buf_sync_lock() is failed.
+
+	dma_buf_sync_put(a sync object, a dmabuf);
+	dma_buf_sync_put_all(a sync object);
+
+    The described steps may be summarized as:
+	get -> lock -> CPU or DMA access to a buffer/s -> unlock -> put
+
+This framework includes the following two features.
+    1. read (shared) and write (exclusive) locks - A task is required to declare
+    the access type when the task tries to register a dmabuf;
+    READ, WRITE, READ DMA, or WRITE DMA.
+
+    The below is example codes,
+	struct dmabuf_sync *sync;
+
+	sync = dmabuf_sync_init(NULL, "test sync");
+
+	dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
+	...
+
+    2. Mandatory resource releasing - a task cannot hold a lock indefinitely.
+    A task may never try to unlock a buffer after taking a lock to the buffer.
+    In this case, a timer handler to the corresponding sync object is called
+    in five (default) seconds and then the timed-out buffer is unlocked by work
+    queue handler to avoid lockups and to enforce resources of the buffer.
+
+
+Access types
+------------
+
+DMA_BUF_ACCESS_READ - CPU will access a buffer for read.
+DMA_BUF_ACCESS_WRITE - CPU will access a buffer for read or write.
+DMA_BUF_ACCESS_READ | DMA_BUF_ACCESS_DMA - DMA will access a buffer for read
+DMA_BUF_ACCESS_WRITE | DMA_BUF_ACCESS_DMA - DMA will access a buffer for read
+					    or write.
+
+
+API set
+-------
+
+bool is_dmabuf_sync_supported(void)
+	- Check if dmabuf sync is supported or not.
+
+struct dmabuf_sync *dmabuf_sync_init(void *priv, const char *name)
+	- Allocate and initialize a new sync object. The caller can get a new
+	sync object for buffer synchronization. priv is used to set caller's
+	private data and name is the name of sync object.
+
+void dmabuf_sync_fini(struct dmabuf_sync *sync)
+	- Release all resources to the sync object.
+
+int dmabuf_sync_get(struct dmabuf_sync *sync, void *sync_buf,
+			unsigned int type)
+	- Add a dmabuf to a sync object. The caller can group multiple dmabufs
+	by calling this function several times. Internally, this function also
+	takes a reference to a dmabuf.
+
+void dmabuf_sync_put(struct dmabuf_sync *sync, struct dma_buf *dmabuf)
+	- Remove a given dmabuf from a sync object. Internally, this function
+	also release every reference to the given dmabuf.
+
+void dmabuf_sync_put_all(struct dmabuf_sync *sync)
+	- Remove all dmabufs added in a sync object. Internally, this function
+	also release every reference to the dmabufs of the sync object.
+
+int dmabuf_sync_lock(struct dmabuf_sync *sync)
+	- Lock all dmabufs added in a sync object. The caller should call this
+	function prior to CPU or DMA access to the dmabufs so that others can
+	not access the dmabufs. Internally, this function avoids dead lock
+	issue with ww-mutex.
+
+int dmabuf_sync_unlock(struct dmabuf_sync *sync)
+	- Unlock all dmabufs added in a sync object. The caller should call
+	this function after CPU or DMA access to the dmabufs is completed so
+	that others can access the dmabufs.
+
+
+Tutorial
+--------
+
+1. Allocate and Initialize a sync object:
+	struct dmabuf_sync *sync;
+
+	sync = dmabuf_sync_init(NULL, "test sync");
+	...
+
+2. Add a dmabuf to the sync object when setting up dma buffer relevant registers:
+	dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
+	...
+
+3. Lock all dmabufs of the sync object before DMA or CPU accesses the dmabufs:
+	dmabuf_sync_lock(sync);
+	...
+
+4. Now CPU or DMA can access all dmabufs locked in step 3.
+
+5. Unlock all dmabufs added in a sync object after DMA or CPU access to these
+   dmabufs is completed:
+	dmabuf_sync_unlock(sync);
+
+   And call the following functions to release all resources,
+	dmabuf_sync_put_all(sync);
+	dmabuf_sync_fini(sync);
+
+
+Cache operation
+---------------
+
+The framework performs cache operation based on the previous and current access
+types to the dmabufs after the locks to all dmabufs are taken:
+	Call dma_buf_begin_cpu_access() to invalidate cache if,
+		previous access type is DMA_BUF_ACCESS_WRITE | DMA and
+		current access type is DMA_BUF_ACCESS_READ
+
+	Call dma_buf_end_cpu_access() to clean cache if,
+		previous access type is DMA_BUF_ACCESS_WRITE and
+		current access type is DMA_BUF_ACCESS_READ | DMA
+
+Such cache operations are invoked via dma-buf interfaces. Thus, the dma buf
+exporter should implement dmabuf->ops->begin_cpu_access and end_cpu_access
+callbacks.
+
+
+References:
+[1] https://patchwork-mail1.kernel.org/patch/2625321/
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 5ccf182..54a1d5a 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -212,6 +212,13 @@  config FENCE_TRACE
 	  lockup related problems for dma-buffers shared across multiple
 	  devices.
 
+config DMABUF_SYNC
+	bool "DMABUF Synchronization Framework"
+	depends on DMA_SHARED_BUFFER
+	help
+	  This option enables dmabuf sync framework for buffer synchronization between
+	  DMA and DMA, CPU and DMA, and CPU and CPU.
+
 config CMA
 	bool "Contiguous Memory Allocator"
 	depends on HAVE_DMA_CONTIGUOUS && HAVE_MEMBLOCK
diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index 8a55cb9..599f6c90 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -11,6 +11,7 @@  obj-y			+= power/
 obj-$(CONFIG_HAS_DMA)	+= dma-mapping.o
 obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
 obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf.o fence.o reservation.o
+obj-$(CONFIG_DMABUF_SYNC) += dmabuf-sync.o
 obj-$(CONFIG_ISA)	+= isa.o
 obj-$(CONFIG_FW_LOADER)	+= firmware_class.o
 obj-$(CONFIG_NUMA)	+= node.o
diff --git a/drivers/base/dmabuf-sync.c b/drivers/base/dmabuf-sync.c
new file mode 100644
index 0000000..eeb13ca
--- /dev/null
+++ b/drivers/base/dmabuf-sync.c
@@ -0,0 +1,545 @@ 
+/*
+ * Copyright (C) 2013 Samsung Electronics Co.Ltd
+ * Authors:
+ *	Inki Dae <inki.dae@samsung.com>
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/debugfs.h>
+#include <linux/uaccess.h>
+
+#include <linux/dmabuf-sync.h>
+
+#define MAX_SYNC_TIMEOUT	5 /* Second. */
+
+#define NEED_BEGIN_CPU_ACCESS(old, new)	\
+			(old->accessed_type == DMA_BUF_ACCESS_DMA_W && \
+			 (new->access_type == DMA_BUF_ACCESS_R))
+
+#define NEED_END_CPU_ACCESS(old, new)	\
+			(old->accessed_type == DMA_BUF_ACCESS_W && \
+			 (new->access_type & DMA_BUF_ACCESS_DMA))
+
+int dmabuf_sync_enabled = 1;
+
+MODULE_PARM_DESC(enabled, "Check if dmabuf sync is supported or not");
+module_param_named(enabled, dmabuf_sync_enabled, int, 0444);
+
+static void dmabuf_sync_timeout_worker(struct work_struct *work)
+{
+	struct dmabuf_sync *sync = container_of(work, struct dmabuf_sync, work);
+	struct dmabuf_sync_object *sobj;
+
+	mutex_lock(&sync->lock);
+
+	list_for_each_entry(sobj, &sync->syncs, head) {
+		if (WARN_ON(!sobj->robj))
+			continue;
+
+		printk(KERN_WARNING "%s: timeout = 0x%x [type = %d, " \
+					"refcnt = %d, locked = %d]\n",
+					sync->name, (u32)sobj->dmabuf,
+					sobj->access_type,
+					atomic_read(&sobj->robj->shared_cnt),
+					sobj->robj->locked);
+
+		/* unlock only valid sync object. */
+		if (!sobj->robj->locked)
+			continue;
+
+		if (sobj->robj->shared &&
+		    atomic_add_unless(&sobj->robj->shared_cnt, -1, 1))
+			continue;
+
+		ww_mutex_unlock(&sobj->robj->lock);
+
+		if (sobj->access_type & DMA_BUF_ACCESS_R)
+			printk(KERN_WARNING "%s: r-unlocked = 0x%x\n",
+					sync->name, (u32)sobj->dmabuf);
+		else
+			printk(KERN_WARNING "%s: w-unlocked = 0x%x\n",
+					sync->name, (u32)sobj->dmabuf);
+	}
+
+	sync->status = 0;
+	mutex_unlock(&sync->lock);
+
+	dmabuf_sync_put_all(sync);
+	dmabuf_sync_fini(sync);
+}
+
+static void dmabuf_sync_cache_ops(struct dmabuf_sync *sync)
+{
+	struct dmabuf_sync_object *sobj;
+
+	mutex_lock(&sync->lock);
+
+	list_for_each_entry(sobj, &sync->syncs, head) {
+		struct dma_buf *dmabuf;
+
+		dmabuf = sobj->dmabuf;
+		if (WARN_ON(!dmabuf || !sobj->robj))
+			continue;
+
+		/* first time access. */
+		if (!sobj->robj->accessed_type)
+			goto out;
+
+		if (NEED_END_CPU_ACCESS(sobj->robj, sobj))
+			/* cache clean */
+			dma_buf_end_cpu_access(dmabuf, 0, dmabuf->size,
+							DMA_TO_DEVICE);
+		else if (NEED_BEGIN_CPU_ACCESS(sobj->robj, sobj))
+			/* cache invalidate */
+			dma_buf_begin_cpu_access(dmabuf, 0, dmabuf->size,
+							DMA_FROM_DEVICE);
+
+out:
+		/* Update access type to new one. */
+		sobj->robj->accessed_type = sobj->access_type;
+	}
+
+	mutex_unlock(&sync->lock);
+}
+
+static void dmabuf_sync_lock_timeout(unsigned long arg)
+{
+	struct dmabuf_sync *sync = (struct dmabuf_sync *)arg;
+
+	schedule_work(&sync->work);
+}
+
+static int dmabuf_sync_lock_objs(struct dmabuf_sync *sync,
+					struct ww_acquire_ctx *ctx)
+{
+	struct dmabuf_sync_object *contended_sobj = NULL;
+	struct dmabuf_sync_object *res_sobj = NULL;
+	struct dmabuf_sync_object *sobj = NULL;
+	int ret;
+
+	if (ctx)
+		ww_acquire_init(ctx, &reservation_ww_class);
+
+retry:
+	list_for_each_entry(sobj, &sync->syncs, head) {
+		if (WARN_ON(!sobj->robj))
+			continue;
+
+		/* Don't lock in case of read and read. */
+		if (sobj->robj->accessed_type & DMA_BUF_ACCESS_R &&
+		    sobj->access_type & DMA_BUF_ACCESS_R) {
+			atomic_inc(&sobj->robj->shared_cnt);
+			sobj->robj->shared = true;
+			continue;
+		}
+
+		if (sobj == res_sobj) {
+			res_sobj = NULL;
+			continue;
+		}
+
+		ret = ww_mutex_lock(&sobj->robj->lock, ctx);
+		if (ret < 0) {
+			contended_sobj = sobj;
+
+			if (ret == -EDEADLK)
+				printk(KERN_WARNING"%s: deadlock = 0x%x\n",
+					sync->name, (u32)sobj->dmabuf);
+			goto err;
+		}
+
+		sobj->robj->locked = true;
+	}
+
+	if (ctx)
+		ww_acquire_done(ctx);
+
+	init_timer(&sync->timer);
+
+	sync->timer.data = (unsigned long)sync;
+	sync->timer.function = dmabuf_sync_lock_timeout;
+	sync->timer.expires = jiffies + (HZ * MAX_SYNC_TIMEOUT);
+
+	add_timer(&sync->timer);
+
+	return 0;
+
+err:
+	list_for_each_entry_continue_reverse(sobj, &sync->syncs, head) {
+		/* Don't need to unlock in case of read and read. */
+		if (atomic_add_unless(&sobj->robj->shared_cnt, -1, 1))
+			continue;
+
+		ww_mutex_unlock(&sobj->robj->lock);
+		sobj->robj->locked = false;
+	}
+
+	if (res_sobj) {
+		if (!atomic_add_unless(&res_sobj->robj->shared_cnt, -1, 1)) {
+			ww_mutex_unlock(&res_sobj->robj->lock);
+			res_sobj->robj->locked = false;
+		}
+	}
+
+	if (ret == -EDEADLK) {
+		ww_mutex_lock_slow(&contended_sobj->robj->lock, ctx);
+		res_sobj = contended_sobj;
+
+		goto retry;
+	}
+
+	if (ctx)
+		ww_acquire_fini(ctx);
+
+	return ret;
+}
+
+static void dmabuf_sync_unlock_objs(struct dmabuf_sync *sync,
+					struct ww_acquire_ctx *ctx)
+{
+	struct dmabuf_sync_object *sobj;
+
+	if (list_empty(&sync->syncs))
+		return;
+
+	mutex_lock(&sync->lock);
+
+	list_for_each_entry(sobj, &sync->syncs, head) {
+		if (sobj->robj->shared) {
+			if (atomic_add_unless(&sobj->robj->shared_cnt, -1 , 1))
+				continue;
+
+			ww_mutex_unlock(&sobj->robj->lock);
+			sobj->robj->shared = false;
+			sobj->robj->locked = false;
+		} else {
+			ww_mutex_unlock(&sobj->robj->lock);
+			sobj->robj->locked = false;
+		}
+	}
+
+	mutex_unlock(&sync->lock);
+
+	if (ctx)
+		ww_acquire_fini(ctx);
+
+	del_timer(&sync->timer);
+}
+
+/**
+ * is_dmabuf_sync_supported - Check if dmabuf sync is supported or not.
+ */
+bool is_dmabuf_sync_supported(void)
+{
+	return dmabuf_sync_enabled == 1;
+}
+EXPORT_SYMBOL(is_dmabuf_sync_supported);
+
+/**
+ * dmabuf_sync_init - Allocate and initialize a dmabuf sync.
+ *
+ * @priv: A device private data.
+ * @name: A sync object name.
+ *
+ * This function should be called when a device context or an event
+ * context such as a page flip event is created. And the created
+ * dmabuf_sync object should be set to the context.
+ * The caller can get a new sync object for buffer synchronization
+ * through this function.
+ */
+struct dmabuf_sync *dmabuf_sync_init(void *priv, const char *name)
+{
+	struct dmabuf_sync *sync;
+
+	sync = kzalloc(sizeof(*sync), GFP_KERNEL);
+	if (!sync)
+		return ERR_PTR(-ENOMEM);
+
+	strncpy(sync->name, name, ARRAY_SIZE(sync->name) - 1);
+
+	sync->priv = priv;
+	INIT_LIST_HEAD(&sync->syncs);
+	mutex_init(&sync->lock);
+	INIT_WORK(&sync->work, dmabuf_sync_timeout_worker);
+
+	return sync;
+}
+EXPORT_SYMBOL(dmabuf_sync_init);
+
+/**
+ * dmabuf_sync_fini - Release a given dmabuf sync.
+ *
+ * @sync: An object to dmabuf_sync structure.
+ *
+ * This function should be called if some operation is failed after
+ * dmabuf_sync_init call to release relevant resources, and after
+ * dmabuf_sync_unlock function is called.
+ */
+void dmabuf_sync_fini(struct dmabuf_sync *sync)
+{
+	if (WARN_ON(!sync))
+		return;
+
+	kfree(sync);
+}
+EXPORT_SYMBOL(dmabuf_sync_fini);
+
+/*
+ * dmabuf_sync_get_obj - Add a given object to syncs list.
+ *
+ * @sync: An object to dmabuf_sync structure.
+ * @dmabuf: An object to dma_buf structure.
+ * @type: A access type to a dma buf.
+ *	The DMA_BUF_ACCESS_R means that this dmabuf could be accessed by
+ *	others for read access. On the other hand, the DMA_BUF_ACCESS_W
+ *	means that this dmabuf couldn't be accessed by others but would be
+ *	accessed by caller's dma exclusively. And the DMA_BUF_ACCESS_DMA can be
+ *	combined.
+ *
+ * This function creates and initializes a new dmabuf sync object and it adds
+ * the dmabuf sync object to syncs list to track and manage all dmabufs.
+ */
+static int dmabuf_sync_get_obj(struct dmabuf_sync *sync, struct dma_buf *dmabuf,
+					unsigned int type)
+{
+	struct dmabuf_sync_object *sobj;
+
+	if (!dmabuf->resv) {
+		WARN_ON(1);
+		return -EFAULT;
+	}
+
+	if (!IS_VALID_DMA_BUF_ACCESS_TYPE(type))
+		return -EINVAL;
+
+	if ((type & DMA_BUF_ACCESS_RW) == DMA_BUF_ACCESS_RW)
+		type &= ~DMA_BUF_ACCESS_R;
+
+	sobj = kzalloc(sizeof(*sobj), GFP_KERNEL);
+	if (!sobj) {
+		WARN_ON(1);
+		return -ENOMEM;
+	}
+
+	sobj->dmabuf = dmabuf;
+	sobj->robj = dmabuf->resv;
+
+	mutex_lock(&sync->lock);
+	list_add_tail(&sobj->head, &sync->syncs);
+	mutex_unlock(&sync->lock);
+
+	get_dma_buf(dmabuf);
+
+	sobj->access_type = type;
+
+	return 0;
+}
+
+/*
+ * dmabuf_sync_put_obj - Release a given sync object.
+ *
+ * @sync: An object to dmabuf_sync structure.
+ *
+ * This function should be called if some operation is failed after
+ * dmabuf_sync_get_obj call to release a given sync object.
+ */
+static void dmabuf_sync_put_obj(struct dmabuf_sync *sync,
+					struct dma_buf *dmabuf)
+{
+	struct dmabuf_sync_object *sobj;
+
+	mutex_lock(&sync->lock);
+
+	list_for_each_entry(sobj, &sync->syncs, head) {
+		if (sobj->dmabuf != dmabuf)
+			continue;
+
+		dma_buf_put(sobj->dmabuf);
+
+		list_del_init(&sobj->head);
+		kfree(sobj);
+		break;
+	}
+
+	if (list_empty(&sync->syncs))
+		sync->status = 0;
+
+	mutex_unlock(&sync->lock);
+}
+
+/*
+ * dmabuf_sync_put_objs - Release all sync objects of dmabuf_sync.
+ *
+ * @sync: An object to dmabuf_sync structure.
+ *
+ * This function should be called if some operation is failed after
+ * dmabuf_sync_get_obj call to release all sync objects.
+ */
+static void dmabuf_sync_put_objs(struct dmabuf_sync *sync)
+{
+	struct dmabuf_sync_object *sobj, *next;
+
+	mutex_lock(&sync->lock);
+
+	list_for_each_entry_safe(sobj, next, &sync->syncs, head) {
+		dma_buf_put(sobj->dmabuf);
+
+		list_del_init(&sobj->head);
+		kfree(sobj);
+	}
+
+	mutex_unlock(&sync->lock);
+
+	sync->status = 0;
+}
+
+/**
+ * dmabuf_sync_lock - lock all dmabufs added to syncs list.
+ *
+ * @sync: An object to dmabuf_sync structure.
+ *
+ * The caller should call this function prior to CPU or DMA access to
+ * the dmabufs so that others can not access the dmabufs.
+ * Internally, this function avoids dead lock issue with ww-mutex.
+ */
+int dmabuf_sync_lock(struct dmabuf_sync *sync)
+{
+	int ret;
+
+	if (!sync) {
+		WARN_ON(1);
+		return -EFAULT;
+	}
+
+	if (list_empty(&sync->syncs))
+		return -EINVAL;
+
+	if (sync->status != DMABUF_SYNC_GOT)
+		return -EINVAL;
+
+	ret = dmabuf_sync_lock_objs(sync, &sync->ctx);
+	if (ret < 0) {
+		WARN_ON(1);
+		return ret;
+	}
+
+	sync->status = DMABUF_SYNC_LOCKED;
+
+	dmabuf_sync_cache_ops(sync);
+
+	return ret;
+}
+EXPORT_SYMBOL(dmabuf_sync_lock);
+
+/**
+ * dmabuf_sync_unlock - unlock all objects added to syncs list.
+ *
+ * @sync: An object to dmabuf_sync structure.
+ *
+ * The caller should call this function after CPU or DMA access to
+ * the dmabufs is completed so that others can access the dmabufs.
+ */
+int dmabuf_sync_unlock(struct dmabuf_sync *sync)
+{
+	if (!sync) {
+		WARN_ON(1);
+		return -EFAULT;
+	}
+
+	/* If current dmabuf sync object wasn't reserved then just return. */
+	if (sync->status != DMABUF_SYNC_LOCKED)
+		return -EAGAIN;
+
+	dmabuf_sync_unlock_objs(sync, &sync->ctx);
+
+	return 0;
+}
+EXPORT_SYMBOL(dmabuf_sync_unlock);
+
+/**
+ * dmabuf_sync_get - initialize reservation entry and update
+ *				dmabuf sync.
+ *
+ * @sync: An object to dmabuf_sync structure.
+ * @sync_buf: A dma_buf object pointer that we want to be synchronized
+ *	with others.
+ *
+ * This function should be called after dmabuf_sync_init function is called.
+ * The caller can group multiple dmabufs by calling this function several
+ * times. Internally, this function also takes a reference to a dmabuf.
+ */
+int dmabuf_sync_get(struct dmabuf_sync *sync, void *sync_buf, unsigned int type)
+{
+	int ret;
+
+	if (!sync || !sync_buf) {
+		WARN_ON(1);
+		return -EFAULT;
+	}
+
+	ret = dmabuf_sync_get_obj(sync, sync_buf, type);
+	if (ret < 0) {
+		WARN_ON(1);
+		return ret;
+	}
+
+	sync->status = DMABUF_SYNC_GOT;
+
+	return 0;
+}
+EXPORT_SYMBOL(dmabuf_sync_get);
+
+/**
+ * dmabuf_sync_put - Release a given dmabuf.
+ *
+ * @sync: An object to dmabuf_sync structure.
+ * @dmabuf: An object to dma_buf structure.
+ *
+ * This function should be called if some operation is failed after
+ * dmabuf_sync_get function is called to release the dmabuf, or
+ * dmabuf_sync_unlock function is called.
+ */
+void dmabuf_sync_put(struct dmabuf_sync *sync, struct dma_buf *dmabuf)
+{
+	if (!sync || !dmabuf) {
+		WARN_ON(1);
+		return;
+	}
+
+	if (list_empty(&sync->syncs))
+		return;
+
+	dmabuf_sync_put_obj(sync, dmabuf);
+}
+EXPORT_SYMBOL(dmabuf_sync_put);
+
+/**
+ * dmabuf_sync_put_all - Release all sync objects
+ *
+ * @sync: An object to dmabuf_sync structure.
+ *
+ * This function should be called if some operation is failed after
+ * dmabuf_sync_get function is called to release all sync objects, or
+ * dmabuf_sync_unlock function is called.
+ */
+void dmabuf_sync_put_all(struct dmabuf_sync *sync)
+{
+	if (!sync) {
+		WARN_ON(1);
+		return;
+	}
+
+	if (list_empty(&sync->syncs))
+		return;
+
+	dmabuf_sync_put_objs(sync);
+}
+EXPORT_SYMBOL(dmabuf_sync_put_all);
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 34cfbac..ae0c87b 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -150,6 +150,20 @@  struct dma_buf_attachment {
 	void *priv;
 };
 
+#define	DMA_BUF_ACCESS_R	0x1
+#define DMA_BUF_ACCESS_W	0x2
+#define DMA_BUF_ACCESS_DMA	0x4
+#define DMA_BUF_ACCESS_RW	(DMA_BUF_ACCESS_R | DMA_BUF_ACCESS_W)
+#define DMA_BUF_ACCESS_DMA_R	(DMA_BUF_ACCESS_R | DMA_BUF_ACCESS_DMA)
+#define DMA_BUF_ACCESS_DMA_W	(DMA_BUF_ACCESS_W | DMA_BUF_ACCESS_DMA)
+#define DMA_BUF_ACCESS_DMA_RW	(DMA_BUF_ACCESS_DMA_R | DMA_BUF_ACCESS_DMA_W)
+#define IS_VALID_DMA_BUF_ACCESS_TYPE(t)	(t == DMA_BUF_ACCESS_R || \
+					 t == DMA_BUF_ACCESS_W || \
+					 t == DMA_BUF_ACCESS_DMA_R || \
+					 t == DMA_BUF_ACCESS_DMA_W || \
+					 t == DMA_BUF_ACCESS_RW || \
+					 t == DMA_BUF_ACCESS_DMA_RW)
+
 /**
  * get_dma_buf - convenience wrapper for get_file.
  * @dmabuf:	[in]	pointer to dma_buf
diff --git a/include/linux/dmabuf-sync.h b/include/linux/dmabuf-sync.h
new file mode 100644
index 0000000..44c37de
--- /dev/null
+++ b/include/linux/dmabuf-sync.h
@@ -0,0 +1,115 @@ 
+/*
+ * Copyright (C) 2013 Samsung Electronics Co.Ltd
+ * Authors:
+ *	Inki Dae <inki.dae@samsung.com>
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ *
+ */
+
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/dma-buf.h>
+#include <linux/reservation.h>
+
+enum dmabuf_sync_status {
+	DMABUF_SYNC_GOT		= 1,
+	DMABUF_SYNC_LOCKED,
+};
+
+/*
+ * A structure for dmabuf_sync_object.
+ *
+ * @head: A list head to be added to syncs list.
+ * @robj: A reservation_object object.
+ * @dma_buf: A dma_buf object.
+ * @access_type: Indicate how a current task tries to access
+ *	a given buffer.
+ */
+struct dmabuf_sync_object {
+	struct list_head		head;
+	struct reservation_object	*robj;
+	struct dma_buf			*dmabuf;
+	unsigned int			access_type;
+};
+
+/*
+ * A structure for dmabuf_sync.
+ *
+ * @syncs: A list head to sync object and this is global to system.
+ * @list: A list entry used as committed list node
+ * @lock: A mutex lock to current sync object.
+ * @ctx: A current context for ww mutex.
+ * @work: A work struct to release resources at timeout.
+ * @priv: A private data.
+ * @name: A string to dmabuf sync owner.
+ * @timer: A timer list to avoid lockup and release resources.
+ * @status: Indicate current status (DMABUF_SYNC_GOT or DMABUF_SYNC_LOCKED).
+ */
+struct dmabuf_sync {
+	struct list_head	syncs;
+	struct list_head	list;
+	struct mutex		lock;
+	struct ww_acquire_ctx	ctx;
+	struct work_struct	work;
+	void			*priv;
+	char			name[64];
+	struct timer_list	timer;
+	unsigned int		status;
+};
+
+#ifdef CONFIG_DMABUF_SYNC
+extern bool is_dmabuf_sync_supported(void);
+
+extern struct dmabuf_sync *dmabuf_sync_init(void *priv, const char *name);
+
+extern void dmabuf_sync_fini(struct dmabuf_sync *sync);
+
+extern int dmabuf_sync_lock(struct dmabuf_sync *sync);
+
+extern int dmabuf_sync_unlock(struct dmabuf_sync *sync);
+
+extern int dmabuf_sync_get(struct dmabuf_sync *sync, void *sync_buf,
+				unsigned int type);
+
+extern void dmabuf_sync_put(struct dmabuf_sync *sync, struct dma_buf *dmabuf);
+
+extern void dmabuf_sync_put_all(struct dmabuf_sync *sync);
+
+#else
+static inline bool is_dmabuf_sync_supported(void) { return false; }
+
+static inline struct dmabuf_sync *dmabuf_sync_init(void *priv,
+					const char *names)
+{
+	return ERR_PTR(0);
+}
+
+static inline void dmabuf_sync_fini(struct dmabuf_sync *sync) { }
+
+static inline int dmabuf_sync_lock(struct dmabuf_sync *sync)
+{
+	return 0;
+}
+
+static inline int dmabuf_sync_unlock(struct dmabuf_sync *sync)
+{
+	return 0;
+}
+
+static inline int dmabuf_sync_get(struct dmabuf_sync *sync,
+					void *sync_buf,
+					unsigned int type)
+{
+	return 0;
+}
+
+static inline void dmabuf_sync_put(struct dmabuf_sync *sync,
+					struct dma_buf *dmabuf) { }
+
+static inline void dmabuf_sync_put_all(struct dmabuf_sync *sync) { }
+
+#endif
diff --git a/include/linux/reservation.h b/include/linux/reservation.h
index 80050e2..4310192 100644
--- a/include/linux/reservation.h
+++ b/include/linux/reservation.h
@@ -50,6 +50,11 @@  struct reservation_object {
 	struct fence *fence_excl;
 	struct fence **fence_shared;
 	u32 fence_shared_count, fence_shared_max;
+
+	atomic_t		shared_cnt;
+	unsigned int		accessed_type;
+	unsigned int		shared;
+	unsigned int		locked;
 };
 
 static inline void
@@ -60,6 +65,8 @@  reservation_object_init(struct reservation_object *obj)
 	obj->fence_shared_count = obj->fence_shared_max = 0;
 	obj->fence_shared = NULL;
 	obj->fence_excl = NULL;
+
+	atomic_set(&obj->shared_cnt, 1);
 }
 
 static inline void