diff mbox

[2/7] dmaengine: omap-dma: Complete the cookie first on transfer completion

Message ID 20160714124242.7579-3-peter.ujfalusi@ti.com (mailing list archive)
State New, archived
Headers show

Commit Message

Peter Ujfalusi July 14, 2016, 12:42 p.m. UTC
Before looking for the next descriptor to start, complete the just finished
cookie.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
---
 drivers/dma/omap-dma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Russell King (Oracle) July 18, 2016, 10:34 a.m. UTC | #1
On Thu, Jul 14, 2016 at 03:42:37PM +0300, Peter Ujfalusi wrote:
> Before looking for the next descriptor to start, complete the just finished
> cookie.

This change will reduce performance as we no longer have an overlap
between the next request starting to be dealt with in the hardware
vs the previous request being completed.  Your commit log doesn't
say _why_ the change is being made, it merely tells us what the
patch is doing, which we can see already.

Please describe changes a little better.
Peter Ujfalusi July 19, 2016, 12:35 p.m. UTC | #2
On 07/18/16 13:34, Russell King - ARM Linux wrote:
> On Thu, Jul 14, 2016 at 03:42:37PM +0300, Peter Ujfalusi wrote:
>> Before looking for the next descriptor to start, complete the just finished
>> cookie.
> 
> This change will reduce performance as we no longer have an overlap
> between the next request starting to be dealt with in the hardware
> vs the previous request being completed.

vchan_cookie_complete() will only mark the cookie completed, adds the vd to
the desc_completed list (it was deleted from desc_issued list when it was
started by omap_dma_start_desc) and schedule the tasklet to deal with the real
completion later.
Marking the just finished descriptor/cookie done first then looking for
possible descriptors in the queue to start feels like a better sequence.

After a quick grep in the kernel source: only omap-dma.c was starting the next
transfer before marking the current completed descriptor/cookie done.

> Your commit log doesn't
> say _why_ the change is being made, it merely tells us what the
> patch is doing, which we can see already.
> 
> Please describe changes a little better.
>
Russell King (Oracle) July 19, 2016, 4:20 p.m. UTC | #3
On Tue, Jul 19, 2016 at 03:35:18PM +0300, Peter Ujfalusi wrote:
> On 07/18/16 13:34, Russell King - ARM Linux wrote:
> > On Thu, Jul 14, 2016 at 03:42:37PM +0300, Peter Ujfalusi wrote:
> >> Before looking for the next descriptor to start, complete the just finished
> >> cookie.
> > 
> > This change will reduce performance as we no longer have an overlap
> > between the next request starting to be dealt with in the hardware
> > vs the previous request being completed.
> 
> vchan_cookie_complete() will only mark the cookie completed, adds the vd to
> the desc_completed list (it was deleted from desc_issued list when it was
> started by omap_dma_start_desc) and schedule the tasklet to deal with the real
> completion later.
> Marking the just finished descriptor/cookie done first then looking for
> possible descriptors in the queue to start feels like a better sequence.

I deliberately arranged the code in the original order so that the next
transfer was started on the hardware with the least amount of work by
the CPU.  Yes, there may not be much in it, but everything you mention
above adds to the number of CPU cycles that need to be executed before
the next transfer can be started.

More CPU cycles wasted means higher latency between transfers, which
means lower performance.

> After a quick grep in the kernel source: only omap-dma.c was starting the
> next transfer before marking the current completed descriptor/cookie done.

Right, because I've thought about the issue, having been the author of
both virt-dma and omap-dma.
Peter Ujfalusi July 19, 2016, 7:23 p.m. UTC | #4
On 07/19/2016 07:20 PM, Russell King - ARM Linux wrote:
>> vchan_cookie_complete() will only mark the cookie completed, adds the vd to
>> the desc_completed list (it was deleted from desc_issued list when it was
>> started by omap_dma_start_desc) and schedule the tasklet to deal with the real
>> completion later.
>> Marking the just finished descriptor/cookie done first then looking for
>> possible descriptors in the queue to start feels like a better sequence.
> 
> I deliberately arranged the code in the original order so that the next
> transfer was started on the hardware with the least amount of work by
> the CPU.  Yes, there may not be much in it, but everything you mention
> above adds to the number of CPU cycles that need to be executed before
> the next transfer can be started.
> 
> More CPU cycles wasted means higher latency between transfers, which
> means lower performance.

OK. I will drop this patch in v2.

>> After a quick grep in the kernel source: only omap-dma.c was starting the
>> next transfer before marking the current completed descriptor/cookie done.
> 
> Right, because I've thought about the issue, having been the author of
> both virt-dma and omap-dma.
>
Robert Jarzmik July 20, 2016, 6:26 a.m. UTC | #5
Peter Ujfalusi <peter.ujfalusi@ti.com> writes:

> On 07/18/16 13:34, Russell King - ARM Linux wrote:
>> On Thu, Jul 14, 2016 at 03:42:37PM +0300, Peter Ujfalusi wrote:
>>> Before looking for the next descriptor to start, complete the just finished
>>> cookie.
>> 
>> This change will reduce performance as we no longer have an overlap
>> between the next request starting to be dealt with in the hardware
>> vs the previous request being completed.
>
> vchan_cookie_complete() will only mark the cookie completed, adds the vd to
> the desc_completed list (it was deleted from desc_issued list when it was
> started by omap_dma_start_desc) and schedule the tasklet to deal with the real
> completion later.
> Marking the just finished descriptor/cookie done first then looking for
> possible descriptors in the queue to start feels like a better sequence.
>
> After a quick grep in the kernel source: only omap-dma.c was starting the next
> transfer before marking the current completed descriptor/cookie done.

Euh actually I think it's done in other drivers as well :
 - Documentation/dmaengine/pxa_dma.txt (chapter "Transfers hot-chaining)
 - drivers/dma/pxa_dma.c
   => look for pxad_try_hotchain() and it's impact on pxad_chan_handler() which
   will mark the completion while the next transfer is already pumped by the
   hardware.

Speaking of which, from a purely design point of view, as long as you think
beforehand what is your sequence, ie. what is the sequence of your link
chaining, completion handling, etc ..., both marking before or after next tx
start should be fine IMHO.

So in your quest for the "better sequence" the pxa driver's one might give you
some perspective :)

Cheers.
Peter Ujfalusi July 21, 2016, 9:33 a.m. UTC | #6
On 07/20/16 09:26, Robert Jarzmik wrote:
> Peter Ujfalusi <peter.ujfalusi@ti.com> writes:
> 
>> On 07/18/16 13:34, Russell King - ARM Linux wrote:
>>> On Thu, Jul 14, 2016 at 03:42:37PM +0300, Peter Ujfalusi wrote:
>>>> Before looking for the next descriptor to start, complete the just finished
>>>> cookie.
>>>
>>> This change will reduce performance as we no longer have an overlap
>>> between the next request starting to be dealt with in the hardware
>>> vs the previous request being completed.
>>
>> vchan_cookie_complete() will only mark the cookie completed, adds the vd to
>> the desc_completed list (it was deleted from desc_issued list when it was
>> started by omap_dma_start_desc) and schedule the tasklet to deal with the real
>> completion later.
>> Marking the just finished descriptor/cookie done first then looking for
>> possible descriptors in the queue to start feels like a better sequence.
>>
>> After a quick grep in the kernel source: only omap-dma.c was starting the next
>> transfer before marking the current completed descriptor/cookie done.
> 
> Euh actually I think it's done in other drivers as well :
>  - Documentation/dmaengine/pxa_dma.txt (chapter "Transfers hot-chaining)
>  - drivers/dma/pxa_dma.c
>    => look for pxad_try_hotchain() and it's impact on pxad_chan_handler() which
>    will mark the completion while the next transfer is already pumped by the
>    hardware.

The 'hot-chaining' is a bit different then what omap-dma is doing. If I got it
right. When the DMA is running and a new request comes the driver will append
the new transfer to the list used by the HW. This way there will be no stop
and restart needed, the DMA is running w/o interruption.

> Speaking of which, from a purely design point of view, as long as you think
> beforehand what is your sequence, ie. what is the sequence of your link
> chaining, completion handling, etc ..., both marking before or after next tx
> start should be fine IMHO.

Yes, it might be a bit better from performance point of view if we first start
the pending descriptor (if there is one) then do the vchan_cookie_complete().
On the other hand if we care more about latency and accuracy we should
complete the transfer first then look for pending descriptors. But since
virt_dma is using a tasklet for the real completion, the latency is always
going to be when the tasklet is given the chance to execute.

> So in your quest for the "better sequence" the pxa driver's one might give you
> some perspective :)

I did thought about similar 'hot-chaining' for TI's eDMA and sDMA. Especially
eDMA would benefit from it, but so far I see too many race conditions to
overcome to be brave enough to write something to test it. and I don't have
time for it atm ;)
Peter Ujfalusi July 21, 2016, 9:35 a.m. UTC | #7
On 07/21/16 12:33, Peter Ujfalusi wrote:
> On 07/20/16 09:26, Robert Jarzmik wrote:
>> Peter Ujfalusi <peter.ujfalusi@ti.com> writes:
>>
>>> On 07/18/16 13:34, Russell King - ARM Linux wrote:
>>>> On Thu, Jul 14, 2016 at 03:42:37PM +0300, Peter Ujfalusi wrote:
>>>>> Before looking for the next descriptor to start, complete the just finished
>>>>> cookie.
>>>>
>>>> This change will reduce performance as we no longer have an overlap
>>>> between the next request starting to be dealt with in the hardware
>>>> vs the previous request being completed.
>>>
>>> vchan_cookie_complete() will only mark the cookie completed, adds the vd to
>>> the desc_completed list (it was deleted from desc_issued list when it was
>>> started by omap_dma_start_desc) and schedule the tasklet to deal with the real
>>> completion later.
>>> Marking the just finished descriptor/cookie done first then looking for
>>> possible descriptors in the queue to start feels like a better sequence.
>>>
>>> After a quick grep in the kernel source: only omap-dma.c was starting the next
>>> transfer before marking the current completed descriptor/cookie done.
>>
>> Euh actually I think it's done in other drivers as well :
>>  - Documentation/dmaengine/pxa_dma.txt (chapter "Transfers hot-chaining)
>>  - drivers/dma/pxa_dma.c
>>    => look for pxad_try_hotchain() and it's impact on pxad_chan_handler() which
>>    will mark the completion while the next transfer is already pumped by the
>>    hardware.
> 
> The 'hot-chaining' is a bit different then what omap-dma is doing.

s/then/than

> If I got it
> right. When the DMA is running and a new request comes the driver will append
> the new transfer to the list used by the HW. This way there will be no stop
> and restart needed, the DMA is running w/o interruption.
> 
>> Speaking of which, from a purely design point of view, as long as you think
>> beforehand what is your sequence, ie. what is the sequence of your link
>> chaining, completion handling, etc ..., both marking before or after next tx
>> start should be fine IMHO.
> 
> Yes, it might be a bit better from performance point of view if we first start
> the pending descriptor (if there is one) then do the vchan_cookie_complete().
> On the other hand if we care more about latency and accuracy we should
> complete the transfer first then look for pending descriptors. But since
> virt_dma is using a tasklet for the real completion, the latency is always
> going to be when the tasklet is given the chance to execute.
> 
>> So in your quest for the "better sequence" the pxa driver's one might give you
>> some perspective :)
> 
> I did thought about similar 'hot-chaining' for TI's eDMA and sDMA. Especially
> eDMA would benefit from it, but so far I see too many race conditions to
> overcome to be brave enough to write something to test it. and I don't have
> time for it atm ;)
>
Russell King (Oracle) July 21, 2016, 9:47 a.m. UTC | #8
On Thu, Jul 21, 2016 at 12:33:12PM +0300, Peter Ujfalusi wrote:
> On 07/20/16 09:26, Robert Jarzmik wrote:
> > Speaking of which, from a purely design point of view, as long as you think
> > beforehand what is your sequence, ie. what is the sequence of your link
> > chaining, completion handling, etc ..., both marking before or after next tx
> > start should be fine IMHO.
> 
> Yes, it might be a bit better from performance point of view if we first start
> the pending descriptor (if there is one) then do the vchan_cookie_complete().
> On the other hand if we care more about latency and accuracy we should
> complete the transfer first then look for pending descriptors. But since
> virt_dma is using a tasklet for the real completion, the latency is always
> going to be when the tasklet is given the chance to execute.

I think this shows a slight misunderstanding of the DMA engine API.  The
DMA completion is defined by the API to always happen in tasklet context,
which is why the virt-dma stuff does it that way - and all other DMA
engine drivers.  It's one of the fundamentals of the API.

As it happens in tasklet context, tasklets can be scheduled to run with
variable latency, so any use of the DMA engine API which has a predictable
latency around the completion handling is going to be unreliable.

Remember also that with circular buffers, there's no guarantee of getting
period-based completion callbacks - several periods can complete and you
are only guaranteed to get one completion callback.

So, the idea that completion callbacks can have anything to do with low
latency or accuracy is totally incorrect.
Peter Ujfalusi July 22, 2016, 11 a.m. UTC | #9
On 07/21/16 12:47, Russell King - ARM Linux wrote:
> On Thu, Jul 21, 2016 at 12:33:12PM +0300, Peter Ujfalusi wrote:
>> On 07/20/16 09:26, Robert Jarzmik wrote:
>>> Speaking of which, from a purely design point of view, as long as you think
>>> beforehand what is your sequence, ie. what is the sequence of your link
>>> chaining, completion handling, etc ..., both marking before or after next tx
>>> start should be fine IMHO.
>>
>> Yes, it might be a bit better from performance point of view if we first start
>> the pending descriptor (if there is one) then do the vchan_cookie_complete().
>> On the other hand if we care more about latency and accuracy we should
>> complete the transfer first then look for pending descriptors. But since
>> virt_dma is using a tasklet for the real completion, the latency is always
>> going to be when the tasklet is given the chance to execute.
> 
> I think this shows a slight misunderstanding of the DMA engine API.  The
> DMA completion is defined by the API to always happen in tasklet context,
> which is why the virt-dma stuff does it that way - and all other DMA
> engine drivers.  It's one of the fundamentals of the API.
> 
> As it happens in tasklet context, tasklets can be scheduled to run with
> variable latency, so any use of the DMA engine API which has a predictable
> latency around the completion handling is going to be unreliable.
> 
> Remember also that with circular buffers, there's no guarantee of getting
> period-based completion callbacks - several periods can complete and you
> are only guaranteed to get one completion callback.
> 
> So, the idea that completion callbacks can have anything to do with low
> latency or accuracy is totally incorrect.

Thanks for refreshing my memory, you are absolutely right.
Vinod Koul July 24, 2016, 7:39 a.m. UTC | #10
On Tue, Jul 19, 2016 at 05:20:04PM +0100, Russell King - ARM Linux wrote:
> On Tue, Jul 19, 2016 at 03:35:18PM +0300, Peter Ujfalusi wrote:
> > On 07/18/16 13:34, Russell King - ARM Linux wrote:
> > > On Thu, Jul 14, 2016 at 03:42:37PM +0300, Peter Ujfalusi wrote:
> > >> Before looking for the next descriptor to start, complete the just finished
> > >> cookie.
> > > 
> > > This change will reduce performance as we no longer have an overlap
> > > between the next request starting to be dealt with in the hardware
> > > vs the previous request being completed.
> > 
> > vchan_cookie_complete() will only mark the cookie completed, adds the vd to
> > the desc_completed list (it was deleted from desc_issued list when it was
> > started by omap_dma_start_desc) and schedule the tasklet to deal with the real
> > completion later.
> > Marking the just finished descriptor/cookie done first then looking for
> > possible descriptors in the queue to start feels like a better sequence.
> 
> I deliberately arranged the code in the original order so that the next
> transfer was started on the hardware with the least amount of work by
> the CPU.  Yes, there may not be much in it, but everything you mention
> above adds to the number of CPU cycles that need to be executed before
> the next transfer can be started.

Yes that is really the right thing to do. Ideally people would want to
minimize the delay and submit the next one as soon as possible, but people
have been lazy on this and few other aspects :)
diff mbox

Patch

diff --git a/drivers/dma/omap-dma.c b/drivers/dma/omap-dma.c
index 7d56cd88c9a5..f7b0b0c668fb 100644
--- a/drivers/dma/omap-dma.c
+++ b/drivers/dma/omap-dma.c
@@ -449,8 +449,8 @@  static void omap_dma_callback(int ch, u16 status, void *data)
 			if (c->sgidx < d->sglen) {
 				omap_dma_start_sg(c, d);
 			} else {
-				omap_dma_start_desc(c);
 				vchan_cookie_complete(&d->vd);
+				omap_dma_start_desc(c);
 			}
 		} else {
 			vchan_cyclic_callback(&d->vd);