Message ID | 20160714124242.7579-3-peter.ujfalusi@ti.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
On Thu, Jul 14, 2016 at 03:42:37PM +0300, Peter Ujfalusi wrote: > Before looking for the next descriptor to start, complete the just finished > cookie. This change will reduce performance as we no longer have an overlap between the next request starting to be dealt with in the hardware vs the previous request being completed. Your commit log doesn't say _why_ the change is being made, it merely tells us what the patch is doing, which we can see already. Please describe changes a little better.
On 07/18/16 13:34, Russell King - ARM Linux wrote: > On Thu, Jul 14, 2016 at 03:42:37PM +0300, Peter Ujfalusi wrote: >> Before looking for the next descriptor to start, complete the just finished >> cookie. > > This change will reduce performance as we no longer have an overlap > between the next request starting to be dealt with in the hardware > vs the previous request being completed. vchan_cookie_complete() will only mark the cookie completed, adds the vd to the desc_completed list (it was deleted from desc_issued list when it was started by omap_dma_start_desc) and schedule the tasklet to deal with the real completion later. Marking the just finished descriptor/cookie done first then looking for possible descriptors in the queue to start feels like a better sequence. After a quick grep in the kernel source: only omap-dma.c was starting the next transfer before marking the current completed descriptor/cookie done. > Your commit log doesn't > say _why_ the change is being made, it merely tells us what the > patch is doing, which we can see already. > > Please describe changes a little better. >
On Tue, Jul 19, 2016 at 03:35:18PM +0300, Peter Ujfalusi wrote: > On 07/18/16 13:34, Russell King - ARM Linux wrote: > > On Thu, Jul 14, 2016 at 03:42:37PM +0300, Peter Ujfalusi wrote: > >> Before looking for the next descriptor to start, complete the just finished > >> cookie. > > > > This change will reduce performance as we no longer have an overlap > > between the next request starting to be dealt with in the hardware > > vs the previous request being completed. > > vchan_cookie_complete() will only mark the cookie completed, adds the vd to > the desc_completed list (it was deleted from desc_issued list when it was > started by omap_dma_start_desc) and schedule the tasklet to deal with the real > completion later. > Marking the just finished descriptor/cookie done first then looking for > possible descriptors in the queue to start feels like a better sequence. I deliberately arranged the code in the original order so that the next transfer was started on the hardware with the least amount of work by the CPU. Yes, there may not be much in it, but everything you mention above adds to the number of CPU cycles that need to be executed before the next transfer can be started. More CPU cycles wasted means higher latency between transfers, which means lower performance. > After a quick grep in the kernel source: only omap-dma.c was starting the > next transfer before marking the current completed descriptor/cookie done. Right, because I've thought about the issue, having been the author of both virt-dma and omap-dma.
On 07/19/2016 07:20 PM, Russell King - ARM Linux wrote: >> vchan_cookie_complete() will only mark the cookie completed, adds the vd to >> the desc_completed list (it was deleted from desc_issued list when it was >> started by omap_dma_start_desc) and schedule the tasklet to deal with the real >> completion later. >> Marking the just finished descriptor/cookie done first then looking for >> possible descriptors in the queue to start feels like a better sequence. > > I deliberately arranged the code in the original order so that the next > transfer was started on the hardware with the least amount of work by > the CPU. Yes, there may not be much in it, but everything you mention > above adds to the number of CPU cycles that need to be executed before > the next transfer can be started. > > More CPU cycles wasted means higher latency between transfers, which > means lower performance. OK. I will drop this patch in v2. >> After a quick grep in the kernel source: only omap-dma.c was starting the >> next transfer before marking the current completed descriptor/cookie done. > > Right, because I've thought about the issue, having been the author of > both virt-dma and omap-dma. >
Peter Ujfalusi <peter.ujfalusi@ti.com> writes: > On 07/18/16 13:34, Russell King - ARM Linux wrote: >> On Thu, Jul 14, 2016 at 03:42:37PM +0300, Peter Ujfalusi wrote: >>> Before looking for the next descriptor to start, complete the just finished >>> cookie. >> >> This change will reduce performance as we no longer have an overlap >> between the next request starting to be dealt with in the hardware >> vs the previous request being completed. > > vchan_cookie_complete() will only mark the cookie completed, adds the vd to > the desc_completed list (it was deleted from desc_issued list when it was > started by omap_dma_start_desc) and schedule the tasklet to deal with the real > completion later. > Marking the just finished descriptor/cookie done first then looking for > possible descriptors in the queue to start feels like a better sequence. > > After a quick grep in the kernel source: only omap-dma.c was starting the next > transfer before marking the current completed descriptor/cookie done. Euh actually I think it's done in other drivers as well : - Documentation/dmaengine/pxa_dma.txt (chapter "Transfers hot-chaining) - drivers/dma/pxa_dma.c => look for pxad_try_hotchain() and it's impact on pxad_chan_handler() which will mark the completion while the next transfer is already pumped by the hardware. Speaking of which, from a purely design point of view, as long as you think beforehand what is your sequence, ie. what is the sequence of your link chaining, completion handling, etc ..., both marking before or after next tx start should be fine IMHO. So in your quest for the "better sequence" the pxa driver's one might give you some perspective :) Cheers.
On 07/20/16 09:26, Robert Jarzmik wrote: > Peter Ujfalusi <peter.ujfalusi@ti.com> writes: > >> On 07/18/16 13:34, Russell King - ARM Linux wrote: >>> On Thu, Jul 14, 2016 at 03:42:37PM +0300, Peter Ujfalusi wrote: >>>> Before looking for the next descriptor to start, complete the just finished >>>> cookie. >>> >>> This change will reduce performance as we no longer have an overlap >>> between the next request starting to be dealt with in the hardware >>> vs the previous request being completed. >> >> vchan_cookie_complete() will only mark the cookie completed, adds the vd to >> the desc_completed list (it was deleted from desc_issued list when it was >> started by omap_dma_start_desc) and schedule the tasklet to deal with the real >> completion later. >> Marking the just finished descriptor/cookie done first then looking for >> possible descriptors in the queue to start feels like a better sequence. >> >> After a quick grep in the kernel source: only omap-dma.c was starting the next >> transfer before marking the current completed descriptor/cookie done. > > Euh actually I think it's done in other drivers as well : > - Documentation/dmaengine/pxa_dma.txt (chapter "Transfers hot-chaining) > - drivers/dma/pxa_dma.c > => look for pxad_try_hotchain() and it's impact on pxad_chan_handler() which > will mark the completion while the next transfer is already pumped by the > hardware. The 'hot-chaining' is a bit different then what omap-dma is doing. If I got it right. When the DMA is running and a new request comes the driver will append the new transfer to the list used by the HW. This way there will be no stop and restart needed, the DMA is running w/o interruption. > Speaking of which, from a purely design point of view, as long as you think > beforehand what is your sequence, ie. what is the sequence of your link > chaining, completion handling, etc ..., both marking before or after next tx > start should be fine IMHO. Yes, it might be a bit better from performance point of view if we first start the pending descriptor (if there is one) then do the vchan_cookie_complete(). On the other hand if we care more about latency and accuracy we should complete the transfer first then look for pending descriptors. But since virt_dma is using a tasklet for the real completion, the latency is always going to be when the tasklet is given the chance to execute. > So in your quest for the "better sequence" the pxa driver's one might give you > some perspective :) I did thought about similar 'hot-chaining' for TI's eDMA and sDMA. Especially eDMA would benefit from it, but so far I see too many race conditions to overcome to be brave enough to write something to test it. and I don't have time for it atm ;)
On 07/21/16 12:33, Peter Ujfalusi wrote: > On 07/20/16 09:26, Robert Jarzmik wrote: >> Peter Ujfalusi <peter.ujfalusi@ti.com> writes: >> >>> On 07/18/16 13:34, Russell King - ARM Linux wrote: >>>> On Thu, Jul 14, 2016 at 03:42:37PM +0300, Peter Ujfalusi wrote: >>>>> Before looking for the next descriptor to start, complete the just finished >>>>> cookie. >>>> >>>> This change will reduce performance as we no longer have an overlap >>>> between the next request starting to be dealt with in the hardware >>>> vs the previous request being completed. >>> >>> vchan_cookie_complete() will only mark the cookie completed, adds the vd to >>> the desc_completed list (it was deleted from desc_issued list when it was >>> started by omap_dma_start_desc) and schedule the tasklet to deal with the real >>> completion later. >>> Marking the just finished descriptor/cookie done first then looking for >>> possible descriptors in the queue to start feels like a better sequence. >>> >>> After a quick grep in the kernel source: only omap-dma.c was starting the next >>> transfer before marking the current completed descriptor/cookie done. >> >> Euh actually I think it's done in other drivers as well : >> - Documentation/dmaengine/pxa_dma.txt (chapter "Transfers hot-chaining) >> - drivers/dma/pxa_dma.c >> => look for pxad_try_hotchain() and it's impact on pxad_chan_handler() which >> will mark the completion while the next transfer is already pumped by the >> hardware. > > The 'hot-chaining' is a bit different then what omap-dma is doing. s/then/than > If I got it > right. When the DMA is running and a new request comes the driver will append > the new transfer to the list used by the HW. This way there will be no stop > and restart needed, the DMA is running w/o interruption. > >> Speaking of which, from a purely design point of view, as long as you think >> beforehand what is your sequence, ie. what is the sequence of your link >> chaining, completion handling, etc ..., both marking before or after next tx >> start should be fine IMHO. > > Yes, it might be a bit better from performance point of view if we first start > the pending descriptor (if there is one) then do the vchan_cookie_complete(). > On the other hand if we care more about latency and accuracy we should > complete the transfer first then look for pending descriptors. But since > virt_dma is using a tasklet for the real completion, the latency is always > going to be when the tasklet is given the chance to execute. > >> So in your quest for the "better sequence" the pxa driver's one might give you >> some perspective :) > > I did thought about similar 'hot-chaining' for TI's eDMA and sDMA. Especially > eDMA would benefit from it, but so far I see too many race conditions to > overcome to be brave enough to write something to test it. and I don't have > time for it atm ;) >
On Thu, Jul 21, 2016 at 12:33:12PM +0300, Peter Ujfalusi wrote: > On 07/20/16 09:26, Robert Jarzmik wrote: > > Speaking of which, from a purely design point of view, as long as you think > > beforehand what is your sequence, ie. what is the sequence of your link > > chaining, completion handling, etc ..., both marking before or after next tx > > start should be fine IMHO. > > Yes, it might be a bit better from performance point of view if we first start > the pending descriptor (if there is one) then do the vchan_cookie_complete(). > On the other hand if we care more about latency and accuracy we should > complete the transfer first then look for pending descriptors. But since > virt_dma is using a tasklet for the real completion, the latency is always > going to be when the tasklet is given the chance to execute. I think this shows a slight misunderstanding of the DMA engine API. The DMA completion is defined by the API to always happen in tasklet context, which is why the virt-dma stuff does it that way - and all other DMA engine drivers. It's one of the fundamentals of the API. As it happens in tasklet context, tasklets can be scheduled to run with variable latency, so any use of the DMA engine API which has a predictable latency around the completion handling is going to be unreliable. Remember also that with circular buffers, there's no guarantee of getting period-based completion callbacks - several periods can complete and you are only guaranteed to get one completion callback. So, the idea that completion callbacks can have anything to do with low latency or accuracy is totally incorrect.
On 07/21/16 12:47, Russell King - ARM Linux wrote: > On Thu, Jul 21, 2016 at 12:33:12PM +0300, Peter Ujfalusi wrote: >> On 07/20/16 09:26, Robert Jarzmik wrote: >>> Speaking of which, from a purely design point of view, as long as you think >>> beforehand what is your sequence, ie. what is the sequence of your link >>> chaining, completion handling, etc ..., both marking before or after next tx >>> start should be fine IMHO. >> >> Yes, it might be a bit better from performance point of view if we first start >> the pending descriptor (if there is one) then do the vchan_cookie_complete(). >> On the other hand if we care more about latency and accuracy we should >> complete the transfer first then look for pending descriptors. But since >> virt_dma is using a tasklet for the real completion, the latency is always >> going to be when the tasklet is given the chance to execute. > > I think this shows a slight misunderstanding of the DMA engine API. The > DMA completion is defined by the API to always happen in tasklet context, > which is why the virt-dma stuff does it that way - and all other DMA > engine drivers. It's one of the fundamentals of the API. > > As it happens in tasklet context, tasklets can be scheduled to run with > variable latency, so any use of the DMA engine API which has a predictable > latency around the completion handling is going to be unreliable. > > Remember also that with circular buffers, there's no guarantee of getting > period-based completion callbacks - several periods can complete and you > are only guaranteed to get one completion callback. > > So, the idea that completion callbacks can have anything to do with low > latency or accuracy is totally incorrect. Thanks for refreshing my memory, you are absolutely right.
On Tue, Jul 19, 2016 at 05:20:04PM +0100, Russell King - ARM Linux wrote: > On Tue, Jul 19, 2016 at 03:35:18PM +0300, Peter Ujfalusi wrote: > > On 07/18/16 13:34, Russell King - ARM Linux wrote: > > > On Thu, Jul 14, 2016 at 03:42:37PM +0300, Peter Ujfalusi wrote: > > >> Before looking for the next descriptor to start, complete the just finished > > >> cookie. > > > > > > This change will reduce performance as we no longer have an overlap > > > between the next request starting to be dealt with in the hardware > > > vs the previous request being completed. > > > > vchan_cookie_complete() will only mark the cookie completed, adds the vd to > > the desc_completed list (it was deleted from desc_issued list when it was > > started by omap_dma_start_desc) and schedule the tasklet to deal with the real > > completion later. > > Marking the just finished descriptor/cookie done first then looking for > > possible descriptors in the queue to start feels like a better sequence. > > I deliberately arranged the code in the original order so that the next > transfer was started on the hardware with the least amount of work by > the CPU. Yes, there may not be much in it, but everything you mention > above adds to the number of CPU cycles that need to be executed before > the next transfer can be started. Yes that is really the right thing to do. Ideally people would want to minimize the delay and submit the next one as soon as possible, but people have been lazy on this and few other aspects :)
diff --git a/drivers/dma/omap-dma.c b/drivers/dma/omap-dma.c index 7d56cd88c9a5..f7b0b0c668fb 100644 --- a/drivers/dma/omap-dma.c +++ b/drivers/dma/omap-dma.c @@ -449,8 +449,8 @@ static void omap_dma_callback(int ch, u16 status, void *data) if (c->sgidx < d->sglen) { omap_dma_start_sg(c, d); } else { - omap_dma_start_desc(c); vchan_cookie_complete(&d->vd); + omap_dma_start_desc(c); } } else { vchan_cyclic_callback(&d->vd);
Before looking for the next descriptor to start, complete the just finished cookie. Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com> --- drivers/dma/omap-dma.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)