[v2] ARM: edma: fix residue race for cyclic

Message ID	87eggvt8hi.fsf_-_@linutronix.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: John Ogness <john.ogness@linutronix.de> To: linux-kernel@vger.kernel.org Subject: [PATCH v2] ARM: edma: fix residue race for cyclic References: <87io68s931.fsf@linutronix.de> <561F862A.7040106@ti.com> Date: Fri, 16 Oct 2015 12:26:17 +0200 In-Reply-To: <561F862A.7040106@ti.com> (Peter Ujfalusi's message of "Thu, 15 Oct 2015 13:55:38 +0300") Message-ID: <87eggvt8hi.fsf_-_@linutronix.de> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux) MIME-Version: 1.0 Precedence: list Cc: dmaengine@vger.kernel.org, vinod.koul@intel.com, nsekhar@ti.com, linux-arm-kernel@lists.infradead.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Message ID

87eggvt8hi.fsf_-_@linutronix.de (mailing list archive)

State

New, archived

Headers

From: John Ogness <john.ogness@linutronix.de>
To: linux-kernel@vger.kernel.org
Subject: [PATCH v2] ARM: edma: fix residue race for cyclic
References: <87io68s931.fsf@linutronix.de> <561F862A.7040106@ti.com>
Date: Fri, 16 Oct 2015 12:26:17 +0200
In-Reply-To: <561F862A.7040106@ti.com> (Peter Ujfalusi's message of "Thu, 15
	Oct 2015 13:55:38 +0300")
Message-ID: <87eggvt8hi.fsf_-_@linutronix.de>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)
MIME-Version: 1.0
Precedence: list
Cc: dmaengine@vger.kernel.org, vinod.koul@intel.com, nsekhar@ti.com,
	linux-arm-kernel@lists.infradead.org
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Commit Message

John Ogness Oct. 16, 2015, 10:26 a.m. UTC

When retrieving the residue value for cyclic transfers, the
SRC/DST fields of the active PaRAM are read. However, the AM335x
Technical Reference Manual states:

  11.3.3.6 Parameter Set Updates

  After the TR is read from the PaRAM (and is in the process
  of being submitted to the EDMA3TC), the following fields are
  updated as needed: ... SRC DST

This means SRC/DST is incremented even though the DMA transfer
may not have started yet or is in progress. Thus if the reader
of the residue accesses the DMA buffer too quickly, the CPU is
misinformed about the data that has been successfully processed.

The CCSTAT.ACTV register is a boolean that is set if any TR is
being processed by either the EMDA3CC or EDMA3TC. By polling
this register it is possible to ensure that the residue value
returned is valid for immediate processing. However, since the
DMA engine may be active, polling may never hit a moment where
no TR is being processed. To handle this, the SRC/DST is also
polled to see if it changes. And as a last resort, a max loop
count for the busy waiting exists to avoid an infinite loop.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
---
 v1-v2 changes
 . rebased for next-20151016
 . added multiple exit conditions for busy wait loop

 drivers/dma/edma.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

Comments

Peter Ujfalusi Oct. 16, 2015, 11:46 a.m. UTC | #1

On 10/16/2015 01:26 PM, John Ogness wrote:
> When retrieving the residue value for cyclic transfers, the
> SRC/DST fields of the active PaRAM are read. However, the AM335x
> Technical Reference Manual states:
> 
>   11.3.3.6 Parameter Set Updates
> 
>   After the TR is read from the PaRAM (and is in the process
>   of being submitted to the EDMA3TC), the following fields are
>   updated as needed: ... SRC DST
> 
> This means SRC/DST is incremented even though the DMA transfer
> may not have started yet or is in progress. Thus if the reader
> of the residue accesses the DMA buffer too quickly, the CPU is
> misinformed about the data that has been successfully processed.
> 
> The CCSTAT.ACTV register is a boolean that is set if any TR is
> being processed by either the EMDA3CC or EDMA3TC. By polling
> this register it is possible to ensure that the residue value
> returned is valid for immediate processing. However, since the
> DMA engine may be active, polling may never hit a moment where
> no TR is being processed. To handle this, the SRC/DST is also
> polled to see if it changes. And as a last resort, a max loop
> count for the busy waiting exists to avoid an infinite loop.

I'm not sure what this actually going to solve, except that we are going to
wait for the next PaRAM parameter update.
As you have already described the issue is that when you first submit the
transfer, the first PaRAM will be submitted to TC and at this point the
parameters will be updated to be prepared for the next update.
This means that after we initiate the DMA the SRC/DST will be updated to point
to the next batch of data. Reading SRC/DST before the second parameter update
will always give you this. But this is also true further in the line.
We never know exactly where the DMA is, we only know that the DMA is somewhere
in between 0x1234 - (0x1234 + data until next parameter update). At the next
parameter update you again can not be sure where the DMA is, since it is now
somewhere between (0x1234 + parameter update number of data) - till the next
update, and so on.
In the cyclic case we use AB-sync mode, in this case the parameter update is
going to happen after each period if I'm not mistaken.

To achieve the same thing (waiting for the next parameter update to happen)
you could just poll the PaRAM's CCNT in AB-sync to see if it changing and when
it did you return the SRC/DST address since those will be close enough at the
time.

But what happens if the period size if big and the position is asked just
right after the parameter update? We can not know this, so we spin a bit here
and give up and return whatever we had in SRC/DST.

Not sure how to deal with this, but for sure needs more thought...

John Ogness Oct. 16, 2015, 12:28 p.m. UTC | #2

On 2015-10-16, Peter Ujfalusi <peter.ujfalusi@ti.com> wrote:
> On 10/16/2015 01:26 PM, John Ogness wrote:
>> When retrieving the residue value for cyclic transfers, the
>> SRC/DST fields of the active PaRAM are read. However, the AM335x
>> Technical Reference Manual states:
>> 
>>   11.3.3.6 Parameter Set Updates
>> 
>>   After the TR is read from the PaRAM (and is in the process
>>   of being submitted to the EDMA3TC), the following fields are
>>   updated as needed: ... SRC DST
>> 
>> This means SRC/DST is incremented even though the DMA transfer
>> may not have started yet or is in progress. Thus if the reader
>> of the residue accesses the DMA buffer too quickly, the CPU is
>> misinformed about the data that has been successfully processed.
>> 
>> The CCSTAT.ACTV register is a boolean that is set if any TR is
>> being processed by either the EMDA3CC or EDMA3TC. By polling
>> this register it is possible to ensure that the residue value
>> returned is valid for immediate processing. However, since the
>> DMA engine may be active, polling may never hit a moment where
>> no TR is being processed. To handle this, the SRC/DST is also
>> polled to see if it changes. And as a last resort, a max loop
>> count for the busy waiting exists to avoid an infinite loop.
>
> I'm not sure what this actually going to solve, except that we are
> going to wait for the next PaRAM parameter update.

We are waiting until the active transfers are complete. The main wait
condition is when ACTV is 0. When this happens, all transfers are
definately complete. In the normal case, this is the condition that
causes the loop exit.

> As you have already described the issue is that when you first submit
> the transfer, the first PaRAM will be submitted to TC and at this
> point the parameters will be updated to be prepared for the next
> update.  This means that after we initiate the DMA the SRC/DST will be
> updated to point to the next batch of data. Reading SRC/DST before the
> second parameter update will always give you this. But this is also
> true further in the line.  We never know exactly where the DMA is, we
> only know that the DMA is somewhere in between 0x1234 - (0x1234 + data
> until next parameter update). At the next parameter update you again
> can not be sure where the DMA is, since it is now somewhere between
> (0x1234 + parameter update number of data) - till the next update, and
> so on.
>
> In the cyclic case we use AB-sync mode, in this case the parameter update is
> going to happen after each period if I'm not mistaken.
>
> To achieve the same thing (waiting for the next parameter update to happen)
> you could just poll the PaRAM's CCNT in AB-sync to see if it changing and when
> it did you return the SRC/DST address since those will be close enough at the
> time.

Is there really a difference between polling CCNT and polling SRC/DST?
Notice that the function does _not_ return the polled SRC/DST. The extra
polling is only used as an additional loop exit condition in case we
missed ACTV being 0. This condition does occur once in while (during my
3Mbit UART tests, about once every 100000 transfers).

> But what happens if the period size if big and the position is asked
> just right after the parameter update? We can not know this, so we
> spin a bit here and give up and return whatever we had in SRC/DST.

I would hope that we never enter the "give up" condition. If a transfer
request was started, that means one burst of data is supposed to be
ready, which means the burst transfer should execute quickly. If
something unexpected happens (certain clocks suddenly turn off, etc),
then we fallback to the "give up" condition and return bad data to the
caller.

I was considering if dmaengine_tx_status() should somehow notify that it
is not possible to identify the residue (i.e. we gave up waiting for the
transfer to complete). But that really adds new semantics to the DMA
API, which is something much bigger.

An alternative would be to always return a "last known residue". But
this can really only be tracked during DMA completion interrupts and
successful edma_residue() calls. Both of which cannot guarentee that we
didn't miss a completed transfer. That is particularly a problem for the
UART driver, where a call to dmaengine_tx_status() is then followed by
PIO. That means dmaengine_tx_status() needs to return as much DMA data
as is really available.

> Not sure how to deal with this, but for sure needs more thought...

This patch closes a very real and reproducable race window in the eDMA
driver. Aside from adding busy waiting cycles it does not produce worse
results than before. But I am open to ideas.

John Ogness

Peter Ujfalusi Oct. 16, 2015, 1:46 p.m. UTC | #3

On 10/16/2015 03:28 PM, John Ogness wrote:
> On 2015-10-16, Peter Ujfalusi <peter.ujfalusi@ti.com> wrote:
>> On 10/16/2015 01:26 PM, John Ogness wrote:
>>> When retrieving the residue value for cyclic transfers, the
>>> SRC/DST fields of the active PaRAM are read. However, the AM335x
>>> Technical Reference Manual states:
>>>
>>>   11.3.3.6 Parameter Set Updates
>>>
>>>   After the TR is read from the PaRAM (and is in the process
>>>   of being submitted to the EDMA3TC), the following fields are
>>>   updated as needed: ... SRC DST
>>>
>>> This means SRC/DST is incremented even though the DMA transfer
>>> may not have started yet or is in progress. Thus if the reader
>>> of the residue accesses the DMA buffer too quickly, the CPU is
>>> misinformed about the data that has been successfully processed.
>>>
>>> The CCSTAT.ACTV register is a boolean that is set if any TR is
>>> being processed by either the EMDA3CC or EDMA3TC. By polling
>>> this register it is possible to ensure that the residue value
>>> returned is valid for immediate processing. However, since the
>>> DMA engine may be active, polling may never hit a moment where
>>> no TR is being processed. To handle this, the SRC/DST is also
>>> polled to see if it changes. And as a last resort, a max loop
>>> count for the busy waiting exists to avoid an infinite loop.
>>
>> I'm not sure what this actually going to solve, except that we are
>> going to wait for the next PaRAM parameter update.
> 
> We are waiting until the active transfers are complete. The main wait
> condition is when ACTV is 0. When this happens, all transfers are
> definately complete. In the normal case, this is the condition that
> causes the loop exit.

Let's say you are playing audio (from SD card), recording audio (to SD card)
and also run the UART test.
You might fail to catch the CC inactivity due to other users of the DMA - on
other TC for example.
Can you try if checking the SRC/DST change only (and not checking the ACTV
bit) is working in a similar way?

>> As you have already described the issue is that when you first submit
>> the transfer, the first PaRAM will be submitted to TC and at this
>> point the parameters will be updated to be prepared for the next
>> update.  This means that after we initiate the DMA the SRC/DST will be
>> updated to point to the next batch of data. Reading SRC/DST before the
>> second parameter update will always give you this. But this is also
>> true further in the line.  We never know exactly where the DMA is, we
>> only know that the DMA is somewhere in between 0x1234 - (0x1234 + data
>> until next parameter update). At the next parameter update you again
>> can not be sure where the DMA is, since it is now somewhere between
>> (0x1234 + parameter update number of data) - till the next update, and
>> so on.
>>
>> In the cyclic case we use AB-sync mode, in this case the parameter update is
>> going to happen after each period if I'm not mistaken.
>>
>> To achieve the same thing (waiting for the next parameter update to happen)
>> you could just poll the PaRAM's CCNT in AB-sync to see if it changing and when
>> it did you return the SRC/DST address since those will be close enough at the
>> time.
> 
> Is there really a difference between polling CCNT and polling SRC/DST?
> Notice that the function does _not_ return the polled SRC/DST. The extra
> polling is only used as an additional loop exit condition in case we
> missed ACTV being 0. This condition does occur once in while (during my
> 3Mbit UART tests, about once every 100000 transfers).

Which means that you were spinning 10000 times on the bit and on the SRC/DST
and still it did not found the condition to exit.
Looking at the code and the documentation: the SRC/DST address should be
updated after each burst (C-sync event) requested, or in case the burst is not
enabled after every B-sync event. I have not looked a the numbers regarding to
the length of these, for audio the burst is relatively small (max 64 words).

So even if the period size is big, the SRC/DST is updated more frequently.

Ideally we should report the previous address IMHO, but not sure how to
calculate that in a generic way (A-sync, AB-sync, cyclic, non-cyclic, etc).

So probably for now this would be fine, but can you try w/o checking the ACTV
and relying only on the change of address?


and few comments to the patch.

>> But what happens if the period size if big and the position is asked
>> just right after the parameter update? We can not know this, so we
>> spin a bit here and give up and return whatever we had in SRC/DST.
> 
> I would hope that we never enter the "give up" condition. If a transfer
> request was started, that means one burst of data is supposed to be
> ready, which means the burst transfer should execute quickly. If
> something unexpected happens (certain clocks suddenly turn off, etc),
> then we fallback to the "give up" condition and return bad data to the
> caller.
> 
> I was considering if dmaengine_tx_status() should somehow notify that it
> is not possible to identify the residue (i.e. we gave up waiting for the
> transfer to complete). But that really adds new semantics to the DMA
> API, which is something much bigger.
> 
> An alternative would be to always return a "last known residue". But
> this can really only be tracked during DMA completion interrupts and
> successful edma_residue() calls. Both of which cannot guarentee that we
> didn't miss a completed transfer. That is particularly a problem for the
> UART driver, where a call to dmaengine_tx_status() is then followed by
> PIO. That means dmaengine_tx_status() needs to return as much DMA data
> as is really available.
> 
>> Not sure how to deal with this, but for sure needs more thought...
> 
> This patch closes a very real and reproducable race window in the eDMA
> driver. Aside from adding busy waiting cycles it does not produce worse
> results than before. But I am open to ideas.
> 
> John Ogness
>

Peter Ujfalusi Oct. 16, 2015, 1:58 p.m. UTC | #4

On 10/16/2015 01:26 PM, John Ogness wrote:
> When retrieving the residue value for cyclic transfers, the
> SRC/DST fields of the active PaRAM are read. However, the AM335x
> Technical Reference Manual states:
> 
>   11.3.3.6 Parameter Set Updates
> 
>   After the TR is read from the PaRAM (and is in the process
>   of being submitted to the EDMA3TC), the following fields are
>   updated as needed: ... SRC DST
> 
> This means SRC/DST is incremented even though the DMA transfer
> may not have started yet or is in progress. Thus if the reader
> of the residue accesses the DMA buffer too quickly, the CPU is
> misinformed about the data that has been successfully processed.
> 
> The CCSTAT.ACTV register is a boolean that is set if any TR is
> being processed by either the EMDA3CC or EDMA3TC. By polling
> this register it is possible to ensure that the residue value
> returned is valid for immediate processing. However, since the
> DMA engine may be active, polling may never hit a moment where
> no TR is being processed. To handle this, the SRC/DST is also
> polled to see if it changes. And as a last resort, a max loop
> count for the busy waiting exists to avoid an infinite loop.
> 
> Signed-off-by: John Ogness <john.ogness@linutronix.de>
> ---
>  v1-v2 changes
>  . rebased for next-20151016
>  . added multiple exit conditions for busy wait loop
> 
>  drivers/dma/edma.c |   40 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
> diff --git a/drivers/dma/edma.c b/drivers/dma/edma.c
> index 7eefbf1..8d3b3ac 100644
> --- a/drivers/dma/edma.c
> +++ b/drivers/dma/edma.c
> @@ -24,6 +24,8 @@
>  #include <linux/platform_device.h>
>  #include <linux/slab.h>
>  #include <linux/spinlock.h>
> +#include <linux/ratelimit.h>
> +#include <linux/printk.h>
>  #include <linux/of.h>
>  #include <linux/of_dma.h>
>  #include <linux/of_irq.h>
> @@ -1785,11 +1787,24 @@ static void edma_issue_pending(struct dma_chan *chan)
>  	spin_unlock_irqrestore(&echan->vchan.lock, flags);
>  }
>  
> +#define EDMA_CCSTAT_ACTV (1 << 4)
> +
> +/*
> + * This limit exists to avoid a possible infinite loop when waiting
> + * for confirmation that a particular transfer is completed. However,
> + * large bursts to/from slow devices might actually require many
> + * loops (in which case busy waiting is bad anyway). On an AM335x
> + * transfering 48 bytes from the UART RX-FIFO, as many as 55 loops
> + * have been seen.
> + */
> +#define EDMA_MAX_TR_WAIT_LOOPS 10000
> +
>  static u32 edma_residue(struct edma_desc *edesc)
>  {
>  	bool dst = edesc->direction == DMA_DEV_TO_MEM;
>  	struct edma_pset *pset = edesc->pset;

	struct edma_chan *echan = edesc->echan;

>  	dma_addr_t done, pos;
> +	int loop_count = 0;

	int loop_count = EDMA_MAX_TR_WAIT_LOOPS;

>  	int i;
>  
>  	/*
> @@ -1799,6 +1814,31 @@ static u32 edma_residue(struct edma_desc *edesc)
>  	pos = edma_get_position(edesc->echan->ecc, edesc->echan->slot[0], dst);

Use echan-> here directly and other places in this function.

>  
>  	/*
> +	 * "pos" may represent a transfer request that is still being
> +	 * processed by the EDMACC or EDMATC. We will busy wait until
> +	 * one of the situations occurs:
> +	 *   1. no transfer requests are active
> +	 *   2. a different transfer request is being processed
> +	 *   3. we hit the loop limit
> +	 */
> +	while (edma_read(edesc->echan->ecc, EDMA_CCSTAT) & EDMA_CCSTAT_ACTV) {
> +		/* check if a different transfer request is active */
> +		if (edma_get_position(edesc->echan->ecc,
> +				      edesc->echan->slot[0], dst) != pos) {
> +			break;
> +		}
> +
> +		loop_count++;
> +		if (loop_count == EDMA_MAX_TR_WAIT_LOOPS) {

		if (--loop_count) {

> +			pr_debug_ratelimited("%s: possibly returning "
> +					     "invalid residue\n", __func__);

			dev_dbg_ratelimited(echan->ecc->dev,
				"%s: timeout waiting for PaRAM update\n",
				__func__);

> +			break;
> +		}
> +
> +		cpu_relax();
> +	}
> +
> +	/*
>  	 * Cyclic is simple. Just subtract pset[0].addr from pos.
>  	 *
>  	 * We never update edesc->residue in the cyclic case, so we
>

diff --git a/drivers/dma/edma.c b/drivers/dma/edma.c
index 7eefbf1..8d3b3ac 100644
--- a/drivers/dma/edma.c
+++ b/drivers/dma/edma.c
@@ -24,6 +24,8 @@ 
 #include <linux/platform_device.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
+#include <linux/ratelimit.h>
+#include <linux/printk.h>
 #include <linux/of.h>
 #include <linux/of_dma.h>
 #include <linux/of_irq.h>
@@ -1785,11 +1787,24 @@  static void edma_issue_pending(struct dma_chan *chan)
 	spin_unlock_irqrestore(&echan->vchan.lock, flags);
 }
 
+#define EDMA_CCSTAT_ACTV (1 << 4)
+
+/*
+ * This limit exists to avoid a possible infinite loop when waiting
+ * for confirmation that a particular transfer is completed. However,
+ * large bursts to/from slow devices might actually require many
+ * loops (in which case busy waiting is bad anyway). On an AM335x
+ * transfering 48 bytes from the UART RX-FIFO, as many as 55 loops
+ * have been seen.
+ */
+#define EDMA_MAX_TR_WAIT_LOOPS 10000
+
 static u32 edma_residue(struct edma_desc *edesc)
 {
 	bool dst = edesc->direction == DMA_DEV_TO_MEM;
 	struct edma_pset *pset = edesc->pset;
 	dma_addr_t done, pos;
+	int loop_count = 0;
 	int i;
 
 	/*
@@ -1799,6 +1814,31 @@  static u32 edma_residue(struct edma_desc *edesc)
 	pos = edma_get_position(edesc->echan->ecc, edesc->echan->slot[0], dst);
 
 	/*
+	 * "pos" may represent a transfer request that is still being
+	 * processed by the EDMACC or EDMATC. We will busy wait until
+	 * one of the situations occurs:
+	 *   1. no transfer requests are active
+	 *   2. a different transfer request is being processed
+	 *   3. we hit the loop limit
+	 */
+	while (edma_read(edesc->echan->ecc, EDMA_CCSTAT) & EDMA_CCSTAT_ACTV) {
+		/* check if a different transfer request is active */
+		if (edma_get_position(edesc->echan->ecc,
+				      edesc->echan->slot[0], dst) != pos) {
+			break;
+		}
+
+		loop_count++;
+		if (loop_count == EDMA_MAX_TR_WAIT_LOOPS) {
+			pr_debug_ratelimited("%s: possibly returning "
+					     "invalid residue\n", __func__);
+			break;
+		}
+
+		cpu_relax();
+	}
+
+	/*
 	 * Cyclic is simple. Just subtract pset[0].addr from pos.
 	 *
 	 * We never update edesc->residue in the cyclic case, so we

[v2] ARM: edma: fix residue race for cyclic

Commit Message

Comments

Patch