diff mbox

[BUG] dmaengine: pxa_dma: + mmc: pxamci: race condition with DMA error on tx channel

Message ID 877f3zwwgo.fsf@belgarion.home (mailing list archive)
State New, archived
Headers show

Commit Message

Robert Jarzmik March 8, 2017, 4:43 p.m. UTC
Petr Cvek <petr.cvek@tul.cz> writes:

Hi Petr,
> I wasn't able to track the problem to a single patch as the problem occurs at
> random time (from the boot to like a half an hour) and it's maybe dependent on a
> level of a battery charge (maybe because of kernel log writes of charging
> messages).
Mmmh, long reproduction time, that will be bad.

> It seems that most occurrency is during writes on an SD card. Using an SDHC
> card decreases the time to fail. After failure the OS is unavailable (rootfs
> in on the card).
Okay, let me try to make write loop on my SD card to see if I manage to
reproduce this.

> From my poking in the kernel source code it seems there is a probability that pxamci_irq() takes longer to call and its subsequent call pxamci_data_done() isn't fast enough to set [1]
> 	host->data = NULL;

> From the DMA side, the DMA done interrupt is generated:
> 	pxad_chan_handler() -> vchan_cookie_complete()
> ...where a tasklet for vchan_complete() is scheduled
At least that seems to hint the DMA part is sound so for.
The bothering part is the log error "mmc0: DMA error on tx channel". I would
need a bit of guidance here, with the same log with [1] applied.

> , where finally with interrupts enabled (can pxamci_irq() be called here?) the
> callback pxamci_dma_irq() is called.
When DMA completes, there is a tiny window, before pxamci_dma_irq() is called,
when pxamci_irq() can be called, yes. As soon as the spinlock is taken in
pxamci_dma_irq() is taken, no more races.

> From my tests it seems at this point [2] the host->data is always NULL and rest
> of the callback is never called. It is called once with a nonempty host->data
> only just before the failure.
>
> During the testing I put udelay(100) at the start of pxamci_dma_irq() and fail
> occurred after like 2 hours (when I for the first time tapped the touchscreen -
> higher CPU usage and interrupts).
Mmm I would need more data here.

The biggest help I could get would be the pxa dma traces here :
echo -n 'file pxa_dma.c +p' > /sys/kernel/debug/dynamic_debug/control
echo -n 'file virt-dma.c +p' > /sys/kernel/debug/dynamic_debug/control

And then capture the last traces and send them to me.

Cheers.

Comments

Petr Cvek March 10, 2017, 12:49 a.m. UTC | #1
Dne 8.3.2017 v 17:43 Robert Jarzmik napsal(a):
> Petr Cvek <petr.cvek@tul.cz> writes:
> 
> Hi Petr,
>> I wasn't able to track the problem to a single patch as the problem occurs at
>> random time (from the boot to like a half an hour) and it's maybe dependent on a
>> level of a battery charge (maybe because of kernel log writes of charging
>> messages).
> Mmmh, long reproduction time, that will be bad.

Sent log is for a machine which finally died when I tried to start an X server. These debug printks slow it down.

>> It seems that most occurrency is during writes on an SD card. Using an SDHC
>> card decreases the time to fail. After failure the OS is unavailable (rootfs
>> in on the card).
> Okay, let me try to make write loop on my SD card to see if I manage to
> reproduce this.
> 

I tried to enable and disable soundcard (other part which is using DMA), but it does not have any effect. Concurrent interrupts may have.

>> From my poking in the kernel source code it seems there is a probability that pxamci_irq() takes longer to call and its subsequent call pxamci_data_done() isn't fast enough to set [1]
>> 	host->data = NULL;
> 
>> From the DMA side, the DMA done interrupt is generated:
>> 	pxad_chan_handler() -> vchan_cookie_complete()
>> ...where a tasklet for vchan_complete() is scheduled
> At least that seems to hint the DMA part is sound so for.
> The bothering part is the log error "mmc0: DMA error on tx channel". I would
> need a bit of guidance here, with the same log with [1] applied.
> 

Probably some cooperation of the DMA engine and the PXA MMC (like that other mail I've sent to you with UDC and g_webcam). I made another version of debug kernel with my own asserts:

In pxamci_dma_irq() before dmaengine_tx_status() call I've put:

pr_info("!!!cookie=%x complete=%x used=%x\n", 
	host->dma_cookie,chan->completed_cookie,chan->cookie);

It gets called only ONCE, just before failure, which values like this:
	cookie=372 complete=371 used=372

vchan_cyclic_callback() is called only with the soundcard in my machine IMO. I first thought there is regression with vd_completed in pxad_chan_handler() being changed during vchan_cyclic_callback() but it seems not.

pxad_tx_status() returns DMA_IN_PROGRESS before the failure.

>> , where finally with interrupts enabled (can pxamci_irq() be called here?) the
>> callback pxamci_dma_irq() is called.
> When DMA completes, there is a tiny window, before pxamci_dma_irq() is called,
> when pxamci_irq() can be called, yes. As soon as the spinlock is taken in
> pxamci_dma_irq() is taken, no more races.
> 

Does pxamci_dma_irq() depend on pxamci_irq() -> pxamci_data_done() NULLing host->data?

>> From my tests it seems at this point [2] the host->data is always NULL and rest
>> of the callback is never called. It is called once with a nonempty host->data
>> only just before the failure.
>>
>> During the testing I put udelay(100) at the start of pxamci_dma_irq() and fail
>> occurred after like 2 hours (when I for the first time tapped the touchscreen -
>> higher CPU usage and interrupts).
> Mmm I would need more data here.
> 
> The biggest help I could get would be the pxa dma traces here :
> echo -n 'file pxa_dma.c +p' > /sys/kernel/debug/dynamic_debug/control
> echo -n 'file virt-dma.c +p' > /sys/kernel/debug/dynamic_debug/control
> 
> And then capture the last traces and send them to me.

Last traces before the failure are in the attachment. It was copied (by a continual dmesg dump) from a ssh terminal over infraport (/var/log/messages gets corrupted after MMC DMA error).

> 
> Cheers.
>
Robert Jarzmik March 14, 2017, 9:11 p.m. UTC | #2
Petr Cvek <petr.cvek@tul.cz> writes:

Ok Petr, I've been trying for days to reproduce without any luck.

I had a look at your traces, and I'd like something else when it happens :
 1) The patch I provided earlier applied
 2) This done (the 'cat' after the bug) :
	mount -t debugfs none /sys/kernel/debug/
	cat /sys/kernel/debug/pxa-dma.0/channels/4/[sd]*

Cheers.

--
Robert
diff mbox

Patch

diff --git a/drivers/mmc/host/pxamci.c b/drivers/mmc/host/pxamci.c
index c763b404510f..ed3812b2a34d 100644
--- a/drivers/mmc/host/pxamci.c
+++ b/drivers/mmc/host/pxamci.c
@@ -571,8 +571,9 @@  static void pxamci_dma_irq(void *param)
 	if (likely(status == DMA_COMPLETE)) {
 		writel(BUF_PART_FULL, host->base + MMC_PRTBUF);
 	} else {
-		pr_err("%s: DMA error on %s channel\n", mmc_hostname(host->mmc),
-			host->data->flags & MMC_DATA_READ ? "rx" : "tx");
+		pr_err("%s: DMA error on %s channel: %d\n",
+		       mmc_hostname(host->mmc),
+		       host->data->flags & MMC_DATA_READ ? "rx" : "tx", status);
 		host->data->error = -EIO;
 		pxamci_data_done(host, 0);
 	}