diff mbox

dmatest regression in 3.10-rc1

Message ID 20130517123423.GR14863@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Vinod Koul May 17, 2013, 12:34 p.m. UTC
On Thu, May 16, 2013 at 04:35:53PM +0100, Will Deacon wrote:
> On Wed, May 15, 2013 at 04:28:03PM +0100, Will Deacon wrote:
> > I've been observing a regression in the dmatest module with 3.10-rc1. It
> > manifests as either:
> > 
> >  - a spurious timeout on one or more of the channel threads
> >  - a complete kernel lockup (loss of console)
> >  - a panic (see below, noting that the callback [dmatest_callback] is
> >    dereferencing a NULL pointer)
> > 
> > If I revert 77101ce578bb ("dmatest: cancel thread immediately when asked
> > for") then things are rosy again, but I'm not sure if this is hiding another
> > problem.
> 
> Right, so I think I understand what's causing this, but I'll leave it to
> Andriy to suggest a fix. The problem comes about because the dmatest
> module is now driven from debugfs, making it possible to unload the module
> whilst a test run is in progress. In this case:
> 
> 	- The DMA threads will return from wait_event_freezable_timeout(...)
> 	  due to kthread_should_stop() returning true, and subsequently
> 	  report failure because done.done is false.
> 
> 	- The DMA engines may not be idle, so the asynchronous callback can
> 	  be invoked after we've started cleaning up, explaining the NULL
> 	  dereference I'm seeing.
> 
> The solutions are either fixing the module exit code to cope with concurrent
> DMA transfers or to revert 77101ce578bb and not allow the channel threads to
> return mid-transfer.
We need to properly abort the channels on removal. This is already handled in
the code but the kthread_stop is called after the transactions are aborted. It
should be the other way round. Can you try with below patch

---

--
~Vinod

Comments

Will Deacon May 17, 2013, 2:18 p.m. UTC | #1
Hi Vinod,

Thanks for the reply.

On Fri, May 17, 2013 at 01:34:23PM +0100, Vinod Koul wrote:
> On Thu, May 16, 2013 at 04:35:53PM +0100, Will Deacon wrote:
> > Right, so I think I understand what's causing this, but I'll leave it to
> > Andriy to suggest a fix. The problem comes about because the dmatest
> > module is now driven from debugfs, making it possible to unload the module
> > whilst a test run is in progress. In this case:
> > 
> > 	- The DMA threads will return from wait_event_freezable_timeout(...)
> > 	  due to kthread_should_stop() returning true, and subsequently
> > 	  report failure because done.done is false.
> > 
> > 	- The DMA engines may not be idle, so the asynchronous callback can
> > 	  be invoked after we've started cleaning up, explaining the NULL
> > 	  dereference I'm seeing.
> > 
> > The solutions are either fixing the module exit code to cope with concurrent
> > DMA transfers or to revert 77101ce578bb and not allow the channel threads to
> > return mid-transfer.
> We need to properly abort the channels on removal. This is already handled in
> the code but the kthread_stop is called after the transactions are aborted. It
> should be the other way round. Can you try with below patch

Unfortunately, I can trigger the exact same panic with this patch applied.

Isn't there a race between terminating the dmaengine transfers
(dmaengine_terminate_all) and killing the test threads (kthread_stop) where
a new transfer could be kicked off by dmatest_func?

Will
Andy Shevchenko May 20, 2013, 7:52 a.m. UTC | #2
On Fri, 2013-05-17 at 15:18 +0100, Will Deacon wrote: 
> On Fri, May 17, 2013 at 01:34:23PM +0100, Vinod Koul wrote:
> > On Thu, May 16, 2013 at 04:35:53PM +0100, Will Deacon wrote:
> > > Right, so I think I understand what's causing this, but I'll leave it to
> > > Andriy to suggest a fix.

Thanks for the report.

Unfortunately neither me nor other guys in the team could not reproduce
mentioned issue on our hardware.
Nevertheless I will try to allocate time and suggest possible fix if any
comes to my mind. So, your testing and analysis is appreciated.
Will Deacon May 20, 2013, 9:58 a.m. UTC | #3
Hi Andy,

On Mon, May 20, 2013 at 08:52:34AM +0100, Andy Shevchenko wrote:
> On Fri, 2013-05-17 at 15:18 +0100, Will Deacon wrote: 
> > On Fri, May 17, 2013 at 01:34:23PM +0100, Vinod Koul wrote:
> > > On Thu, May 16, 2013 at 04:35:53PM +0100, Will Deacon wrote:
> > > > Right, so I think I understand what's causing this, but I'll leave it to
> > > > Andriy to suggest a fix.
> 
> Thanks for the report.
> 
> Unfortunately neither me nor other guys in the team could not reproduce
> mentioned issue on our hardware.
> Nevertheless I will try to allocate time and suggest possible fix if any
> comes to my mind. So, your testing and analysis is appreciated.

Sure. They way I trigger it is to unload the module whilst the DMA is
ongoing (I'm using a software model, so I can make the DMA nice and slow --
I guess you could try using some large buffers).

My script (I just run from busybox) is:

  #!/bin/sh

  mount proc -t proc /proc
  mount sysfs -t sysfs /sys
  mount debugfs -t debugfs /sys/kernel/debug

  echo 9 > /proc/sysrq-trigger

  modprobe dmatest
  echo 8 > /sys/kernel/debug/dmatest/iterations
  echo 8 > /sys/kernel/debug/dmatest/threads_per_chan
  echo 131072 > /sys/kernel/debug/dmatest/test_buf_size
  echo 1 > /sys/kernel/debug/dmatest/run
  modprobe -r dmatest

Will
Andy Shevchenko May 21, 2013, 12:31 p.m. UTC | #4
On Mon, 2013-05-20 at 10:58 +0100, Will Deacon wrote: 
> Sure. They way I trigger it is to unload the module whilst the DMA is
> ongoing (I'm using a software model, so I can make the DMA nice and slow --
> I guess you could try using some large buffers).

Thank you for the script. It looks like I managed to reproduce it with
help of your script. I'm about to send the patch. I will appreciate to
collect a Tested-by tags in case it solves the issue.
diff mbox

Patch

diff --git a/drivers/dma/dmatest.c b/drivers/dma/dmatest.c
index d8ce4ec..4e8d581 100644
--- a/drivers/dma/dmatest.c
+++ b/drivers/dma/dmatest.c
@@ -822,6 +822,9 @@  static void dmatest_cleanup_channel(struct dmatest_chan *dtc)
 	struct dmatest_thread	*_thread;
 	int			ret;
 
+	/* terminate all transfers on specified channels */
+	dmaengine_terminate_all(dtc->chan);
+
 	list_for_each_entry_safe(thread, _thread, &dtc->threads, node) {
 		ret = kthread_stop(thread->task);
 		pr_debug("dmatest: thread %s exited with status %d\n",
@@ -830,9 +833,6 @@  static void dmatest_cleanup_channel(struct dmatest_chan *dtc)
 		kfree(thread);
 	}
 
-	/* terminate all transfers on specified channels */
-	dmaengine_terminate_all(dtc->chan);
-
 	kfree(dtc);
 }