Serious memory leak in TI EDMA driver (drivers/dma/edma.c)

Message ID	550C27C3.10406@ti.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-omap-owner@kernel.org> Message-ID: <550C27C3.10406@ti.com> Date: Fri, 20 Mar 2015 15:59:31 +0200 From: Peter Ujfalusi <peter.ujfalusi@ti.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Petr Kulhavy <petr@barix.com>, <linux-omap@vger.kernel.org> Subject: Re: Serious memory leak in TI EDMA driver (drivers/dma/edma.c) References: <55072E56.7050802@barix.com> <5508205D.7010106@ti.com> <55087A3A.6070300@barix.com> <55098414.9020301@ti.com> <5509EF23.8080404@barix.com> In-Reply-To: <5509EF23.8080404@barix.com> Content-Type: text/plain; charset="iso-8859-2" Content-Transfer-Encoding: 8bit Sender: linux-omap-owner@vger.kernel.org Precedence: bulk

Message ID

550C27C3.10406@ti.com (mailing list archive)

State

New, archived

Headers

Message-ID: <550C27C3.10406@ti.com>
Date: Fri, 20 Mar 2015 15:59:31 +0200
From: Peter Ujfalusi <peter.ujfalusi@ti.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: Petr Kulhavy <petr@barix.com>, <linux-omap@vger.kernel.org>
Subject: Re: Serious memory leak in TI EDMA driver (drivers/dma/edma.c)
References: <55072E56.7050802@barix.com> <5508205D.7010106@ti.com>
	<55087A3A.6070300@barix.com> <55098414.9020301@ti.com>
	<5509EF23.8080404@barix.com>
In-Reply-To: <5509EF23.8080404@barix.com>
Content-Type: text/plain; charset="iso-8859-2"
Content-Transfer-Encoding: 8bit
Sender: linux-omap-owner@vger.kernel.org
Precedence: bulk

Commit Message

Peter Ujfalusi March 20, 2015, 1:59 p.m. UTC

Petr,

On 03/18/2015 11:33 PM, Petr Kulhavy wrote:
> Hi Peter,
> 
> Yes, we do not use DT.
> Our board design is very close to the da850evm reference board. So if you have
> a chance of getting hold of it you could try on that one.

I have been trying to reproduce this on my OMAP-L138-EVM (da850evm) but was
not able.

There is a big difference in your an my setup: the MMC on my EVM is connected
to MMC/SD0 interface while in your setup it seams to be connected to MMC/SD1.
The DMA requests for MMC/SD0 is eDMA CC0 16/17, while the MMC/SD1 is eDMA CC1
28/29.
So they are handled by different channel controllers. I would not rule out
that the support for the second CC has issues.
The information from you logs just points to this direction also:
You are writing to the MMC, so CC1 ch29 is active most of the time.
In your log the channel number sometimes 65565 ((1 << 16) | 29) sometimes it
is 29. In my case it is 17 every time.
I think there is something funny regarding to how these channels on CC1 are
working and this might be causing the leak for you.

Unfortunately I do not have HW where I could use any channel on CC1 so I can
not debug this further, but I'll look at the code to see if anything obvious
stands out.
And something does stand out:
arch/arm/common/edma.c: dma_irq_handler()
It calls the callback with a wrong interrupt number since it does not take
into account the different CC.
Can you try something similar to see if it helps:

Comments

Petr Kulhavy March 20, 2015, 9:53 p.m. UTC | #1

Hi Peter,

yes, that is one of the differences to the EVM that the SD card is on 
MMCSD1. This is due to the pin multiplexer and other peripherals we're 
using.

Your patch is correct, however the edma_callback()  is using the channel 
number parameter for debug messages only. For the actual work 
echan->ch_num is used. BTW is that correct? Could there be a mismatch 
between the echan->ch_num and the ch_num parameter?

Something else seems to be odd in edma_alloc_chan_resources():

/* Alloc channel resources */
static int edma_alloc_chan_resources(struct dma_chan *chan)
{
         struct edma_chan *echan = to_edma_chan(chan);
         struct device *dev = chan->device->dev;
         int ret;
         int a_ch_num;
         LIST_HEAD(descs);

         a_ch_num = edma_alloc_channel(echan->ch_num, edma_callback,
                                         chan, EVENTQ_DEFAULT);


The third parameter to edma_alloc_channel() should be echan, not chan, 
since the edma_callback() interprets the callback data parameter as 
struct edma_chan *.

Let me know if you find something or if you have an advice for more debug.

Cheers
Petr



On 20.03.2015 14:59, Peter Ujfalusi wrote:
> Petr,
>
> On 03/18/2015 11:33 PM, Petr Kulhavy wrote:
>> Hi Peter,
>>
>> Yes, we do not use DT.
>> Our board design is very close to the da850evm reference board. So if you have
>> a chance of getting hold of it you could try on that one.
> I have been trying to reproduce this on my OMAP-L138-EVM (da850evm) but was
> not able.
>
> There is a big difference in your an my setup: the MMC on my EVM is connected
> to MMC/SD0 interface while in your setup it seams to be connected to MMC/SD1.
> The DMA requests for MMC/SD0 is eDMA CC0 16/17, while the MMC/SD1 is eDMA CC1
> 28/29.
> So they are handled by different channel controllers. I would not rule out
> that the support for the second CC has issues.
> The information from you logs just points to this direction also:
> You are writing to the MMC, so CC1 ch29 is active most of the time.
> In your log the channel number sometimes 65565 ((1 << 16) | 29) sometimes it
> is 29. In my case it is 17 every time.
> I think there is something funny regarding to how these channels on CC1 are
> working and this might be causing the leak for you.
>
> Unfortunately I do not have HW where I could use any channel on CC1 so I can
> not debug this further, but I'll look at the code to see if anything obvious
> stands out.
> And something does stand out:
> arch/arm/common/edma.c: dma_irq_handler()
> It calls the callback with a wrong interrupt number since it does not take
> into account the different CC.
> Can you try something similar to see if it helps:
>
> diff --git a/arch/arm/common/edma.c b/arch/arm/common/edma.c
> index 6c49887d326e..e1d413c61e9e 100644
> --- a/arch/arm/common/edma.c
> +++ b/arch/arm/common/edma.c
> @@ -405,7 +405,8 @@ static irqreturn_t dma_irq_handler(int irq, void *data)
>   					BIT(slot));
>   			if (edma_cc[ctlr]->intr_data[channel].callback)
>   				edma_cc[ctlr]->intr_data[channel].callback(
> -					channel, EDMA_DMA_COMPLETE,
> +					EDMA_CTLR_CHAN(ctlr, channel),
> +					EDMA_DMA_COMPLETE,
>   					edma_cc[ctlr]->intr_data[channel].data);
>   		}
>   	} while (sh_ipr);
>

Peter Ujfalusi March 23, 2015, 3:28 p.m. UTC | #2

On 03/20/2015 11:53 PM, Petr Kulhavy wrote:
> Hi Peter,
> 
> yes, that is one of the differences to the EVM that the SD card is on MMCSD1.
> This is due to the pin multiplexer and other peripherals we're using.
> 
> Your patch is correct, however the edma_callback()  is using the channel
> number parameter for debug messages only. For the actual work echan->ch_num is
> used. BTW is that correct? Could there be a mismatch between the echan->ch_num
> and the ch_num parameter?
> 
> Something else seems to be odd in edma_alloc_chan_resources():
> 
> /* Alloc channel resources */
> static int edma_alloc_chan_resources(struct dma_chan *chan)
> {
>         struct edma_chan *echan = to_edma_chan(chan);
>         struct device *dev = chan->device->dev;
>         int ret;
>         int a_ch_num;
>         LIST_HEAD(descs);
> 
>         a_ch_num = edma_alloc_channel(echan->ch_num, edma_callback,
>                                         chan, EVENTQ_DEFAULT);

Hah, yes this is wrong but worked so far fine because:
struct edma_chan {
	struct virt_dma_chan		vchan;
...
};

struct virt_dma_chan {
	struct dma_chan	chan;
...
};

so &edma_chan == &edma_chan.vchan.chan;

But this is not why you see the leak.

What I did on my board is that I have swapped in SW the cc0 and cc1, now mmc0
is on cc1 from the sw point of view, but still can not reproduce the issue.

> 
> The third parameter to edma_alloc_channel() should be echan, not chan, since
> the edma_callback() interprets the callback data parameter as struct edma_chan *.
> 
> Let me know if you find something or if you have an advice for more debug.

From the log I would go and see what happens in the
vchan_get_all_descriptors() and vchan_dma_desc_free_list(), is it so that at
terminate_all time the list we got is empty? But why it is?

It seams you got the terminate_all call before the transfer finished, this is
also interesting. I'll look at at this more deeply.
What I see is that at terminate_all we clear the echan->edesc and for some
reason the vchan code will not free the desc. Later the completion interrupt
comes, but since the echan->edesc is NULL we do nothing. This causes the leak.

The question to all of this why and how to reproduce it?

Petr Kulhavy March 23, 2015, 3:45 p.m. UTC | #3

Hi Peter,

I've just posted a patch to the community, you should have received 
another email from me just a few minutes ago.
Basically there should be a kfree in terminate_all() just before 
echan->edesc is set to NULL.
The reason is that at that point the edesc is in none of the vchan lists 
(edma_execute() removes it from the "issued" list)
and vchan_get_all_descriptors() doesn't find it.

With the extra kfree the mem leak is not seen any more.
It's a good question why terminate_all is called before the transfer 
finished, but I thought it could be called at any time?

I'm also not sure if the echan->edesc shouldn't be free also in 
edma_execute(), could you please check that? :

static void edma_execute(struct edma_chan *echan)
{
     struct virt_dma_desc *vdesc;
     struct edma_desc *edesc;
     struct device *dev = echan->vchan.chan.device->dev;
     int i, j, left, nslots;

     /* If either we processed all psets or we're still not started */
     if (!echan->edesc ||
         echan->edesc->pset_nr == echan->edesc->processed) {
         /* Get next vdesc */
         vdesc = vchan_next_desc(&echan->vchan);
         if (!vdesc) {
             dev_dbg(dev, "Setting edesc 0x%p to NULL\n",echan->edesc);
             echan->edesc = NULL;
#warning "possible memory leak here ?"
             return;
         }
         list_del(&vdesc->node);        // at this point node was in issued
         echan->edesc = to_edma_desc(&vdesc->tx);
     }

I'll post another patch for the wrong type in edma_alloc_chan_resources()


Regards
Petr


On 23.03.2015 16:28, Peter Ujfalusi wrote:
> On 03/20/2015 11:53 PM, Petr Kulhavy wrote:
>> Hi Peter,
>>
>> yes, that is one of the differences to the EVM that the SD card is on MMCSD1.
>> This is due to the pin multiplexer and other peripherals we're using.
>>
>> Your patch is correct, however the edma_callback()  is using the channel
>> number parameter for debug messages only. For the actual work echan->ch_num is
>> used. BTW is that correct? Could there be a mismatch between the echan->ch_num
>> and the ch_num parameter?
>>
>> Something else seems to be odd in edma_alloc_chan_resources():
>>
>> /* Alloc channel resources */
>> static int edma_alloc_chan_resources(struct dma_chan *chan)
>> {
>>          struct edma_chan *echan = to_edma_chan(chan);
>>          struct device *dev = chan->device->dev;
>>          int ret;
>>          int a_ch_num;
>>          LIST_HEAD(descs);
>>
>>          a_ch_num = edma_alloc_channel(echan->ch_num, edma_callback,
>>                                          chan, EVENTQ_DEFAULT);
> Hah, yes this is wrong but worked so far fine because:
> struct edma_chan {
> 	struct virt_dma_chan		vchan;
> ...
> };
>
> struct virt_dma_chan {
> 	struct dma_chan	chan;
> ...
> };
>
> so &edma_chan == &edma_chan.vchan.chan;
>
> But this is not why you see the leak.
>
> What I did on my board is that I have swapped in SW the cc0 and cc1, now mmc0
> is on cc1 from the sw point of view, but still can not reproduce the issue.
>
>> The third parameter to edma_alloc_channel() should be echan, not chan, since
>> the edma_callback() interprets the callback data parameter as struct edma_chan *.
>>
>> Let me know if you find something or if you have an advice for more debug.
>  From the log I would go and see what happens in the
> vchan_get_all_descriptors() and vchan_dma_desc_free_list(), is it so that at
> terminate_all time the list we got is empty? But why it is?
>
> It seams you got the terminate_all call before the transfer finished, this is
> also interesting. I'll look at at this more deeply.
> What I see is that at terminate_all we clear the echan->edesc and for some
> reason the vchan code will not free the desc. Later the completion interrupt
> comes, but since the echan->edesc is NULL we do nothing. This causes the leak.
>
> The question to all of this why and how to reproduce it?
>

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Ujfalusi March 24, 2015, 12:59 p.m. UTC | #4

On 03/23/2015 05:45 PM, Petr Kulhavy wrote:
> Hi Peter,
> 
> I've just posted a patch to the community, you should have received another
> email from me just a few minutes ago.
> Basically there should be a kfree in terminate_all() just before echan->edesc
> is set to NULL.
> The reason is that at that point the edesc is in none of the vchan lists
> (edma_execute() removes it from the "issued" list)
> and vchan_get_all_descriptors() doesn't find it.

And this is the main issue in my view. The desc should be in one of the lists
IMHO.

> With the extra kfree the mem leak is not seen any more.
> It's a good question why terminate_all is called before the transfer finished,
> but I thought it could be called at any time?
> 
> I'm also not sure if the echan->edesc shouldn't be free also in
> edma_execute(), could you please check that? :
> 
> static void edma_execute(struct edma_chan *echan)
> {
>     struct virt_dma_desc *vdesc;
>     struct edma_desc *edesc;
>     struct device *dev = echan->vchan.chan.device->dev;
>     int i, j, left, nslots;
> 
>     /* If either we processed all psets or we're still not started */
>     if (!echan->edesc ||
>         echan->edesc->pset_nr == echan->edesc->processed) {
>         /* Get next vdesc */
>         vdesc = vchan_next_desc(&echan->vchan);
>         if (!vdesc) {
>             dev_dbg(dev, "Setting edesc 0x%p to NULL\n",echan->edesc);
>             echan->edesc = NULL;
> #warning "possible memory leak here ?"

Yes, it would be, but in fact this will never happen since the edma_execute()
will not be called when echan->edesc is not NULL _and_ (echan->edesc->pset_nr
== echan->edesc->processed).

if we have echan->edesc, we are in a middle of transfer, moving from one batch
of sg to another batch.

I do have cleanup patches for this part to clarify the situation.

>             return;
>         }
>         list_del(&vdesc->node);        // at this point node was in issued
>         echan->edesc = to_edma_desc(&vdesc->tx);
>     }
> 
> I'll post another patch for the wrong type in edma_alloc_chan_resources()
> 
> 
> Regards
> Petr
> 
> 
> On 23.03.2015 16:28, Peter Ujfalusi wrote:
>> On 03/20/2015 11:53 PM, Petr Kulhavy wrote:
>>> Hi Peter,
>>>
>>> yes, that is one of the differences to the EVM that the SD card is on MMCSD1.
>>> This is due to the pin multiplexer and other peripherals we're using.
>>>
>>> Your patch is correct, however the edma_callback()  is using the channel
>>> number parameter for debug messages only. For the actual work echan->ch_num is
>>> used. BTW is that correct? Could there be a mismatch between the echan->ch_num
>>> and the ch_num parameter?
>>>
>>> Something else seems to be odd in edma_alloc_chan_resources():
>>>
>>> /* Alloc channel resources */
>>> static int edma_alloc_chan_resources(struct dma_chan *chan)
>>> {
>>>          struct edma_chan *echan = to_edma_chan(chan);
>>>          struct device *dev = chan->device->dev;
>>>          int ret;
>>>          int a_ch_num;
>>>          LIST_HEAD(descs);
>>>
>>>          a_ch_num = edma_alloc_channel(echan->ch_num, edma_callback,
>>>                                          chan, EVENTQ_DEFAULT);
>> Hah, yes this is wrong but worked so far fine because:
>> struct edma_chan {
>>     struct virt_dma_chan        vchan;
>> ...
>> };
>>
>> struct virt_dma_chan {
>>     struct dma_chan    chan;
>> ...
>> };
>>
>> so &edma_chan == &edma_chan.vchan.chan;
>>
>> But this is not why you see the leak.
>>
>> What I did on my board is that I have swapped in SW the cc0 and cc1, now mmc0
>> is on cc1 from the sw point of view, but still can not reproduce the issue.
>>
>>> The third parameter to edma_alloc_channel() should be echan, not chan, since
>>> the edma_callback() interprets the callback data parameter as struct
>>> edma_chan *.
>>>
>>> Let me know if you find something or if you have an advice for more debug.
>>  From the log I would go and see what happens in the
>> vchan_get_all_descriptors() and vchan_dma_desc_free_list(), is it so that at
>> terminate_all time the list we got is empty? But why it is?
>>
>> It seams you got the terminate_all call before the transfer finished, this is
>> also interesting. I'll look at at this more deeply.
>> What I see is that at terminate_all we clear the echan->edesc and for some
>> reason the vchan code will not free the desc. Later the completion interrupt
>> comes, but since the echan->edesc is NULL we do nothing. This causes the leak.
>>
>> The question to all of this why and how to reproduce it?
>>
>

diff --git a/arch/arm/common/edma.c b/arch/arm/common/edma.c
index 6c49887d326e..e1d413c61e9e 100644
--- a/arch/arm/common/edma.c
+++ b/arch/arm/common/edma.c
@@ -405,7 +405,8 @@  static irqreturn_t dma_irq_handler(int irq, void *data)
 					BIT(slot));
 			if (edma_cc[ctlr]->intr_data[channel].callback)
 				edma_cc[ctlr]->intr_data[channel].callback(
-					channel, EDMA_DMA_COMPLETE,
+					EDMA_CTLR_CHAN(ctlr, channel),
+					EDMA_DMA_COMPLETE,
 					edma_cc[ctlr]->intr_data[channel].data);
 		}
 	} while (sh_ipr);

Serious memory leak in TI EDMA driver (drivers/dma/edma.c)

Commit Message

Comments

Patch