diff mbox series

[v2,1/8] mmc: sdhci: Get rid of finish_tasklet

Message ID 20190215192033.24203-2-faiz_abbas@ti.com (mailing list archive)
State New, archived
Headers show
Series Port am335 and am437 devices to sdhci-omap | expand

Commit Message

Faiz Abbas Feb. 15, 2019, 7:20 p.m. UTC
sdhci.c has two bottom halves implemented. A threaded_irq for handling
card insert/remove operations and a tasklet for finishing mmc requests.
With the addition of external dma support, dmaengine APIs need to
terminate in non-atomic context before unmapping the dma buffers.

To facilitate this, remove the finish_tasklet and move the call of
sdhci_request_done() to the threaded_irq() callback. Also move the
interrupt result variable to sdhci_host so it can be populated from
anywhere inside the sdhci_irq handler.

Signed-off-by: Faiz Abbas <faiz_abbas@ti.com>
---
 drivers/mmc/host/sdhci.c | 43 +++++++++++++++-------------------------
 drivers/mmc/host/sdhci.h |  2 +-
 2 files changed, 17 insertions(+), 28 deletions(-)

Comments

Adrian Hunter Feb. 25, 2019, 8:17 a.m. UTC | #1
On 15/02/19 9:20 PM, Faiz Abbas wrote:
> sdhci.c has two bottom halves implemented. A threaded_irq for handling
> card insert/remove operations and a tasklet for finishing mmc requests.
> With the addition of external dma support, dmaengine APIs need to
> terminate in non-atomic context before unmapping the dma buffers.
> 
> To facilitate this, remove the finish_tasklet and move the call of
> sdhci_request_done() to the threaded_irq() callback.

The irq thread has a higher latency than the tasklet.  The performance drop
is measurable on the system I tried:

Before:

# dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 4.44502 s, 242 MB/s

After:

# dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 4.50898 s, 238 MB/s

So we only want to resort to the thread for the error case.
Faiz Abbas March 6, 2019, 10 a.m. UTC | #2
Adrian,

On 25/02/19 1:47 PM, Adrian Hunter wrote:
> On 15/02/19 9:20 PM, Faiz Abbas wrote:
>> sdhci.c has two bottom halves implemented. A threaded_irq for handling
>> card insert/remove operations and a tasklet for finishing mmc requests.
>> With the addition of external dma support, dmaengine APIs need to
>> terminate in non-atomic context before unmapping the dma buffers.
>>
>> To facilitate this, remove the finish_tasklet and move the call of
>> sdhci_request_done() to the threaded_irq() callback.
> 
> The irq thread has a higher latency than the tasklet.  The performance drop
> is measurable on the system I tried:
> 
> Before:
> 
> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
> 1+0 records in
> 1+0 records out
> 1073741824 bytes (1.1 GB) copied, 4.44502 s, 242 MB/s
> 
> After:
> 
> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
> 1+0 records in
> 1+0 records out
> 1073741824 bytes (1.1 GB) copied, 4.50898 s, 238 MB/s
> 
> So we only want to resort to the thread for the error case.
> 

Sorry for the late response here, but this is about 1.6% decrease. I
tried out the same commands on a dra7xx board here (with about 5
consecutive dd of 1GB) and the average decrease was 0.3%. I believe you
will also find a lesser percentage change if you average over multiple
dd commands.

Is this really so significant that we have to maintain two different
bottom halves and keep having difficulty with adding APIs that can sleep?

Also I am not sure how to implement only the error handling part in the
threaded_irq. We need to enter sdhci_request_done() and get the current
mrq before we can check for error conditions like I've done in patch 2:

/* Terminate and synchronize dma in case of an error */
if (data && (mrq->cmd->error || data->error) &&
    host->use_external_dma) {
	struct dma_chan *chan = sdhci_external_dma_channel(host, data);
	dmaengine_terminate_sync(chan);
}

On a related note, do we really need to protect everything in
sdhci_request_done() with spinlocks? In patch 2 I have only removed lock
for the terminate_sync() parts that I added but the whole
dma_unmap/dma_sync parts should be left unprotected IMO.

Thanks,
Faiz
Adrian Hunter March 8, 2019, 1:36 p.m. UTC | #3
On 6/03/19 12:00 PM, Faiz Abbas wrote:
> Adrian,
> 
> On 25/02/19 1:47 PM, Adrian Hunter wrote:
>> On 15/02/19 9:20 PM, Faiz Abbas wrote:
>>> sdhci.c has two bottom halves implemented. A threaded_irq for handling
>>> card insert/remove operations and a tasklet for finishing mmc requests.
>>> With the addition of external dma support, dmaengine APIs need to
>>> terminate in non-atomic context before unmapping the dma buffers.
>>>
>>> To facilitate this, remove the finish_tasklet and move the call of
>>> sdhci_request_done() to the threaded_irq() callback.
>>
>> The irq thread has a higher latency than the tasklet.  The performance drop
>> is measurable on the system I tried:
>>
>> Before:
>>
>> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
>> 1+0 records in
>> 1+0 records out
>> 1073741824 bytes (1.1 GB) copied, 4.44502 s, 242 MB/s
>>
>> After:
>>
>> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
>> 1+0 records in
>> 1+0 records out
>> 1073741824 bytes (1.1 GB) copied, 4.50898 s, 238 MB/s
>>
>> So we only want to resort to the thread for the error case.
>>
> 
> Sorry for the late response here, but this is about 1.6% decrease. I
> tried out the same commands on a dra7xx board here (with about 5
> consecutive dd of 1GB) and the average decrease was 0.3%. I believe you
> will also find a lesser percentage change if you average over multiple
> dd commands.
> 
> Is this really so significant that we have to maintain two different
> bottom halves and keep having difficulty with adding APIs that can sleep?

It is a performance drop that can be avoided, so it might as well be.
Splitting the success path from the failure path is common for I/O drivers
for similar reasons as here: the success path can be optimized whereas the
failure path potentially needs to sleep.

> 
> Also I am not sure how to implement only the error handling part in the
> threaded_irq. We need to enter sdhci_request_done() and get the current
> mrq before we can check for error conditions like I've done in patch 2:
> 
> /* Terminate and synchronize dma in case of an error */
> if (data && (mrq->cmd->error || data->error) &&
>     host->use_external_dma) {
> 	struct dma_chan *chan = sdhci_external_dma_channel(host, data);
> 	dmaengine_terminate_sync(chan);
> }
> 
> On a related note, do we really need to protect everything in
> sdhci_request_done() with spinlocks?
>                                      In patch 2 I have only removed lock
> for the terminate_sync() parts that I added but the whole
> dma_unmap/dma_sync parts should be left unprotected IMO.

As it is written, synchronization is needed to stop the same mrq being
finished twice.

I suggest doing the dmaengine_terminate_sync() before calling
sdhci_request_done().  Perhaps you could record the channel that needs to be
sync'd and then do:

	struct dma_chan *chan = READ_ONCE(host->sync_chan);

	if (chan) {
		dmaengine_terminate_sync(chan);
		host->sync_chan = NULL;
	}
Faiz Abbas March 12, 2019, 5:30 p.m. UTC | #4
Hi Adrian,

On 3/8/2019 7:06 PM, Adrian Hunter wrote:
> On 6/03/19 12:00 PM, Faiz Abbas wrote:
>> Adrian,
>>
>> On 25/02/19 1:47 PM, Adrian Hunter wrote:
>>> On 15/02/19 9:20 PM, Faiz Abbas wrote:
>>>> sdhci.c has two bottom halves implemented. A threaded_irq for handling
>>>> card insert/remove operations and a tasklet for finishing mmc requests.
>>>> With the addition of external dma support, dmaengine APIs need to
>>>> terminate in non-atomic context before unmapping the dma buffers.
>>>>
>>>> To facilitate this, remove the finish_tasklet and move the call of
>>>> sdhci_request_done() to the threaded_irq() callback.
>>>
>>> The irq thread has a higher latency than the tasklet.  The performance drop
>>> is measurable on the system I tried:
>>>
>>> Before:
>>>
>>> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
>>> 1+0 records in
>>> 1+0 records out
>>> 1073741824 bytes (1.1 GB) copied, 4.44502 s, 242 MB/s
>>>
>>> After:
>>>
>>> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
>>> 1+0 records in
>>> 1+0 records out
>>> 1073741824 bytes (1.1 GB) copied, 4.50898 s, 238 MB/s
>>>
>>> So we only want to resort to the thread for the error case.
>>>
>>
>> Sorry for the late response here, but this is about 1.6% decrease. I
>> tried out the same commands on a dra7xx board here (with about 5
>> consecutive dd of 1GB) and the average decrease was 0.3%. I believe you
>> will also find a lesser percentage change if you average over multiple
>> dd commands.
>>
>> Is this really so significant that we have to maintain two different
>> bottom halves and keep having difficulty with adding APIs that can sleep?
> 
> It is a performance drop that can be avoided, so it might as well be.
> Splitting the success path from the failure path is common for I/O drivers
> for similar reasons as here: the success path can be optimized whereas the
> failure path potentially needs to sleep.

Understood. You wanna keep the success path as fast as possible.
> 
>>
>> Also I am not sure how to implement only the error handling part in the
>> threaded_irq. We need to enter sdhci_request_done() and get the current
>> mrq before we can check for error conditions like I've done in patch 2:
>>
>> /* Terminate and synchronize dma in case of an error */
>> if (data && (mrq->cmd->error || data->error) &&
>>     host->use_external_dma) {
>> 	struct dma_chan *chan = sdhci_external_dma_channel(host, data);
>> 	dmaengine_terminate_sync(chan);
>> }
>>
>> On a related note, do we really need to protect everything in
>> sdhci_request_done() with spinlocks?
>>                                      In patch 2 I have only removed lock
>> for the terminate_sync() parts that I added but the whole
>> dma_unmap/dma_sync parts should be left unprotected IMO.
> 
> As it is written, synchronization is needed to stop the same mrq being
> finished twice.
> 
> I suggest doing the dmaengine_terminate_sync() before calling
> sdhci_request_done().  Perhaps you could record the channel that needs to be
> sync'd and then do:
> 
> 	struct dma_chan *chan = READ_ONCE(host->sync_chan);
> 
> 	if (chan) {
> 		dmaengine_terminate_sync(chan);
> 		host->sync_chan = NULL;
> 	}
> 

Will try this out and send next version.

Thanks,
Faiz
Grygorii Strashko March 14, 2019, 11:15 a.m. UTC | #5
On 12.03.19 19:30, Rizvi, Mohammad Faiz Abbas wrote:
> Hi Adrian,
> 
> On 3/8/2019 7:06 PM, Adrian Hunter wrote:
>> On 6/03/19 12:00 PM, Faiz Abbas wrote:
>>> Adrian,
>>>
>>> On 25/02/19 1:47 PM, Adrian Hunter wrote:
>>>> On 15/02/19 9:20 PM, Faiz Abbas wrote:
>>>>> sdhci.c has two bottom halves implemented. A threaded_irq for handling
>>>>> card insert/remove operations and a tasklet for finishing mmc requests.
>>>>> With the addition of external dma support, dmaengine APIs need to
>>>>> terminate in non-atomic context before unmapping the dma buffers.
>>>>>
>>>>> To facilitate this, remove the finish_tasklet and move the call of
>>>>> sdhci_request_done() to the threaded_irq() callback.
>>>>
>>>> The irq thread has a higher latency than the tasklet.  The performance drop
>>>> is measurable on the system I tried:
>>>>
>>>> Before:
>>>>
>>>> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
>>>> 1+0 records in
>>>> 1+0 records out
>>>> 1073741824 bytes (1.1 GB) copied, 4.44502 s, 242 MB/s
>>>>
>>>> After:
>>>>
>>>> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
>>>> 1+0 records in
>>>> 1+0 records out
>>>> 1073741824 bytes (1.1 GB) copied, 4.50898 s, 238 MB/s
>>>>
>>>> So we only want to resort to the thread for the error case.
>>>>
>>>
>>> Sorry for the late response here, but this is about 1.6% decrease. I
>>> tried out the same commands on a dra7xx board here (with about 5
>>> consecutive dd of 1GB) and the average decrease was 0.3%. I believe you
>>> will also find a lesser percentage change if you average over multiple
>>> dd commands.
>>>
>>> Is this really so significant that we have to maintain two different
>>> bottom halves and keep having difficulty with adding APIs that can sleep?
>>
>> It is a performance drop that can be avoided, so it might as well be.
>> Splitting the success path from the failure path is common for I/O drivers
>> for similar reasons as here: the success path can be optimized whereas the
>> failure path potentially needs to sleep.
> 
> Understood. You wanna keep the success path as fast as possible.

Sry, I've not completely followed this series, but I'd like to add 5c

It's good thing to get rid of tasklets hence RT Linux kernel is actively moving towards to LKML
and there everything handled in threads (even networking trying to get rid of softirqs).

Performance is pretty relative thing here - just try to run network traffic in parallel, and
there are no control over it comparing to threads. Now way to assign priority or pin to CPU.
Arnd Bergmann March 14, 2019, 11:40 a.m. UTC | #6
On Tue, Mar 12, 2019 at 6:32 PM Rizvi, Mohammad Faiz Abbas
<faiz_abbas@ti.com> wrote:
> On 3/8/2019 7:06 PM, Adrian Hunter wrote:
> > On 6/03/19 12:00 PM, Faiz Abbas wrote:
> > It is a performance drop that can be avoided, so it might as well be.
> > Splitting the success path from the failure path is common for I/O drivers
> > for similar reasons as here: the success path can be optimized whereas the
> > failure path potentially needs to sleep.
>
> Understood. You wanna keep the success path as fast as possible.

I looked at the sdhci_request_done() function and found that almost all
of it is executed inside of a 'spin_lock_irqsave()', including the potentially
expensive dma_sync_single_for_cpu() calls.

This means there is very little benefit in using the tasklet in the first place,
it could just as well run in the hwirq context that triggered it.

The part that is actually run with interrupts enabled in the tasklet
is mmc_blk_cqe_req_done(), but other drivers also call this from
IRQ context.

       Arnd
Faiz Abbas March 14, 2019, 11:41 a.m. UTC | #7
Hi,

On 14/03/19 4:45 PM, Grygorii Strashko wrote:
> 
> 
> On 12.03.19 19:30, Rizvi, Mohammad Faiz Abbas wrote:
>> Hi Adrian,
>>
>> On 3/8/2019 7:06 PM, Adrian Hunter wrote:
>>> On 6/03/19 12:00 PM, Faiz Abbas wrote:
>>>> Adrian,
>>>>
>>>> On 25/02/19 1:47 PM, Adrian Hunter wrote:
>>>>> On 15/02/19 9:20 PM, Faiz Abbas wrote:
>>>>>> sdhci.c has two bottom halves implemented. A threaded_irq for handling
>>>>>> card insert/remove operations and a tasklet for finishing mmc requests.
>>>>>> With the addition of external dma support, dmaengine APIs need to
>>>>>> terminate in non-atomic context before unmapping the dma buffers.
>>>>>>
>>>>>> To facilitate this, remove the finish_tasklet and move the call of
>>>>>> sdhci_request_done() to the threaded_irq() callback.
>>>>>
>>>>> The irq thread has a higher latency than the tasklet.  The performance drop
>>>>> is measurable on the system I tried:
>>>>>
>>>>> Before:
>>>>>
>>>>> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
>>>>> 1+0 records in
>>>>> 1+0 records out
>>>>> 1073741824 bytes (1.1 GB) copied, 4.44502 s, 242 MB/s
>>>>>
>>>>> After:
>>>>>
>>>>> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
>>>>> 1+0 records in
>>>>> 1+0 records out
>>>>> 1073741824 bytes (1.1 GB) copied, 4.50898 s, 238 MB/s
>>>>>
>>>>> So we only want to resort to the thread for the error case.
>>>>>
>>>>
>>>> Sorry for the late response here, but this is about 1.6% decrease. I
>>>> tried out the same commands on a dra7xx board here (with about 5
>>>> consecutive dd of 1GB) and the average decrease was 0.3%. I believe you
>>>> will also find a lesser percentage change if you average over multiple
>>>> dd commands.
>>>>
>>>> Is this really so significant that we have to maintain two different
>>>> bottom halves and keep having difficulty with adding APIs that can sleep?
>>>
>>> It is a performance drop that can be avoided, so it might as well be.
>>> Splitting the success path from the failure path is common for I/O drivers
>>> for similar reasons as here: the success path can be optimized whereas the
>>> failure path potentially needs to sleep.
>>
>> Understood. You wanna keep the success path as fast as possible.
> 
> Sry, I've not completely followed this series, but I'd like to add 5c
> 
> It's good thing to get rid of tasklets hence RT Linux kernel is actively moving towards to LKML
> and there everything handled in threads (even networking trying to get rid of softirqs).
> 
> Performance is pretty relative thing here - just try to run network traffic in parallel, and
> there are no control over it comparing to threads. Now way to assign priority or pin to CPU.

There is a 2007 LWN article(https://lwn.net/Articles/239633/) which
talks about removing tasklets altogether. I wonder what happened after that.

Thanks,
Faiz
Ulf Hansson March 18, 2019, 9:33 a.m. UTC | #8
+ Arnd, Grygorii

On Fri, 15 Feb 2019 at 20:17, Faiz Abbas <faiz_abbas@ti.com> wrote:
>
> sdhci.c has two bottom halves implemented. A threaded_irq for handling
> card insert/remove operations and a tasklet for finishing mmc requests.
> With the addition of external dma support, dmaengine APIs need to
> terminate in non-atomic context before unmapping the dma buffers.
>
> To facilitate this, remove the finish_tasklet and move the call of
> sdhci_request_done() to the threaded_irq() callback. Also move the
> interrupt result variable to sdhci_host so it can be populated from
> anywhere inside the sdhci_irq handler.
>
> Signed-off-by: Faiz Abbas <faiz_abbas@ti.com>

Adrian, I think it makes sense to apply this patch, even if there is
very minor negative impact throughput wise.

To me, it doesn't seems like MMC/SD/SDIO has good justification for
using tasklets, besides from the legacy point of view, of course.
Instead, I think we should try to move all mmc hosts into using
threaded IRQs.

So, what do you think? Can you overlook the throughput drop and
instead we can try to recover this on top with other optimizations?

Kind regards
Uffe

> ---
>  drivers/mmc/host/sdhci.c | 43 +++++++++++++++-------------------------
>  drivers/mmc/host/sdhci.h |  2 +-
>  2 files changed, 17 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> index eba9bcc92ad3..20ed09b896d7 100644
> --- a/drivers/mmc/host/sdhci.c
> +++ b/drivers/mmc/host/sdhci.c
> @@ -1232,7 +1232,7 @@ static void __sdhci_finish_mrq(struct sdhci_host *host, struct mmc_request *mrq)
>
>         WARN_ON(i >= SDHCI_MAX_MRQS);
>
> -       tasklet_schedule(&host->finish_tasklet);
> +       host->result = IRQ_WAKE_THREAD;
>  }
>
>  static void sdhci_finish_mrq(struct sdhci_host *host, struct mmc_request *mrq)
> @@ -2705,14 +2705,6 @@ static bool sdhci_request_done(struct sdhci_host *host)
>         return false;
>  }
>
> -static void sdhci_tasklet_finish(unsigned long param)
> -{
> -       struct sdhci_host *host = (struct sdhci_host *)param;
> -
> -       while (!sdhci_request_done(host))
> -               ;
> -}
> -
>  static void sdhci_timeout_timer(struct timer_list *t)
>  {
>         struct sdhci_host *host;
> @@ -2995,11 +2987,12 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
>
>  static irqreturn_t sdhci_irq(int irq, void *dev_id)
>  {
> -       irqreturn_t result = IRQ_NONE;
>         struct sdhci_host *host = dev_id;
>         u32 intmask, mask, unexpected = 0;
>         int max_loops = 16;
>
> +       host->result = IRQ_NONE;
> +
>         spin_lock(&host->lock);
>
>         if (host->runtime_suspended && !sdhci_sdio_irq_enabled(host)) {
> @@ -3009,7 +3002,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
>
>         intmask = sdhci_readl(host, SDHCI_INT_STATUS);
>         if (!intmask || intmask == 0xffffffff) {
> -               result = IRQ_NONE;
> +               host->result = IRQ_NONE;
>                 goto out;
>         }
>
> @@ -3054,7 +3047,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
>
>                         host->thread_isr |= intmask & (SDHCI_INT_CARD_INSERT |
>                                                        SDHCI_INT_CARD_REMOVE);
> -                       result = IRQ_WAKE_THREAD;
> +                       host->result = IRQ_WAKE_THREAD;
>                 }
>
>                 if (intmask & SDHCI_INT_CMD_MASK)
> @@ -3074,7 +3067,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
>                     (host->ier & SDHCI_INT_CARD_INT)) {
>                         sdhci_enable_sdio_irq_nolock(host, false);
>                         host->thread_isr |= SDHCI_INT_CARD_INT;
> -                       result = IRQ_WAKE_THREAD;
> +                       host->result = IRQ_WAKE_THREAD;
>                 }
>
>                 intmask &= ~(SDHCI_INT_CARD_INSERT | SDHCI_INT_CARD_REMOVE |
> @@ -3087,8 +3080,8 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
>                         sdhci_writel(host, intmask, SDHCI_INT_STATUS);
>                 }
>  cont:
> -               if (result == IRQ_NONE)
> -                       result = IRQ_HANDLED;
> +               if (host->result == IRQ_NONE)
> +                       host->result = IRQ_HANDLED;
>
>                 intmask = sdhci_readl(host, SDHCI_INT_STATUS);
>         } while (intmask && --max_loops);
> @@ -3101,7 +3094,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
>                 sdhci_dumpregs(host);
>         }
>
> -       return result;
> +       return host->result;
>  }
>
>  static irqreturn_t sdhci_thread_irq(int irq, void *dev_id)
> @@ -3131,6 +3124,12 @@ static irqreturn_t sdhci_thread_irq(int irq, void *dev_id)
>                 spin_unlock_irqrestore(&host->lock, flags);
>         }
>
> +       if (!isr) {
> +               do {
> +                       isr = !sdhci_request_done(host);
> +               } while (isr);
> +       }
> +
>         return isr ? IRQ_HANDLED : IRQ_NONE;
>  }
>
> @@ -4212,12 +4211,6 @@ int __sdhci_add_host(struct sdhci_host *host)
>         struct mmc_host *mmc = host->mmc;
>         int ret;
>
> -       /*
> -        * Init tasklets.
> -        */
> -       tasklet_init(&host->finish_tasklet,
> -               sdhci_tasklet_finish, (unsigned long)host);
> -
>         timer_setup(&host->timer, sdhci_timeout_timer, 0);
>         timer_setup(&host->data_timer, sdhci_timeout_data_timer, 0);
>
> @@ -4230,7 +4223,7 @@ int __sdhci_add_host(struct sdhci_host *host)
>         if (ret) {
>                 pr_err("%s: Failed to request IRQ %d: %d\n",
>                        mmc_hostname(mmc), host->irq, ret);
> -               goto untasklet;
> +               return ret;
>         }
>
>         ret = sdhci_led_register(host);
> @@ -4263,8 +4256,6 @@ int __sdhci_add_host(struct sdhci_host *host)
>         sdhci_writel(host, 0, SDHCI_INT_ENABLE);
>         sdhci_writel(host, 0, SDHCI_SIGNAL_ENABLE);
>         free_irq(host->irq, host);
> -untasklet:
> -       tasklet_kill(&host->finish_tasklet);
>
>         return ret;
>  }
> @@ -4326,8 +4317,6 @@ void sdhci_remove_host(struct sdhci_host *host, int dead)
>         del_timer_sync(&host->timer);
>         del_timer_sync(&host->data_timer);
>
> -       tasklet_kill(&host->finish_tasklet);
> -
>         if (!IS_ERR(mmc->supply.vqmmc))
>                 regulator_disable(mmc->supply.vqmmc);
>
> diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
> index 6cc9a3c2ac66..624d5aa01995 100644
> --- a/drivers/mmc/host/sdhci.h
> +++ b/drivers/mmc/host/sdhci.h
> @@ -554,7 +554,7 @@ struct sdhci_host {
>
>         unsigned int desc_sz;   /* ADMA descriptor size */
>
> -       struct tasklet_struct finish_tasklet;   /* Tasklet structures */
> +       irqreturn_t result;     /* Result of IRQ handler */
>
>         struct timer_list timer;        /* Timer for timeouts */
>         struct timer_list data_timer;   /* Timer for data timeouts */
> --
> 2.19.2
>
Adrian Hunter March 26, 2019, 7:33 a.m. UTC | #9
On 18/03/19 11:33 AM, Ulf Hansson wrote:
> + Arnd, Grygorii
> 
> On Fri, 15 Feb 2019 at 20:17, Faiz Abbas <faiz_abbas@ti.com> wrote:
>>
>> sdhci.c has two bottom halves implemented. A threaded_irq for handling
>> card insert/remove operations and a tasklet for finishing mmc requests.
>> With the addition of external dma support, dmaengine APIs need to
>> terminate in non-atomic context before unmapping the dma buffers.
>>
>> To facilitate this, remove the finish_tasklet and move the call of
>> sdhci_request_done() to the threaded_irq() callback. Also move the
>> interrupt result variable to sdhci_host so it can be populated from
>> anywhere inside the sdhci_irq handler.
>>
>> Signed-off-by: Faiz Abbas <faiz_abbas@ti.com>
> 
> Adrian, I think it makes sense to apply this patch, even if there is
> very minor negative impact throughput wise.
> 
> To me, it doesn't seems like MMC/SD/SDIO has good justification for
> using tasklets, besides from the legacy point of view, of course.
> Instead, I think we should try to move all mmc hosts into using
> threaded IRQs.
> 
> So, what do you think? Can you overlook the throughput drop and
> instead we can try to recover this on top with other optimizations?

I tend to favour good results as expressed here:

	https://lkml.org/lkml/2007/6/22/360

So I want to do optimization first.

But performance is not the only problem with the patch.  Give me a few
days and I will see what I can come up with.

> 
> Kind regards
> Uffe
> 
>> ---
>>  drivers/mmc/host/sdhci.c | 43 +++++++++++++++-------------------------
>>  drivers/mmc/host/sdhci.h |  2 +-
>>  2 files changed, 17 insertions(+), 28 deletions(-)
>>
>> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
>> index eba9bcc92ad3..20ed09b896d7 100644
>> --- a/drivers/mmc/host/sdhci.c
>> +++ b/drivers/mmc/host/sdhci.c
>> @@ -1232,7 +1232,7 @@ static void __sdhci_finish_mrq(struct sdhci_host *host, struct mmc_request *mrq)
>>
>>         WARN_ON(i >= SDHCI_MAX_MRQS);
>>
>> -       tasklet_schedule(&host->finish_tasklet);
>> +       host->result = IRQ_WAKE_THREAD;
>>  }
>>
>>  static void sdhci_finish_mrq(struct sdhci_host *host, struct mmc_request *mrq)
>> @@ -2705,14 +2705,6 @@ static bool sdhci_request_done(struct sdhci_host *host)
>>         return false;
>>  }
>>
>> -static void sdhci_tasklet_finish(unsigned long param)
>> -{
>> -       struct sdhci_host *host = (struct sdhci_host *)param;
>> -
>> -       while (!sdhci_request_done(host))
>> -               ;
>> -}
>> -
>>  static void sdhci_timeout_timer(struct timer_list *t)
>>  {
>>         struct sdhci_host *host;
>> @@ -2995,11 +2987,12 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
>>
>>  static irqreturn_t sdhci_irq(int irq, void *dev_id)
>>  {
>> -       irqreturn_t result = IRQ_NONE;
>>         struct sdhci_host *host = dev_id;
>>         u32 intmask, mask, unexpected = 0;
>>         int max_loops = 16;
>>
>> +       host->result = IRQ_NONE;
>> +
>>         spin_lock(&host->lock);
>>
>>         if (host->runtime_suspended && !sdhci_sdio_irq_enabled(host)) {
>> @@ -3009,7 +3002,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
>>
>>         intmask = sdhci_readl(host, SDHCI_INT_STATUS);
>>         if (!intmask || intmask == 0xffffffff) {
>> -               result = IRQ_NONE;
>> +               host->result = IRQ_NONE;
>>                 goto out;
>>         }
>>
>> @@ -3054,7 +3047,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
>>
>>                         host->thread_isr |= intmask & (SDHCI_INT_CARD_INSERT |
>>                                                        SDHCI_INT_CARD_REMOVE);
>> -                       result = IRQ_WAKE_THREAD;
>> +                       host->result = IRQ_WAKE_THREAD;
>>                 }
>>
>>                 if (intmask & SDHCI_INT_CMD_MASK)
>> @@ -3074,7 +3067,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
>>                     (host->ier & SDHCI_INT_CARD_INT)) {
>>                         sdhci_enable_sdio_irq_nolock(host, false);
>>                         host->thread_isr |= SDHCI_INT_CARD_INT;
>> -                       result = IRQ_WAKE_THREAD;
>> +                       host->result = IRQ_WAKE_THREAD;
>>                 }
>>
>>                 intmask &= ~(SDHCI_INT_CARD_INSERT | SDHCI_INT_CARD_REMOVE |
>> @@ -3087,8 +3080,8 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
>>                         sdhci_writel(host, intmask, SDHCI_INT_STATUS);
>>                 }
>>  cont:
>> -               if (result == IRQ_NONE)
>> -                       result = IRQ_HANDLED;
>> +               if (host->result == IRQ_NONE)
>> +                       host->result = IRQ_HANDLED;
>>
>>                 intmask = sdhci_readl(host, SDHCI_INT_STATUS);
>>         } while (intmask && --max_loops);
>> @@ -3101,7 +3094,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
>>                 sdhci_dumpregs(host);
>>         }
>>
>> -       return result;
>> +       return host->result;
>>  }
>>
>>  static irqreturn_t sdhci_thread_irq(int irq, void *dev_id)
>> @@ -3131,6 +3124,12 @@ static irqreturn_t sdhci_thread_irq(int irq, void *dev_id)
>>                 spin_unlock_irqrestore(&host->lock, flags);
>>         }
>>
>> +       if (!isr) {
>> +               do {
>> +                       isr = !sdhci_request_done(host);
>> +               } while (isr);
>> +       }
>> +
>>         return isr ? IRQ_HANDLED : IRQ_NONE;
>>  }
>>
>> @@ -4212,12 +4211,6 @@ int __sdhci_add_host(struct sdhci_host *host)
>>         struct mmc_host *mmc = host->mmc;
>>         int ret;
>>
>> -       /*
>> -        * Init tasklets.
>> -        */
>> -       tasklet_init(&host->finish_tasklet,
>> -               sdhci_tasklet_finish, (unsigned long)host);
>> -
>>         timer_setup(&host->timer, sdhci_timeout_timer, 0);
>>         timer_setup(&host->data_timer, sdhci_timeout_data_timer, 0);
>>
>> @@ -4230,7 +4223,7 @@ int __sdhci_add_host(struct sdhci_host *host)
>>         if (ret) {
>>                 pr_err("%s: Failed to request IRQ %d: %d\n",
>>                        mmc_hostname(mmc), host->irq, ret);
>> -               goto untasklet;
>> +               return ret;
>>         }
>>
>>         ret = sdhci_led_register(host);
>> @@ -4263,8 +4256,6 @@ int __sdhci_add_host(struct sdhci_host *host)
>>         sdhci_writel(host, 0, SDHCI_INT_ENABLE);
>>         sdhci_writel(host, 0, SDHCI_SIGNAL_ENABLE);
>>         free_irq(host->irq, host);
>> -untasklet:
>> -       tasklet_kill(&host->finish_tasklet);
>>
>>         return ret;
>>  }
>> @@ -4326,8 +4317,6 @@ void sdhci_remove_host(struct sdhci_host *host, int dead)
>>         del_timer_sync(&host->timer);
>>         del_timer_sync(&host->data_timer);
>>
>> -       tasklet_kill(&host->finish_tasklet);
>> -
>>         if (!IS_ERR(mmc->supply.vqmmc))
>>                 regulator_disable(mmc->supply.vqmmc);
>>
>> diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
>> index 6cc9a3c2ac66..624d5aa01995 100644
>> --- a/drivers/mmc/host/sdhci.h
>> +++ b/drivers/mmc/host/sdhci.h
>> @@ -554,7 +554,7 @@ struct sdhci_host {
>>
>>         unsigned int desc_sz;   /* ADMA descriptor size */
>>
>> -       struct tasklet_struct finish_tasklet;   /* Tasklet structures */
>> +       irqreturn_t result;     /* Result of IRQ handler */
>>
>>         struct timer_list timer;        /* Timer for timeouts */
>>         struct timer_list data_timer;   /* Timer for data timeouts */
>> --
>> 2.19.2
>>
>
Faiz Abbas April 2, 2019, 7:59 a.m. UTC | #10
Hi Adrian,

On 26/03/19 1:03 PM, Adrian Hunter wrote:
> On 18/03/19 11:33 AM, Ulf Hansson wrote:
>> + Arnd, Grygorii
>>
>> On Fri, 15 Feb 2019 at 20:17, Faiz Abbas <faiz_abbas@ti.com> wrote:
>>>
>>> sdhci.c has two bottom halves implemented. A threaded_irq for handling
>>> card insert/remove operations and a tasklet for finishing mmc requests.
>>> With the addition of external dma support, dmaengine APIs need to
>>> terminate in non-atomic context before unmapping the dma buffers.
>>>
>>> To facilitate this, remove the finish_tasklet and move the call of
>>> sdhci_request_done() to the threaded_irq() callback. Also move the
>>> interrupt result variable to sdhci_host so it can be populated from
>>> anywhere inside the sdhci_irq handler.
>>>
>>> Signed-off-by: Faiz Abbas <faiz_abbas@ti.com>
>>
>> Adrian, I think it makes sense to apply this patch, even if there is
>> very minor negative impact throughput wise.
>>
>> To me, it doesn't seems like MMC/SD/SDIO has good justification for
>> using tasklets, besides from the legacy point of view, of course.
>> Instead, I think we should try to move all mmc hosts into using
>> threaded IRQs.
>>
>> So, what do you think? Can you overlook the throughput drop and
>> instead we can try to recover this on top with other optimizations?
> 
> I tend to favour good results as expressed here:
> 
> 	https://lkml.org/lkml/2007/6/22/360
> 
> So I want to do optimization first.
> 
> But performance is not the only problem with the patch.  Give me a few
> days and I will see what I can come up with.
> 

Gentle ping on this.

Thanks,
Faiz
Adrian Hunter April 2, 2019, 1:12 p.m. UTC | #11
On 2/04/19 10:59 AM, Faiz Abbas wrote:
> Hi Adrian,
> 
> On 26/03/19 1:03 PM, Adrian Hunter wrote:
>> On 18/03/19 11:33 AM, Ulf Hansson wrote:
>>> + Arnd, Grygorii
>>>
>>> On Fri, 15 Feb 2019 at 20:17, Faiz Abbas <faiz_abbas@ti.com> wrote:
>>>>
>>>> sdhci.c has two bottom halves implemented. A threaded_irq for handling
>>>> card insert/remove operations and a tasklet for finishing mmc requests.
>>>> With the addition of external dma support, dmaengine APIs need to
>>>> terminate in non-atomic context before unmapping the dma buffers.
>>>>
>>>> To facilitate this, remove the finish_tasklet and move the call of
>>>> sdhci_request_done() to the threaded_irq() callback. Also move the
>>>> interrupt result variable to sdhci_host so it can be populated from
>>>> anywhere inside the sdhci_irq handler.
>>>>
>>>> Signed-off-by: Faiz Abbas <faiz_abbas@ti.com>
>>>
>>> Adrian, I think it makes sense to apply this patch, even if there is
>>> very minor negative impact throughput wise.
>>>
>>> To me, it doesn't seems like MMC/SD/SDIO has good justification for
>>> using tasklets, besides from the legacy point of view, of course.
>>> Instead, I think we should try to move all mmc hosts into using
>>> threaded IRQs.
>>>
>>> So, what do you think? Can you overlook the throughput drop and
>>> instead we can try to recover this on top with other optimizations?
>>
>> I tend to favour good results as expressed here:
>>
>> 	https://lkml.org/lkml/2007/6/22/360
>>
>> So I want to do optimization first.
>>
>> But performance is not the only problem with the patch.  Give me a few
>> days and I will see what I can come up with.
>>
> 
> Gentle ping on this.

Working on it :-)
diff mbox series

Patch

diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index eba9bcc92ad3..20ed09b896d7 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -1232,7 +1232,7 @@  static void __sdhci_finish_mrq(struct sdhci_host *host, struct mmc_request *mrq)
 
 	WARN_ON(i >= SDHCI_MAX_MRQS);
 
-	tasklet_schedule(&host->finish_tasklet);
+	host->result = IRQ_WAKE_THREAD;
 }
 
 static void sdhci_finish_mrq(struct sdhci_host *host, struct mmc_request *mrq)
@@ -2705,14 +2705,6 @@  static bool sdhci_request_done(struct sdhci_host *host)
 	return false;
 }
 
-static void sdhci_tasklet_finish(unsigned long param)
-{
-	struct sdhci_host *host = (struct sdhci_host *)param;
-
-	while (!sdhci_request_done(host))
-		;
-}
-
 static void sdhci_timeout_timer(struct timer_list *t)
 {
 	struct sdhci_host *host;
@@ -2995,11 +2987,12 @@  static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
 
 static irqreturn_t sdhci_irq(int irq, void *dev_id)
 {
-	irqreturn_t result = IRQ_NONE;
 	struct sdhci_host *host = dev_id;
 	u32 intmask, mask, unexpected = 0;
 	int max_loops = 16;
 
+	host->result = IRQ_NONE;
+
 	spin_lock(&host->lock);
 
 	if (host->runtime_suspended && !sdhci_sdio_irq_enabled(host)) {
@@ -3009,7 +3002,7 @@  static irqreturn_t sdhci_irq(int irq, void *dev_id)
 
 	intmask = sdhci_readl(host, SDHCI_INT_STATUS);
 	if (!intmask || intmask == 0xffffffff) {
-		result = IRQ_NONE;
+		host->result = IRQ_NONE;
 		goto out;
 	}
 
@@ -3054,7 +3047,7 @@  static irqreturn_t sdhci_irq(int irq, void *dev_id)
 
 			host->thread_isr |= intmask & (SDHCI_INT_CARD_INSERT |
 						       SDHCI_INT_CARD_REMOVE);
-			result = IRQ_WAKE_THREAD;
+			host->result = IRQ_WAKE_THREAD;
 		}
 
 		if (intmask & SDHCI_INT_CMD_MASK)
@@ -3074,7 +3067,7 @@  static irqreturn_t sdhci_irq(int irq, void *dev_id)
 		    (host->ier & SDHCI_INT_CARD_INT)) {
 			sdhci_enable_sdio_irq_nolock(host, false);
 			host->thread_isr |= SDHCI_INT_CARD_INT;
-			result = IRQ_WAKE_THREAD;
+			host->result = IRQ_WAKE_THREAD;
 		}
 
 		intmask &= ~(SDHCI_INT_CARD_INSERT | SDHCI_INT_CARD_REMOVE |
@@ -3087,8 +3080,8 @@  static irqreturn_t sdhci_irq(int irq, void *dev_id)
 			sdhci_writel(host, intmask, SDHCI_INT_STATUS);
 		}
 cont:
-		if (result == IRQ_NONE)
-			result = IRQ_HANDLED;
+		if (host->result == IRQ_NONE)
+			host->result = IRQ_HANDLED;
 
 		intmask = sdhci_readl(host, SDHCI_INT_STATUS);
 	} while (intmask && --max_loops);
@@ -3101,7 +3094,7 @@  static irqreturn_t sdhci_irq(int irq, void *dev_id)
 		sdhci_dumpregs(host);
 	}
 
-	return result;
+	return host->result;
 }
 
 static irqreturn_t sdhci_thread_irq(int irq, void *dev_id)
@@ -3131,6 +3124,12 @@  static irqreturn_t sdhci_thread_irq(int irq, void *dev_id)
 		spin_unlock_irqrestore(&host->lock, flags);
 	}
 
+	if (!isr) {
+		do {
+			isr = !sdhci_request_done(host);
+		} while (isr);
+	}
+
 	return isr ? IRQ_HANDLED : IRQ_NONE;
 }
 
@@ -4212,12 +4211,6 @@  int __sdhci_add_host(struct sdhci_host *host)
 	struct mmc_host *mmc = host->mmc;
 	int ret;
 
-	/*
-	 * Init tasklets.
-	 */
-	tasklet_init(&host->finish_tasklet,
-		sdhci_tasklet_finish, (unsigned long)host);
-
 	timer_setup(&host->timer, sdhci_timeout_timer, 0);
 	timer_setup(&host->data_timer, sdhci_timeout_data_timer, 0);
 
@@ -4230,7 +4223,7 @@  int __sdhci_add_host(struct sdhci_host *host)
 	if (ret) {
 		pr_err("%s: Failed to request IRQ %d: %d\n",
 		       mmc_hostname(mmc), host->irq, ret);
-		goto untasklet;
+		return ret;
 	}
 
 	ret = sdhci_led_register(host);
@@ -4263,8 +4256,6 @@  int __sdhci_add_host(struct sdhci_host *host)
 	sdhci_writel(host, 0, SDHCI_INT_ENABLE);
 	sdhci_writel(host, 0, SDHCI_SIGNAL_ENABLE);
 	free_irq(host->irq, host);
-untasklet:
-	tasklet_kill(&host->finish_tasklet);
 
 	return ret;
 }
@@ -4326,8 +4317,6 @@  void sdhci_remove_host(struct sdhci_host *host, int dead)
 	del_timer_sync(&host->timer);
 	del_timer_sync(&host->data_timer);
 
-	tasklet_kill(&host->finish_tasklet);
-
 	if (!IS_ERR(mmc->supply.vqmmc))
 		regulator_disable(mmc->supply.vqmmc);
 
diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
index 6cc9a3c2ac66..624d5aa01995 100644
--- a/drivers/mmc/host/sdhci.h
+++ b/drivers/mmc/host/sdhci.h
@@ -554,7 +554,7 @@  struct sdhci_host {
 
 	unsigned int desc_sz;	/* ADMA descriptor size */
 
-	struct tasklet_struct finish_tasklet;	/* Tasklet structures */
+	irqreturn_t result;	/* Result of IRQ handler */
 
 	struct timer_list timer;	/* Timer for timeouts */
 	struct timer_list data_timer;	/* Timer for data timeouts */