mmc: dw_mmc: Wait for data transfer after response errors
diff mbox

Message ID 1431964402-27457-1-git-send-email-dianders@chromium.org
State New
Headers show

Commit Message

Douglas Anderson May 18, 2015, 3:53 p.m. UTC
According to the DesignWare state machine description, after we get a
"response error" or "response CRC error" we move into data transfer
mode.  That means that we don't necessarily need to special case
trying to deal with the failure right away.  We can wait until we are
notified that the data transfer is complete (with or without errors)
and then we can deal with the failure.

It may sound strange to defer dealing with a command that we know will
fail anyway, but this appears to fix a bug.  During tuning (CMD19) on
a specific card on an rk3288-based system, we found that we could get
a "response CRC error".  Sending the stop command after the "response
CRC error" would then throw the system into a confused state causing
all future tuning phases to report failure.

When in the confused state, the controller would show these (hex codes
are interrupt status register):
 CMD ERR: 0x00000046 (cmd=19)
 CMD ERR: 0x0000004e (cmd=12)
 DATA ERR: 0x00000208
 DATA ERR: 0x0000020c
 CMD ERR: 0x00000104 (cmd=19)
 CMD ERR: 0x00000104 (cmd=12)
 DATA ERR: 0x00000208
 DATA ERR: 0x0000020c
 ...
 ...

It is inherently difficult to deal with the complexity of trying to
correctly send a stop command while a data transfer is taking place
since you need to deal with different corner cases caused by the fact
that the data transfer could complete (with errors or without errors)
during various places in sending the stop command (dw_mci_stop_dma,
send_stop_abort, etc)

Instead of adding a bunch of extra complexity to deal with this, it
seems much simpler to just use the more straightforward (and less
error-prone) path of letting the data transfer finish.  There
shouldn't be any huge benefit to sending the stop command slightly
earlier, anyway.

Signed-off-by: Doug Anderson <dianders@chromium.org>
---
 drivers/mmc/host/dw_mmc.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Comments

Alim Akhtar May 26, 2015, 6:02 p.m. UTC | #1
Hi Doug,
On peach-pi, I got a hung task once in 4 cold boot as [1].
And every time got a hung task [2] on suspend/resume, triggered
exactly from this change. I have a debug print at $SUBJECT change.
[1]:
----------------
on boot:
[  240.197190] INFO: task kworker/u16:1:50 blocked for more than 120 seconds.
[  240.202602]       Not tainted 4.1.0-rc5-00001-g862e58a-dirty #6
[  240.208505] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  240.216307] kworker/u16:1   D c0505528     0    50      2 0x00000000
[  240.222650] Workqueue: kmmcd mmc_rescan
[  240.226448] [<c0505528>] (__schedule) from [<c0505924>] (schedule+0x34/0x98)
[  240.233496] [<c0505924>] (schedule) from [<c0509274>]
(schedule_timeout+0x110/0x164)
[  240.241207] [<c0509274>] (schedule_timeout) from [<c0506258>]
(wait_for_common+0xb8/0x14c)
[  240.249454] [<c0506258>] (wait_for_common) from [<c03b2b5c>]
(mmc_wait_for_req+0xb0/0x13c)
[  240.257691] [<c03b2b5c>] (mmc_wait_for_req) from [<c03b8684>]
(mmc_send_tuning+0x140/0x1b0)
[  240.266026] [<c03b8684>] (mmc_send_tuning) from [<c03cd45c>]
(dw_mci_exynos_execute_tuning+0x70/0x174)
[  240.275299] [<c03cd45c>] (dw_mci_exynos_execute_tuning) from
[<c03c97f0>] (dw_mci_execute_tuning+0x2c/0x38)
[  240.285021] [<c03c97f0>] (dw_mci_execute_tuning) from [<c03b3e90>]
(mmc_execute_tuning+0x30/0x60)
[  240.293864] [<c03b3e90>] (mmc_execute_tuning) from [<c03b6fe4>]
(mmc_init_card+0x8e0/0x1648)
[  240.302283] [<c03b6fe4>] (mmc_init_card) from [<c03b801c>]
(mmc_attach_mmc+0x90/0x15c)
[  240.310172] [<c03b801c>] (mmc_attach_mmc) from [<c03b4cf4>]
(mmc_rescan+0x29c/0x2e4)
[  240.317903] [<c03b4cf4>] (mmc_rescan) from [<c00342a0>]
(process_one_work+0x120/0x324)
[  240.325788] [<c00342a0>] (process_one_work) from [<c0034500>]
(worker_thread+0x28/0x490)
[  240.333861] [<c0034500>] (worker_thread) from [<c0039324>]
(kthread+0xd8/0xf4)
[  240.341058] [<c0039324>] (kthread) from [<c000f5a0>]
(ret_from_fork+0x14/0x34)
-----------


[2]: On s2r cycle:
========
[   71.107181] mmc_host mmc0: Bus speed (slot 0) = 50000000Hz (slot
req 400000Hz, actual 396825HZ div = 63)
[   71.147363] mmc_host mmc0: Bus speed (slot 0) = 200000000Hz (slot
req 200000000Hz, actual 200000000HZ div = 0)


[   71.155946] dwmmc_exynos 12200000.mmc: [ALIM] :
dw_mci_tasklet_func: 1593: cmd->data and err


[  240.197177] INFO: task sh:1645 blocked for more than 120 seconds.
[  240.201802]       Not tainted 4.1.0-rc5-00001-g862e58a-dirty #6
[  240.207713] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  240.215518] sh              D c0505528     0  1645      1 0x00000000
[  240.221857] [<c0505528>] (__schedule) from [<c0505924>] (schedule+0x34/0x98)
[  240.228887] [<c0505924>] (schedule) from [<c0509274>]
(schedule_timeout+0x110/0x164)
[  240.236606] [<c0509274>] (schedule_timeout) from [<c0506258>]
(wait_for_common+0xb8/0x14c)
[  240.244851] [<c0506258>] (wait_for_common) from [<c03b2b5c>]
(mmc_wait_for_req+0xb0/0x13c)
[  240.253090] [<c03b2b5c>] (mmc_wait_for_req) from [<c03b8684>]
(mmc_send_tuning+0x140/0x1b0)
[  240.261417] [<c03b8684>] (mmc_send_tuning) from [<c03cd45c>]
(dw_mci_exynos_execute_tuning+0x70/0x174)
[  240.270699] [<c03cd45c>] (dw_mci_exynos_execute_tuning) from
[<c03c97f0>] (dw_mci_execute_tuning+0x2c/0x38)
[  240.280415] [<c03c97f0>] (dw_mci_execute_tuning) from [<c03b3e90>]
(mmc_execute_tuning+0x30/0x60)
[  240.289263] [<c03b3e90>] (mmc_execute_tuning) from [<c03b6fe4>]
(mmc_init_card+0x8e0/0x1648)
[  240.297675] [<c03b6fe4>] (mmc_init_card) from [<c03b7d98>]
(_mmc_resume+0x4c/0x78)
[  240.305222] [<c03b7d98>] (_mmc_resume) from [<c03b7e8c>]
(mmc_resume+0x1c/0x58)
[  240.312508] [<c03b7e8c>] (mmc_resume) from [<c03b5128>]
(mmc_bus_resume+0x1c/0x50)
[  240.320060] [<c03b5128>] (mmc_bus_resume) from [<c02c0008>]
(dpm_run_callback.isra.8+0x1c/0x48)
[  240.328736] [<c02c0008>] (dpm_run_callback.isra.8) from
[<c02c00e0>] (device_resume+0xac/0x180)
[  240.337436] [<c02c00e0>] (device_resume) from [<c02c1358>]
(dpm_resume+0xe8/0x210)
[  240.344957] [<c02c1358>] (dpm_resume) from [<c02c161c>]
(dpm_resume_end+0xc/0x18)
[  240.352419] [<c02c161c>] (dpm_resume_end) from [<c00577f8>]
(suspend_devices_and_enter+0x10c/0x3e4)
[  240.361437] [<c00577f8>] (suspend_devices_and_enter) from
[<c0057c38>] (pm_suspend+0x168/0x208)
[  240.370110] [<c0057c38>] (pm_suspend) from [<c00568e0>]
(state_store+0x6c/0xbc)
[  240.377400] [<c00568e0>] (state_store) from [<c01f6ac8>]
(kobj_attr_store+0x14/0x20)
[  240.385123] [<c01f6ac8>] (kobj_attr_store) from [<c012215c>]
(sysfs_kf_write+0x44/0x48)
[  240.393103] [<c012215c>] (sysfs_kf_write) from [<c01217d0>]
(kernfs_fop_write+0xb8/0x194)
[  240.401257] [<c01217d0>] (kernfs_fop_write) from [<c00c91a4>]
(__vfs_write+0x2c/0xd8)
[  240.409064] [<c00c91a4>] (__vfs_write) from [<c00ca05c>]
(vfs_write+0x90/0x14c)
[  240.416350] [<c00ca05c>] (vfs_write) from [<c00ca2e4>] (SyS_write+0x40/0x8c)
[  240.423378] [<c00ca2e4>] (SyS_write) from [<c000f500>]
(ret_fast_syscall+0x0/0x34)
====

I was checking on v4.1-rc5, git hash as below:

862e58a mmc: dw_mmc: Wait for data transfer after response errors
ba155e2 Linux 4.1-rc5
5b13966

Not sure if I missed any dependent patch??
Have not checked the dw TRM for this change, will do that as soon as I
get access to it.

Regards,
Alim


On Mon, May 18, 2015 at 9:23 PM, Doug Anderson <dianders@chromium.org> wrote:
> According to the DesignWare state machine description, after we get a
> "response error" or "response CRC error" we move into data transfer
> mode.  That means that we don't necessarily need to special case
> trying to deal with the failure right away.  We can wait until we are
> notified that the data transfer is complete (with or without errors)
> and then we can deal with the failure.
>
> It may sound strange to defer dealing with a command that we know will
> fail anyway, but this appears to fix a bug.  During tuning (CMD19) on
> a specific card on an rk3288-based system, we found that we could get
> a "response CRC error".  Sending the stop command after the "response
> CRC error" would then throw the system into a confused state causing
> all future tuning phases to report failure.
>
> When in the confused state, the controller would show these (hex codes
> are interrupt status register):
>  CMD ERR: 0x00000046 (cmd=19)
>  CMD ERR: 0x0000004e (cmd=12)
>  DATA ERR: 0x00000208
>  DATA ERR: 0x0000020c
>  CMD ERR: 0x00000104 (cmd=19)
>  CMD ERR: 0x00000104 (cmd=12)
>  DATA ERR: 0x00000208
>  DATA ERR: 0x0000020c
>  ...
>  ...
>
> It is inherently difficult to deal with the complexity of trying to
> correctly send a stop command while a data transfer is taking place
> since you need to deal with different corner cases caused by the fact
> that the data transfer could complete (with errors or without errors)
> during various places in sending the stop command (dw_mci_stop_dma,
> send_stop_abort, etc)
>
> Instead of adding a bunch of extra complexity to deal with this, it
> seems much simpler to just use the more straightforward (and less
> error-prone) path of letting the data transfer finish.  There
> shouldn't be any huge benefit to sending the stop command slightly
> earlier, anyway.
>
> Signed-off-by: Doug Anderson <dianders@chromium.org>
> ---
>  drivers/mmc/host/dw_mmc.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> index 5f5adaf..c081ce2 100644
> --- a/drivers/mmc/host/dw_mmc.c
> +++ b/drivers/mmc/host/dw_mmc.c
> @@ -1574,6 +1574,26 @@ static void dw_mci_tasklet_func(unsigned long priv)
>                         }
>
>                         if (cmd->data && err) {
> +                               /*
> +                                * Controller will move into a data transfer
> +                                * state after a response error or response CRC
> +                                * error.  Let's let that finish before trying
> +                                * to send a stop, so we'll go to
> +                                * STATE_SENDING_DATA.
> +                                *
> +                                * Although letting the data transfer take place
> +                                * will waste a bit of time (we already know
> +                                * the command was bad), it can't cause any
> +                                * errors since it's possible it would have
> +                                * taken place anyway if this tasklet got
> +                                * delayed.  Allowing the transfer to take place
> +                                * avoids races and keeps things simple.
> +                                */
> +                               if (err != -ETIMEDOUT) {
> +                                       state = STATE_SENDING_DATA;
> +                                       continue;
> +                               }
> +
>                                 dw_mci_stop_dma(host);
>                                 send_stop_abort(host, data);
>                                 state = STATE_SENDING_STOP;
> --
> 2.2.0.rc0.207.ga3a616c
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Douglas Anderson May 26, 2015, 8:44 p.m. UTC | #2
Alim,

On Tue, May 26, 2015 at 11:02 AM, Alim Akhtar <alim.akhtar@gmail.com> wrote:
> Hi Doug,
> On peach-pi, I got a hung task once in 4 cold boot as [1].

OK, I'll have to get my peach-pi or peach-pit up and running again.  I
ran out of desk space and I haven't gotten it set back up.  :(

I've been testing on an rk3288-based device.  Past experience has
taught me that the rk3288 dw_mmc works differently than the exynos
one, so perhaps this is a difference.

Could you possibly patch in something like
<https://chromium-review.googlesource.com/#/c/244347/> and provide the
console for the failure?  I'll put it on my list to try this myself,
too


> I was checking on v4.1-rc5, git hash as below:
>
> 862e58a mmc: dw_mmc: Wait for data transfer after response errors
> ba155e2 Linux 4.1-rc5
> 5b13966
>
> Not sure if I missed any dependent patch??

I'm currently testing out of tree, but my dw_mmc is very close to mainline.


> Have not checked the dw TRM for this change, will do that as soon as I
> get access to it.

OK, sounds good.  I have some old version of the DesignWare TRM, so
possibly something is different in the newer one...


-Doug
--
To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jaehoon Chung May 27, 2015, 1:53 a.m. UTC | #3
Hi, All.

Thanks for your effort and sorry for late!

On 05/27/2015 05:44 AM, Doug Anderson wrote:
> Alim,
> 
> On Tue, May 26, 2015 at 11:02 AM, Alim Akhtar <alim.akhtar@gmail.com> wrote:
>> Hi Doug,
>> On peach-pi, I got a hung task once in 4 cold boot as [1].
> 
> OK, I'll have to get my peach-pi or peach-pit up and running again.  I
> ran out of desk space and I haven't gotten it set back up.  :(
> 
> I've been testing on an rk3288-based device.  Past experience has
> taught me that the rk3288 dw_mmc works differently than the exynos
> one, so perhaps this is a difference.
> 
> Could you possibly patch in something like
> <https://chromium-review.googlesource.com/#/c/244347/> and provide the
> console for the failure?  I'll put it on my list to try this myself,
> too

I don't test with this..but if needs, i will test.
It needs to fix the Alim's problem.

> 
> 
>> I was checking on v4.1-rc5, git hash as below:
>>
>> 862e58a mmc: dw_mmc: Wait for data transfer after response errors
>> ba155e2 Linux 4.1-rc5
>> 5b13966
>>
>> Not sure if I missed any dependent patch??
> 
> I'm currently testing out of tree, but my dw_mmc is very close to mainline.
> 
> 
>> Have not checked the dw TRM for this change, will do that as soon as I
>> get access to it.
> 
> OK, sounds good.  I have some old version of the DesignWare TRM, so
> possibly something is different in the newer one...

Which version do you have? I also need to check the TRM version.

Best Regards,
Jaehoon Chung

> 
> 
> -Doug
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Douglas Anderson May 27, 2015, 4:52 p.m. UTC | #4
Hi,

On Tue, May 26, 2015 at 6:53 PM, Jaehoon Chung <jh80.chung@samsung.com> wrote:
>>> Have not checked the dw TRM for this change, will do that as soon as I
>>> get access to it.
>>
>> OK, sounds good.  I have some old version of the DesignWare TRM, so
>> possibly something is different in the newer one...
>
> Which version do you have? I also need to check the TRM version.

It is ancient.  It says 2.41a July 2011.  If anyone has the ability to
give me a newer version, I wouldn't object.

-Doug
--
To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
index 5f5adaf..c081ce2 100644
--- a/drivers/mmc/host/dw_mmc.c
+++ b/drivers/mmc/host/dw_mmc.c
@@ -1574,6 +1574,26 @@  static void dw_mci_tasklet_func(unsigned long priv)
 			}
 
 			if (cmd->data && err) {
+				/*
+				 * Controller will move into a data transfer
+				 * state after a response error or response CRC
+				 * error.  Let's let that finish before trying
+				 * to send a stop, so we'll go to
+				 * STATE_SENDING_DATA.
+				 *
+				 * Although letting the data transfer take place
+				 * will waste a bit of time (we already know
+				 * the command was bad), it can't cause any
+				 * errors since it's possible it would have
+				 * taken place anyway if this tasklet got
+				 * delayed.  Allowing the transfer to take place
+				 * avoids races and keeps things simple.
+				 */
+				if (err != -ETIMEDOUT) {
+					state = STATE_SENDING_DATA;
+					continue;
+				}
+
 				dw_mci_stop_dma(host);
 				send_stop_abort(host, data);
 				state = STATE_SENDING_STOP;