diff mbox

[v2,1/2] mmc: core: Fix CMD6 timeout issue

Message ID 1483433397-11756-2-git-send-email-yong.mao@mediatek.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yong Mao Jan. 3, 2017, 8:49 a.m. UTC
From: yong mao <yong.mao@mediatek.com>

When initializing EMMC, after switch to HS400,
it will issue CMD6 to change ext_csd,
if first CMD6 got CRC error,
the repeat CMD6 may get timeout,
that's because card is not back to transfer state immediately.

For resolving this issue, it need check if card is busy
before sending repeat CMD6.

Not only CMD6 here has this issue, but also other R1B CMD has
the same issue.

Signed-off-by: Yong Mao <yong.mao@mediatek.com>
Signed-off-by: Chaotian Jing <chaotian.jing@mediatek.com>
---
 drivers/mmc/core/core.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Comments

Ulf Hansson Jan. 12, 2017, 3:21 p.m. UTC | #1
- trimmed cc-list

On 3 January 2017 at 09:49, Yong Mao <yong.mao@mediatek.com> wrote:
> From: yong mao <yong.mao@mediatek.com>
>
> When initializing EMMC, after switch to HS400,
> it will issue CMD6 to change ext_csd,
> if first CMD6 got CRC error,
> the repeat CMD6 may get timeout,
> that's because card is not back to transfer state immediately.
>
> For resolving this issue, it need check if card is busy
> before sending repeat CMD6.

I agree that doing retries in this path might may not be correctly
done, but currently we do it as best effort.

We should probably change this to *not* retry the CMD6 command, in
cases when we have a MMC_RSP_R1B response, because of the reasons you
describe.
I get the feeling that these retry attempts for CMD6, may instead hide
other real issues and making it harder to narrow them down. Don't you
think?

This leads to my following question:
Why do you get a CRC error at the first CMD6 attempt? That shouldn't
happen, right?

Perhaps you can elaborate on what of the CMD6 commands in the HS400
enabling sequence that fails. It may help us to understand, perhaps
there may be something in that sequence that should be changed.

>
> Not only CMD6 here has this issue, but also other R1B CMD has
> the same issue.

Yes, agree!

However, can you please try to point out some other commands than CM6
that you see uses *retries*, has R1B response, and which you believe
may not be properly managed.

Dealing with R1B responses isn't always straight forward. Therefore I
am wondering whether we perhaps should just not allow "automatic
retries" in cases when R1B responses is used.

The reason why I think that is easier, is because of the complexity we
have when dealing with R1B responses.

As for example the timeout may differ depending on the command, so
just guessing that 500 ms might work, isn't good enough. Moreover we
would need to deal with MMC_CAP_WAIT_WHILE_BUSY, etc. Currently I am
not saying that we shouldn't do this, but then I first need to
understand how big of problem this is.

>
> Signed-off-by: Yong Mao <yong.mao@mediatek.com>
> Signed-off-by: Chaotian Jing <chaotian.jing@mediatek.com>
> ---
>  drivers/mmc/core/core.c |   19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
>
> diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
> index 1076b9d..8674dbb 100644
> --- a/drivers/mmc/core/core.c
> +++ b/drivers/mmc/core/core.c
> @@ -566,6 +566,25 @@ void mmc_wait_for_req_done(struct mmc_host *host, struct mmc_request *mrq)
>
>                 mmc_retune_recheck(host);
>
> +               /*
> +                * If a R1B CMD such as CMD6 occur CRC error,
> +                * it will retry 3 times here.
> +                * But before retrying, it must ensure card is in
> +                * transfer state.
> +                * Otherwise, the next retried CMD will got TMO error.
> +                */
> +               if (mmc_resp_type(cmd) == MMC_RSP_R1B && host->ops->card_busy) {
> +                       int tries = 500; /* Wait aprox 500ms at maximum */
> +
> +                       while (host->ops->card_busy(host) && --tries)
> +                               mmc_delay(1);
> +
> +                       if (tries == 0) {
> +                               cmd->error = -EBUSY;
> +                               break;
> +                       }
> +               }
> +
>                 pr_debug("%s: req failed (CMD%u): %d, retrying...\n",
>                          mmc_hostname(host), cmd->opcode, cmd->error);
>                 cmd->retries--;
> --
> 1.7.9.5
>

Kind regards
Uffe
Shawn Lin Jan. 13, 2017, 3:35 a.m. UTC | #2
On 2017/1/12 23:21, Ulf Hansson wrote:
> - trimmed cc-list
>
> On 3 January 2017 at 09:49, Yong Mao <yong.mao@mediatek.com> wrote:
>> From: yong mao <yong.mao@mediatek.com>
>>
>> When initializing EMMC, after switch to HS400,
>> it will issue CMD6 to change ext_csd,
>> if first CMD6 got CRC error,
>> the repeat CMD6 may get timeout,
>> that's because card is not back to transfer state immediately.
>>
>> For resolving this issue, it need check if card is busy
>> before sending repeat CMD6.
>
> I agree that doing retries in this path might may not be correctly
> done, but currently we do it as best effort.
>
> We should probably change this to *not* retry the CMD6 command, in
> cases when we have a MMC_RSP_R1B response, because of the reasons you
> describe.
> I get the feeling that these retry attempts for CMD6, may instead hide
> other real issues and making it harder to narrow them down. Don't you
> think?
>
> This leads to my following question:
> Why do you get a CRC error at the first CMD6 attempt? That shouldn't
> happen, right?
>
> Perhaps you can elaborate on what of the CMD6 commands in the HS400
> enabling sequence that fails. It may help us to understand, perhaps
> there may be something in that sequence that should be changed.
>
>>
>> Not only CMD6 here has this issue, but also other R1B CMD has
>> the same issue.
>
> Yes, agree!
>
> However, can you please try to point out some other commands than CM6
> that you see uses *retries*, has R1B response, and which you believe
> may not be properly managed.
>
> Dealing with R1B responses isn't always straight forward. Therefore I
> am wondering whether we perhaps should just not allow "automatic
> retries" in cases when R1B responses is used.
>
> The reason why I think that is easier, is because of the complexity we
> have when dealing with R1B responses.
>

I'm just thinking a interesting question: What will happen if someone
uses a userspace tool to switch the partition to RPMB when we are in
command queue mode? It will fails to finish CMD6 and the emmc should be
busy for a while. So now we shouldn't retry CMD6 but need to send a
CMD13 to make sure it's back to transfer state OR a HPI to break it. We
should be able to cover these cases not only from kernel context but
also the interaction between user and kernel.

> As for example the timeout may differ depending on the command, so
> just guessing that 500 ms might work, isn't good enough. Moreover we
> would need to deal with MMC_CAP_WAIT_WHILE_BUSY, etc. Currently I am
> not saying that we shouldn't do this, but then I first need to
> understand how big of problem this is.
>
>>
>> Signed-off-by: Yong Mao <yong.mao@mediatek.com>
>> Signed-off-by: Chaotian Jing <chaotian.jing@mediatek.com>
>> ---
>>  drivers/mmc/core/core.c |   19 +++++++++++++++++++
>>  1 file changed, 19 insertions(+)
>>
>> diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
>> index 1076b9d..8674dbb 100644
>> --- a/drivers/mmc/core/core.c
>> +++ b/drivers/mmc/core/core.c
>> @@ -566,6 +566,25 @@ void mmc_wait_for_req_done(struct mmc_host *host, struct mmc_request *mrq)
>>
>>                 mmc_retune_recheck(host);
>>
>> +               /*
>> +                * If a R1B CMD such as CMD6 occur CRC error,
>> +                * it will retry 3 times here.
>> +                * But before retrying, it must ensure card is in
>> +                * transfer state.
>> +                * Otherwise, the next retried CMD will got TMO error.
>> +                */
>> +               if (mmc_resp_type(cmd) == MMC_RSP_R1B && host->ops->card_busy) {
>> +                       int tries = 500; /* Wait aprox 500ms at maximum */
>> +
>> +                       while (host->ops->card_busy(host) && --tries)
>> +                               mmc_delay(1);
>> +
>> +                       if (tries == 0) {
>> +                               cmd->error = -EBUSY;
>> +                               break;
>> +                       }
>> +               }
>> +
>>                 pr_debug("%s: req failed (CMD%u): %d, retrying...\n",
>>                          mmc_hostname(host), cmd->opcode, cmd->error);
>>                 cmd->retries--;
>> --
>> 1.7.9.5
>>
>
> Kind regards
> Uffe
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Yong Mao Jan. 19, 2017, 8:58 a.m. UTC | #3
On Thu, 2017-01-12 at 16:21 +0100, Ulf Hansson wrote:
> - trimmed cc-list
> 
> On 3 January 2017 at 09:49, Yong Mao <yong.mao@mediatek.com> wrote:
> > From: yong mao <yong.mao@mediatek.com>
> >
> > When initializing EMMC, after switch to HS400,
> > it will issue CMD6 to change ext_csd,
> > if first CMD6 got CRC error,
> > the repeat CMD6 may get timeout,
> > that's because card is not back to transfer state immediately.
> >
> > For resolving this issue, it need check if card is busy
> > before sending repeat CMD6.
> 
> I agree that doing retries in this path might may not be correctly
> done, but currently we do it as best effort.
> 
> We should probably change this to *not* retry the CMD6 command, in
> cases when we have a MMC_RSP_R1B response, because of the reasons you
> describe.
> I get the feeling that these retry attempts for CMD6, may instead hide
> other real issues and making it harder to narrow them down. Don't you
> think?

> This leads to my following question:
> Why do you get a CRC error at the first CMD6 attempt? That shouldn't
> happen, right?
> 
Host needs to ensure that there is no CRC error when sampling the
signal from device. However, in the high frequency mode, it is
inevitable that there will be CRC error.
Because of the presence of CRC error, we have the re-tune mechanism in
the mmc core layer to handle.
But in this case, it will occur CMD timeout error after re-sending the
R1B CMD. Currently, we still do not have mechanism to handle this case
in the mmc core layer.

> Perhaps you can elaborate on what of the CMD6 commands in the HS400
> enabling sequence that fails. It may help us to understand, perhaps
> there may be something in that sequence that should be changed.
> 
Sorry. Maybe I didn't describe this issue clearly.
The CMD6 command issue in my commit message is only a specific example
of this type of issue we actually encounter.

I describe the issue step by step as below:
Step 1: Send a R1B command and then get command CRC error.
Step 2: According to the current logic of mmc core layer code,
        it will resend this R1B command again.
Step 3: Because card may be in the busy state while resending this 
	R1B command, card will not response this R1B command.
	And then this R1B command will fail with the reason 
	of timeout.   

> >
> > Not only CMD6 here has this issue, but also other R1B CMD has
> > the same issue.
> 
> Yes, agree!
> 
> However, can you please try to point out some other commands than CM6
> that you see uses *retries*, has R1B response, and which you believe
> may not be properly managed.
> 
Actually, we just met this CMD6 timeout issue in our project. 
However, according to the analysis, all R1b command may have the same
issue.

> Dealing with R1B responses isn't always straight forward. Therefore I
> am wondering whether we perhaps should just not allow "automatic
> retries" in cases when R1B responses is used.
> 
> The reason why I think that is easier, is because of the complexity we
> have when dealing with R1B responses.
We think "automatic retries" should still be used in this case.
With "automatic retries", it have re-tune mechanism to let host
re-select a better parameters to handle command CRC error. 

> 
> As for example the timeout may differ depending on the command, so
> just guessing that 500 ms might work, isn't good enough. Moreover we
> would need to deal with MMC_CAP_WAIT_WHILE_BUSY, etc. Currently I am
> not saying that we shouldn't do this, but then I first need to
> understand how big of problem this is.
> 
Yes. You are right. We need to deal with MMC_CAP_WAIT_WHILE_BUSY here.
If the host has the capability of MMC_CAP_WAIT_WHILE_BUSY, it does not
need this patch. But the host, like our host, does not have this
capability, it needs this patch. I will add this logic in next version.
According to our experience, 500ms is enough here. 

> >
> > Signed-off-by: Yong Mao <yong.mao@mediatek.com>
> > Signed-off-by: Chaotian Jing <chaotian.jing@mediatek.com>
> > ---
> >  drivers/mmc/core/core.c |   19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> >
> > diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
> > index 1076b9d..8674dbb 100644
> > --- a/drivers/mmc/core/core.c
> > +++ b/drivers/mmc/core/core.c
> > @@ -566,6 +566,25 @@ void mmc_wait_for_req_done(struct mmc_host *host, struct mmc_request *mrq)
> >
> >                 mmc_retune_recheck(host);
> >
> > +               /*
> > +                * If a R1B CMD such as CMD6 occur CRC error,
> > +                * it will retry 3 times here.
> > +                * But before retrying, it must ensure card is in
> > +                * transfer state.
> > +                * Otherwise, the next retried CMD will got TMO error.
> > +                */
> > +               if (mmc_resp_type(cmd) == MMC_RSP_R1B && host->ops->card_busy) {
> > +                       int tries = 500; /* Wait aprox 500ms at maximum */
> > +
> > +                       while (host->ops->card_busy(host) && --tries)
> > +                               mmc_delay(1);
> > +
> > +                       if (tries == 0) {
> > +                               cmd->error = -EBUSY;
> > +                               break;
> > +                       }
> > +               }
> > +
> >                 pr_debug("%s: req failed (CMD%u): %d, retrying...\n",
> >                          mmc_hostname(host), cmd->opcode, cmd->error);
> >                 cmd->retries--;
> > --
> > 1.7.9.5
> >
> 
> Kind regards
> Uffe
Yong Mao Jan. 19, 2017, 9:06 a.m. UTC | #4
On Fri, 2017-01-13 at 11:35 +0800, Shawn Lin wrote:
> On 2017/1/12 23:21, Ulf Hansson wrote:
> > - trimmed cc-list
> >
> > On 3 January 2017 at 09:49, Yong Mao <yong.mao@mediatek.com> wrote:
> >> From: yong mao <yong.mao@mediatek.com>
> >>
> >> When initializing EMMC, after switch to HS400,
> >> it will issue CMD6 to change ext_csd,
> >> if first CMD6 got CRC error,
> >> the repeat CMD6 may get timeout,
> >> that's because card is not back to transfer state immediately.
> >>
> >> For resolving this issue, it need check if card is busy
> >> before sending repeat CMD6.
> >
> > I agree that doing retries in this path might may not be correctly
> > done, but currently we do it as best effort.
> >
> > We should probably change this to *not* retry the CMD6 command, in
> > cases when we have a MMC_RSP_R1B response, because of the reasons you
> > describe.
> > I get the feeling that these retry attempts for CMD6, may instead hide
> > other real issues and making it harder to narrow them down. Don't you
> > think?
> >
> > This leads to my following question:
> > Why do you get a CRC error at the first CMD6 attempt? That shouldn't
> > happen, right?
> >
> > Perhaps you can elaborate on what of the CMD6 commands in the HS400
> > enabling sequence that fails. It may help us to understand, perhaps
> > there may be something in that sequence that should be changed.
> >
> >>
> >> Not only CMD6 here has this issue, but also other R1B CMD has
> >> the same issue.
> >
> > Yes, agree!
> >
> > However, can you please try to point out some other commands than CM6
> > that you see uses *retries*, has R1B response, and which you believe
> > may not be properly managed.
> >
> > Dealing with R1B responses isn't always straight forward. Therefore I
> > am wondering whether we perhaps should just not allow "automatic
> > retries" in cases when R1B responses is used.
> >
> > The reason why I think that is easier, is because of the complexity we
> > have when dealing with R1B responses.
> >
> 
> I'm just thinking a interesting question: What will happen if someone
> uses a userspace tool to switch the partition to RPMB when we are in
> command queue mode? It will fails to finish CMD6 and the emmc should be
> busy for a while. So now we shouldn't retry CMD6 but need to send a
> CMD13 to make sure it's back to transfer state OR a HPI to break it. We
> should be able to cover these cases not only from kernel context but
> also the interaction between user and kernel.
> 
Specification can ensure that your question will not occur.
"NOTE if hosts need to access RPMB partition, the host should disable
the Command Queue mechanism and access RPMB partition not through the
command queue."
"General Purpose partitions may be accessed when command queuing is
enabled.The queue must be empty when CMD6 is sent(to switch partitions
or to disable command queuing). Sending CMD6 while the queue is not
empty shall be regarded as illegal command(see 6.6.39.9, Supported
Commands)."

> > As for example the timeout may differ depending on the command, so
> > just guessing that 500 ms might work, isn't good enough. Moreover we
> > would need to deal with MMC_CAP_WAIT_WHILE_BUSY, etc. Currently I am
> > not saying that we shouldn't do this, but then I first need to
> > understand how big of problem this is.
> >
> >>
> >> Signed-off-by: Yong Mao <yong.mao@mediatek.com>
> >> Signed-off-by: Chaotian Jing <chaotian.jing@mediatek.com>
> >> ---
> >>  drivers/mmc/core/core.c |   19 +++++++++++++++++++
> >>  1 file changed, 19 insertions(+)
> >>
> >> diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
> >> index 1076b9d..8674dbb 100644
> >> --- a/drivers/mmc/core/core.c
> >> +++ b/drivers/mmc/core/core.c
> >> @@ -566,6 +566,25 @@ void mmc_wait_for_req_done(struct mmc_host *host, struct mmc_request *mrq)
> >>
> >>                 mmc_retune_recheck(host);
> >>
> >> +               /*
> >> +                * If a R1B CMD such as CMD6 occur CRC error,
> >> +                * it will retry 3 times here.
> >> +                * But before retrying, it must ensure card is in
> >> +                * transfer state.
> >> +                * Otherwise, the next retried CMD will got TMO error.
> >> +                */
> >> +               if (mmc_resp_type(cmd) == MMC_RSP_R1B && host->ops->card_busy) {
> >> +                       int tries = 500; /* Wait aprox 500ms at maximum */
> >> +
> >> +                       while (host->ops->card_busy(host) && --tries)
> >> +                               mmc_delay(1);
> >> +
> >> +                       if (tries == 0) {
> >> +                               cmd->error = -EBUSY;
> >> +                               break;
> >> +                       }
> >> +               }
> >> +
> >>                 pr_debug("%s: req failed (CMD%u): %d, retrying...\n",
> >>                          mmc_hostname(host), cmd->opcode, cmd->error);
> >>                 cmd->retries--;
> >> --
> >> 1.7.9.5
> >>
> >
> > Kind regards
> > Uffe
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
>
Ulf Hansson Jan. 19, 2017, 9:13 p.m. UTC | #5
>
>> This leads to my following question:
>> Why do you get a CRC error at the first CMD6 attempt? That shouldn't
>> happen, right?
>>
> Host needs to ensure that there is no CRC error when sampling the
> signal from device. However, in the high frequency mode, it is
> inevitable that there will be CRC error.
> Because of the presence of CRC error, we have the re-tune mechanism in
> the mmc core layer to handle.

Hmm. I assume your issue is only when sending a CMD6 and quite far
into the HS400 mode enabling sequence.

> But in this case, it will occur CMD timeout error after re-sending the
> R1B CMD. Currently, we still do not have mechanism to handle this case
> in the mmc core layer.
>
>> Perhaps you can elaborate on what of the CMD6 commands in the HS400
>> enabling sequence that fails. It may help us to understand, perhaps
>> there may be something in that sequence that should be changed.
>>
> Sorry. Maybe I didn't describe this issue clearly.
> The CMD6 command issue in my commit message is only a specific example
> of this type of issue we actually encounter.
>
> I describe the issue step by step as below:
> Step 1: Send a R1B command and then get command CRC error.
> Step 2: According to the current logic of mmc core layer code,
>         it will resend this R1B command again.
> Step 3: Because card may be in the busy state while resending this
>         R1B command, card will not response this R1B command.
>         And then this R1B command will fail with the reason
>         of timeout.

Thanks for the clarification!

>
>> >
>> > Not only CMD6 here has this issue, but also other R1B CMD has
>> > the same issue.
>>
>> Yes, agree!
>>
>> However, can you please try to point out some other commands than CM6
>> that you see uses *retries*, has R1B response, and which you believe
>> may not be properly managed.
>>
> Actually, we just met this CMD6 timeout issue in our project.
> However, according to the analysis, all R1b command may have the same
> issue.

Okay. I see.

To me it seems like you have two options to fix this

1. Try fix the problem your host driver. I think it's odd to a CRC
error, immediately after you have done a tuning. Maybe the tuning
actually failed in the first step?
2.  Implement some kind of re-try mechanism specific for the HS400
enabling sequence when sending the CMD6 command you refer to, and if
so that in conjunction of removing the retries for CMD6 in
__mmc_switch().

>
>> Dealing with R1B responses isn't always straight forward. Therefore I
>> am wondering whether we perhaps should just not allow "automatic
>> retries" in cases when R1B responses is used.
>>
>> The reason why I think that is easier, is because of the complexity we
>> have when dealing with R1B responses.
> We think "automatic retries" should still be used in this case.
> With "automatic retries", it have re-tune mechanism to let host
> re-select a better parameters to handle command CRC error.
>
>>
>> As for example the timeout may differ depending on the command, so
>> just guessing that 500 ms might work, isn't good enough. Moreover we
>> would need to deal with MMC_CAP_WAIT_WHILE_BUSY, etc. Currently I am
>> not saying that we shouldn't do this, but then I first need to
>> understand how big of problem this is.
>>
> Yes. You are right. We need to deal with MMC_CAP_WAIT_WHILE_BUSY here.
> If the host has the capability of MMC_CAP_WAIT_WHILE_BUSY, it does not
> need this patch. But the host, like our host, does not have this
> capability, it needs this patch. I will add this logic in next version.
> According to our experience, 500ms is enough here.

It seems the problem mostly concerns switching to HS400 mode. Perhaps,
in the end it's actually a tuning issue specific to your controller?

Therefore, I would rather suggest you to try one of the two approaches
I suggested above, than continuing this path.

[...]

Kind regards
Uffe
diff mbox

Patch

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 1076b9d..8674dbb 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -566,6 +566,25 @@  void mmc_wait_for_req_done(struct mmc_host *host, struct mmc_request *mrq)
 
 		mmc_retune_recheck(host);
 
+		/*
+		 * If a R1B CMD such as CMD6 occur CRC error,
+		 * it will retry 3 times here.
+		 * But before retrying, it must ensure card is in
+		 * transfer state.
+		 * Otherwise, the next retried CMD will got TMO error.
+		 */
+		if (mmc_resp_type(cmd) == MMC_RSP_R1B && host->ops->card_busy) {
+			int tries = 500; /* Wait aprox 500ms at maximum */
+
+			while (host->ops->card_busy(host) && --tries)
+				mmc_delay(1);
+
+			if (tries == 0) {
+				cmd->error = -EBUSY;
+				break;
+			}
+		}
+
 		pr_debug("%s: req failed (CMD%u): %d, retrying...\n",
 			 mmc_hostname(host), cmd->opcode, cmd->error);
 		cmd->retries--;