diff mbox series

mmc: core: Add a card quirk for broken boot partitions

Message ID 20210628232955.3327484-1-linus.walleij@linaro.org (mailing list archive)
State New, archived
Headers show
Series mmc: core: Add a card quirk for broken boot partitions | expand

Commit Message

Linus Walleij June 28, 2021, 11:29 p.m. UTC
Some boot partitions on the Samsung 4GB KLM4G1YE4C "4YMD1R" card appear
broken when accessed randomly. CMD6 to switch back to the main partition
randomly stalls after CMD18 access to the boot partition 1, and the card
never comes back online. The accesses to the boot partitions work several
times before this happens.

I tried using only single blocks with CMD17 on the boot partitions with the
result that it crashed even faster.

The card may survive on older kernels, but this seems to be because recent
kernel partition parsers are accessing the boot partitions more, and we get
more block traffic to the boot partitions during kernel startup.

My only conclusion is that the boot partitions are somehow broken on this
card, and we add a quirk for this.

After applying this patch, the main partition can be accessed and mounted
without problems.

This eMMC card is found in the Samsung GT-S7710 mobile phone.

Cc: phone-devel@vger.kernel.org
Cc: Stephan Gerhold <stephan@gerhold.net>
Reported-by: newbyte@disroot.org
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
 drivers/mmc/core/block.c  | 6 ++++++
 drivers/mmc/core/quirks.h | 7 +++++++
 include/linux/mmc/card.h  | 1 +
 3 files changed, 14 insertions(+)

Comments

Ulf Hansson June 29, 2021, 1:56 p.m. UTC | #1
On Tue, 29 Jun 2021 at 01:32, Linus Walleij <linus.walleij@linaro.org> wrote:
>
> Some boot partitions on the Samsung 4GB KLM4G1YE4C "4YMD1R" card appear
> broken when accessed randomly. CMD6 to switch back to the main partition
> randomly stalls after CMD18 access to the boot partition 1, and the card
> never comes back online. The accesses to the boot partitions work several
> times before this happens.
>
> I tried using only single blocks with CMD17 on the boot partitions with the
> result that it crashed even faster.
>
> The card may survive on older kernels, but this seems to be because recent
> kernel partition parsers are accessing the boot partitions more, and we get
> more block traffic to the boot partitions during kernel startup.
>
> My only conclusion is that the boot partitions are somehow broken on this
> card, and we add a quirk for this.

You may very well be right on the above conclusion. However, I would
prefer some more details before I am ready to apply this. For example,
more exactly, what happens when the stall is triggered? Can you
provide some more debug info, perhaps with the sequence of commands
that triggers the error?

Moreover, it would be nice to know exactly where in the code the stall
happens. Could it be that it's the mmc_wait_for_cmd() that never
returns in __mmc_switch(), when sending the CMD6 to switch the
partition?

My main worry is that we should not set a card quirk for an eMMC that
could be working fine (on some other platforms), that's why I want us
to be sure.

>
> After applying this patch, the main partition can be accessed and mounted
> without problems.
>
> This eMMC card is found in the Samsung GT-S7710 mobile phone.
>
> Cc: phone-devel@vger.kernel.org
> Cc: Stephan Gerhold <stephan@gerhold.net>
> Reported-by: newbyte@disroot.org
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

Kind regards
Uffe

> ---
>  drivers/mmc/core/block.c  | 6 ++++++
>  drivers/mmc/core/quirks.h | 7 +++++++
>  include/linux/mmc/card.h  | 1 +
>  3 files changed, 14 insertions(+)
>
> diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
> index 689eb9afeeed..e6cde68cda0e 100644
> --- a/drivers/mmc/core/block.c
> +++ b/drivers/mmc/core/block.c
> @@ -2621,6 +2621,12 @@ static int mmc_blk_alloc_parts(struct mmc_card *card, struct mmc_blk_data *md)
>                                 card->part[idx].name);
>                         if (ret)
>                                 return ret;
> +               } else if((card->part[idx].area_type & MMC_BLK_DATA_AREA_BOOT) &&
> +                         (card->quirks & MMC_QUIRK_BROKEN_BOOT_PARTITIONS)) {
> +                       pr_info("%s: skipping broken boot partition %s\n",
> +                               mmc_hostname(card->host),
> +                               card->part[idx].name);
> +                       continue;
>                 } else if (card->part[idx].size) {
>                         ret = mmc_blk_alloc_part(card, md,
>                                 card->part[idx].part_cfg,
> diff --git a/drivers/mmc/core/quirks.h b/drivers/mmc/core/quirks.h
> index d68e6e513a4f..aa4060c10aa9 100644
> --- a/drivers/mmc/core/quirks.h
> +++ b/drivers/mmc/core/quirks.h
> @@ -99,6 +99,13 @@ static const struct mmc_fixup __maybe_unused mmc_blk_fixups[] = {
>         MMC_FIXUP("V10016", CID_MANFID_KINGSTON, CID_OEMID_ANY, add_quirk_mmc,
>                   MMC_QUIRK_TRIM_BROKEN),
>
> +       /*
> +        * Some Samsung eMMCs have broken boot partitions, accessing them
> +        * randomly will make the device lock up and crash.
> +        */
> +       MMC_FIXUP("4YMD1R", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
> +                 MMC_QUIRK_BROKEN_BOOT_PARTITIONS),
> +
>         END_FIXUP
>  };
>
> diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
> index f9ad35dd6012..4006736f59dd 100644
> --- a/include/linux/mmc/card.h
> +++ b/include/linux/mmc/card.h
> @@ -259,6 +259,7 @@ struct mmc_card {
>                                                 /* for byte mode */
>  #define MMC_QUIRK_NONSTD_SDIO  (1<<2)          /* non-standard SDIO card attached */
>                                                 /* (missing CIA registers) */
> +#define MMC_QUIRK_BROKEN_BOOT_PARTITIONS (1<<3)        /* Disable broken boot partitions */
>  #define MMC_QUIRK_NONSTD_FUNC_IF (1<<4)                /* SDIO card has nonstd function interfaces */
>  #define MMC_QUIRK_DISABLE_CD   (1<<5)          /* disconnect CD/DAT[3] resistor */
>  #define MMC_QUIRK_INAND_CMD38  (1<<6)          /* iNAND devices have broken CMD38 */
> --
> 2.31.1
>
Linus Walleij June 29, 2021, 5:23 p.m. UTC | #2
On Tue, Jun 29, 2021 at 3:56 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:

> I would
> prefer some more details before I am ready to apply this. For example,
> more exactly, what happens when the stall is triggered? Can you
> provide some more debug info, perhaps with the sequence of commands
> that triggers the error?

I have added debug prints and it looks like this, first tons of normal
traffick switching between the different partitions with CMD6,
both 0 (main), 1 (boot0) and 2 (boot1) are investigated before
results. At the end of the trail this happens:

[   54.171173] mmc_host mmc2: start CMD18 arg 00005010
[   54.171478] mmc_host mmc2: start CMD18 arg 00005028
[   54.172058] mmc_host mmc2: start CMD18 arg 00005048
[   54.172790] mmc_host mmc2: start CMD18 arg 00005088
[   54.174926] mmc_host mmc2: start CMD18 arg 00005108
[   54.177581] mmc_host mmc2: start CMD18 arg 00015038
[   54.178039] mmc_host mmc2: start CMD18 arg 00066400
[   54.178222] mmc_host mmc2: start CMD18 arg 00004880
[   54.178497] mmc_host mmc2: start CMD18 arg 00000040
[   54.178649] mmc_host mmc2: start CMD18 arg 00015078
[   54.178894] mmc_host mmc2: start CMD18 arg 00066800
[   54.179382] mmc2: modify EXT_CSD, index 179, value: 1, set 1, timing 0
[   54.179412] mmc_host mmc2: start CMD6 arg 03b30101
[   54.180145] mmc2 modify EXT_CSD completed (0)
[   54.180175] mmc_host mmc2: start CMD13 arg 00010000
[   54.180267] mmc_host mmc2: start CMD18 arg 00001ff0
[   54.180755] mmc2: modify EXT_CSD, index 179, value: 0, set 1, timing 0
[   54.180786] mmc_host mmc2: start CMD6 arg 03b30001
[   54.180847] mmc2 modify EXT_CSD completed (0)
[   54.180847] mmc_host mmc2: start CMD13 arg 00010000
[   54.180938] mmc_host mmc2: start CMD18 arg 00015010
[   54.181121] mmc_host mmc2: start CMD18 arg 00015028
[   54.181365] mmc_host mmc2: start CMD18 arg 00015048
[   54.182312] mmc_host mmc2: start CMD18 arg 00015088
[   54.183654] mmc_host mmc2: start CMD18 arg 00015108
[   54.186187] mmc_host mmc2: start CMD18 arg 00066018
[   54.186767] mmc_host mmc2: start CMD18 arg 00004900
[   54.186950] mmc_host mmc2: start CMD18 arg 00000080
[   54.187225] mmc_host mmc2: start CMD18 arg 00066038
[   54.187469] mmc_host mmc2: start CMD18 arg 00004a00
[   54.187713] mmc_host mmc2: start CMD18 arg 00066078
[   54.187988] mmc_host mmc2: start CMD18 arg 00004c00
[   54.188293] mmc2: modify EXT_CSD, index 179, value: 1, set 1, timing 0
[   54.188323] mmc_host mmc2: start CMD6 arg 03b30101
[   54.188354] mmc2 modify EXT_CSD completed (0)
[   54.188385] mmc_host mmc2: start CMD13 arg 00010000
[   54.188446] mmc_host mmc2: start CMD18 arg 00000000
[   54.189300] mmc2: modify EXT_CSD, index 179, value: 0, set 1, timing 0
[   54.189331] mmc_host mmc2: start CMD6 arg 03b30001

After this CMD6 just hangs there. Nothing proceeds on mmc2 (the eMMC)
ever again.

> Moreover, it would be nice to know exactly where in the code the stall
> happens. Could it be that it's the mmc_wait_for_cmd() that never
> returns in __mmc_switch(), when sending the CMD6 to switch the
> partition?

__mmc_switch() initiates a partition switch, then
issues CMD6. Then:

__mmc_switch()
  mmc_wait_for_cmd()
    mmc_wait_for_req()
    __mmc_start_req()
      mmc_wait_for_req_done()

We wait forever in mmc_wait_for_req_done() since the completion
never arrives.

> My main worry is that we should not set a card quirk for an eMMC that
> could be working fine (on some other platforms), that's why I want us
> to be sure.

Yes we need to get to the bottom of this.

One theory as we discussed on chat is that the busy detect
for the previous command isn't working as it should leading us
to prematurely start CMD6 while the previous CMD18 is actually still
busy.

After testing, this appears to be true!

I disabled hardware busy detect in mmci.c like this:

diff --git a/drivers/mmc/host/mmci.c b/drivers/mmc/host/mmci.c
index 984d35055156..cffe95e32b9a 100644
--- a/drivers/mmc/host/mmci.c
+++ b/drivers/mmc/host/mmci.c
@@ -180,10 +180,10 @@ static struct variant_data variant_ux500 = {
        .f_max                  = 100000000,
        .signal_direction       = true,
        .pwrreg_clkgate         = true,
-       .busy_detect            = true,
-       .busy_dpsm_flag         = MCI_DPSM_ST_BUSYMODE,
-       .busy_detect_flag       = MCI_ST_CARDBUSY,
-       .busy_detect_mask       = MCI_ST_BUSYENDMASK,
+       //.busy_detect          = true,
+       //.busy_dpsm_flag               = MCI_DPSM_ST_BUSYMODE,
+       //.busy_detect_flag     = MCI_ST_CARDBUSY,
+       //.busy_detect_mask     = MCI_ST_BUSYENDMASK,
        .pwrreg_nopower         = true,
        .mmcimask1              = true,
        .irq_pio_mask           = MCI_IRQ_PIO_MASK,
@@ -215,10 +215,10 @@ static struct variant_data variant_ux500v2 = {
        .f_max                  = 100000000,
        .signal_direction       = true,
        .pwrreg_clkgate         = true,
-       .busy_detect            = true,
-       .busy_dpsm_flag         = MCI_DPSM_ST_BUSYMODE,
-       .busy_detect_flag       = MCI_ST_CARDBUSY,
-       .busy_detect_mask       = MCI_ST_BUSYENDMASK,
+       // .busy_detect         = true,
+       //.busy_dpsm_flag               = MCI_DPSM_ST_BUSYMODE,
+       //.busy_detect_flag     = MCI_ST_CARDBUSY,
+       //.busy_detect_mask     = MCI_ST_BUSYENDMASK,
        .pwrreg_nopower         = true,
        .mmcimask1              = true,
        .irq_pio_mask           = MCI_IRQ_PIO_MASK,

Now it works.

So now I am thinking about a quirk that disables hardware busy
detect on some older eMMC cards instead, my current
assumption is that the hardware busy detect on these eMMCs
is glitchy. It works a bit but not good enough.

I think you also mentioned the timeout in EXT_CSD maybe not being
long enough? How do I play around with this?
MMC_QUIRK_LONG_READ_TIME?

Yours,
Linus Walleij
Ulf Hansson June 30, 2021, 11:49 a.m. UTC | #3
On Tue, 29 Jun 2021 at 19:23, Linus Walleij <linus.walleij@linaro.org> wrote:
>
> On Tue, Jun 29, 2021 at 3:56 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> > I would
> > prefer some more details before I am ready to apply this. For example,
> > more exactly, what happens when the stall is triggered? Can you
> > provide some more debug info, perhaps with the sequence of commands
> > that triggers the error?
>
> I have added debug prints and it looks like this, first tons of normal
> traffick switching between the different partitions with CMD6,
> both 0 (main), 1 (boot0) and 2 (boot1) are investigated before
> results. At the end of the trail this happens:
>
> [   54.171173] mmc_host mmc2: start CMD18 arg 00005010
> [   54.171478] mmc_host mmc2: start CMD18 arg 00005028
> [   54.172058] mmc_host mmc2: start CMD18 arg 00005048
> [   54.172790] mmc_host mmc2: start CMD18 arg 00005088
> [   54.174926] mmc_host mmc2: start CMD18 arg 00005108
> [   54.177581] mmc_host mmc2: start CMD18 arg 00015038
> [   54.178039] mmc_host mmc2: start CMD18 arg 00066400
> [   54.178222] mmc_host mmc2: start CMD18 arg 00004880
> [   54.178497] mmc_host mmc2: start CMD18 arg 00000040
> [   54.178649] mmc_host mmc2: start CMD18 arg 00015078
> [   54.178894] mmc_host mmc2: start CMD18 arg 00066800
> [   54.179382] mmc2: modify EXT_CSD, index 179, value: 1, set 1, timing 0
> [   54.179412] mmc_host mmc2: start CMD6 arg 03b30101
> [   54.180145] mmc2 modify EXT_CSD completed (0)
> [   54.180175] mmc_host mmc2: start CMD13 arg 00010000
> [   54.180267] mmc_host mmc2: start CMD18 arg 00001ff0
> [   54.180755] mmc2: modify EXT_CSD, index 179, value: 0, set 1, timing 0
> [   54.180786] mmc_host mmc2: start CMD6 arg 03b30001
> [   54.180847] mmc2 modify EXT_CSD completed (0)
> [   54.180847] mmc_host mmc2: start CMD13 arg 00010000
> [   54.180938] mmc_host mmc2: start CMD18 arg 00015010
> [   54.181121] mmc_host mmc2: start CMD18 arg 00015028
> [   54.181365] mmc_host mmc2: start CMD18 arg 00015048
> [   54.182312] mmc_host mmc2: start CMD18 arg 00015088
> [   54.183654] mmc_host mmc2: start CMD18 arg 00015108
> [   54.186187] mmc_host mmc2: start CMD18 arg 00066018
> [   54.186767] mmc_host mmc2: start CMD18 arg 00004900
> [   54.186950] mmc_host mmc2: start CMD18 arg 00000080
> [   54.187225] mmc_host mmc2: start CMD18 arg 00066038
> [   54.187469] mmc_host mmc2: start CMD18 arg 00004a00
> [   54.187713] mmc_host mmc2: start CMD18 arg 00066078
> [   54.187988] mmc_host mmc2: start CMD18 arg 00004c00
> [   54.188293] mmc2: modify EXT_CSD, index 179, value: 1, set 1, timing 0
> [   54.188323] mmc_host mmc2: start CMD6 arg 03b30101
> [   54.188354] mmc2 modify EXT_CSD completed (0)
> [   54.188385] mmc_host mmc2: start CMD13 arg 00010000
> [   54.188446] mmc_host mmc2: start CMD18 arg 00000000
> [   54.189300] mmc2: modify EXT_CSD, index 179, value: 0, set 1, timing 0
> [   54.189331] mmc_host mmc2: start CMD6 arg 03b30001
>
> After this CMD6 just hangs there. Nothing proceeds on mmc2 (the eMMC)
> ever again.
>
> > Moreover, it would be nice to know exactly where in the code the stall
> > happens. Could it be that it's the mmc_wait_for_cmd() that never
> > returns in __mmc_switch(), when sending the CMD6 to switch the
> > partition?
>
> __mmc_switch() initiates a partition switch, then
> issues CMD6. Then:
>
> __mmc_switch()
>   mmc_wait_for_cmd()
>     mmc_wait_for_req()
>     __mmc_start_req()
>       mmc_wait_for_req_done()
>
> We wait forever in mmc_wait_for_req_done() since the completion
> never arrives.

Thanks for sharing these details. It looks like the mmci driver is
waiting for the busy completion IRQ, forever.

>
> > My main worry is that we should not set a card quirk for an eMMC that
> > could be working fine (on some other platforms), that's why I want us
> > to be sure.
>
> Yes we need to get to the bottom of this.
>
> One theory as we discussed on chat is that the busy detect
> for the previous command isn't working as it should leading us
> to prematurely start CMD6 while the previous CMD18 is actually still
> busy.
>
> After testing, this appears to be true!
>
> I disabled hardware busy detect in mmci.c like this:
>
> diff --git a/drivers/mmc/host/mmci.c b/drivers/mmc/host/mmci.c
> index 984d35055156..cffe95e32b9a 100644
> --- a/drivers/mmc/host/mmci.c
> +++ b/drivers/mmc/host/mmci.c
> @@ -180,10 +180,10 @@ static struct variant_data variant_ux500 = {
>         .f_max                  = 100000000,
>         .signal_direction       = true,
>         .pwrreg_clkgate         = true,
> -       .busy_detect            = true,
> -       .busy_dpsm_flag         = MCI_DPSM_ST_BUSYMODE,
> -       .busy_detect_flag       = MCI_ST_CARDBUSY,
> -       .busy_detect_mask       = MCI_ST_BUSYENDMASK,
> +       //.busy_detect          = true,
> +       //.busy_dpsm_flag               = MCI_DPSM_ST_BUSYMODE,
> +       //.busy_detect_flag     = MCI_ST_CARDBUSY,
> +       //.busy_detect_mask     = MCI_ST_BUSYENDMASK,
>         .pwrreg_nopower         = true,
>         .mmcimask1              = true,
>         .irq_pio_mask           = MCI_IRQ_PIO_MASK,
> @@ -215,10 +215,10 @@ static struct variant_data variant_ux500v2 = {
>         .f_max                  = 100000000,
>         .signal_direction       = true,
>         .pwrreg_clkgate         = true,
> -       .busy_detect            = true,
> -       .busy_dpsm_flag         = MCI_DPSM_ST_BUSYMODE,
> -       .busy_detect_flag       = MCI_ST_CARDBUSY,
> -       .busy_detect_mask       = MCI_ST_BUSYENDMASK,
> +       // .busy_detect         = true,
> +       //.busy_dpsm_flag               = MCI_DPSM_ST_BUSYMODE,
> +       //.busy_detect_flag     = MCI_ST_CARDBUSY,
> +       //.busy_detect_mask     = MCI_ST_BUSYENDMASK,
>         .pwrreg_nopower         = true,
>         .mmcimask1              = true,
>         .irq_pio_mask           = MCI_IRQ_PIO_MASK,
>
> Now it works.

Alright, this certainly brings us a bit closer to a fix.

Either the HW busy detection isn't working properly in mmci or the
eMMC card is behaving a bit odd (continues to deassert the DAT0 line
forever).

What certainly is missing in the mmci driver, is a software based
timeout for cases like these. The mmci driver should better complete
the request with -ETIMEDOUT error for the cmd, rather than hanging
forever and waiting for the busy completion IRQ.

>
> So now I am thinking about a quirk that disables hardware busy
> detect on some older eMMC cards instead, my current
> assumption is that the hardware busy detect on these eMMCs
> is glitchy. It works a bit but not good enough.
>
> I think you also mentioned the timeout in EXT_CSD maybe not being
> long enough? How do I play around with this?
> MMC_QUIRK_LONG_READ_TIME?

As mmci doesn't care about busy timeouts, but waits forever, this is
likely not the problem.

However, I would like to try to narrow down the problem even further.
Would you mind try the below debug patch?

--

Subject: [PATCH] mmc: mmci: HACK/DEBUG: Drop IRQ based HW busy detect

This change make MMC_CAP_WAIT_WHILE_BUSY to become unset, but keeps the
->card_busy ops. In this way, the core will sometimes (for CMD6 for
example), use this callback rather than polling with CMD13.

Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
---
 drivers/mmc/host/mmci.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/mmc/host/mmci.c b/drivers/mmc/host/mmci.c
index 984d35055156..cbc67c0e64b1 100644
--- a/drivers/mmc/host/mmci.c
+++ b/drivers/mmc/host/mmci.c
@@ -180,7 +180,7 @@ static struct variant_data variant_ux500 = {
        .f_max                  = 100000000,
        .signal_direction       = true,
        .pwrreg_clkgate         = true,
-       .busy_detect            = true,
+       .busy_detect            = false,
        .busy_dpsm_flag         = MCI_DPSM_ST_BUSYMODE,
        .busy_detect_flag       = MCI_ST_CARDBUSY,
        .busy_detect_mask       = MCI_ST_BUSYENDMASK,
@@ -215,7 +215,7 @@ static struct variant_data variant_ux500v2 = {
        .f_max                  = 100000000,
        .signal_direction       = true,
        .pwrreg_clkgate         = true,
-       .busy_detect            = true,
+       .busy_detect            = false,
        .busy_dpsm_flag         = MCI_DPSM_ST_BUSYMODE,
        .busy_detect_flag       = MCI_ST_CARDBUSY,
        .busy_detect_mask       = MCI_ST_BUSYENDMASK,
@@ -2143,7 +2143,7 @@ static int mmci_probe(struct amba_device *dev,
        /*
         * Enable busy detection.
         */
-       if (variant->busy_detect) {
+//     if (variant->busy_detect) {
                mmci_ops.card_busy = mmci_card_busy;
                /*
                 * Not all variants have a flag to enable busy detection
@@ -2152,8 +2152,8 @@ static int mmci_probe(struct amba_device *dev,
                if (variant->busy_dpsm_flag)
                        mmci_write_datactrlreg(host,
                                               host->variant->busy_dpsm_flag);
-               mmc->caps |= MMC_CAP_WAIT_WHILE_BUSY;
-       }
+//             mmc->caps |= MMC_CAP_WAIT_WHILE_BUSY;
+//     }

        /* Variants with mandatory busy timeout in HW needs R1B responses. */
        if (variant->busy_timeout)
Linus Walleij June 30, 2021, 2:25 p.m. UTC | #4
On Wed, Jun 30, 2021 at 1:50 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:

> > We wait forever in mmc_wait_for_req_done() since the completion
> > never arrives.
>
> Thanks for sharing these details. It looks like the mmci driver is
> waiting for the busy completion IRQ, forever.

Yep

> Either the HW busy detection isn't working properly in mmci or the
> eMMC card is behaving a bit odd (continues to deassert the DAT0 line
> forever).

Yep. I think the card is odd because I have this working fine on
3 other ux500 phones, this is the odd one out. I will try to check
what eMMC is in the others as well.

> What certainly is missing in the mmci driver, is a software based
> timeout for cases like these. The mmci driver should better complete
> the request with -ETIMEDOUT error for the cmd, rather than hanging
> forever and waiting for the busy completion IRQ.

That is true, it would make the driver more robust.

> > I think you also mentioned the timeout in EXT_CSD maybe not being
> > long enough? How do I play around with this?
> > MMC_QUIRK_LONG_READ_TIME?
>
> As mmci doesn't care about busy timeouts, but waits forever, this is
> likely not the problem.
>
> However, I would like to try to narrow down the problem even further.
> Would you mind try the below debug patch?

With this patch mmc2 comes up and I can mount and browse
the eMMC.

I think it is because these lines in mmci_cmd_irq() will not
be executed:

        /* Handle busy detection on DAT0 if the variant supports it. */
        if (busy_resp && host->variant->busy_detect)
                if (!host->ops->busy_complete(host, status, err_msk))
                        return;

These seemed to be especially problematic to me.

However the core can still use ->card_busy() at times?

Yours,
Linus Walleij
Ulf Hansson June 30, 2021, 3:27 p.m. UTC | #5
On Wed, 30 Jun 2021 at 16:26, Linus Walleij <linus.walleij@linaro.org> wrote:
>
> On Wed, Jun 30, 2021 at 1:50 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> > > We wait forever in mmc_wait_for_req_done() since the completion
> > > never arrives.
> >
> > Thanks for sharing these details. It looks like the mmci driver is
> > waiting for the busy completion IRQ, forever.
>
> Yep
>
> > Either the HW busy detection isn't working properly in mmci or the
> > eMMC card is behaving a bit odd (continues to deassert the DAT0 line
> > forever).
>
> Yep. I think the card is odd because I have this working fine on
> 3 other ux500 phones, this is the odd one out. I will try to check
> what eMMC is in the others as well.
>
> > What certainly is missing in the mmci driver, is a software based
> > timeout for cases like these. The mmci driver should better complete
> > the request with -ETIMEDOUT error for the cmd, rather than hanging
> > forever and waiting for the busy completion IRQ.
>
> That is true, it would make the driver more robust.
>
> > > I think you also mentioned the timeout in EXT_CSD maybe not being
> > > long enough? How do I play around with this?
> > > MMC_QUIRK_LONG_READ_TIME?
> >
> > As mmci doesn't care about busy timeouts, but waits forever, this is
> > likely not the problem.
> >
> > However, I would like to try to narrow down the problem even further.
> > Would you mind try the below debug patch?
>
> With this patch mmc2 comes up and I can mount and browse
> the eMMC.
>
> I think it is because these lines in mmci_cmd_irq() will not
> be executed:
>
>         /* Handle busy detection on DAT0 if the variant supports it. */
>         if (busy_resp && host->variant->busy_detect)
>                 if (!host->ops->busy_complete(host, status, err_msk))
>                         return;
>
> These seemed to be especially problematic to me.

Yes, exactly. The IRQ based busy detection code gets disabled with my
debug patch.

It looks like there are some race conditions in the HW busy detection
path for mmci, which gets triggered by this eMMC card.

>
> However the core can still use ->card_busy() at times?

It can and will.

Although, it's more optimal to receive an IRQ when busy on DAT0 is
de-asserted, rather than polling with ->card_busy(). Hence we also
have MMC_CAP_WAIT_WHILE_BUSY.

I will have another look at the code in mmci, to see if there is
something we can try to improve.

Kind regards
Uffe
Linus Walleij June 30, 2021, 10:33 p.m. UTC | #6
On Wed, Jun 30, 2021 at 5:28 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:

> >         /* Handle busy detection on DAT0 if the variant supports it. */
> >         if (busy_resp && host->variant->busy_detect)
> >                 if (!host->ops->busy_complete(host, status, err_msk))
> >                         return;
> >
> > These seemed to be especially problematic to me.
>
> Yes, exactly. The IRQ based busy detection code gets disabled with my
> debug patch.
>
> It looks like there are some race conditions in the HW busy detection
> path for mmci, which gets triggered by this eMMC card.
(...)
> Although, it's more optimal to receive an IRQ when busy on DAT0 is
> de-asserted, rather than polling with ->card_busy(). Hence we also
> have MMC_CAP_WAIT_WHILE_BUSY.

Hmmmmm it kind of assumes that DAT0 will be de-asserted *before*
we get a command response, never after. I think that is what the card
is doing. If that is out-of-spec then we need to have a quirk like
this but if it is legal behaviour, we rather need to fix the mmci driver.

Yours,
Linus Walleij
Ulf Hansson July 1, 2021, 2:26 p.m. UTC | #7
+ Jean, Ludovic

On Thu, 1 Jul 2021 at 00:33, Linus Walleij <linus.walleij@linaro.org> wrote:
>
> On Wed, Jun 30, 2021 at 5:28 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> > >         /* Handle busy detection on DAT0 if the variant supports it. */
> > >         if (busy_resp && host->variant->busy_detect)
> > >                 if (!host->ops->busy_complete(host, status, err_msk))
> > >                         return;
> > >
> > > These seemed to be especially problematic to me.
> >
> > Yes, exactly. The IRQ based busy detection code gets disabled with my
> > debug patch.
> >
> > It looks like there are some race conditions in the HW busy detection
> > path for mmci, which gets triggered by this eMMC card.
> (...)
> > Although, it's more optimal to receive an IRQ when busy on DAT0 is
> > de-asserted, rather than polling with ->card_busy(). Hence we also
> > have MMC_CAP_WAIT_WHILE_BUSY.
>
> Hmmmmm it kind of assumes that DAT0 will be de-asserted *before*
> we get a command response, never after. I think that is what the card
> is doing. If that is out-of-spec then we need to have a quirk like
> this but if it is legal behaviour, we rather need to fix the mmci driver.

That's correct and this could very well be the reason why polling
works better for this case.

On the other hand, I am still a bit puzzled why the mmci driver hangs,
waiting for the busy completion IRQ to be raised.

I did some more inspection of the code in ux500_busy_complete() and
found that there may be a potential race condition. I tried to fix it
up, but I don't know if it really makes any difference. Can you please
test the below patch and see if it helps.

---

From: Ulf Hansson <ulf.hansson@linaro.org>
Date: Thu, 1 Jul 2021 13:12:48 +0200
Subject: [PATCH] mmc: mmci: Fix busy detect completion

One of the pre-conditions to set the ->busy_detect_mask in the MMCIMASK0
register, is to first re-read the MMCISTATUS register to verify that the
->busy_detect_flag is set. The intent is to avoid enabling IRQ based busy
completion if the card does not signal busy.

Assuming the busy_detect_flag is set, we enter a small window before the
actual write of the ->busy_detect_mask hits the HW. In this window, the
->busy_detect_flag in the MMCISTATUS register may change to not indicate
busy any more. This could lead to that we end up waiting for a busy
completion IRQ forever.

Fix this, by writing the ->busy_detect_mask to the MMCIMASK0 first, but
clear it if it turns out that the card wasn't signaling busy.

Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
---
 drivers/mmc/host/mmci.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/mmc/host/mmci.c b/drivers/mmc/host/mmci.c
index 984d35055156..122de99759a5 100644
--- a/drivers/mmc/host/mmci.c
+++ b/drivers/mmc/host/mmci.c
@@ -671,14 +671,20 @@ static bool ux500_busy_complete(struct mmci_host
*host, u32 status, u32 err_msk)
         * while, to allow it to be set, but tests indicates that it
         * isn't needed.
         */
-       if (!host->busy_status && !(status & err_msk) &&
-           (readl(base + MMCISTATUS) & host->variant->busy_detect_flag)) {
-               writel(readl(base + MMCIMASK0) |
-                      host->variant->busy_detect_mask,
+       if (!host->busy_status && !(status & err_msk)) {
+               u32 mask0 = readl(base + MMCIMASK0);
+
+               writel(mask0 | host->variant->busy_detect_mask,
                       base + MMCIMASK0);
+               wmb();

-               host->busy_status = status & (MCI_CMDSENT | MCI_CMDRESPEND);
-               return false;
+               if (readl(base + MMCISTATUS) &
+                   host->variant->busy_detect_flag) {
+                       host->busy_status = status &
+                                           (MCI_CMDSENT | MCI_CMDRESPEND);
+                       return false;
+               }
+               writel(mask0, base + MMCIMASK0);
        }

        /*
Linus Walleij July 1, 2021, 3:52 p.m. UTC | #8
On Thu, Jul 1, 2021 at 4:27 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> On Thu, 1 Jul 2021 at 00:33, Linus Walleij <linus.walleij@linaro.org> wrote:

> > > It looks like there are some race conditions in the HW busy detection
> > > path for mmci, which gets triggered by this eMMC card.
> > (...)
> > > Although, it's more optimal to receive an IRQ when busy on DAT0 is
> > > de-asserted, rather than polling with ->card_busy(). Hence we also
> > > have MMC_CAP_WAIT_WHILE_BUSY.
> >
> > Hmmmmm it kind of assumes that DAT0 will be de-asserted *before*
> > we get a command response, never after. I think that is what the card
> > is doing. If that is out-of-spec then we need to have a quirk like
> > this but if it is legal behaviour, we rather need to fix the mmci driver.
>
> That's correct and this could very well be the reason why polling
> works better for this case.
>
> On the other hand, I am still a bit puzzled why the mmci driver hangs,
> waiting for the busy completion IRQ to be raised.
>
> I did some more inspection of the code in ux500_busy_complete() and
> found that there may be a potential race condition. I tried to fix it
> up, but I don't know if it really makes any difference. Can you please
> test the below patch and see if it helps.

I tested it and sadly it does not help :(

The system hangs on CMD6 after CMD18 as before.
It looks like this with my debug prints:

[   53.940399] mmc_host mmc2: start CMD13 arg 00010000
[   53.940490] mmc_host mmc2: start CMD18 arg 0000d010
[   53.940765] mmc_host mmc2: start CMD18 arg 0000d028
[   53.941162] mmc_host mmc2: start CMD18 arg 0000d048
[   53.941864] mmc_host mmc2: start CMD18 arg 0000d088
[   53.943878] mmc_host mmc2: start CMD18 arg 0000d108
[   53.946563] mmc_host mmc2: start CMD18 arg 00004810
[   53.947174] mmc_host mmc2: start CMD18 arg 00015018
[   53.947357] mmc_host mmc2: start CMD18 arg 0005e078
[   53.947845] mmc2: modify EXT_CSD, index 179, value: 1, set 1, timing 0
[   53.947875] mmc_host mmc2: start CMD6 arg 03b30101
[   53.947937] mmc2 modify EXT_CSD completed (0)
[   53.947967] mmc_host mmc2: start CMD13 arg 00010000
[   53.948059] mmc_host mmc2: start CMD18 arg 00001c70
[   53.948364] mmc2: modify EXT_CSD, index 179, value: 0, set 1, timing 0
[   53.948394] mmc_host mmc2: start CMD6 arg 03b30001
[   53.948455] mmc2 modify EXT_CSD completed (0)
[   53.948486] mmc_host mmc2: start CMD13 arg 00010000
[   53.948516] mmc_host mmc2: start CMD18 arg 00004ff8
[   53.949005] mmc_host mmc2: start CMD18 arg 0005e010
[   53.949340] mmc_host mmc2: start CMD18 arg 0005e028
[   53.949707] mmc_host mmc2: start CMD18 arg 0005e048
[   53.950378] mmc_host mmc2: start CMD18 arg 0005e088
[   53.951812] mmc_host mmc2: start CMD18 arg 0005e108
[   53.954589] mmc_host mmc2: start CMD18 arg 00015038
[   53.955047] mmc2: modify EXT_CSD, index 179, value: 1, set 1, timing 0
[   53.955078] mmc_host mmc2: start CMD6 arg 03b30101
[   53.955169] mmc2 modify EXT_CSD completed (0)
[   53.955169] mmc_host mmc2: start CMD13 arg 00010000
[   53.955627] mmc_host mmc2: start CMD18 arg 00001c30
[   53.956115] mmc2: modify EXT_CSD, index 179, value: 0, set 1, timing 0
[   53.956146] mmc_host mmc2: start CMD6 arg 03b30001

Here it hangs forever, this last CMD6 never completes.

Yours,
Linus Walleij
diff mbox series

Patch

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 689eb9afeeed..e6cde68cda0e 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -2621,6 +2621,12 @@  static int mmc_blk_alloc_parts(struct mmc_card *card, struct mmc_blk_data *md)
 				card->part[idx].name);
 			if (ret)
 				return ret;
+		} else if((card->part[idx].area_type & MMC_BLK_DATA_AREA_BOOT) &&
+			  (card->quirks & MMC_QUIRK_BROKEN_BOOT_PARTITIONS)) {
+			pr_info("%s: skipping broken boot partition %s\n",
+				mmc_hostname(card->host),
+				card->part[idx].name);
+			continue;
 		} else if (card->part[idx].size) {
 			ret = mmc_blk_alloc_part(card, md,
 				card->part[idx].part_cfg,
diff --git a/drivers/mmc/core/quirks.h b/drivers/mmc/core/quirks.h
index d68e6e513a4f..aa4060c10aa9 100644
--- a/drivers/mmc/core/quirks.h
+++ b/drivers/mmc/core/quirks.h
@@ -99,6 +99,13 @@  static const struct mmc_fixup __maybe_unused mmc_blk_fixups[] = {
 	MMC_FIXUP("V10016", CID_MANFID_KINGSTON, CID_OEMID_ANY, add_quirk_mmc,
 		  MMC_QUIRK_TRIM_BROKEN),
 
+	/*
+	 * Some Samsung eMMCs have broken boot partitions, accessing them
+	 * randomly will make the device lock up and crash.
+	 */
+	MMC_FIXUP("4YMD1R", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+		  MMC_QUIRK_BROKEN_BOOT_PARTITIONS),
+
 	END_FIXUP
 };
 
diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
index f9ad35dd6012..4006736f59dd 100644
--- a/include/linux/mmc/card.h
+++ b/include/linux/mmc/card.h
@@ -259,6 +259,7 @@  struct mmc_card {
 						/* for byte mode */
 #define MMC_QUIRK_NONSTD_SDIO	(1<<2)		/* non-standard SDIO card attached */
 						/* (missing CIA registers) */
+#define MMC_QUIRK_BROKEN_BOOT_PARTITIONS (1<<3)	/* Disable broken boot partitions */
 #define MMC_QUIRK_NONSTD_FUNC_IF (1<<4)		/* SDIO card has nonstd function interfaces */
 #define MMC_QUIRK_DISABLE_CD	(1<<5)		/* disconnect CD/DAT[3] resistor */
 #define MMC_QUIRK_INAND_CMD38	(1<<6)		/* iNAND devices have broken CMD38 */