diff mbox series

[v3] mmc: core: Add a card quirk for non-hw busy detection

Message ID 20210720144115.1525257-1-linus.walleij@linaro.org (mailing list archive)
State New, archived
Headers show
Series [v3] mmc: core: Add a card quirk for non-hw busy detection | expand

Commit Message

Linus Walleij July 20, 2021, 2:41 p.m. UTC
Some boot partitions on the Samsung 4GB KLM4G1YE4C "4YMD1R" and "M4G1YC"
cards appear broken when accessed randomly. CMD6 to switch back to the main
partition randomly stalls after CMD18 access to the boot partition 1, and
the card never comes back online. The accesses to the boot partitions work
several times before this happens, but eventually the card access hangs
while initializing the card.

Some problematic eMMC cards are found in the Samsung GT-S7710 (Skomer)
and SGH-I407 (Kyle) mobile phones.

I tried using only single blocks with CMD17 on the boot partitions with the
result that it crashed even faster.

After a bit of root cause analysis it turns out that these old eMMC cards
probably cannot do hardware busy detection (monitoring DAT0) properly.

The card survives on older kernels, but this is because recent kernels have
added busy detection handling for the SoC used in these phones, exposing
the issue.

Construct a quirk that makes the MMC cord avoid using the ->card_busy()
callback if the card is listed with MMC_QUIRK_BROKEN_HW_BUSY_DETECT and
register the known problematic cards. The core changes are pretty
straight-forward with a helper inline to check of we can use hardware
busy detection.

On the MMCI host we have to counter the fact that if the host was able to
use busy detect, it would be used unsolicited in the command IRQ callback.
Rewrite this so that MMCI will not attempt to use hardware busy detection
in the command IRQ until:
- A card is attached to the host and
- We know that the card can handle this busy detection

I have glanced over the ->card_busy() callbacks on some other hosts and
they seem to mostly read a register reflecting the value of DAT0 for this
which works fine with the quirk in this patch. However if the error appear
on other hosts they might need additional fixes.

After applying this patch, the main partition can be accessed and mounted
without problems on Samsung GT-S7710 and SGH-I407.

Fixes: cb0335b778c7 ("mmc: mmci: add busy_complete callback")
Cc: stable@vger.kernel.org
Cc: phone-devel@vger.kernel.org
Cc: Ludovic Barre <ludovic.barre@st.com>
Cc: Stephan Gerhold <stephan@gerhold.net>
Reported-by: newbyte@disroot.org
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
ChangeLog v2->v3:
- Rebase on v5.14-rc1
- Reword the commit message slightly.
ChangeLog v1->v2:
- Rewrite to reflect the actual problem of broken busy detection.
---
 drivers/mmc/core/core.c    |  8 ++++----
 drivers/mmc/core/core.h    | 17 +++++++++++++++++
 drivers/mmc/core/mmc_ops.c |  4 ++--
 drivers/mmc/core/quirks.h  | 21 +++++++++++++++++++++
 drivers/mmc/host/mmci.c    | 22 ++++++++++++++++++++--
 include/linux/mmc/card.h   |  1 +
 6 files changed, 65 insertions(+), 8 deletions(-)

Comments

Ulf Hansson Aug. 16, 2021, 1:31 p.m. UTC | #1
On Tue, 20 Jul 2021 at 16:43, Linus Walleij <linus.walleij@linaro.org> wrote:
>
> Some boot partitions on the Samsung 4GB KLM4G1YE4C "4YMD1R" and "M4G1YC"
> cards appear broken when accessed randomly. CMD6 to switch back to the main
> partition randomly stalls after CMD18 access to the boot partition 1, and
> the card never comes back online. The accesses to the boot partitions work
> several times before this happens, but eventually the card access hangs
> while initializing the card.
>
> Some problematic eMMC cards are found in the Samsung GT-S7710 (Skomer)
> and SGH-I407 (Kyle) mobile phones.
>
> I tried using only single blocks with CMD17 on the boot partitions with the
> result that it crashed even faster.
>
> After a bit of root cause analysis it turns out that these old eMMC cards
> probably cannot do hardware busy detection (monitoring DAT0) properly.
>
> The card survives on older kernels, but this is because recent kernels have
> added busy detection handling for the SoC used in these phones, exposing
> the issue.
>
> Construct a quirk that makes the MMC cord avoid using the ->card_busy()
> callback if the card is listed with MMC_QUIRK_BROKEN_HW_BUSY_DETECT and
> register the known problematic cards. The core changes are pretty
> straight-forward with a helper inline to check of we can use hardware
> busy detection.
>
> On the MMCI host we have to counter the fact that if the host was able to
> use busy detect, it would be used unsolicited in the command IRQ callback.
> Rewrite this so that MMCI will not attempt to use hardware busy detection
> in the command IRQ until:
> - A card is attached to the host and
> - We know that the card can handle this busy detection
>
> I have glanced over the ->card_busy() callbacks on some other hosts and
> they seem to mostly read a register reflecting the value of DAT0 for this
> which works fine with the quirk in this patch. However if the error appear
> on other hosts they might need additional fixes.
>
> After applying this patch, the main partition can be accessed and mounted
> without problems on Samsung GT-S7710 and SGH-I407.
>
> Fixes: cb0335b778c7 ("mmc: mmci: add busy_complete callback")
> Cc: stable@vger.kernel.org
> Cc: phone-devel@vger.kernel.org
> Cc: Ludovic Barre <ludovic.barre@st.com>
> Cc: Stephan Gerhold <stephan@gerhold.net>
> Reported-by: newbyte@disroot.org
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

Hi Linus,

My apologies for the delay. Unfortunately I am still not convinced
that what you're suggesting in the $subject patch is the correct
solution. I think we should debug the issue further so we really
understand what goes on.

For example, when the controller hangs on the CMD6, what state are the
controler in? Did you confirm that the mmci driver is waiting for the
busyend IRQ, as our earlier discussions indicated? Perhaps a register
dump at the point of when the driver is hanging would tell us more,
for example.

I can certainly help with some suggestions for debugging, just let me
know when you have some time, then we can discuss this.

Kind regards
Uffe

> ---
> ChangeLog v2->v3:
> - Rebase on v5.14-rc1
> - Reword the commit message slightly.
> ChangeLog v1->v2:
> - Rewrite to reflect the actual problem of broken busy detection.
> ---
>  drivers/mmc/core/core.c    |  8 ++++----
>  drivers/mmc/core/core.h    | 17 +++++++++++++++++
>  drivers/mmc/core/mmc_ops.c |  4 ++--
>  drivers/mmc/core/quirks.h  | 21 +++++++++++++++++++++
>  drivers/mmc/host/mmci.c    | 22 ++++++++++++++++++++--
>  include/linux/mmc/card.h   |  1 +
>  6 files changed, 65 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
> index 95fedcf56e4a..e08dd9ea3d46 100644
> --- a/drivers/mmc/core/core.c
> +++ b/drivers/mmc/core/core.c
> @@ -232,7 +232,7 @@ static void __mmc_start_request(struct mmc_host *host, struct mmc_request *mrq)
>          * And bypass I/O abort, reset and bus suspend operations.
>          */
>         if (sdio_is_io_busy(mrq->cmd->opcode, mrq->cmd->arg) &&
> -           host->ops->card_busy) {
> +           mmc_hw_busy_detect(host)) {
>                 int tries = 500; /* Wait aprox 500ms at maximum */
>
>                 while (host->ops->card_busy(host) && --tries)
> @@ -1200,7 +1200,7 @@ int mmc_set_uhs_voltage(struct mmc_host *host, u32 ocr)
>          */
>         if (!host->ops->start_signal_voltage_switch)
>                 return -EPERM;
> -       if (!host->ops->card_busy)
> +       if (!mmc_hw_busy_detect(host))
>                 pr_warn("%s: cannot verify signal voltage switch\n",
>                         mmc_hostname(host));
>
> @@ -1220,7 +1220,7 @@ int mmc_set_uhs_voltage(struct mmc_host *host, u32 ocr)
>          * after the response of cmd11, but wait 1 ms to be sure
>          */
>         mmc_delay(1);
> -       if (host->ops->card_busy && !host->ops->card_busy(host)) {
> +       if (mmc_hw_busy_detect(host) && !host->ops->card_busy(host)) {
>                 err = -EAGAIN;
>                 goto power_cycle;
>         }
> @@ -1241,7 +1241,7 @@ int mmc_set_uhs_voltage(struct mmc_host *host, u32 ocr)
>          * Failure to switch is indicated by the card holding
>          * dat[0:3] low
>          */
> -       if (host->ops->card_busy && host->ops->card_busy(host))
> +       if (mmc_hw_busy_detect(host) && host->ops->card_busy(host))
>                 err = -EAGAIN;
>
>  power_cycle:
> diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h
> index 0c4de2030b3f..6a5619eed4a6 100644
> --- a/drivers/mmc/core/core.h
> +++ b/drivers/mmc/core/core.h
> @@ -181,4 +181,21 @@ static inline int mmc_flush_cache(struct mmc_host *host)
>         return 0;
>  }
>
> +/**
> + * mmc_hw_busy_detect() - Can we use hw busy detection?
> + * @host: the host in question
> + */
> +static inline bool mmc_hw_busy_detect(struct mmc_host *host)
> +{
> +       struct mmc_card *card = host->card;
> +       bool has_ops;
> +       bool able = true;
> +
> +       has_ops = (host->ops->card_busy != NULL);
> +       if (card)
> +               able = !(card->quirks & MMC_QUIRK_BROKEN_HW_BUSY_DETECT);
> +
> +       return (has_ops && able);
> +}
> +
>  #endif
> diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
> index 973756ed4016..546fc799a8e5 100644
> --- a/drivers/mmc/core/mmc_ops.c
> +++ b/drivers/mmc/core/mmc_ops.c
> @@ -435,7 +435,7 @@ static int mmc_busy_cb(void *cb_data, bool *busy)
>         u32 status = 0;
>         int err;
>
> -       if (host->ops->card_busy) {
> +       if (mmc_hw_busy_detect(host)) {
>                 *busy = host->ops->card_busy(host);
>                 return 0;
>         }
> @@ -597,7 +597,7 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value,
>          * when it's not allowed to poll by using CMD13, then we need to rely on
>          * waiting the stated timeout to be sufficient.
>          */
> -       if (!send_status && !host->ops->card_busy) {
> +       if (!send_status && !mmc_hw_busy_detect(host)) {
>                 mmc_delay(timeout_ms);
>                 goto out_tim;
>         }
> diff --git a/drivers/mmc/core/quirks.h b/drivers/mmc/core/quirks.h
> index d68e6e513a4f..8da6526f0eb0 100644
> --- a/drivers/mmc/core/quirks.h
> +++ b/drivers/mmc/core/quirks.h
> @@ -99,6 +99,27 @@ static const struct mmc_fixup __maybe_unused mmc_blk_fixups[] = {
>         MMC_FIXUP("V10016", CID_MANFID_KINGSTON, CID_OEMID_ANY, add_quirk_mmc,
>                   MMC_QUIRK_TRIM_BROKEN),
>
> +       /*
> +        * Some older Samsung eMMCs have broken hardware busy detection.
> +        * Enabling this feature in the host controller can make the card
> +        * accesses lock up completely.
> +        */
> +       MMC_FIXUP("4YMD1R", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
> +                 MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
> +       /* Samsung KLMxGxxE4x eMMCs from 2012: 4, 8, 16, 32 and 64 GB */
> +       MMC_FIXUP("M4G1YC", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
> +                 MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
> +       MMC_FIXUP("M8G1WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
> +                 MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
> +       MMC_FIXUP("MAG2WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
> +                 MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
> +       MMC_FIXUP("MBG4WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
> +                 MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
> +       MMC_FIXUP("MAG2WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
> +                 MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
> +       MMC_FIXUP("MCG8WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
> +                 MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
> +
>         END_FIXUP
>  };
>
> diff --git a/drivers/mmc/host/mmci.c b/drivers/mmc/host/mmci.c
> index 984d35055156..3046917b2b67 100644
> --- a/drivers/mmc/host/mmci.c
> +++ b/drivers/mmc/host/mmci.c
> @@ -347,6 +347,24 @@ static int mmci_card_busy(struct mmc_host *mmc)
>         return busy;
>  }
>
> +/* Use this if the MMCI variant AND the card supports it */
> +static bool mmci_use_busy_detect(struct mmci_host *host)
> +{
> +       struct mmc_card *card = host->mmc->card;
> +
> +       if (!host->variant->busy_detect)
> +               return false;
> +
> +       /* We don't allow this until we know that the card can handle it */
> +       if (!card)
> +               return false;
> +
> +       if (card->quirks & MMC_QUIRK_BROKEN_HW_BUSY_DETECT)
> +               return false;
> +
> +       return true;
> +}
> +
>  static void mmci_reg_delay(struct mmci_host *host)
>  {
>         /*
> @@ -1381,7 +1399,7 @@ mmci_cmd_irq(struct mmci_host *host, struct mmc_command *cmd,
>                 return;
>
>         /* Handle busy detection on DAT0 if the variant supports it. */
> -       if (busy_resp && host->variant->busy_detect)
> +       if (busy_resp && mmci_use_busy_detect(host))
>                 if (!host->ops->busy_complete(host, status, err_msk))
>                         return;
>
> @@ -1725,7 +1743,7 @@ static void mmci_set_max_busy_timeout(struct mmc_host *mmc)
>         struct mmci_host *host = mmc_priv(mmc);
>         u32 max_busy_timeout = 0;
>
> -       if (!host->variant->busy_detect)
> +       if (!mmci_use_busy_detect(host))
>                 return;
>
>         if (host->variant->busy_timeout && mmc->actual_clock)
> diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
> index 74e6c0624d27..525a39951c6d 100644
> --- a/include/linux/mmc/card.h
> +++ b/include/linux/mmc/card.h
> @@ -280,6 +280,7 @@ struct mmc_card {
>                                                 /* for byte mode */
>  #define MMC_QUIRK_NONSTD_SDIO  (1<<2)          /* non-standard SDIO card attached */
>                                                 /* (missing CIA registers) */
> +#define MMC_QUIRK_BROKEN_HW_BUSY_DETECT (1<<3) /* Disable hardware busy detection on DAT0 */
>  #define MMC_QUIRK_NONSTD_FUNC_IF (1<<4)                /* SDIO card has nonstd function interfaces */
>  #define MMC_QUIRK_DISABLE_CD   (1<<5)          /* disconnect CD/DAT[3] resistor */
>  #define MMC_QUIRK_INAND_CMD38  (1<<6)          /* iNAND devices have broken CMD38 */
> --
> 2.31.1
>
Yann Gautier Aug. 16, 2021, 2:03 p.m. UTC | #2
On 7/20/21 4:41 PM, Linus Walleij wrote:
> Some boot partitions on the Samsung 4GB KLM4G1YE4C "4YMD1R" and "M4G1YC"
> cards appear broken when accessed randomly. CMD6 to switch back to the main
> partition randomly stalls after CMD18 access to the boot partition 1, and
> the card never comes back online. The accesses to the boot partitions work
> several times before this happens, but eventually the card access hangs
> while initializing the card.
> 
> Some problematic eMMC cards are found in the Samsung GT-S7710 (Skomer)
> and SGH-I407 (Kyle) mobile phones.
> 
> I tried using only single blocks with CMD17 on the boot partitions with the
> result that it crashed even faster.
> 
> After a bit of root cause analysis it turns out that these old eMMC cards
> probably cannot do hardware busy detection (monitoring DAT0) properly.
> 
> The card survives on older kernels, but this is because recent kernels have
> added busy detection handling for the SoC used in these phones, exposing
> the issue.
> 
> Construct a quirk that makes the MMC cord avoid using the ->card_busy()
> callback if the card is listed with MMC_QUIRK_BROKEN_HW_BUSY_DETECT and
> register the known problematic cards. The core changes are pretty
> straight-forward with a helper inline to check of we can use hardware
> busy detection.
> 
> On the MMCI host we have to counter the fact that if the host was able to
> use busy detect, it would be used unsolicited in the command IRQ callback.
> Rewrite this so that MMCI will not attempt to use hardware busy detection
> in the command IRQ until:
> - A card is attached to the host and
> - We know that the card can handle this busy detection
> 
> I have glanced over the ->card_busy() callbacks on some other hosts and
> they seem to mostly read a register reflecting the value of DAT0 for this
> which works fine with the quirk in this patch. However if the error appear
> on other hosts they might need additional fixes.
> 
> After applying this patch, the main partition can be accessed and mounted
> without problems on Samsung GT-S7710 and SGH-I407.
> 
> Fixes: cb0335b778c7 ("mmc: mmci: add busy_complete callback")
> Cc: stable@vger.kernel.org
> Cc: phone-devel@vger.kernel.org
> Cc: Ludovic Barre <ludovic.barre@st.com>
> Cc: Stephan Gerhold <stephan@gerhold.net>
> Reported-by: newbyte@disroot.org
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

Hi Linus,

I was just testing your patch on top of mmc/next.
Whereas mmc/next is fine, with your patch I fail to pass MMC test 5 
(Multi-block write).
I've got this error on STM32MP157C-EV1 board:
[  108.956218] mmc0: Starting tests of card mmc0:aaaa...
[  108.959862] mmc0: Test case 5. Multi-block write...
[  108.995615] mmc0: Warning: Host did not wait for busy state to end.
[  109.000483] mmc0: Result: ERROR (-110)
Then nothing more happens.

The test was done on an SD-card Sandisk Extreme Pro SDXC UHS-I mark 3, 
in DDR50 mode.

I'll try to add more traces to see what happens.


Best regards,
Yann


> ---
> ChangeLog v2->v3:
> - Rebase on v5.14-rc1
> - Reword the commit message slightly.
> ChangeLog v1->v2:
> - Rewrite to reflect the actual problem of broken busy detection.
> ---
>   drivers/mmc/core/core.c    |  8 ++++----
>   drivers/mmc/core/core.h    | 17 +++++++++++++++++
>   drivers/mmc/core/mmc_ops.c |  4 ++--
>   drivers/mmc/core/quirks.h  | 21 +++++++++++++++++++++
>   drivers/mmc/host/mmci.c    | 22 ++++++++++++++++++++--
>   include/linux/mmc/card.h   |  1 +
>   6 files changed, 65 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
> index 95fedcf56e4a..e08dd9ea3d46 100644
> --- a/drivers/mmc/core/core.c
> +++ b/drivers/mmc/core/core.c
> @@ -232,7 +232,7 @@ static void __mmc_start_request(struct mmc_host *host, struct mmc_request *mrq)
>   	 * And bypass I/O abort, reset and bus suspend operations.
>   	 */
>   	if (sdio_is_io_busy(mrq->cmd->opcode, mrq->cmd->arg) &&
> -	    host->ops->card_busy) {
> +	    mmc_hw_busy_detect(host)) {
>   		int tries = 500; /* Wait aprox 500ms at maximum */
>   
>   		while (host->ops->card_busy(host) && --tries)
> @@ -1200,7 +1200,7 @@ int mmc_set_uhs_voltage(struct mmc_host *host, u32 ocr)
>   	 */
>   	if (!host->ops->start_signal_voltage_switch)
>   		return -EPERM;
> -	if (!host->ops->card_busy)
> +	if (!mmc_hw_busy_detect(host))
>   		pr_warn("%s: cannot verify signal voltage switch\n",
>   			mmc_hostname(host));
>   
> @@ -1220,7 +1220,7 @@ int mmc_set_uhs_voltage(struct mmc_host *host, u32 ocr)
>   	 * after the response of cmd11, but wait 1 ms to be sure
>   	 */
>   	mmc_delay(1);
> -	if (host->ops->card_busy && !host->ops->card_busy(host)) {
> +	if (mmc_hw_busy_detect(host) && !host->ops->card_busy(host)) {
>   		err = -EAGAIN;
>   		goto power_cycle;
>   	}
> @@ -1241,7 +1241,7 @@ int mmc_set_uhs_voltage(struct mmc_host *host, u32 ocr)
>   	 * Failure to switch is indicated by the card holding
>   	 * dat[0:3] low
>   	 */
> -	if (host->ops->card_busy && host->ops->card_busy(host))
> +	if (mmc_hw_busy_detect(host) && host->ops->card_busy(host))
>   		err = -EAGAIN;
>   
>   power_cycle:
> diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h
> index 0c4de2030b3f..6a5619eed4a6 100644
> --- a/drivers/mmc/core/core.h
> +++ b/drivers/mmc/core/core.h
> @@ -181,4 +181,21 @@ static inline int mmc_flush_cache(struct mmc_host *host)
>   	return 0;
>   }
>   
> +/**
> + * mmc_hw_busy_detect() - Can we use hw busy detection?
> + * @host: the host in question
> + */
> +static inline bool mmc_hw_busy_detect(struct mmc_host *host)
> +{
> +	struct mmc_card *card = host->card;
> +	bool has_ops;
> +	bool able = true;
> +
> +	has_ops = (host->ops->card_busy != NULL);
> +	if (card)
> +		able = !(card->quirks & MMC_QUIRK_BROKEN_HW_BUSY_DETECT);
> +
> +	return (has_ops && able);
> +}
> +
>   #endif
> diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
> index 973756ed4016..546fc799a8e5 100644
> --- a/drivers/mmc/core/mmc_ops.c
> +++ b/drivers/mmc/core/mmc_ops.c
> @@ -435,7 +435,7 @@ static int mmc_busy_cb(void *cb_data, bool *busy)
>   	u32 status = 0;
>   	int err;
>   
> -	if (host->ops->card_busy) {
> +	if (mmc_hw_busy_detect(host)) {
>   		*busy = host->ops->card_busy(host);
>   		return 0;
>   	}
> @@ -597,7 +597,7 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value,
>   	 * when it's not allowed to poll by using CMD13, then we need to rely on
>   	 * waiting the stated timeout to be sufficient.
>   	 */
> -	if (!send_status && !host->ops->card_busy) {
> +	if (!send_status && !mmc_hw_busy_detect(host)) {
>   		mmc_delay(timeout_ms);
>   		goto out_tim;
>   	}
> diff --git a/drivers/mmc/core/quirks.h b/drivers/mmc/core/quirks.h
> index d68e6e513a4f..8da6526f0eb0 100644
> --- a/drivers/mmc/core/quirks.h
> +++ b/drivers/mmc/core/quirks.h
> @@ -99,6 +99,27 @@ static const struct mmc_fixup __maybe_unused mmc_blk_fixups[] = {
>   	MMC_FIXUP("V10016", CID_MANFID_KINGSTON, CID_OEMID_ANY, add_quirk_mmc,
>   		  MMC_QUIRK_TRIM_BROKEN),
>   
> +	/*
> +	 * Some older Samsung eMMCs have broken hardware busy detection.
> +	 * Enabling this feature in the host controller can make the card
> +	 * accesses lock up completely.
> +	 */
> +	MMC_FIXUP("4YMD1R", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
> +		  MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
> +	/* Samsung KLMxGxxE4x eMMCs from 2012: 4, 8, 16, 32 and 64 GB */
> +	MMC_FIXUP("M4G1YC", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
> +		  MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
> +	MMC_FIXUP("M8G1WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
> +		  MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
> +	MMC_FIXUP("MAG2WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
> +		  MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
> +	MMC_FIXUP("MBG4WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
> +		  MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
> +	MMC_FIXUP("MAG2WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
> +		  MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
> +	MMC_FIXUP("MCG8WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
> +		  MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
> +
>   	END_FIXUP
>   };
>   
> diff --git a/drivers/mmc/host/mmci.c b/drivers/mmc/host/mmci.c
> index 984d35055156..3046917b2b67 100644
> --- a/drivers/mmc/host/mmci.c
> +++ b/drivers/mmc/host/mmci.c
> @@ -347,6 +347,24 @@ static int mmci_card_busy(struct mmc_host *mmc)
>   	return busy;
>   }
>   
> +/* Use this if the MMCI variant AND the card supports it */
> +static bool mmci_use_busy_detect(struct mmci_host *host)
> +{
> +	struct mmc_card *card = host->mmc->card;
> +
> +	if (!host->variant->busy_detect)
> +		return false;
> +
> +	/* We don't allow this until we know that the card can handle it */
> +	if (!card)
> +		return false;
> +
> +	if (card->quirks & MMC_QUIRK_BROKEN_HW_BUSY_DETECT)
> +		return false;
> +
> +	return true;
> +}
> +
>   static void mmci_reg_delay(struct mmci_host *host)
>   {
>   	/*
> @@ -1381,7 +1399,7 @@ mmci_cmd_irq(struct mmci_host *host, struct mmc_command *cmd,
>   		return;
>   
>   	/* Handle busy detection on DAT0 if the variant supports it. */
> -	if (busy_resp && host->variant->busy_detect)
> +	if (busy_resp && mmci_use_busy_detect(host))
>   		if (!host->ops->busy_complete(host, status, err_msk))
>   			return;
>   
> @@ -1725,7 +1743,7 @@ static void mmci_set_max_busy_timeout(struct mmc_host *mmc)
>   	struct mmci_host *host = mmc_priv(mmc);
>   	u32 max_busy_timeout = 0;
>   
> -	if (!host->variant->busy_detect)
> +	if (!mmci_use_busy_detect(host))
>   		return;
>   
>   	if (host->variant->busy_timeout && mmc->actual_clock)
> diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
> index 74e6c0624d27..525a39951c6d 100644
> --- a/include/linux/mmc/card.h
> +++ b/include/linux/mmc/card.h
> @@ -280,6 +280,7 @@ struct mmc_card {
>   						/* for byte mode */
>   #define MMC_QUIRK_NONSTD_SDIO	(1<<2)		/* non-standard SDIO card attached */
>   						/* (missing CIA registers) */
> +#define MMC_QUIRK_BROKEN_HW_BUSY_DETECT (1<<3)	/* Disable hardware busy detection on DAT0 */
>   #define MMC_QUIRK_NONSTD_FUNC_IF (1<<4)		/* SDIO card has nonstd function interfaces */
>   #define MMC_QUIRK_DISABLE_CD	(1<<5)		/* disconnect CD/DAT[3] resistor */
>   #define MMC_QUIRK_INAND_CMD38	(1<<6)		/* iNAND devices have broken CMD38 */
>
Linus Walleij Aug. 16, 2021, 10:37 p.m. UTC | #3
On Mon, Aug 16, 2021 at 4:03 PM Yann Gautier <yann.gautier@foss.st.com> wrote:

> I was just testing your patch on top of mmc/next.
> Whereas mmc/next is fine, with your patch I fail to pass MMC test 5
> (Multi-block write).
> I've got this error on STM32MP157C-EV1 board:
> [  108.956218] mmc0: Starting tests of card mmc0:aaaa...
> [  108.959862] mmc0: Test case 5. Multi-block write...
> [  108.995615] mmc0: Warning: Host did not wait for busy state to end.
> [  109.000483] mmc0: Result: ERROR (-110)
> Then nothing more happens.
>
> The test was done on an SD-card Sandisk Extreme Pro SDXC UHS-I mark 3,
> in DDR50 mode.
>
> I'll try to add more traces to see what happens.

What I think happens is:
- You are using the MMCI driver (correct?)
- My patch augments the driver to not use busydetect until we have
  determined that the card can do it (after reading extcsd etc)
- Before this patch, the MMCI would unconditionally use HW
  busy detect on any card.

Either we have managed to wire the MMCI driver so that it doesn't
work without HW busy detect anymore, you can easily test this
by doing this:

diff --git a/drivers/mmc/host/mmci.c b/drivers/mmc/host/mmci.c
index 3765e2f4ad98..3a35f65491c8 100644
--- a/drivers/mmc/host/mmci.c
+++ b/drivers/mmc/host/mmci.c
@@ -270,10 +270,10 @@ static struct variant_data variant_stm32_sdmmc = {
        .datactrl_any_blocksz   = true,
        .datactrl_mask_sdio     = MCI_DPSM_ST_SDIOEN,
        .stm32_idmabsize_mask   = GENMASK(12, 5),
-       .busy_timeout           = true,
-       .busy_detect            = true,
-       .busy_detect_flag       = MCI_STM32_BUSYD0,
-       .busy_detect_mask       = MCI_STM32_BUSYD0ENDMASK,
+       //.busy_timeout         = true,
+       //.busy_detect          = true,
+       //.busy_detect_flag     = MCI_STM32_BUSYD0,
+       //.busy_detect_mask     = MCI_STM32_BUSYD0ENDMASK,
        .init                   = sdmmc_variant_init,
 };

@@ -297,10 +297,10 @@ static struct variant_data variant_stm32_sdmmcv2 = {
        .datactrl_mask_sdio     = MCI_DPSM_ST_SDIOEN,
        .stm32_idmabsize_mask   = GENMASK(16, 5),
        .dma_lli                = true,
-       .busy_timeout           = true,
-       .busy_detect            = true,
-       .busy_detect_flag       = MCI_STM32_BUSYD0,
-       .busy_detect_mask       = MCI_STM32_BUSYD0ENDMASK,
+       //.busy_timeout         = true,
+       //.busy_detect          = true,
+       //.busy_detect_flag     = MCI_STM32_BUSYD0,
+       //.busy_detect_mask     = MCI_STM32_BUSYD0ENDMASK,
        .init                   = sdmmc_variant_init,

Or else there is a card that cannot work without busy detect which
I find unlikely.

Yours,
Linus Walleij
Yann Gautier Aug. 24, 2021, 3:50 p.m. UTC | #4
On 8/17/21 12:37 AM, Linus Walleij wrote:
> On Mon, Aug 16, 2021 at 4:03 PM Yann Gautier <yann.gautier@foss.st.com> wrote:
> 
>> I was just testing your patch on top of mmc/next.
>> Whereas mmc/next is fine, with your patch I fail to pass MMC test 5
>> (Multi-block write).
>> I've got this error on STM32MP157C-EV1 board:
>> [  108.956218] mmc0: Starting tests of card mmc0:aaaa...
>> [  108.959862] mmc0: Test case 5. Multi-block write...
>> [  108.995615] mmc0: Warning: Host did not wait for busy state to end.
>> [  109.000483] mmc0: Result: ERROR (-110)
>> Then nothing more happens.
>>
>> The test was done on an SD-card Sandisk Extreme Pro SDXC UHS-I mark 3,
>> in DDR50 mode.
>>
>> I'll try to add more traces to see what happens.
> 

Hi Linus

> What I think happens is:
> - You are using the MMCI driver (correct?)
Yes

> - My patch augments the driver to not use busydetect until we have
>    determined that the card can do it (after reading extcsd etc)
> - Before this patch, the MMCI would unconditionally use HW
>    busy detect on any card.
I finally found the problem.
The assignment of host->card is done at the end of mmc_sd_init_card().
But mmci_set_max_busy_timeout() is called in mmc_sd_init_card() before 
that, and card is then NULL at that time. This let me a 
mmc->max_busy_timeout = 0. And this value is no more updated.
mmci_start_command() will then have a unexpected behavior with that 0 value.

Maybe we should not use mmci_use_busy_detect() in 
mmci_set_max_busy_timeout()?

If I use this patch on top of yours (reverting the 
mmci_set_max_busy_timeout() change), all the mmc tests pass on the 
SD-card I was testing:

@@ -1741,11 +1741,11 @@ static void mmci_request(struct mmc_host *mmc, 
struct mmc_request *mrq)
  static void mmci_set_max_busy_timeout(struct mmc_host *mmc)
  {
  	struct mmci_host *host = mmc_priv(mmc);
  	u32 max_busy_timeout = 0;

-	if (!mmci_use_busy_detect(host))
+	if (!host->variant->busy_detect)
  		return;

  	if (host->variant->busy_timeout && mmc->actual_clock)
  		max_busy_timeout = ~0UL / (mmc->actual_clock / MSEC_PER_SEC);


> 
> Either we have managed to wire the MMCI driver so that it doesn't
> work without HW busy detect anymore, you can easily test this
> by doing this:
> 
> diff --git a/drivers/mmc/host/mmci.c b/drivers/mmc/host/mmci.c
> index 3765e2f4ad98..3a35f65491c8 100644
> --- a/drivers/mmc/host/mmci.c
> +++ b/drivers/mmc/host/mmci.c
> @@ -270,10 +270,10 @@ static struct variant_data variant_stm32_sdmmc = {
>          .datactrl_any_blocksz   = true,
>          .datactrl_mask_sdio     = MCI_DPSM_ST_SDIOEN,
>          .stm32_idmabsize_mask   = GENMASK(12, 5),
> -       .busy_timeout           = true,
> -       .busy_detect            = true,
> -       .busy_detect_flag       = MCI_STM32_BUSYD0,
> -       .busy_detect_mask       = MCI_STM32_BUSYD0ENDMASK,
> +       //.busy_timeout         = true,
> +       //.busy_detect          = true,
> +       //.busy_detect_flag     = MCI_STM32_BUSYD0,
> +       //.busy_detect_mask     = MCI_STM32_BUSYD0ENDMASK,
>          .init                   = sdmmc_variant_init,
>   };
> 
> @@ -297,10 +297,10 @@ static struct variant_data variant_stm32_sdmmcv2 = {
>          .datactrl_mask_sdio     = MCI_DPSM_ST_SDIOEN,
>          .stm32_idmabsize_mask   = GENMASK(16, 5),
>          .dma_lli                = true,
> -       .busy_timeout           = true,
> -       .busy_detect            = true,
> -       .busy_detect_flag       = MCI_STM32_BUSYD0,
> -       .busy_detect_mask       = MCI_STM32_BUSYD0ENDMASK,
> +       //.busy_timeout         = true,
> +       //.busy_detect          = true,
> +       //.busy_detect_flag     = MCI_STM32_BUSYD0,
> +       //.busy_detect_mask     = MCI_STM32_BUSYD0ENDMASK,

This was working, but disabling HW busy detection is not really what we 
want.

>          .init                   = sdmmc_variant_init,
> 
> Or else there is a card that cannot work without busy detect which
> I find unlikely.
> 
> Yours,
> Linus Walleij >


I have the same kind of issues with the eMMC on the STM32MP157C-EV1 
board. But here it fails at boot when trying to enable HPI, in mmc_switch().


I then updated the patch like this:
@@ -357,7 +357,7 @@ static bool mmci_use_busy_detect(struct mmci_host *host)

         /* We don't allow this until we know that the card can handle it */
         if (!card)
-               return false;
+               return true;


And it then works for all my use-cases, but I suppose that's not what 
you wanted to do.

So I guess we need to have the mmc_card structure, to determine if we 
have the quirk, but not from the mmc_host. Through some new callback?


Best regards,
Yann
Linus Walleij Feb. 13, 2022, 11:56 p.m. UTC | #5
Some updates on this issue:

We no longer get lockups as of kernel v5.17-rc1.

The console however says things like this:

[    1.979485] mmci-pl18x 80005000.mmc: mmc2: PL180 manf 80 rev4 at
0x80005000 irq 93,0 (pio)
[    1.987943] mmci-pl18x 80005000.mmc: DMA channels RX dma0chan6, TX dma0chan7
[    3.204496] mmc2: Card stuck being busy! __mmc_poll_for_busy
[    4.284431] mmc2: Card stuck being busy! __mmc_poll_for_busy

And after this it never finds the partitions on the card.

So the issue isn't gone, eMMC is still unusable on Skomer and Codina,
but the issue no longer hangs the whole MMC/SD stack which is nicer.

Yours,
Linus Walleij
diff mbox series

Patch

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 95fedcf56e4a..e08dd9ea3d46 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -232,7 +232,7 @@  static void __mmc_start_request(struct mmc_host *host, struct mmc_request *mrq)
 	 * And bypass I/O abort, reset and bus suspend operations.
 	 */
 	if (sdio_is_io_busy(mrq->cmd->opcode, mrq->cmd->arg) &&
-	    host->ops->card_busy) {
+	    mmc_hw_busy_detect(host)) {
 		int tries = 500; /* Wait aprox 500ms at maximum */
 
 		while (host->ops->card_busy(host) && --tries)
@@ -1200,7 +1200,7 @@  int mmc_set_uhs_voltage(struct mmc_host *host, u32 ocr)
 	 */
 	if (!host->ops->start_signal_voltage_switch)
 		return -EPERM;
-	if (!host->ops->card_busy)
+	if (!mmc_hw_busy_detect(host))
 		pr_warn("%s: cannot verify signal voltage switch\n",
 			mmc_hostname(host));
 
@@ -1220,7 +1220,7 @@  int mmc_set_uhs_voltage(struct mmc_host *host, u32 ocr)
 	 * after the response of cmd11, but wait 1 ms to be sure
 	 */
 	mmc_delay(1);
-	if (host->ops->card_busy && !host->ops->card_busy(host)) {
+	if (mmc_hw_busy_detect(host) && !host->ops->card_busy(host)) {
 		err = -EAGAIN;
 		goto power_cycle;
 	}
@@ -1241,7 +1241,7 @@  int mmc_set_uhs_voltage(struct mmc_host *host, u32 ocr)
 	 * Failure to switch is indicated by the card holding
 	 * dat[0:3] low
 	 */
-	if (host->ops->card_busy && host->ops->card_busy(host))
+	if (mmc_hw_busy_detect(host) && host->ops->card_busy(host))
 		err = -EAGAIN;
 
 power_cycle:
diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h
index 0c4de2030b3f..6a5619eed4a6 100644
--- a/drivers/mmc/core/core.h
+++ b/drivers/mmc/core/core.h
@@ -181,4 +181,21 @@  static inline int mmc_flush_cache(struct mmc_host *host)
 	return 0;
 }
 
+/**
+ * mmc_hw_busy_detect() - Can we use hw busy detection?
+ * @host: the host in question
+ */
+static inline bool mmc_hw_busy_detect(struct mmc_host *host)
+{
+	struct mmc_card *card = host->card;
+	bool has_ops;
+	bool able = true;
+
+	has_ops = (host->ops->card_busy != NULL);
+	if (card)
+		able = !(card->quirks & MMC_QUIRK_BROKEN_HW_BUSY_DETECT);
+
+	return (has_ops && able);
+}
+
 #endif
diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
index 973756ed4016..546fc799a8e5 100644
--- a/drivers/mmc/core/mmc_ops.c
+++ b/drivers/mmc/core/mmc_ops.c
@@ -435,7 +435,7 @@  static int mmc_busy_cb(void *cb_data, bool *busy)
 	u32 status = 0;
 	int err;
 
-	if (host->ops->card_busy) {
+	if (mmc_hw_busy_detect(host)) {
 		*busy = host->ops->card_busy(host);
 		return 0;
 	}
@@ -597,7 +597,7 @@  int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value,
 	 * when it's not allowed to poll by using CMD13, then we need to rely on
 	 * waiting the stated timeout to be sufficient.
 	 */
-	if (!send_status && !host->ops->card_busy) {
+	if (!send_status && !mmc_hw_busy_detect(host)) {
 		mmc_delay(timeout_ms);
 		goto out_tim;
 	}
diff --git a/drivers/mmc/core/quirks.h b/drivers/mmc/core/quirks.h
index d68e6e513a4f..8da6526f0eb0 100644
--- a/drivers/mmc/core/quirks.h
+++ b/drivers/mmc/core/quirks.h
@@ -99,6 +99,27 @@  static const struct mmc_fixup __maybe_unused mmc_blk_fixups[] = {
 	MMC_FIXUP("V10016", CID_MANFID_KINGSTON, CID_OEMID_ANY, add_quirk_mmc,
 		  MMC_QUIRK_TRIM_BROKEN),
 
+	/*
+	 * Some older Samsung eMMCs have broken hardware busy detection.
+	 * Enabling this feature in the host controller can make the card
+	 * accesses lock up completely.
+	 */
+	MMC_FIXUP("4YMD1R", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+		  MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+	/* Samsung KLMxGxxE4x eMMCs from 2012: 4, 8, 16, 32 and 64 GB */
+	MMC_FIXUP("M4G1YC", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+		  MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+	MMC_FIXUP("M8G1WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+		  MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+	MMC_FIXUP("MAG2WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+		  MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+	MMC_FIXUP("MBG4WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+		  MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+	MMC_FIXUP("MAG2WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+		  MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+	MMC_FIXUP("MCG8WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+		  MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+
 	END_FIXUP
 };
 
diff --git a/drivers/mmc/host/mmci.c b/drivers/mmc/host/mmci.c
index 984d35055156..3046917b2b67 100644
--- a/drivers/mmc/host/mmci.c
+++ b/drivers/mmc/host/mmci.c
@@ -347,6 +347,24 @@  static int mmci_card_busy(struct mmc_host *mmc)
 	return busy;
 }
 
+/* Use this if the MMCI variant AND the card supports it */
+static bool mmci_use_busy_detect(struct mmci_host *host)
+{
+	struct mmc_card *card = host->mmc->card;
+
+	if (!host->variant->busy_detect)
+		return false;
+
+	/* We don't allow this until we know that the card can handle it */
+	if (!card)
+		return false;
+
+	if (card->quirks & MMC_QUIRK_BROKEN_HW_BUSY_DETECT)
+		return false;
+
+	return true;
+}
+
 static void mmci_reg_delay(struct mmci_host *host)
 {
 	/*
@@ -1381,7 +1399,7 @@  mmci_cmd_irq(struct mmci_host *host, struct mmc_command *cmd,
 		return;
 
 	/* Handle busy detection on DAT0 if the variant supports it. */
-	if (busy_resp && host->variant->busy_detect)
+	if (busy_resp && mmci_use_busy_detect(host))
 		if (!host->ops->busy_complete(host, status, err_msk))
 			return;
 
@@ -1725,7 +1743,7 @@  static void mmci_set_max_busy_timeout(struct mmc_host *mmc)
 	struct mmci_host *host = mmc_priv(mmc);
 	u32 max_busy_timeout = 0;
 
-	if (!host->variant->busy_detect)
+	if (!mmci_use_busy_detect(host))
 		return;
 
 	if (host->variant->busy_timeout && mmc->actual_clock)
diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
index 74e6c0624d27..525a39951c6d 100644
--- a/include/linux/mmc/card.h
+++ b/include/linux/mmc/card.h
@@ -280,6 +280,7 @@  struct mmc_card {
 						/* for byte mode */
 #define MMC_QUIRK_NONSTD_SDIO	(1<<2)		/* non-standard SDIO card attached */
 						/* (missing CIA registers) */
+#define MMC_QUIRK_BROKEN_HW_BUSY_DETECT (1<<3)	/* Disable hardware busy detection on DAT0 */
 #define MMC_QUIRK_NONSTD_FUNC_IF (1<<4)		/* SDIO card has nonstd function interfaces */
 #define MMC_QUIRK_DISABLE_CD	(1<<5)		/* disconnect CD/DAT[3] resistor */
 #define MMC_QUIRK_INAND_CMD38	(1<<6)		/* iNAND devices have broken CMD38 */