diff mbox series

mmc: dw_mmc: Disable SDIO interrupts while suspended to fix suspend/resume

Message ID 20190410221237.160856-1-dianders@chromium.org
State New, archived
Headers show
Series mmc: dw_mmc: Disable SDIO interrupts while suspended to fix suspend/resume | expand

Commit Message

Doug Anderson April 10, 2019, 10:12 p.m. UTC
Processing SDIO interrupts while dw_mmc is suspended (or partly
suspended) seems like a bad idea.  We really don't want to be
processing them until we've gotten ourselves fully powered up.

You might be wondering how it's even possible to become suspended when
an SDIO interrupt is active.  As can be seen in
dw_mci_enable_sdio_irq(), we explicitly keep dw_mmc out of runtime
suspend when the SDIO interrupt is enabled.  ...but even though we
stop normal runtime suspend transitions when SDIO interrupts are
enabled, the dw_mci_runtime_suspend() can still get called for a full
system suspend.

Let's handle all this by explicitly masking SDIO interrupts in the
suspend call and unmasking them later in the resume call.  To do this
cleanly I'll keep track of whether the client requested that SDIO
interrupts be enabled so that we can reliably restore them regardless
of whether we're masking them for one reason or another.

Without this fix it can be seen that rk3288-veyron Chromebooks with
Marvell WiFi would sometimes fail to resume WiFi even after picking my
recent mwifiex patch [1].  Specifically you'd see messages like this:
  mwifiex_sdio mmc1:0001:1: Firmware wakeup failed
  mwifiex_sdio mmc1:0001:1: PREP_CMD: FW in reset state

...and tracing through the resume code in the failing cases showed
that we were processing a SDIO interrupt really early in the resume
call.

NOTE: downstream in Chrome OS 3.14 and 3.18 kernels (both of which
support the Marvell SDIO WiFi card) we had a patch ("CHROMIUM: sdio:
Defer SDIO interrupt handling until after resume") [2].  Presumably
this is the same problem that was solved by that patch.

[1] https://lkml.kernel.org/r/20190404040106.40519-1-dianders@chromium.org
[2] https://crrev.com/c/230765

Cc: <stable@vger.kernel.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
I didn't put any "Fixes" tag here, but presumably this could be
backported to whichever kernels folks found it useful for.  I have at
least confirmed that kernels v4.14 and v4.19 (as well as v5.1-rc2)
show the problem.  It is very easy to pick this to v4.19 and it
definitely fixes the problem there.

I haven't spent the time to pick this to 4.14 myself, but presumably
it wouldn't be too hard to backport this as far as v4.13 since that
contains commit 32dba73772f8 ("mmc: dw_mmc: Convert to use
MMC_CAP2_SDIO_IRQ_NOTHREAD for SDIO IRQs").  Prior to that it might
make sense for anyone experiencing this problem to just pick the old
CHROMIUM patch to fix them.

 drivers/mmc/host/dw_mmc.c | 24 ++++++++++++++++++++----
 drivers/mmc/host/dw_mmc.h |  3 +++
 2 files changed, 23 insertions(+), 4 deletions(-)

Comments

Doug Anderson April 22, 2019, 3:21 p.m. UTC | #1
Hi,

On Wed, Apr 10, 2019 at 3:13 PM Douglas Anderson <dianders@chromium.org> wrote:
>
> Processing SDIO interrupts while dw_mmc is suspended (or partly
> suspended) seems like a bad idea.  We really don't want to be
> processing them until we've gotten ourselves fully powered up.
>
> You might be wondering how it's even possible to become suspended when
> an SDIO interrupt is active.  As can be seen in
> dw_mci_enable_sdio_irq(), we explicitly keep dw_mmc out of runtime
> suspend when the SDIO interrupt is enabled.  ...but even though we
> stop normal runtime suspend transitions when SDIO interrupts are
> enabled, the dw_mci_runtime_suspend() can still get called for a full
> system suspend.
>
> Let's handle all this by explicitly masking SDIO interrupts in the
> suspend call and unmasking them later in the resume call.  To do this
> cleanly I'll keep track of whether the client requested that SDIO
> interrupts be enabled so that we can reliably restore them regardless
> of whether we're masking them for one reason or another.
>
> Without this fix it can be seen that rk3288-veyron Chromebooks with
> Marvell WiFi would sometimes fail to resume WiFi even after picking my
> recent mwifiex patch [1].  Specifically you'd see messages like this:
>   mwifiex_sdio mmc1:0001:1: Firmware wakeup failed
>   mwifiex_sdio mmc1:0001:1: PREP_CMD: FW in reset state
>
> ...and tracing through the resume code in the failing cases showed
> that we were processing a SDIO interrupt really early in the resume
> call.
>
> NOTE: downstream in Chrome OS 3.14 and 3.18 kernels (both of which
> support the Marvell SDIO WiFi card) we had a patch ("CHROMIUM: sdio:
> Defer SDIO interrupt handling until after resume") [2].  Presumably
> this is the same problem that was solved by that patch.
>
> [1] https://lkml.kernel.org/r/20190404040106.40519-1-dianders@chromium.org
> [2] https://crrev.com/c/230765
>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> I didn't put any "Fixes" tag here, but presumably this could be
> backported to whichever kernels folks found it useful for.  I have at
> least confirmed that kernels v4.14 and v4.19 (as well as v5.1-rc2)
> show the problem.  It is very easy to pick this to v4.19 and it
> definitely fixes the problem there.
>
> I haven't spent the time to pick this to 4.14 myself, but presumably
> it wouldn't be too hard to backport this as far as v4.13 since that
> contains commit 32dba73772f8 ("mmc: dw_mmc: Convert to use
> MMC_CAP2_SDIO_IRQ_NOTHREAD for SDIO IRQs").  Prior to that it might
> make sense for anyone experiencing this problem to just pick the old
> CHROMIUM patch to fix them.
>
>  drivers/mmc/host/dw_mmc.c | 24 ++++++++++++++++++++----
>  drivers/mmc/host/dw_mmc.h |  3 +++
>  2 files changed, 23 insertions(+), 4 deletions(-)

Jaehoon / Shawn: any thoughts on this patch?

-Doug
Shawn Lin April 24, 2019, 12:57 a.m. UTC | #2
On 2019/4/22 23:21, Doug Anderson wrote:
> Hi,
> 
> On Wed, Apr 10, 2019 at 3:13 PM Douglas Anderson <dianders@chromium.org> wrote:
>>
>> Processing SDIO interrupts while dw_mmc is suspended (or partly
>> suspended) seems like a bad idea.  We really don't want to be
>> processing them until we've gotten ourselves fully powered up.
>>
>> You might be wondering how it's even possible to become suspended when
>> an SDIO interrupt is active.  As can be seen in
>> dw_mci_enable_sdio_irq(), we explicitly keep dw_mmc out of runtime
>> suspend when the SDIO interrupt is enabled.  ...but even though we
>> stop normal runtime suspend transitions when SDIO interrupts are
>> enabled, the dw_mci_runtime_suspend() can still get called for a full
>> system suspend.
>>
>> Let's handle all this by explicitly masking SDIO interrupts in the
>> suspend call and unmasking them later in the resume call.  To do this
>> cleanly I'll keep track of whether the client requested that SDIO
>> interrupts be enabled so that we can reliably restore them regardless
>> of whether we're masking them for one reason or another.
>>
>> Without this fix it can be seen that rk3288-veyron Chromebooks with
>> Marvell WiFi would sometimes fail to resume WiFi even after picking my
>> recent mwifiex patch [1].  Specifically you'd see messages like this:
>>    mwifiex_sdio mmc1:0001:1: Firmware wakeup failed
>>    mwifiex_sdio mmc1:0001:1: PREP_CMD: FW in reset state
>>
>> ...and tracing through the resume code in the failing cases showed
>> that we were processing a SDIO interrupt really early in the resume
>> call.
>>
>> NOTE: downstream in Chrome OS 3.14 and 3.18 kernels (both of which
>> support the Marvell SDIO WiFi card) we had a patch ("CHROMIUM: sdio:
>> Defer SDIO interrupt handling until after resume") [2].  Presumably
>> this is the same problem that was solved by that patch.
>>
>> [1] https://lkml.kernel.org/r/20190404040106.40519-1-dianders@chromium.org
>> [2] https://crrev.com/c/230765
>>
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
>> ---
>> I didn't put any "Fixes" tag here, but presumably this could be
>> backported to whichever kernels folks found it useful for.  I have at
>> least confirmed that kernels v4.14 and v4.19 (as well as v5.1-rc2)
>> show the problem.  It is very easy to pick this to v4.19 and it
>> definitely fixes the problem there.
>>
>> I haven't spent the time to pick this to 4.14 myself, but presumably
>> it wouldn't be too hard to backport this as far as v4.13 since that
>> contains commit 32dba73772f8 ("mmc: dw_mmc: Convert to use
>> MMC_CAP2_SDIO_IRQ_NOTHREAD for SDIO IRQs").  Prior to that it might
>> make sense for anyone experiencing this problem to just pick the old
>> CHROMIUM patch to fix them.
>>
>>   drivers/mmc/host/dw_mmc.c | 24 ++++++++++++++++++++----
>>   drivers/mmc/host/dw_mmc.h |  3 +++
>>   2 files changed, 23 insertions(+), 4 deletions(-)
> 
> Jaehoon / Shawn: any thoughts on this patch?

The intention seems reasonable to me, but just wonder if we need
mask/unmask SDIO interrupt when it's never used?  It's the same
situation for SDMMC_CLKEN_LOW_PWR that we couldn't stop providing
clock for SDIO cards, so I guess we need to check MMC_CAP_SDIO_IRQ
as well.

> 
> -Doug
> 
> 
>
Doug Anderson April 24, 2019, 1:09 a.m. UTC | #3
Hi,

On Tue, Apr 23, 2019 at 5:57 PM Shawn Lin <shawn.lin@rock-chips.com> wrote:
>
> The intention seems reasonable to me, but just wonder if we need
> mask/unmask SDIO interrupt when it's never used?

I don't think we do.

Specifically "client_sdio_enb" starts out as false.  If nobody ever
calls dw_mci_enable_sdio_irq() then "client_sdio_enb" will always
continue to be false.

Now at suspend time we'll call "__dw_mci_enable_sdio_irq".  Because
"client_sdio_enb" is false then the local variable "enb" will always
be false.  Sure we'll clear the "SDMMC_INT_SDIO" from the "INTMASK"
register, but it should already have been cleared so this is a no-op.

...at resume time we'll have a similar situation where
"client_sdio_enb" is false and thus we'll (again) just clear the
"SDMMC_INT_SDIO" from the "INTMASK".

I could potentially optimize away the "mci_writel()" if we're not
changing anything if you're worried about that?


> It's the same
> situation for SDMMC_CLKEN_LOW_PWR that we couldn't stop providing
> clock for SDIO cards, so I guess we need to check MMC_CAP_SDIO_IRQ
> as well.

I think it might be a slightly different situation though.  In this
case I believe it's not just a problem with clock stoppage.  I believe
the problem is that the interrupt will be passed to the SDIO device
driver right away and that'll call back into dw_mmc.  dw_mmc is just
not in a state to handle it until we've more fully resumed things.

In any case with my patch the only way we'd ever end up unmasking the
SDIO IRQ here would be if dw_mci_enable_sdio_irq() was called.  That
will only happen if there's an SDIO card plugged in.


-Doug
Emil Renner Berthing April 24, 2019, 8:19 a.m. UTC | #4
Hi Douglas,

Unfortunately this seems to beak resume on my rk3399-gru-kevin. I have
a semi-complicated setup with my rootfs as a btrfs on dmcrypt on
mmcblk0 which is the dw_mmc, so I'm guessing something goes wrong when
waking up the dm_mmc which probably wasn't suspended before this
patch. It's not 100% consistent though. Sometimes I see it resume the
first time I try suspending, but then 2nd time I suspend it won't come
back.

Let me know if I can do something to help debug this.

/Emil


On Thu, 11 Apr 2019 at 00:13, Douglas Anderson <dianders@chromium.org> wrote:
>
> Processing SDIO interrupts while dw_mmc is suspended (or partly
> suspended) seems like a bad idea.  We really don't want to be
> processing them until we've gotten ourselves fully powered up.
>
> You might be wondering how it's even possible to become suspended when
> an SDIO interrupt is active.  As can be seen in
> dw_mci_enable_sdio_irq(), we explicitly keep dw_mmc out of runtime
> suspend when the SDIO interrupt is enabled.  ...but even though we
> stop normal runtime suspend transitions when SDIO interrupts are
> enabled, the dw_mci_runtime_suspend() can still get called for a full
> system suspend.
>
> Let's handle all this by explicitly masking SDIO interrupts in the
> suspend call and unmasking them later in the resume call.  To do this
> cleanly I'll keep track of whether the client requested that SDIO
> interrupts be enabled so that we can reliably restore them regardless
> of whether we're masking them for one reason or another.
>
> Without this fix it can be seen that rk3288-veyron Chromebooks with
> Marvell WiFi would sometimes fail to resume WiFi even after picking my
> recent mwifiex patch [1].  Specifically you'd see messages like this:
>   mwifiex_sdio mmc1:0001:1: Firmware wakeup failed
>   mwifiex_sdio mmc1:0001:1: PREP_CMD: FW in reset state
>
> ...and tracing through the resume code in the failing cases showed
> that we were processing a SDIO interrupt really early in the resume
> call.
>
> NOTE: downstream in Chrome OS 3.14 and 3.18 kernels (both of which
> support the Marvell SDIO WiFi card) we had a patch ("CHROMIUM: sdio:
> Defer SDIO interrupt handling until after resume") [2].  Presumably
> this is the same problem that was solved by that patch.
>
> [1] https://lkml.kernel.org/r/20190404040106.40519-1-dianders@chromium.org
> [2] https://crrev.com/c/230765
>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> I didn't put any "Fixes" tag here, but presumably this could be
> backported to whichever kernels folks found it useful for.  I have at
> least confirmed that kernels v4.14 and v4.19 (as well as v5.1-rc2)
> show the problem.  It is very easy to pick this to v4.19 and it
> definitely fixes the problem there.
>
> I haven't spent the time to pick this to 4.14 myself, but presumably
> it wouldn't be too hard to backport this as far as v4.13 since that
> contains commit 32dba73772f8 ("mmc: dw_mmc: Convert to use
> MMC_CAP2_SDIO_IRQ_NOTHREAD for SDIO IRQs").  Prior to that it might
> make sense for anyone experiencing this problem to just pick the old
> CHROMIUM patch to fix them.
>
>  drivers/mmc/host/dw_mmc.c | 24 ++++++++++++++++++++----
>  drivers/mmc/host/dw_mmc.h |  3 +++
>  2 files changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> index 80dc2fd6576c..432f6e3ddd43 100644
> --- a/drivers/mmc/host/dw_mmc.c
> +++ b/drivers/mmc/host/dw_mmc.c
> @@ -1664,7 +1664,8 @@ static void dw_mci_init_card(struct mmc_host *mmc, struct mmc_card *card)
>         }
>  }
>
> -static void __dw_mci_enable_sdio_irq(struct dw_mci_slot *slot, int enb)
> +static void __dw_mci_enable_sdio_irq(struct dw_mci_slot *slot, bool enb,
> +                                    bool client_requested)
>  {
>         struct dw_mci *host = slot->host;
>         unsigned long irqflags;
> @@ -1672,6 +1673,17 @@ static void __dw_mci_enable_sdio_irq(struct dw_mci_slot *slot, int enb)
>
>         spin_lock_irqsave(&host->irq_lock, irqflags);
>
> +       /*
> +        * If this was requested by the client save the request.  If this
> +        * wasn't required by the client then logically AND it with the
> +        * client request since we want to disable if either the client
> +        * disabled OR we have some other reason to disable.
> +        */
> +       if (client_requested)
> +               host->client_sdio_enb = enb;
> +       else if (!host->client_sdio_enb)
> +               enb = 0;
> +
>         /* Enable/disable Slot Specific SDIO interrupt */
>         int_mask = mci_readl(host, INTMASK);
>         if (enb)
> @@ -1688,7 +1700,7 @@ static void dw_mci_enable_sdio_irq(struct mmc_host *mmc, int enb)
>         struct dw_mci_slot *slot = mmc_priv(mmc);
>         struct dw_mci *host = slot->host;
>
> -       __dw_mci_enable_sdio_irq(slot, enb);
> +       __dw_mci_enable_sdio_irq(slot, enb, true);
>
>         /* Avoid runtime suspending the device when SDIO IRQ is enabled */
>         if (enb)
> @@ -1701,7 +1713,7 @@ static void dw_mci_ack_sdio_irq(struct mmc_host *mmc)
>  {
>         struct dw_mci_slot *slot = mmc_priv(mmc);
>
> -       __dw_mci_enable_sdio_irq(slot, 1);
> +       __dw_mci_enable_sdio_irq(slot, true, false);
>  }
>
>  static int dw_mci_execute_tuning(struct mmc_host *mmc, u32 opcode)
> @@ -2734,7 +2746,7 @@ static irqreturn_t dw_mci_interrupt(int irq, void *dev_id)
>                 if (pending & SDMMC_INT_SDIO(slot->sdio_id)) {
>                         mci_writel(host, RINTSTS,
>                                    SDMMC_INT_SDIO(slot->sdio_id));
> -                       __dw_mci_enable_sdio_irq(slot, 0);
> +                       __dw_mci_enable_sdio_irq(slot, false, false);
>                         sdio_signal_irq(slot->mmc);
>                 }
>
> @@ -3424,6 +3436,8 @@ int dw_mci_runtime_suspend(struct device *dev)
>  {
>         struct dw_mci *host = dev_get_drvdata(dev);
>
> +       __dw_mci_enable_sdio_irq(host->slot, false, false);
> +
>         if (host->use_dma && host->dma_ops->exit)
>                 host->dma_ops->exit(host);
>
> @@ -3490,6 +3504,8 @@ int dw_mci_runtime_resume(struct device *dev)
>         /* Now that slots are all setup, we can enable card detect */
>         dw_mci_enable_cd(host);
>
> +       __dw_mci_enable_sdio_irq(host->slot, true, false);
> +
>         return 0;
>
>  err:
> diff --git a/drivers/mmc/host/dw_mmc.h b/drivers/mmc/host/dw_mmc.h
> index 46e9f8ec5398..dfbace0f5043 100644
> --- a/drivers/mmc/host/dw_mmc.h
> +++ b/drivers/mmc/host/dw_mmc.h
> @@ -127,6 +127,7 @@ struct dw_mci_dma_slave {
>   * @cmd11_timer: Timer for SD3.0 voltage switch over scheme.
>   * @cto_timer: Timer for broken command transfer over scheme.
>   * @dto_timer: Timer for broken data transfer over scheme.
> + * @client_sdio_enb: The value last passed to enable_sdio_irq.
>   *
>   * Locking
>   * =======
> @@ -234,6 +235,8 @@ struct dw_mci {
>         struct timer_list       cmd11_timer;
>         struct timer_list       cto_timer;
>         struct timer_list       dto_timer;
> +
> +       bool                    client_sdio_enb;
>  };
>
>  /* DMA ops for Internal/External DMAC interface */
> --
> 2.21.0.392.gf8f6787159e-goog
>
>
> _______________________________________________
> Linux-rockchip mailing list
> Linux-rockchip@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-rockchip
Doug Anderson April 25, 2019, 9:24 p.m. UTC | #5
Hi,

On Wed, Apr 24, 2019 at 1:19 AM Emil Renner Berthing
<emil.renner.berthing@gmail.com> wrote:
>
> Hi Douglas,
>
> Unfortunately this seems to beak resume on my rk3399-gru-kevin. I have
> a semi-complicated setup with my rootfs as a btrfs on dmcrypt on
> mmcblk0 which is the dw_mmc, so I'm guessing something goes wrong when
> waking up the dm_mmc which probably wasn't suspended before this
> patch. It's not 100% consistent though. Sometimes I see it resume the
> first time I try suspending, but then 2nd time I suspend it won't come
> back.

Thanks for testing!

Can you give me more details about your kernel version?  Any local
patches?  What config are you using?

I tried booting up my rk3388-gru-kevin on the Chrome OS 4.19 kernel (I
know, not quite fully upstream, but linuxnext just crashed upon boot
for me when I tried it).  I have this patch in place and I booted from
SD card.  I ran a Chrome OS tool which will test 20 cycles of
suspend/resume (uses the RTC to schedule a wakeup):

suspend_stress_test -c20 --suspend_min=15 --suspend_max=20

...and I didn't see failures.

...so I'm pretty baffled.  On rk3399-gru-kevin you should have no SDIO
devices unless you've managed to cram an SDIO card into your micro SD
slot.  Specifically WiFi is connected via PCIE.

As I wrote Shawn, I'm pretty sure my patch is a effectively a no-op in
these cases.  "client_sdio_enb" should always be 0 and thus runtime
suspend and resume should just be:

spin_lock_irqsave(&host->irq_lock, irqflags);
int_mask = mci_readl(host, INTMASK);
mci_writel(host, INTMASK, int_mask);
spin_unlock_irqrestore(&host->irq_lock, irqflags);

...other than changing the timing slightly that shouldn't do anything at all.


My first guess is that this patch is your (un)lucky shirt, as in
"every time I wear my lucky shirt I win at slots in Las Vegas".  Since
the problem you're seeing doesn't happen every time I'm going to guess
that you got lucky in that it seemed to go away when you reverted my
patch.


> Let me know if I can do something to help debug this.

Assuming the failure and this patch aren't just correlated by luck...

Logs would be a start.  If you don't have serial console then
hopefully you've got console-ramoops working?  Then if it's wedged you
can do the warmest reset you can (Alt-topRowVolUp-R) and hopefully the
ramoops will give you some logs.

Presumably you could also add some logs to double-check that all my
assertions are true.  That is:

1. host->client_sdio_enb should always be false in __dw_mci_enable_sdio_irq()

2. In __dw_mci_enable_sdio_irq() confirm that the int_mask we are
writing is the same as the one we're reading.  AKA the "if (enb) ...
else ..." makes no change to the value of int_mask.


Assuming my assertions are correct then pretty much everything is
_supposed_ to be a no-op on your system, so you can start gutting
things one at a time and see when the problem goes away:

a) Delete the call from suspend.  Does it fix it?

b) Delete the call from resume.  Does it fix it?

c) Delete parts of __dw_mci_enable_sdio_irq().  Get rid of the
mci_wirtel().  Does it fix it?  What about if you get rid of the read
and the write and just have the spinlock?

===

If I can help with real-time debugging let me know.  I'm in California
time zone and can be found as dianders in the #linux-rockchip channel
on freenode.



-Doug
Emil Renner Berthing April 26, 2019, 5:19 p.m. UTC | #6
Hi Doug,

TLDR: I'm no longer convinced this patch breaks suspend/resume more
than it already is. Sorry about the noise.

On Thu, 25 Apr 2019 at 23:25, Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Wed, Apr 24, 2019 at 1:19 AM Emil Renner Berthing
> <emil.renner.berthing@gmail.com> wrote:
> >
> > Hi Douglas,
> >
> > Unfortunately this seems to beak resume on my rk3399-gru-kevin. I have
> > a semi-complicated setup with my rootfs as a btrfs on dmcrypt on
> > mmcblk0 which is the dw_mmc, so I'm guessing something goes wrong when
> > waking up the dm_mmc which probably wasn't suspended before this
> > patch. It's not 100% consistent though. Sometimes I see it resume the
> > first time I try suspending, but then 2nd time I suspend it won't come
> > back.
>
> Thanks for testing!

Thanks for your detailed response. It made me want to make absolutely
sure that this patch is the culprit.
As a baseline I booted a vanilla 5.0.9 and suspend/resumed it about a
dusin times without any errors.
So I applied this patch and immediately it crashed on suspend, but in
a way that I could still see the kernel log,
and it was the mwifiex driver that crashed. I rebooted and tried
supend/resume again and
this time it seemed like it was the dwc3 or usb3-phy that crashed.
I still have the kernel log if anyone is interested.
However 3rd time booting 5.0.9 with this patch suspend/resume just works.
At least the 2 dusin times I tried before giving up on making it crash.
I went back to vanilla 5.0.9 and after a few tries I managed to make
that one crash too.
I guess that means this patch is off the hook. I'm sorry about the
false report :/

/Emil
Doug Anderson April 26, 2019, 10:08 p.m. UTC | #7
Hi,

On Fri, Apr 26, 2019 at 10:19 AM Emil Renner Berthing
<emil.renner.berthing@gmail.com> wrote:
>
> Hi Doug,
>
> TLDR: I'm no longer convinced this patch breaks suspend/resume more
> than it already is. Sorry about the noise.
>
> On Thu, 25 Apr 2019 at 23:25, Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Wed, Apr 24, 2019 at 1:19 AM Emil Renner Berthing
> > <emil.renner.berthing@gmail.com> wrote:
> > >
> > > Hi Douglas,
> > >
> > > Unfortunately this seems to beak resume on my rk3399-gru-kevin. I have
> > > a semi-complicated setup with my rootfs as a btrfs on dmcrypt on
> > > mmcblk0 which is the dw_mmc, so I'm guessing something goes wrong when
> > > waking up the dm_mmc which probably wasn't suspended before this
> > > patch. It's not 100% consistent though. Sometimes I see it resume the
> > > first time I try suspending, but then 2nd time I suspend it won't come
> > > back.
> >
> > Thanks for testing!
>
> Thanks for your detailed response. It made me want to make absolutely
> sure that this patch is the culprit.
> As a baseline I booted a vanilla 5.0.9 and suspend/resumed it about a
> dusin times without any errors.
> So I applied this patch and immediately it crashed on suspend, but in
> a way that I could still see the kernel log,
> and it was the mwifiex driver that crashed. I rebooted and tried
> supend/resume again and
> this time it seemed like it was the dwc3 or usb3-phy that crashed.
> I still have the kernel log if anyone is interested.
> However 3rd time booting 5.0.9 with this patch suspend/resume just works.
> At least the 2 dusin times I tried before giving up on making it crash.
> I went back to vanilla 5.0.9 and after a few tries I managed to make
> that one crash too.
> I guess that means this patch is off the hook. I'm sorry about the
> false report :/

No worries, I've certainly been there and I'm super happy to have
people testing patches.  :-)

Odd that you're having suspend/resume patches.  My first guess for
super randomness would be WiFi.  The PCIe bus on rk3399 causes the
most impossible to debug problems if you try to access it at the wrong
time.  If you disable WiFi do all your problems go away?  I tried
putting v5.0.9 on the kevin sitting on my desk and it seems to
suspend/resume OK (25 cycles), but:

* I just jammed it straight onto a normal Chrome OS root filesystem.
Since that filesystem expects the GPU to be there, I'm just booting to
a serial prompt and the screen just displays the boot splash.

* I didn't try to configure WiFi or anything.

* I'm using the Chrome OS "fallback config" for the kernel (the config
our build system picks if building an upstream kernel without the
normal split config).  AKA:
<https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/refs/heads/master/eclass/cros-kernel/rockchip64_defconfig>.
I'm not 100% sure everything is enabled there...

* I'm booting w/ serial console enabled and doing my testing with "no
console suspend" which can certainly affect suspend/resume timing.


Best of luck tracking your problems down!  I suppose if things used to
work maybe a bisect would be possible?

-Doug
diff mbox series

Patch

diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
index 80dc2fd6576c..432f6e3ddd43 100644
--- a/drivers/mmc/host/dw_mmc.c
+++ b/drivers/mmc/host/dw_mmc.c
@@ -1664,7 +1664,8 @@  static void dw_mci_init_card(struct mmc_host *mmc, struct mmc_card *card)
 	}
 }
 
-static void __dw_mci_enable_sdio_irq(struct dw_mci_slot *slot, int enb)
+static void __dw_mci_enable_sdio_irq(struct dw_mci_slot *slot, bool enb,
+				     bool client_requested)
 {
 	struct dw_mci *host = slot->host;
 	unsigned long irqflags;
@@ -1672,6 +1673,17 @@  static void __dw_mci_enable_sdio_irq(struct dw_mci_slot *slot, int enb)
 
 	spin_lock_irqsave(&host->irq_lock, irqflags);
 
+	/*
+	 * If this was requested by the client save the request.  If this
+	 * wasn't required by the client then logically AND it with the
+	 * client request since we want to disable if either the client
+	 * disabled OR we have some other reason to disable.
+	 */
+	if (client_requested)
+		host->client_sdio_enb = enb;
+	else if (!host->client_sdio_enb)
+		enb = 0;
+
 	/* Enable/disable Slot Specific SDIO interrupt */
 	int_mask = mci_readl(host, INTMASK);
 	if (enb)
@@ -1688,7 +1700,7 @@  static void dw_mci_enable_sdio_irq(struct mmc_host *mmc, int enb)
 	struct dw_mci_slot *slot = mmc_priv(mmc);
 	struct dw_mci *host = slot->host;
 
-	__dw_mci_enable_sdio_irq(slot, enb);
+	__dw_mci_enable_sdio_irq(slot, enb, true);
 
 	/* Avoid runtime suspending the device when SDIO IRQ is enabled */
 	if (enb)
@@ -1701,7 +1713,7 @@  static void dw_mci_ack_sdio_irq(struct mmc_host *mmc)
 {
 	struct dw_mci_slot *slot = mmc_priv(mmc);
 
-	__dw_mci_enable_sdio_irq(slot, 1);
+	__dw_mci_enable_sdio_irq(slot, true, false);
 }
 
 static int dw_mci_execute_tuning(struct mmc_host *mmc, u32 opcode)
@@ -2734,7 +2746,7 @@  static irqreturn_t dw_mci_interrupt(int irq, void *dev_id)
 		if (pending & SDMMC_INT_SDIO(slot->sdio_id)) {
 			mci_writel(host, RINTSTS,
 				   SDMMC_INT_SDIO(slot->sdio_id));
-			__dw_mci_enable_sdio_irq(slot, 0);
+			__dw_mci_enable_sdio_irq(slot, false, false);
 			sdio_signal_irq(slot->mmc);
 		}
 
@@ -3424,6 +3436,8 @@  int dw_mci_runtime_suspend(struct device *dev)
 {
 	struct dw_mci *host = dev_get_drvdata(dev);
 
+	__dw_mci_enable_sdio_irq(host->slot, false, false);
+
 	if (host->use_dma && host->dma_ops->exit)
 		host->dma_ops->exit(host);
 
@@ -3490,6 +3504,8 @@  int dw_mci_runtime_resume(struct device *dev)
 	/* Now that slots are all setup, we can enable card detect */
 	dw_mci_enable_cd(host);
 
+	__dw_mci_enable_sdio_irq(host->slot, true, false);
+
 	return 0;
 
 err:
diff --git a/drivers/mmc/host/dw_mmc.h b/drivers/mmc/host/dw_mmc.h
index 46e9f8ec5398..dfbace0f5043 100644
--- a/drivers/mmc/host/dw_mmc.h
+++ b/drivers/mmc/host/dw_mmc.h
@@ -127,6 +127,7 @@  struct dw_mci_dma_slave {
  * @cmd11_timer: Timer for SD3.0 voltage switch over scheme.
  * @cto_timer: Timer for broken command transfer over scheme.
  * @dto_timer: Timer for broken data transfer over scheme.
+ * @client_sdio_enb: The value last passed to enable_sdio_irq.
  *
  * Locking
  * =======
@@ -234,6 +235,8 @@  struct dw_mci {
 	struct timer_list       cmd11_timer;
 	struct timer_list       cto_timer;
 	struct timer_list       dto_timer;
+
+	bool			client_sdio_enb;
 };
 
 /* DMA ops for Internal/External DMAC interface */