diff mbox

Regulator regression in next-20180305

Message ID 20180306163035.GE13586@sirena.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Mark Brown March 6, 2018, 4:30 p.m. UTC
On Mon, Mar 05, 2018 at 08:22:26PM -0300, Fabio Estevam wrote:
> On Mon, Mar 5, 2018 at 8:12 PM, Tony Lindgren <tony@atomide.com> wrote:

> > Looks like with next-20180305 there's a regulator regression
> > where mmc0 won't show any cards or produces errors:

> > mmcblk0: error -110 requesting status
> > mmc1: new high speed SDIO card at address 0001
> > mmcblk0: error -110 requesting status
> > mmcblk0: recovery failed!
> > print_req_error: I/O error, dev mmcblk0, sector 0
> > Buffer I/O error on dev mmcblk0, logical block 0, async page read
> > mmcblk0: error -110 requesting status
> > mmcblk0: recovery failed!

No other error messages?  That seems like there's something going on
that's very different to what Fabio was reporting...  I'm guessing some
voltage application didn't go through but it's hard to tell with so
little data.  dra7 does seem to have what Fabio had though so there's
definitely some effect on the OMAP platforms.

> I have also seen regulator issues due to this series:
> https://lkml.org/lkml/2018/3/5/731

Looking at your stuff I'm having trouble figuring out what's going on -
we're getting double locking of a parent regulator during enable
according to your backtraces but it's not clear to me what took that
lock already.  regulator_enable() walks the supplies before it takes
the lock on the regulator it's immediately being called on, not holding
any locks on supplies while enabling.  regulator_balance_voltage() then
tries to lock the supplies again but lockdep says the lock is already
held by regulator_enable().  It's also weird that this doesn't seem to
be showing up on other boards in kernelci, the regulator setup on those
i.MX boards looks to be quite simple so I'd expect a much wider impact.

I'm wondering if your case is more pain from mutex_lock_nested(), both
regulator_lock_coupled() and regulator_lock_supply() will end up using 
indexes starting at 0 for the locking classes.  That doesn't smell right
though, but in case my straw clutching works:

If we can't figure it out I'll just drop the series but I'd prefer to at
least understand what's going on.

Comments

Fabio Estevam March 6, 2018, 4:56 p.m. UTC | #1
Hi Mark,

On Tue, Mar 6, 2018 at 1:30 PM, Mark Brown <broonie@kernel.org> wrote:

> diff --git a/drivers/regulator/core.c b/drivers/regulator/core.c
> index e685f8b94acf..2c5b20a97f51 100644
> --- a/drivers/regulator/core.c
> +++ b/drivers/regulator/core.c
> @@ -159,7 +159,7 @@ static void regulator_lock_supply(struct regulator_dev *rdev)
>  {
>         int i;
>
> -       for (i = 0; rdev; rdev = rdev_get_supply(rdev), i++)
> +       for (i = 1000; rdev; rdev = rdev_get_supply(rdev), i++)
>                 mutex_lock_nested(&rdev->mutex, i);

With this change the log is a bit different, but still get a kernel hang:
https://pastebin.com/eF08TnuT

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tony Lindgren March 6, 2018, 5:18 p.m. UTC | #2
* Fabio Estevam <festevam@gmail.com> [180306 16:57]:
> Hi Mark,
> 
> On Tue, Mar 6, 2018 at 1:30 PM, Mark Brown <broonie@kernel.org> wrote:
> 
> > diff --git a/drivers/regulator/core.c b/drivers/regulator/core.c
> > index e685f8b94acf..2c5b20a97f51 100644
> > --- a/drivers/regulator/core.c
> > +++ b/drivers/regulator/core.c
> > @@ -159,7 +159,7 @@ static void regulator_lock_supply(struct regulator_dev *rdev)
> >  {
> >         int i;
> >
> > -       for (i = 0; rdev; rdev = rdev_get_supply(rdev), i++)
> > +       for (i = 1000; rdev; rdev = rdev_get_supply(rdev), i++)
> >                 mutex_lock_nested(&rdev->mutex, i);
> 
> With this change the log is a bit different, but still get a kernel hang:
> https://pastebin.com/eF08TnuT

That patch does not seem to change anything for me. The errors
I posted yesterday for mmc0 only happen on duovero, other omap
variants just fail silently with no cards showing.

Regards,

Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Brown March 6, 2018, 7:12 p.m. UTC | #3
On Tue, Mar 06, 2018 at 01:56:42PM -0300, Fabio Estevam wrote:
> On Tue, Mar 6, 2018 at 1:30 PM, Mark Brown <broonie@kernel.org> wrote:

> > -       for (i = 0; rdev; rdev = rdev_get_supply(rdev), i++)
> > +       for (i = 1000; rdev; rdev = rdev_get_supply(rdev), i++)
> >                 mutex_lock_nested(&rdev->mutex, i);

> With this change the log is a bit different, but still get a kernel hang:
> https://pastebin.com/eF08TnuT

Yeah, that's what I expected.  What we really need to figure out I think
is what exactly is taking the lock that we end up colliding with, I
don't seem to be able to reproduce so I'm having to just go on staring
at the code here.
Fabio Estevam March 6, 2018, 8:06 p.m. UTC | #4
Hi Mark,

On Tue, Mar 6, 2018 at 4:12 PM, Mark Brown <broonie@kernel.org> wrote:

> Yeah, that's what I expected.  What we really need to figure out I think
> is what exactly is taking the lock that we end up colliding with, I
> don't seem to be able to reproduce so I'm having to just go on staring
> at the code here.

Any debug options I should turn on in order to help us to understand
more on the lock issue?
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Brown March 6, 2018, 9:33 p.m. UTC | #5
On Tue, Mar 06, 2018 at 05:06:48PM -0300, Fabio Estevam wrote:
> On Tue, Mar 6, 2018 at 4:12 PM, Mark Brown <broonie@kernel.org> wrote:

> > Yeah, that's what I expected.  What we really need to figure out I think
> > is what exactly is taking the lock that we end up colliding with, I
> > don't seem to be able to reproduce so I'm having to just go on staring
> > at the code here.

> Any debug options I should turn on in order to help us to understand
> more on the lock issue?

No, unfortunately, apart from the generic kernel ones (like lockdep).
I'll try to have a poke at adding some tomorrow, though I'm also
considering just dropping the series for now.
Fabio Estevam March 6, 2018, 10:10 p.m. UTC | #6
On Tue, Mar 6, 2018 at 6:33 PM, Mark Brown <broonie@kernel.org> wrote:

> No, unfortunately, apart from the generic kernel ones (like lockdep).
> I'll try to have a poke at adding some tomorrow, though I'm also
> considering just dropping the series for now.

Ok, lockdep is already enabled by default in the defconfig I am using
(imx_v6_v7_defconfig).

Yesterday I spent a few hours trying to understand this locking issue,
but didn't get any success on this task.

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Maciej Purski March 7, 2018, 12:57 p.m. UTC | #7
Hi all,
sorry it took me so long to answer.

On 03/06/2018 05:30 PM, Mark Brown wrote:
> On Mon, Mar 05, 2018 at 08:22:26PM -0300, Fabio Estevam wrote:
>> On Mon, Mar 5, 2018 at 8:12 PM, Tony Lindgren <tony@atomide.com> wrote:
> 
>>> Looks like with next-20180305 there's a regulator regression
>>> where mmc0 won't show any cards or produces errors:
> 
>>> mmcblk0: error -110 requesting status
>>> mmc1: new high speed SDIO card at address 0001
>>> mmcblk0: error -110 requesting status
>>> mmcblk0: recovery failed!
>>> print_req_error: I/O error, dev mmcblk0, sector 0
>>> Buffer I/O error on dev mmcblk0, logical block 0, async page read
>>> mmcblk0: error -110 requesting status
>>> mmcblk0: recovery failed!
> 
> No other error messages?  That seems like there's something going on
> that's very different to what Fabio was reporting...  I'm guessing some
> voltage application didn't go through but it's hard to tell with so
> little data.  dra7 does seem to have what Fabio had though so there's
> definitely some effect on the OMAP platforms.
>  >> I have also seen regulator issues due to this series:
>> https://lkml.org/lkml/2018/3/5/731
> 
> Looking at your stuff I'm having trouble figuring out what's going on -
> we're getting double locking of a parent regulator during enable
> according to your backtraces but it's not clear to me what took that
> lock already.  regulator_enable() walks the supplies before it takes
> the lock on the regulator it's immediately being called on, not holding
> any locks on supplies while enabling.  regulator_balance_voltage() then
> tries to lock the supplies again but lockdep says the lock is already
> held by regulator_enable().  It's also weird that this doesn't seem to
> be showing up on other boards in kernelci, the regulator setup on those
> i.MX boards looks to be quite simple so I'd expect a much wider impact.
>

I'm trying to figure out what is so special about these boards. The only strange 
thing, that I haven't noticed at first, is that all regulators share a common 
supply - dummy regulator. It is defined in anatop_regulator.c.


> I'm wondering if your case is more pain from mutex_lock_nested(), both
> regulator_lock_coupled() and regulator_lock_supply() will end up using
> indexes starting at 0 for the locking classes.  That doesn't smell right
> though, but in case my straw clutching works:
> 
> If we can't figure it out I'll just drop the series but I'd prefer to at
> least understand what's going on.
>

I have been struggling to reproduce the issue on my exynos boards, but all I 
have achieved is getting the same lockdep warning, but everything else works 
fine. I think it was a false positive caused by using the same indices in 
lock_coupled() and lock_supply().

> diff --git a/drivers/regulator/core.c b/drivers/regulator/core.c
> index e685f8b94acf..2c5b20a97f51 100644
> --- a/drivers/regulator/core.c
> +++ b/drivers/regulator/core.c
> @@ -159,7 +159,7 @@ static void regulator_lock_supply(struct regulator_dev *rdev)
>   {
>   	int i;
>   
> -	for (i = 0; rdev; rdev = rdev_get_supply(rdev), i++)
> +	for (i = 1000; rdev; rdev = rdev_get_supply(rdev), i++)
>   		mutex_lock_nested(&rdev->mutex, i);
>   }
>   
> 

Best regards,
Maciej Purski
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Brown March 7, 2018, 2:10 p.m. UTC | #8
On Wed, Mar 07, 2018 at 01:57:12PM +0100, Maciej Purski wrote:

> I'm trying to figure out what is so special about these boards. The only
> strange thing, that I haven't noticed at first, is that all regulators share
> a common supply - dummy regulator. It is defined in anatop_regulator.c.

No, that's a regulator framework thing - the regulator framework will
use the dummy regulator as a supply when there's nothing described in
the DT so long as the client doesn't explicitly tell it that the supply
might be optional.
Maciej Purski March 7, 2018, 2:37 p.m. UTC | #9
On 03/07/2018 03:10 PM, Mark Brown wrote:
> On Wed, Mar 07, 2018 at 01:57:12PM +0100, Maciej Purski wrote:
> 
>> I'm trying to figure out what is so special about these boards. The only
>> strange thing, that I haven't noticed at first, is that all regulators share
>> a common supply - dummy regulator. It is defined in anatop_regulator.c.
> 
> No, that's a regulator framework thing - the regulator framework will
> use the dummy regulator as a supply when there's nothing described in
> the DT so long as the client doesn't explicitly tell it that the supply
> might be optional.
> 

Ok, thanks for explanation. I think I have found a possibly dangerous scenario, 
but I can't see this situation possible in Fabio's case.

Assume, that we have a chain of supplies, consisting of at least 3. Say: A->B->C.

When we're setting voltage on A, we lock it, call balance_voltage(), lock 
suppliers and call set_voltage_rdev(). So we have regulators A, B, C locked. 
Then set_voltage_rdev() is trying to set voltage of its supply by calling 
set_voltage_unlocked().

Now we're on the regulator B. Set_voltage_unlocked() calls balance_voltage(), 
which again locks its supplies, if they exist. B's supply is C, so we end up 
with having a deadlock on regulator C.

Tony and Fabio, do you find this scenario possible on your boards?

Best regards
Maciej Purski
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Brown March 7, 2018, 2:45 p.m. UTC | #10
On Tue, Mar 06, 2018 at 07:10:01PM -0300, Fabio Estevam wrote:

> Yesterday I spent a few hours trying to understand this locking issue,
> but didn't get any success on this task.

Right, so I'm thinking that if it's taking this much effort to think
about the problem we should just drop the patch set for now.  I've moved
the code to a branch test/coupled in my git so it's not getting in
people's way in -next.  Maciej, I suspect that you might be able to
reproduce this if you set supply names in the regulator definitons for
the PMICs on your boards.  I think that's the difference here.
Tony Lindgren March 7, 2018, 3:17 p.m. UTC | #11
* Maciej Purski <m.purski@samsung.com> [180307 14:38]:
> 
> On 03/07/2018 03:10 PM, Mark Brown wrote:
> > On Wed, Mar 07, 2018 at 01:57:12PM +0100, Maciej Purski wrote:
> > 
> > > I'm trying to figure out what is so special about these boards. The only
> > > strange thing, that I haven't noticed at first, is that all regulators share
> > > a common supply - dummy regulator. It is defined in anatop_regulator.c.
> > 
> > No, that's a regulator framework thing - the regulator framework will
> > use the dummy regulator as a supply when there's nothing described in
> > the DT so long as the client doesn't explicitly tell it that the supply
> > might be optional.
> > 
> 
> Ok, thanks for explanation. I think I have found a possibly dangerous
> scenario, but I can't see this situation possible in Fabio's case.
> 
> Assume, that we have a chain of supplies, consisting of at least 3. Say: A->B->C.
> 
> When we're setting voltage on A, we lock it, call balance_voltage(), lock
> suppliers and call set_voltage_rdev(). So we have regulators A, B, C locked.
> Then set_voltage_rdev() is trying to set voltage of its supply by calling
> set_voltage_unlocked().
> 
> Now we're on the regulator B. Set_voltage_unlocked() calls
> balance_voltage(), which again locks its supplies, if they exist. B's supply
> is C, so we end up with having a deadlock on regulator C.
> 
> Tony and Fabio, do you find this scenario possible on your boards?

For mmc0 at least typically the supply is on the PMIC for omap
variants and it's controlled over i2c or spi bus like most other
regulators. Then boards some have additional GPIO controlled
regulators. So yeah some kind of locking issue between two or
more regulators on i2c or spi bus could be the reason.

Regards,

Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/regulator/core.c b/drivers/regulator/core.c
index e685f8b94acf..2c5b20a97f51 100644
--- a/drivers/regulator/core.c
+++ b/drivers/regulator/core.c
@@ -159,7 +159,7 @@  static void regulator_lock_supply(struct regulator_dev *rdev)
 {
 	int i;
 
-	for (i = 0; rdev; rdev = rdev_get_supply(rdev), i++)
+	for (i = 1000; rdev; rdev = rdev_get_supply(rdev), i++)
 		mutex_lock_nested(&rdev->mutex, i);
 }