diff mbox series

[RFC] PM / core: skip suspend next time if resume returns an error

Message ID 20180927203523.60856-1-dianders@chromium.org (mailing list archive)
State RFC, archived
Headers show
Series [RFC] PM / core: skip suspend next time if resume returns an error | expand

Commit Message

Doug Anderson Sept. 27, 2018, 8:35 p.m. UTC
In general Linux doesn't behave super great if you get an error while
executing a device's resume handler.  Nothing will come along later
and and try again to resume the device (and all devices that depend on
it), so pretty much you're left with a non-functioning device and
that's not good.

However, even though you'll end up with a non-functioning device we
still don't consider resume failures to be fatal to the system.  We'll
keep chugging along and just hope that the device that failed to
resume wasn't too critical.  This establishes the precedent that we
should at least try our best not to fully bork the system after a
resume failure.

I will argue that the best way to keep the system in the best shape is
to assume that if a resume callback failed that it did as close to
no-op as possible.  Because of this we should consider the device
still suspended and shouldn't try to suspend the device again next
time around.  Today that's not what happens.  AKA if you have a device
that fails resume every other time then you'll see these calls:
1. suspend
2. resume (no error)
3. suspend
4. resume (fails!)
5. suspend
6. resume (no error)

I believe that a more correct thing to do is to remove step #5 from
the above.  To understand why, let's imagine a hypothetical device.
In such a device let's say we have a regulator and clock to turn off.

We'll say:
  hypothetical_suspend(...) {
    ret = regulator_disable(...);
    if (ret)
      return ret;
    ret = clk_disable(...);
    if (ret)
      regulator_enable(...);
    return ret;

  hypothetical_resume(...) {
    ret = clk_enable(...);
    if (ret)
      return ret;
    ret = regulator_enable(...);
    if (ret)
      clk_disable(...);
    return ret;

Let's imagine that the regulator_enable() at resume fails sometimes.
You'll see that on the next call to hypothetical_suspend() we'll try
to disable a regulator that was never enabled.

...but if we change things to skip the next suspend callback then
actually we might get the device functioning again after another
suspend/resume cycle (especially if the resume failure was
intermittent for some reason).

Obviously this patch is pretty simplistic and certainly doesn't fix
the world, but perhaps it moves us in the right direction?

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 drivers/base/power/main.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Pavel Machek Oct. 2, 2018, 8:05 a.m. UTC | #1
Hi!

> In general Linux doesn't behave super great if you get an error while
> executing a device's resume handler.  Nothing will come along later
> and and try again to resume the device (and all devices that depend on
> it), so pretty much you're left with a non-functioning device and
> that's not good.
> 
> However, even though you'll end up with a non-functioning device we
> still don't consider resume failures to be fatal to the system.  We'll
> keep chugging along and just hope that the device that failed to
> resume wasn't too critical.  This establishes the precedent that we
> should at least try our best not to fully bork the system after a
> resume failure.
> 
> I will argue that the best way to keep the system in the best shape is
> to assume that if a resume callback failed that it did as close to
> no-op as possible.  Because of this we should consider the device
> still suspended and shouldn't try to suspend the device again next
> time around.  Today that's not what happens.  AKA if you have a
> device

I don't think there are many guarantees when device resume fail. It
may have done nothing, and it may have resumed the device almost
fully.

I guess the best option would be to refuse system suspend after some
device failed like that.

That leaves user possibility to debug it...

								Pavel-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Rafael J. Wysocki Oct. 2, 2018, 8:28 a.m. UTC | #2
On Tue, Oct 2, 2018 at 10:05 AM Pavel Machek <pavel@ucw.cz> wrote:
>
> Hi!
>
> > In general Linux doesn't behave super great if you get an error while
> > executing a device's resume handler.  Nothing will come along later
> > and and try again to resume the device (and all devices that depend on
> > it), so pretty much you're left with a non-functioning device and
> > that's not good.
> >
> > However, even though you'll end up with a non-functioning device we
> > still don't consider resume failures to be fatal to the system.  We'll
> > keep chugging along and just hope that the device that failed to
> > resume wasn't too critical.  This establishes the precedent that we
> > should at least try our best not to fully bork the system after a
> > resume failure.
> >
> > I will argue that the best way to keep the system in the best shape is
> > to assume that if a resume callback failed that it did as close to
> > no-op as possible.  Because of this we should consider the device
> > still suspended and shouldn't try to suspend the device again next
> > time around.  Today that's not what happens.  AKA if you have a
> > device
>
> I don't think there are many guarantees when device resume fail. It
> may have done nothing, and it may have resumed the device almost
> fully.
>
> I guess the best option would be to refuse system suspend after some
> device failed like that.
>
> That leaves user possibility to debug it...

I guess so.

Doing that in all cases is kind of risky IMO, because we haven't taken
the return values of the ->resume* callbacks into account so far
(except for printing the information that is), so there may be
non-lethal cases when that happens and the $subject patch would make
them not work any more.

Thanks,
Rafael
Doug Anderson Oct. 2, 2018, 9:01 p.m. UTC | #3
Hi,

On Tue, Oct 2, 2018 at 1:29 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Tue, Oct 2, 2018 at 10:05 AM Pavel Machek <pavel@ucw.cz> wrote:
> >
> > Hi!
> >
> > > In general Linux doesn't behave super great if you get an error while
> > > executing a device's resume handler.  Nothing will come along later
> > > and and try again to resume the device (and all devices that depend on
> > > it), so pretty much you're left with a non-functioning device and
> > > that's not good.
> > >
> > > However, even though you'll end up with a non-functioning device we
> > > still don't consider resume failures to be fatal to the system.  We'll
> > > keep chugging along and just hope that the device that failed to
> > > resume wasn't too critical.  This establishes the precedent that we
> > > should at least try our best not to fully bork the system after a
> > > resume failure.
> > >
> > > I will argue that the best way to keep the system in the best shape is
> > > to assume that if a resume callback failed that it did as close to
> > > no-op as possible.  Because of this we should consider the device
> > > still suspended and shouldn't try to suspend the device again next
> > > time around.  Today that's not what happens.  AKA if you have a
> > > device
> >
> > I don't think there are many guarantees when device resume fail. It
> > may have done nothing, and it may have resumed the device almost
> > fully.

With today's drivers you're probably right.  I don't think too many
drivers properly undo everything when they fail in the resume path.

...but it makes sense to have some sort of standard for how drivers
ought to behave.  IMO the most sane thing to do (and what we do in
probe--which is why EPROBE_DEFER mostly works) is that if a function
returns an error then it should try its best to undo everything it
did.  Then you can always try calling it again later.


> > I guess the best option would be to refuse system suspend after some
> > device failed like that.

Yeah, I thought about that.  I think we sometimes end up in that
situation today (without my patch) because the suspend callback
doesn't always work so great if the previous resume failed, so it's
entirely likely it'll return an error and block suspend...

 ...another thing I thought about was possibly trying to forcibly
unbind the device.  AKA: it's misbehaving so kick it out of the
system...  Of course if "resume" failed perhaps "remove" won't be so
happy...


> > That leaves user possibility to debug it...

You mean they wouldn't notice that some peripheral failed to resume
but they'd notice that their computer doesn't suspend any more?  Maybe
they'd notice, but it'd probably because they pulled their laptop out
of their bag and it was hot and had a dead batter.


> I guess so.
>
> Doing that in all cases is kind of risky IMO, because we haven't taken
> the return values of the ->resume* callbacks into account so far
> (except for printing the information that is), so there may be
> non-lethal cases when that happens and the $subject patch would make
> them not work any more.

I think you're arguing that the best option is to leave the code / API
exactly as-is because someone could be relying on the existing
behavior?  That is certainly the least likely to introduce any new
bugs.  ;-P

...would you accept a patch adding a comment codifying the existing
behavior (AKA suspend will be called again even if resume failed) as
the officially documented behavior?  Then we can start making new
drivers behave correctly at least.  If nothing else I can add a
boolean inside my driver data that says "resume failed, ignore the
next suspend".

...or if the official word is that if your resume fails you're totally
unrecoverable then I can start simplifying the error handling in
resume.  AKA instead of:

  hypothetical_resume(...) {
    ret = clk_enable(...);
    if (ret)
      return ret;
    ret = regulator_enable(...);
    if (ret)
      clk_disable(...);
    return ret;

...I can just change it to:

  hypothetical_resume(...) {
    ret = clk_enable(...);
    if (ret)
      return ret;
    return regulator_enable(...);

...the above would leave no way to recover the system because if
hypothetical_resume() returned an error we'd have no idea if the clock
was left enabled or not.  ...but if we're unrecoverable anyway why not
save the code?


-Doug
Rafael J. Wysocki Oct. 2, 2018, 9:16 p.m. UTC | #4
On Tue, Oct 2, 2018 at 11:01 PM Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Tue, Oct 2, 2018 at 1:29 AM Rafael J. Wysocki <rafael@kernel.org> wrote:

[cut]

> > I guess so.
> >
> > Doing that in all cases is kind of risky IMO, because we haven't taken
> > the return values of the ->resume* callbacks into account so far
> > (except for printing the information that is), so there may be
> > non-lethal cases when that happens and the $subject patch would make
> > them not work any more.
>
> I think you're arguing that the best option is to leave the code / API
> exactly as-is because someone could be relying on the existing
> behavior?  That is certainly the least likely to introduce any new
> bugs.  ;-P
>
> ...would you accept a patch adding a comment codifying the existing
> behavior (AKA suspend will be called again even if resume failed) as
> the officially documented behavior?

It is documented already IIRC, but yes.

> Then we can start making new
> drivers behave correctly at least.  If nothing else I can add a
> boolean inside my driver data that says "resume failed, ignore the
> next suspend".

Or maybe "fail the next suspend" even?

> ...or if the official word is that if your resume fails you're totally
> unrecoverable then I can start simplifying the error handling in
> resume.  AKA instead of:
>
>   hypothetical_resume(...) {
>     ret = clk_enable(...);
>     if (ret)
>       return ret;
>     ret = regulator_enable(...);
>     if (ret)
>       clk_disable(...);
>     return ret;
>
> ...I can just change it to:
>
>   hypothetical_resume(...) {
>     ret = clk_enable(...);
>     if (ret)
>       return ret;
>     return regulator_enable(...);
>
> ...the above would leave no way to recover the system because if
> hypothetical_resume() returned an error we'd have no idea if the clock
> was left enabled or not.  ...but if we're unrecoverable anyway why not
> save the code?

This really depends on the particular case.

If you deal with clocks directly, then you pretty much know whether or
not things are recoverable after a failing device resume, but if AML
tells you that it failed (say), you don't really know what happened.
In many cases the device that failed to resume will not work correctly
in the working state, but attempting to suspend it again may be fine.
It may recover after the next suspend-resume cycle even sometimes.  So
IMO drivers can do "smart" things if they really want to and know
enough, but there really is too much variation to handle it in the
core in a uniform way.

Thanks,
Rafael
Doug Anderson Oct. 2, 2018, 10:16 p.m. UTC | #5
Hi,

On Tue, Oct 2, 2018 at 2:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Tue, Oct 2, 2018 at 11:01 PM Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Tue, Oct 2, 2018 at 1:29 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> [cut]
>
> > > I guess so.
> > >
> > > Doing that in all cases is kind of risky IMO, because we haven't taken
> > > the return values of the ->resume* callbacks into account so far
> > > (except for printing the information that is), so there may be
> > > non-lethal cases when that happens and the $subject patch would make
> > > them not work any more.
> >
> > I think you're arguing that the best option is to leave the code / API
> > exactly as-is because someone could be relying on the existing
> > behavior?  That is certainly the least likely to introduce any new
> > bugs.  ;-P
> >
> > ...would you accept a patch adding a comment codifying the existing
> > behavior (AKA suspend will be called again even if resume failed) as
> > the officially documented behavior?
>
> It is documented already IIRC, but yes.

Ah ha!  I'm guessing this is the documentation you're talking about in pm.h?

 * All of the above callbacks, except for @complete(), return error codes.
 * However, the error codes returned by @resume(), @thaw(), @restore(),
 * @resume_noirq(), @thaw_noirq(), and @restore_noirq(), do not cause the PM
 * core to abort the resume transition during which they are returned.  The
 * error codes returned in those cases are only printed to the system logs for
 * debugging purposes.  Still, it is recommended that drivers only return error
 * codes from their resume methods in case of an unrecoverable failure (i.e.
 * when the device being handled refuses to resume and becomes unusable) to
 * allow the PM core to be modified in the future, so that it can avoid
 * attempting to handle devices that failed to resume and their children.

To me the above reads as "the behavior of the kernel right now isn't
quite right, but we'll fix it in the future".  It also don't
explicitly state that the next "suspend" will still be called.  That's
implicit in the "all we do is print a message" but the "we'll fix it
in the future" makes me feel like that might change.

...if there's some other documentation you're thinking of then I'm
happy to keep looking.


> > ...or if the official word is that if your resume fails you're totally
> > unrecoverable then I can start simplifying the error handling in
> > resume.  AKA instead of:
> >
> >   hypothetical_resume(...) {
> >     ret = clk_enable(...);
> >     if (ret)
> >       return ret;
> >     ret = regulator_enable(...);
> >     if (ret)
> >       clk_disable(...);
> >     return ret;
> >
> > ...I can just change it to:
> >
> >   hypothetical_resume(...) {
> >     ret = clk_enable(...);
> >     if (ret)
> >       return ret;
> >     return regulator_enable(...);
> >
> > ...the above would leave no way to recover the system because if
> > hypothetical_resume() returned an error we'd have no idea if the clock
> > was left enabled or not.  ...but if we're unrecoverable anyway why not
> > save the code?
>
> This really depends on the particular case.
>
> If you deal with clocks directly, then you pretty much know whether or
> not things are recoverable after a failing device resume, but if AML
> tells you that it failed (say), you don't really know what happened.
> In many cases the device that failed to resume will not work correctly
> in the working state, but attempting to suspend it again may be fine.
> It may recover after the next suspend-resume cycle even sometimes.  So
> IMO drivers can do "smart" things if they really want to and know
> enough, but there really is too much variation to handle it in the
> core in a uniform way.

Got it.  Right that every driver will be different and we can't
possibly magically "fix the world" and universally recover from all
errors.  ...and putting too much smarts in the drivers doesn't make a
lot of sense since really we're in a mostly unrecoverable place
anyway.

In any case, if I don't hear anything else I'll assume that the
officially documented suggestion is to assume that suspend() will
still be called after a failed resume() (AKA today's behavior) and I
should code drivers to that standard until I hear otherwise.

Thanks for your time.

-Doug
Rafael J. Wysocki Oct. 3, 2018, 8:46 a.m. UTC | #6
On Wednesday, October 3, 2018 12:16:51 AM CEST Doug Anderson wrote:
> Hi,
> 
> On Tue, Oct 2, 2018 at 2:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Tue, Oct 2, 2018 at 11:01 PM Doug Anderson <dianders@chromium.org> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, Oct 2, 2018 at 1:29 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > [cut]
> >
> > > > I guess so.
> > > >
> > > > Doing that in all cases is kind of risky IMO, because we haven't taken
> > > > the return values of the ->resume* callbacks into account so far
> > > > (except for printing the information that is), so there may be
> > > > non-lethal cases when that happens and the $subject patch would make
> > > > them not work any more.
> > >
> > > I think you're arguing that the best option is to leave the code / API
> > > exactly as-is because someone could be relying on the existing
> > > behavior?  That is certainly the least likely to introduce any new
> > > bugs.  ;-P
> > >
> > > ...would you accept a patch adding a comment codifying the existing
> > > behavior (AKA suspend will be called again even if resume failed) as
> > > the officially documented behavior?
> >
> > It is documented already IIRC, but yes.
> 
> Ah ha!  I'm guessing this is the documentation you're talking about in pm.h?
> 
>  * All of the above callbacks, except for @complete(), return error codes.
>  * However, the error codes returned by @resume(), @thaw(), @restore(),
>  * @resume_noirq(), @thaw_noirq(), and @restore_noirq(), do not cause the PM
>  * core to abort the resume transition during which they are returned.  The
>  * error codes returned in those cases are only printed to the system logs for
>  * debugging purposes.  Still, it is recommended that drivers only return error
>  * codes from their resume methods in case of an unrecoverable failure (i.e.
>  * when the device being handled refuses to resume and becomes unusable) to
>  * allow the PM core to be modified in the future, so that it can avoid
>  * attempting to handle devices that failed to resume and their children.
> 
> To me the above reads as "the behavior of the kernel right now isn't
> quite right, but we'll fix it in the future".

This is just a recommendation due to a possible change in the core in future,
not a FIXME comment or similar.

And this recommendation hasn't been universally followed AFAICS.

> It also don't explicitly state that the next "suspend" will still be called.

It also doesn't explicitly state that the next "suspend" will not be called. :-)

> That's implicit in the "all we do is print a message" but the "we'll fix it
> in the future" makes me feel like that might change.

It might, but it didn't. :-)

> ...if there's some other documentation you're thinking of then I'm
> happy to keep looking.

There is the more detailed suspend and resume description in
Documentation/driver-api/pm/devices.rst (but that doesn't say what
will happen on the next suspend if the current resume fails too, which
basically means to expect it to be carried out as usual).

> > > ...or if the official word is that if your resume fails you're totally
> > > unrecoverable then I can start simplifying the error handling in
> > > resume.  AKA instead of:
> > >
> > >   hypothetical_resume(...) {
> > >     ret = clk_enable(...);
> > >     if (ret)
> > >       return ret;
> > >     ret = regulator_enable(...);
> > >     if (ret)
> > >       clk_disable(...);
> > >     return ret;
> > >
> > > ...I can just change it to:
> > >
> > >   hypothetical_resume(...) {
> > >     ret = clk_enable(...);
> > >     if (ret)
> > >       return ret;
> > >     return regulator_enable(...);
> > >
> > > ...the above would leave no way to recover the system because if
> > > hypothetical_resume() returned an error we'd have no idea if the clock
> > > was left enabled or not.  ...but if we're unrecoverable anyway why not
> > > save the code?
> >
> > This really depends on the particular case.
> >
> > If you deal with clocks directly, then you pretty much know whether or
> > not things are recoverable after a failing device resume, but if AML
> > tells you that it failed (say), you don't really know what happened.
> > In many cases the device that failed to resume will not work correctly
> > in the working state, but attempting to suspend it again may be fine.
> > It may recover after the next suspend-resume cycle even sometimes.  So
> > IMO drivers can do "smart" things if they really want to and know
> > enough, but there really is too much variation to handle it in the
> > core in a uniform way.
> 
> Got it.  Right that every driver will be different and we can't
> possibly magically "fix the world" and universally recover from all
> errors.  ...and putting too much smarts in the drivers doesn't make a
> lot of sense since really we're in a mostly unrecoverable place
> anyway.
> 
> In any case, if I don't hear anything else I'll assume that the
> officially documented suggestion is to assume that suspend() will
> still be called after a failed resume() (AKA today's behavior) and I
> should code drivers to that standard until I hear otherwise.

Yes, that's the current behavior and there are no plans to change it.

Thanks,
Rafael
diff mbox series

Patch

diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 3f68e2919dc5..27c7d1f76cee 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -999,7 +999,7 @@  static int device_resume(struct device *dev, pm_message_t state, bool async)
 
  End:
 	error = dpm_run_callback(callback, dev, state, info);
-	dev->power.is_suspended = false;
+	dev->power.is_suspended = error;
 
  Unlock:
 	device_unlock(dev);
@@ -1750,6 +1750,9 @@  static int __device_suspend(struct device *dev, pm_message_t state, bool async)
 	dpm_watchdog_set(&wd, dev);
 	device_lock(dev);
 
+	if (dev->power.is_suspended)
+		goto End;
+
 	if (dev->pm_domain) {
 		info = "power domain ";
 		callback = pm_op(&dev->pm_domain->ops, state);