diff mbox

[2/4] PM / Loss: power loss management

Message ID 1305220265-9020-3-git-send-email-lamiaposta71@gmail.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Raffaele Recalcati May 12, 2011, 5:11 p.m. UTC
From: Davide Ciminaghi <ciminaghi@gnudd.com>

On some embedded systems, an asynchronous power failure notification event is
available some tens to hundreds milliseconds before the actual power failure
takes place.
Such an event can be used to trigger some actions, typically shutting down
all non-vital power sinks, thus allowing the board to survive longer to
temporary power losses.

Signed-off-by: Davide Ciminaghi <ciminaghi@gnudd.com>
Signed-off-by: Raffaele Recalcati <raffaele.recalcati@bticino.it>
---
 Documentation/power/loss.txt |  191 +++++++++++++
 drivers/base/bus.c           |    6 +
 drivers/base/init.c          |    1 +
 drivers/base/platform.c      |   14 +
 drivers/base/power/Makefile  |    2 +
 drivers/base/power/loss.c    |  648 ++++++++++++++++++++++++++++++++++++++++++
 drivers/base/power/power.h   |   14 +
 include/linux/pm.h           |   16 +
 include/linux/pm_loss.h      |  109 +++++++
 kernel/power/Kconfig         |   27 ++
 10 files changed, 1028 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/power/loss.txt
 create mode 100644 drivers/base/power/loss.c
 create mode 100644 include/linux/pm_loss.h

Comments

Davide Ciminaghi May 13, 2011, 1:02 p.m. UTC | #1
On Thu, May 12, 2011 at 09:57:46PM +0200, Rafael J. Wysocki wrote:
> Hi,
>
> On Thursday, May 12, 2011, Raffaele Recalcati wrote:
> > From: Davide Ciminaghi <ciminaghi@gnudd.com>
> > 

hello, I (Davide) will reply as the patch author.

....................
> 
> I'm not really sure about that.  Do you want to hardcode policies in some
> platform/board-specific files inside of the kernel?
> 

I think a little history of the patch can help: it looks like recent mmc 
devices can do any sort of weird things in case of sudden power losses during 
write operations. This is still to be confirmed by actual tests on our
particular hardware, but we suspect that even a journalling fs could be 
corrupted in such a case. 
Stopping the mmc request queue when a power loss is detected, does not 
completely solve the problem, but at least mitigates the risk of disasters.
Plus, we needed to get our board to immediately turn off any unneeded power 
sink in case of temporary power losses (just try to survive as long as you 
can ...). 
Our hardware triggers an interrupt about 100 millisenconds before the cpu 
dies, so we needed to notify the drivers about the event as fast as we could.
That was one of the two main reasons why we chose to put this stuff inside 
the kernel. The other reason was that the mmc request queue can be easily 
stopped from kernel space.
Since forcing a given policy seemed the wrong thing to do, we chose to adopt
a modular policy approach.

> > +The header file include/linux/pm_loss.h contains all the pm_loss function
> > +prototypes, together with definitions of data structures.
> > +For what concerns power loss policies, the framework already contains a couple
> > +of predefined policies: "nop" and "default" (see later paragraphs for more
> > +details).
> 
> I think it would be cleaner to split the patch so that the actual functionality
> is added first and the policy part goes in a separate patch on top of that.
> 

agreed.

> > +
> > +1.1 Sysfs interface.
> > +
> > +It can be useful (for instance during tests), to be able to quickly switch
> > +from one power loss policy to another, or to simulate power fail and resume
> > +events. To this end, a sysfs interface of the following form has been devised:
> > +
> > +/sys/power/loss +
> > +		|
> > +		+-- active_policy
> > +		|
> > +		+-- policies -+
> > +		|	      |
> > +		|	      +-- nop
> > +		|	      |
> > +		|	      +-- default +
> > +		|	      |		  |
> > +		|	      |	 	  +-- bus_table
> > +		|	      |
> > +		|	      +-- ....
> > +		|
> > +		+-- pwrfail_sim
> > +
> > +2. Details
> > +
> > +2.1 Sending events to the power loss management core.
> > +
> > +The board specific code shall trigger a power failure event notification by
> > +calling pm_loss_power_changed(SYS_PWR_FAILING).
> > +In case of a temporary power loss, the same function shall be called with
> > +SYS_PWR_GOOD argument on power resume. pm_loss_power_changed() can sleep, so
> > +it shall not be called from interrupt context.
> > +
> > +2.3 Effects on bus and device drivers.
> > +
> > +One more entry has been added to the device PM callbacks:
> > +
> > +   int (*power_changed)(struct device *, enum sys_power_state);
> 
> Please don't use the second argument.  Make it two (or as many as you need)
> callbacks each with one struct device * argument instead (following the
> convention we have in struct dev_pm_ops already).
> 
> That said, I'm not quite sure we _need_ those additional callbacks at all.
> Your example mmc implementation appears to use a simple "runtime suspend on
> power fail, runtime resume on power good" approach, for which it's not
> necessary to add any new callbacks.
> 

maybe this is design choice comes from our little knowledge of pm_runtime, 
but the reason why we didn't use the existing callbacks was that we wanted 
to avoid any possible conflict.
In our understanding, pm_runtime works more or less like this: "when 
you're done with a peripheral, turn it off" (well, maybe also wait a little 
while to avoid continuously toggling power on and off, thus wasting more 
power than just doing nothing). This is because its goal is to save power 
during normal operation.
On the other hand, we needed something like this: "when some external event 
takes place, just turn off anything you can as fast as you can".
Actually, one of the attempts we made before adding pm_loss was to implement 
it via pm_runtime, by just immediately stopping devices on idle and refusing
to turn them on again in case a power loss event took place in the meanwhile.
This worked, and was ok for drivers with no implementation of the pm_runtime 
callbacks, but introduced a potential conflict as far as pm_runtime enabled 
drivers were concerned. That's why we (maybe wrongfully) thought a new 
callback was needed.

> > +This function can be optionally invoked by power loss policies in case of
> > +system power changes (loss and/or resume). Of course every bus or device driver
> > +can react to such events in some specific way. For instance the mmc block
> > +driver will probably block its request queue during a temporary power loss.
> 
> I think you'd have to deal with user space somehow before suspending devices.
> Runtime PM suspends devices when they aren't in use (as indicated by the
> usage counters), but in this power loss case we may really end up suspending
> devices that _are_ in use, right?
> 
yes. This is not a problem for us right now, because when a power loss happens,
the "power bad" condition just lasts for about 100ms. At present, pm_loss
callbacks are implemented for the mmc block device and a video capture device
only.
In practice what happens on our platform is that processes accessing the mmc 
can be put to sleep until either the board dies or a power resume event takes 
place. If video capture is running, some "black" frames are generated, but 
that is not considered a real fault because power losses should not occur 
during normal operation (unless you really turn the board off, of course).

That said, I think you're right anyway, userspace should be notified somehow.
SIGPWR to init ?. As far as I know, there's no single place in the kernel 
where SIGPWR is sent to init:

find . -name \*.c  | xargs grep SIGPWR
...
./drivers/char/snsc_event.c:            kill_cad_pid(SIGPWR, 0);
...
./arch/s390/kernel/nmi.c:               kill_cad_pid(SIGPWR, 1);

Maybe pm_loss could become such a place ?

> > +2.3.1 The platform bus.
> > +
> > +For what concerns platform bus drivers, platform specific code can override
> > +the power_changed() callback :
> > +
> > +platform_pm_loss_power_changed(struct device *dev, enum sys_power_state s)
> > +
> > +whose default (empty) version is defined as a __weak symbol in
> > +drivers/base/platform.c.
> 
> Please don't use any more __weak symbols in that file.
>
ok.
 
> > +2.4 Power loss policies.
> > +
> > +A power loss policy can be registered via the pm_loss_register_policy()
> > +function, which receives:
> > +
> > +      + A pointer to a struct pm_loss_policy_ops , which hosts the pointers
> > +      to the policy's specific callbacks (see below for more details).
> > +      + A pointer to a struct kobj_type, which will allow the pm loss core
> > +      to setup the policy related directory under /sys/power/loss/policies.
> > +      See Documentation/kobject.txt for more details on struct kobj_type.
> > +      + A pointer to an array of chars containing the name of the policy.
> > +      + A pointer to the policy's private data (void *).
> 
> This sounds like quilte a bit of new infrastructure and it seems it might
> be implemented in a more straightforward way (presumably in a less general
> but still useful manner).
>
well, we just followed the most generic approach. Maybe it could be simplified.
 
> > +A power loss policy can define the following callbacks:
> > +
> > +      + int (*bus_added)(struct pm_loss_policy *, struct bus_type *) : this
> > +      is invoked whenever a new bus has been registered within the system.
> > +      Since power related events are often managed at bus level, it can be
> > +      useful for the policy to have a list of available busses within the
> > +      system. When a policy is registered, this callback is invoked once
> > +      for every already registered bus.
> 
> Well, we have a list of registered bus types anyway.  They may or may not
> register "power loss" callbacks.  Why don't you simply execute the callbacks
> from the bus types that defined them?  What exactly is the purpose of the
> new list?
> 
this list is ordered by priority.

> > +      + int (*bus_removed)(struct pm_loss_policy *, struct bus_type *):
> > +      this is invoked when a bus is removed from the system.
> > +      + int (*start)(struct pm_loss_policy *): this is called when a policy
> > +      becomes active.
> > +      + void (*stop)(struct pm_loss_policy *): this is called when a policy
> > +      becomes inactive.
> 
> What are the start/stop callbacks needed for?  IOW, what's your envisioned
> use case for those callbacks?
> 

Honestly they were added without having nothing special in mind (it just looked
more flexible), and in fact we were going to remove them, until they showed
useful if a policy table is owned by a module. In such case 
you need some place to do module_get/put, and start/stop looked just the 
right place. So, since we found at least one possible use for them, we decided
to leave them in place.
Actually, your question made me review the code and I've just found that
try_module_get() is called from pm_loss_setup_default_policy(), not from
the start() callback of the default policy. This looks strange, I think I
should review this part.

> > +2.4.1 The nop predefined policy
> > +
> > +The nop policy is the first active policy when the pm loss framework is
> > +initialized. It just does nothing in case of power loss / power resume
> > +events.
> > +
> > +2.4.2 The default predefined policy
> > +
> > +When a power failure warning is received, the default policy walks through a
> > +list of critical busses and invokes their drivers' power_changed() callback.
> 
> Why not to walk all bus types here?
> 
because of priorities. More prioritary busses are notified first.

> > +The default policy can be customized and registered by calling:
> > +
> > +    pm_loss_setup_default_policy(struct pm_loss_default_policy_table *);
> > +
> > +This function receives a pointer to a pm_loss_default_policy_table structure,
> > +which defines a priority ordered list of critical buffers:
> > +
> > +      struct pm_loss_default_policy_table {
> > +	     struct module *owner ;
> > +	     const char *name ;
> > +	     struct pm_loss_default_policy_item *items;
> > +	     int nitems;
> > +      };
> > +
> > +Here's a short description of such structure:
> > +
> > +      + owner  : shall point to the module registering the table (or NULL
> > +		 if the table is not instantiated by a module).
> > +      + name   : this is the name given to this particular customization of
> > +		 the default policy.
> > +      + items  : pointer to an array of table items.
> > +      + nitems : number of the items in the array.
> > +
> > +Each item is a struct pm_loss_default_policy_item, defined as follows:
> > +
> > +     struct pm_loss_default_policy_item {
> > +	const char *bus_name;
> > +	int bus_priority ;
> > +     };
> > +
> > +where bus_name is the name of a bus and bus_priority is a numerical priority
> > +of the bus itself. Numerically higher priorities correspond to more prioritary
> > +busses.
> 
> So I can understand the need for priorities, but why do you use the name
> instead of a pointer to struct bus_type?
>
The idea here was to make it simpler writing a bus priority table.
For instance this is how a sample table looks now:

struct pm_loss_default_policy_item my_board_pm_loss_policy_items[] = {
        {
                .bus_name = "mmc",
                .bus_priority = MY_BOARD_PWR_FAIL_PRIO_1,
        },
        {
                .bus_name = "platform",
                .bus_priority = MY_BOARD_PWR_FAIL_PRIO_2,
        },
};

A struct bus_type * would be less immediate.
Using a name is slower, but the operation of finding a bus given its name
is only done when you register the policy (or when a new bus is added after
policy registration).
 
> > +
> > +2.4.3 Activating a specific policy.
> > +
> > +A policy can be activated:
> > +
> > +	+ From within the kernel by calling pm_loss_set_policy(char *name).
> > +	The argument passed to this function shall be the name of the policy
> > +	to be activated.
> > +
> > +	+ From user space by writing the name of the policy to be activated
> > +	to /sys/power/loss/active_policy.
> > +
> > +2.4.4 Unregistering a policy
> > +
> > +For a generic user defined policy, just call :
> > +
> > +     pm_loss_unregister_policy(struct pm_loss_policy *);
> > +
> > +For a default policy customization:
> > +
> > +     pm_loss_shutdown_default_policy(struct pm_loss_policy *);
> > +
> > +3. Kernel configuration
> > +
> > +The following configuration options have been defined:
> > +
> > +    CONFIG_PM_LOSS     : this enables the generic pm_loss framework.
> > +    CONFIG_PM_LOSS_SIM : this adds the pwrfail_sim file to the sysfs interface
> > +		         and allows power loss events simulation.
> > +    CONFIG_PM_LOSS_DEBUG : this option enables some debug printk's .
> > +    CONFIG_PM_LOSS_TEST: this enables the compilation of a sample test module
> > +                         containing a customized default policy definition
> 
> OK, so I think that this feature may be kind of useful, but it will require
> some work before it's ready to go into the kernel.
> 
> Please separate the policy part for starters and let's see what's left.
> 
ok.

Thanks and regards
Davide
Rafael Wysocki May 17, 2011, 11:32 p.m. UTC | #2
Hi,

Sorry for the delay.

On Friday, May 13, 2011, Davide Ciminaghi wrote:
> On Thu, May 12, 2011 at 09:57:46PM +0200, Rafael J. Wysocki wrote:
> > Hi,
> >
> > On Thursday, May 12, 2011, Raffaele Recalcati wrote:
> > > From: Davide Ciminaghi <ciminaghi@gnudd.com>
> > > 
> 
> hello, I (Davide) will reply as the patch author.
> 
> ....................
> > 
> > I'm not really sure about that.  Do you want to hardcode policies in some
> > platform/board-specific files inside of the kernel?
> > 
> 
> I think a little history of the patch can help: it looks like recent mmc 
> devices can do any sort of weird things in case of sudden power losses during 
> write operations. This is still to be confirmed by actual tests on our
> particular hardware, but we suspect that even a journalling fs could be 
> corrupted in such a case.

That very well may be the case.

> Stopping the mmc request queue when a power loss is detected, does not 
> completely solve the problem, but at least mitigates the risk of disasters.
> Plus, we needed to get our board to immediately turn off any unneeded power 
> sink in case of temporary power losses (just try to survive as long as you 
> can ...). 

I understand the motivation I think. :-)

> Our hardware triggers an interrupt about 100 millisenconds before the cpu 
> dies, so we needed to notify the drivers about the event as fast as we could.
> That was one of the two main reasons why we chose to put this stuff inside 
> the kernel. The other reason was that the mmc request queue can be easily 
> stopped from kernel space.

OK, but it's not really sufficient to notify drivers in that case, because
they may crash user space if they simply switch off devices.  Now, if the
power loss actually happens, that probably doesn't matter, but if power is
restored afterwards, that won't be nice.

The mmc request queue is a special case that shouldn't affect user space
directly, but other devices may be accessed in many different ways and
not all of them are safe to intercept forcibly.

However, even for the mmc case you'd need a mechanism that will force
the filesystem on it to flush everything to the storage immediately.

> Since forcing a given policy seemed the wrong thing to do, we chose to adopt
> a modular policy approach.
> 
> > > +The header file include/linux/pm_loss.h contains all the pm_loss function
> > > +prototypes, together with definitions of data structures.
> > > +For what concerns power loss policies, the framework already contains a couple
> > > +of predefined policies: "nop" and "default" (see later paragraphs for more
> > > +details).
> > 
> > I think it would be cleaner to split the patch so that the actual functionality
> > is added first and the policy part goes in a separate patch on top of that.
> > 
> 
> agreed.
> 
> > > +
> > > +1.1 Sysfs interface.
> > > +
> > > +It can be useful (for instance during tests), to be able to quickly switch
> > > +from one power loss policy to another, or to simulate power fail and resume
> > > +events. To this end, a sysfs interface of the following form has been devised:
> > > +
> > > +/sys/power/loss +
> > > +		|
> > > +		+-- active_policy
> > > +		|
> > > +		+-- policies -+
> > > +		|	      |
> > > +		|	      +-- nop
> > > +		|	      |
> > > +		|	      +-- default +
> > > +		|	      |		  |
> > > +		|	      |	 	  +-- bus_table
> > > +		|	      |
> > > +		|	      +-- ....
> > > +		|
> > > +		+-- pwrfail_sim
> > > +
> > > +2. Details
> > > +
> > > +2.1 Sending events to the power loss management core.
> > > +
> > > +The board specific code shall trigger a power failure event notification by
> > > +calling pm_loss_power_changed(SYS_PWR_FAILING).
> > > +In case of a temporary power loss, the same function shall be called with
> > > +SYS_PWR_GOOD argument on power resume. pm_loss_power_changed() can sleep, so
> > > +it shall not be called from interrupt context.
> > > +
> > > +2.3 Effects on bus and device drivers.
> > > +
> > > +One more entry has been added to the device PM callbacks:
> > > +
> > > +   int (*power_changed)(struct device *, enum sys_power_state);
> > 
> > Please don't use the second argument.  Make it two (or as many as you need)
> > callbacks each with one struct device * argument instead (following the
> > convention we have in struct dev_pm_ops already).
> > 
> > That said, I'm not quite sure we _need_ those additional callbacks at all.
> > Your example mmc implementation appears to use a simple "runtime suspend on
> > power fail, runtime resume on power good" approach, for which it's not
> > necessary to add any new callbacks.
> > 
> 
> maybe this is design choice comes from our little knowledge of pm_runtime, 
> but the reason why we didn't use the existing callbacks was that we wanted 
> to avoid any possible conflict.
> In our understanding, pm_runtime works more or less like this: "when 
> you're done with a peripheral, turn it off" (well, maybe also wait a little 
> while to avoid continuously toggling power on and off, thus wasting more 
> power than just doing nothing). This is because its goal is to save power 
> during normal operation.
> On the other hand, we needed something like this: "when some external event 
> takes place, just turn off anything you can as fast as you can".

This is more similar to system suspend than to runtime PM.

> Actually, one of the attempts we made before adding pm_loss was to implement 
> it via pm_runtime, by just immediately stopping devices on idle and refusing
> to turn them on again in case a power loss event took place in the meanwhile.
> This worked, and was ok for drivers with no implementation of the pm_runtime 
> callbacks, but introduced a potential conflict as far as pm_runtime enabled 
> drivers were concerned. That's why we (maybe wrongfully) thought a new 
> callback was needed.

Having considered that for a while I think that additional callbacks may be
necessary, because they generally may need to do more than just suspending
devices.

> > > +This function can be optionally invoked by power loss policies in case of
> > > +system power changes (loss and/or resume). Of course every bus or device driver
> > > +can react to such events in some specific way. For instance the mmc block
> > > +driver will probably block its request queue during a temporary power loss.
> > 
> > I think you'd have to deal with user space somehow before suspending devices.
> > Runtime PM suspends devices when they aren't in use (as indicated by the
> > usage counters), but in this power loss case we may really end up suspending
> > devices that _are_ in use, right?
> > 
> yes. This is not a problem for us right now, because when a power loss happens,
> the "power bad" condition just lasts for about 100ms. At present, pm_loss
> callbacks are implemented for the mmc block device and a video capture device
> only.

While that may not be a problem to you, it is a problem in general and it
should be taken into account if we are to provide a generic framework for
such things.

> In practice what happens on our platform is that processes accessing the mmc 
> can be put to sleep until either the board dies or a power resume event takes 
> place. If video capture is running, some "black" frames are generated, but 
> that is not considered a real fault because power losses should not occur 
> during normal operation (unless you really turn the board off, of course).
> 
> That said, I think you're right anyway, userspace should be notified somehow.
> SIGPWR to init ?. As far as I know, there's no single place in the kernel 
> where SIGPWR is sent to init:
> 
> find . -name \*.c  | xargs grep SIGPWR
> ...
> ./drivers/char/snsc_event.c:            kill_cad_pid(SIGPWR, 0);
> ...
> ./arch/s390/kernel/nmi.c:               kill_cad_pid(SIGPWR, 1);
> 
> Maybe pm_loss could become such a place ?

I think so.  SIGPWR seems to be a good candidate, but that would change the
current (documented) behavior where SIGPWR is a synonym for SIGINFO.

Thanks,
Rafael
Davide Ciminaghi May 19, 2011, 9:26 a.m. UTC | #3
On Wed, May 18, 2011 at 01:32:30AM +0200, Rafael J. Wysocki wrote:
> Hi,
> 
> Sorry for the delay.
>
hi,

no problem. I'm having some extremely busy days at the moment, so I'm always 
late too (as you see).

... 

> > That was one of the two main reasons why we chose to put this stuff inside 
> > the kernel. The other reason was that the mmc request queue can be easily 
> > stopped from kernel space.
> 
> OK, but it's not really sufficient to notify drivers in that case, because
> they may crash user space if they simply switch off devices.  Now, if the
> power loss actually happens, that probably doesn't matter, but if power is
> restored afterwards, that won't be nice.
> 
> The mmc request queue is a special case that shouldn't affect user space
> directly, but other devices may be accessed in many different ways and
> not all of them are safe to intercept forcibly.
> 

well, I don't know if this is completely correct, but my initial idea was to 
find some way to just block user space processes when we're not sure about 
power. Most devices can be accessed via some form of read()/write() or 
enqueue/dequeue mechanism. What I would like to do is to temporarily blocking 
such operations (or returning -EAGAIN in case of non-blocking open, for 
instance). This should slightly reduce power consumption
anyway, because user space processes cannot run. Then, each single driver 
should decide whether it is safe or not to actually turn its device off,
and then restore its state in case of power resume.

> However, even for the mmc case you'd need a mechanism that will force
> the filesystem on it to flush everything to the storage immediately.
> 

I'm not completely sure about this. What we wanted to do was to avoid powering
down the mmc while it is physically writing data into its internal memory.
If we force a sync when the power loss warning event warning happens,
it is very difficult to be able to guarantee that all buffered data will be 
written before power actually dies. So we preferred to follow another strategy:
let the mmc finish any running write operation, and then stop its request 
queue. If power really goes down, then we hope that the file system journal 
will fix things on next boot (yes, some data could get lost, but the fs should
still be mountable). On the other hand, if power resumes, nothing bad should 
happen for user space processes.

> > Since forcing a given policy seemed the wrong thing to do, we chose to adopt
....
> > 
> > maybe this is design choice comes from our little knowledge of pm_runtime, 
> > but the reason why we didn't use the existing callbacks was that we wanted 
> > to avoid any possible conflict.
> > In our understanding, pm_runtime works more or less like this: "when 
> > you're done with a peripheral, turn it off" (well, maybe also wait a little 
> > while to avoid continuously toggling power on and off, thus wasting more 
> > power than just doing nothing). This is because its goal is to save power 
> > during normal operation.
> > On the other hand, we needed something like this: "when some external event 
> > takes place, just turn off anything you can as fast as you can".
> 
> This is more similar to system suspend than to runtime PM.
> 
> > Actually, one of the attempts we made before adding pm_loss was to implement 
> > it via pm_runtime, by just immediately stopping devices on idle and refusing
> > to turn them on again in case a power loss event took place in the meanwhile.
> > This worked, and was ok for drivers with no implementation of the pm_runtime 
> > callbacks, but introduced a potential conflict as far as pm_runtime enabled 
> > drivers were concerned. That's why we (maybe wrongfully) thought a new 
> > callback was needed.
> 
> Having considered that for a while I think that additional callbacks may be
> necessary, because they generally may need to do more than just suspending
> devices.
> 

ok.

> > > > +This function can be optionally invoked by power loss policies in case of
> > > > +system power changes (loss and/or resume). Of course every bus or device driver
> > > > +can react to such events in some specific way. For instance the mmc block
> > > > +driver will probably block its request queue during a temporary power loss.
> > > 
> > > I think you'd have to deal with user space somehow before suspending devices.
> > > Runtime PM suspends devices when they aren't in use (as indicated by the
> > > usage counters), but in this power loss case we may really end up suspending
> > > devices that _are_ in use, right?
> > > 
> > yes. This is not a problem for us right now, because when a power loss happens,
> > the "power bad" condition just lasts for about 100ms. At present, pm_loss
> > callbacks are implemented for the mmc block device and a video capture device
> > only.
> 
> While that may not be a problem to you, it is a problem in general and it
> should be taken into account if we are to provide a generic framework for
> such things.
> 
> > In practice what happens on our platform is that processes accessing the mmc 
> > can be put to sleep until either the board dies or a power resume event takes 
> > place. If video capture is running, some "black" frames are generated, but 
> > that is not considered a real fault because power losses should not occur 
> > during normal operation (unless you really turn the board off, of course).
> > 
> > That said, I think you're right anyway, userspace should be notified somehow.
> > SIGPWR to init ?. As far as I know, there's no single place in the kernel 
> > where SIGPWR is sent to init:
> > 
> > find . -name \*.c  | xargs grep SIGPWR
> > ...
> > ./drivers/char/snsc_event.c:            kill_cad_pid(SIGPWR, 0);
> > ...
> > ./arch/s390/kernel/nmi.c:               kill_cad_pid(SIGPWR, 1);
> > 
> > Maybe pm_loss could become such a place ?
> 
> I think so.  SIGPWR seems to be a good candidate, but that would change the
> current (documented) behavior where SIGPWR is a synonym for SIGINFO.
> 
ok, I'll think about this.

A new version of the patchset is underway, but as I told you I'm having quite
a busy time, so I need some more days to finish and do some test.

Thanks and regards
Davide
Alan Stern May 19, 2011, 2:25 p.m. UTC | #4
On Thu, 19 May 2011, Davide Ciminaghi wrote:

> I'm not completely sure about this. What we wanted to do was to avoid powering
> down the mmc while it is physically writing data into its internal memory.
> If we force a sync when the power loss warning event warning happens,
> it is very difficult to be able to guarantee that all buffered data will be 
> written before power actually dies. So we preferred to follow another strategy:
> let the mmc finish any running write operation, and then stop its request 
> queue. If power really goes down, then we hope that the file system journal 
> will fix things on next boot (yes, some data could get lost, but the fs should
> still be mountable). On the other hand, if power resumes, nothing bad should 
> happen for user space processes.

You could consider a totally different approach.

Each platform will have a different set of high-power devices it wants
to turn off when a power-loss warning occurs.  So instead of changing
the core PM interface, you could add a new "power_loss" notifier list.  
Only the most critical drivers would need to listen for notifications, 
and this could be different drivers on different platforms.

Alan Stern
Rafael Wysocki May 19, 2011, 8:52 p.m. UTC | #5
On Thursday, May 19, 2011, Alan Stern wrote:
> On Thu, 19 May 2011, Davide Ciminaghi wrote:
> 
> > I'm not completely sure about this. What we wanted to do was to avoid powering
> > down the mmc while it is physically writing data into its internal memory.
> > If we force a sync when the power loss warning event warning happens,
> > it is very difficult to be able to guarantee that all buffered data will be 
> > written before power actually dies. So we preferred to follow another strategy:
> > let the mmc finish any running write operation, and then stop its request 
> > queue. If power really goes down, then we hope that the file system journal 
> > will fix things on next boot (yes, some data could get lost, but the fs should
> > still be mountable). On the other hand, if power resumes, nothing bad should 
> > happen for user space processes.
> 
> You could consider a totally different approach.
> 
> Each platform will have a different set of high-power devices it wants
> to turn off when a power-loss warning occurs.  So instead of changing
> the core PM interface, you could add a new "power_loss" notifier list.  
> Only the most critical drivers would need to listen for notifications, 
> and this could be different drivers on different platforms.

Moreover, it would allow not only drivers, but also filesystems (for
one example) to get notifications.

Thanks,
Rafael
diff mbox

Patch

diff --git a/Documentation/power/loss.txt b/Documentation/power/loss.txt
new file mode 100644
index 0000000..10da89c
--- /dev/null
+++ b/Documentation/power/loss.txt
@@ -0,0 +1,191 @@ 
+**** POWER LOSS MANAGEMENT ****
+
+Davide Ciminaghi <ciminaghi@gnudd.com> 2011
+
+1. Overview
+
+On some embedded systems, an asynchronous power failure notification event is
+available some tens to hundreds milliseconds before the actual power failure
+takes place.
+Such an event can be used to trigger some actions, typically shutting down
+all non-vital power sinks, thus allowing the board to survive longer to
+temporary power losses. Sometimes, also flash-based block devices can be
+stopped after a power loss event notification has been received. This should
+be useful for mmc devices, for which a sudden power failure while a write
+command is being executed can threaten file system integrity even in case a
+journalling fs is in use.
+Generally speaking, one can expect the course of action taken when a power
+failure warning is received to be deeply system specific. Similarly, the
+mechanism used for power failure notifications can equally be board/platform
+specific. For these reasons, support for power loss management has been split
+into three parts:
+
+	+ Generic code (board and driver independent).
+	+ Board dependent code.
+	+ Power loss policy definition.
+
+The generic part of the code is located under drivers/base/power/loss.c .
+On the other hand, board dependent code and power loss policies definitions
+should live somewhere in the platform/board specific files.
+The header file include/linux/pm_loss.h contains all the pm_loss function
+prototypes, together with definitions of data structures.
+For what concerns power loss policies, the framework already contains a couple
+of predefined policies: "nop" and "default" (see later paragraphs for more
+details).
+
+1.1 Sysfs interface.
+
+It can be useful (for instance during tests), to be able to quickly switch
+from one power loss policy to another, or to simulate power fail and resume
+events. To this end, a sysfs interface of the following form has been devised:
+
+/sys/power/loss +
+		|
+		+-- active_policy
+		|
+		+-- policies -+
+		|	      |
+		|	      +-- nop
+		|	      |
+		|	      +-- default +
+		|	      |		  |
+		|	      |	 	  +-- bus_table
+		|	      |
+		|	      +-- ....
+		|
+		+-- pwrfail_sim
+
+2. Details
+
+2.1 Sending events to the power loss management core.
+
+The board specific code shall trigger a power failure event notification by
+calling pm_loss_power_changed(SYS_PWR_FAILING).
+In case of a temporary power loss, the same function shall be called with
+SYS_PWR_GOOD argument on power resume. pm_loss_power_changed() can sleep, so
+it shall not be called from interrupt context.
+
+2.3 Effects on bus and device drivers.
+
+One more entry has been added to the device PM callbacks:
+
+   int (*power_changed)(struct device *, enum sys_power_state);
+
+This function can be optionally invoked by power loss policies in case of
+system power changes (loss and/or resume). Of course every bus or device driver
+can react to such events in some specific way. For instance the mmc block
+driver will probably block its request queue during a temporary power loss.
+
+2.3.1 The platform bus.
+
+For what concerns platform bus drivers, platform specific code can override
+the power_changed() callback :
+
+platform_pm_loss_power_changed(struct device *dev, enum sys_power_state s)
+
+whose default (empty) version is defined as a __weak symbol in
+drivers/base/platform.c.
+
+2.4 Power loss policies.
+
+A power loss policy can be registered via the pm_loss_register_policy()
+function, which receives:
+
+      + A pointer to a struct pm_loss_policy_ops , which hosts the pointers
+      to the policy's specific callbacks (see below for more details).
+      + A pointer to a struct kobj_type, which will allow the pm loss core
+      to setup the policy related directory under /sys/power/loss/policies.
+      See Documentation/kobject.txt for more details on struct kobj_type.
+      + A pointer to an array of chars containing the name of the policy.
+      + A pointer to the policy's private data (void *).
+
+A power loss policy can define the following callbacks:
+
+      + int (*bus_added)(struct pm_loss_policy *, struct bus_type *) : this
+      is invoked whenever a new bus has been registered within the system.
+      Since power related events are often managed at bus level, it can be
+      useful for the policy to have a list of available busses within the
+      system. When a policy is registered, this callback is invoked once
+      for every already registered bus.
+      + int (*bus_removed)(struct pm_loss_policy *, struct bus_type *):
+      this is invoked when a bus is removed from the system.
+      + int (*start)(struct pm_loss_policy *): this is called when a policy
+      becomes active.
+      + void (*stop)(struct pm_loss_policy *): this is called when a policy
+      becomes inactive.
+
+2.4.1 The nop predefined policy
+
+The nop policy is the first active policy when the pm loss framework is
+initialized. It just does nothing in case of power loss / power resume
+events.
+
+2.4.2 The default predefined policy
+
+When a power failure warning is received, the default policy walks through a
+list of critical busses and invokes their drivers' power_changed() callback.
+The default policy can be customized and registered by calling:
+
+    pm_loss_setup_default_policy(struct pm_loss_default_policy_table *);
+
+This function receives a pointer to a pm_loss_default_policy_table structure,
+which defines a priority ordered list of critical buffers:
+
+      struct pm_loss_default_policy_table {
+	     struct module *owner ;
+	     const char *name ;
+	     struct pm_loss_default_policy_item *items;
+	     int nitems;
+      };
+
+Here's a short description of such structure:
+
+      + owner  : shall point to the module registering the table (or NULL
+		 if the table is not instantiated by a module).
+      + name   : this is the name given to this particular customization of
+		 the default policy.
+      + items  : pointer to an array of table items.
+      + nitems : number of the items in the array.
+
+Each item is a struct pm_loss_default_policy_item, defined as follows:
+
+     struct pm_loss_default_policy_item {
+	const char *bus_name;
+	int bus_priority ;
+     };
+
+where bus_name is the name of a bus and bus_priority is a numerical priority
+of the bus itself. Numerically higher priorities correspond to more prioritary
+busses.
+
+2.4.3 Activating a specific policy.
+
+A policy can be activated:
+
+	+ From within the kernel by calling pm_loss_set_policy(char *name).
+	The argument passed to this function shall be the name of the policy
+	to be activated.
+
+	+ From user space by writing the name of the policy to be activated
+	to /sys/power/loss/active_policy.
+
+2.4.4 Unregistering a policy
+
+For a generic user defined policy, just call :
+
+     pm_loss_unregister_policy(struct pm_loss_policy *);
+
+For a default policy customization:
+
+     pm_loss_shutdown_default_policy(struct pm_loss_policy *);
+
+3. Kernel configuration
+
+The following configuration options have been defined:
+
+    CONFIG_PM_LOSS     : this enables the generic pm_loss framework.
+    CONFIG_PM_LOSS_SIM : this adds the pwrfail_sim file to the sysfs interface
+		         and allows power loss events simulation.
+    CONFIG_PM_LOSS_DEBUG : this option enables some debug printk's .
+    CONFIG_PM_LOSS_TEST: this enables the compilation of a sample test module
+                         containing a customized default policy definition
diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index 2134248..d67a506 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -910,6 +910,9 @@  int bus_register(struct bus_type *bus)
 	if (retval)
 		goto bus_attrs_fail;
 
+#ifdef CONFIG_PM_LOSS
+	pm_loss_on_bus_added(bus);
+#endif
 	pr_debug("bus: '%s': registered\n", bus->name);
 	return 0;
 
@@ -940,6 +943,9 @@  EXPORT_SYMBOL_GPL(bus_register);
 void bus_unregister(struct bus_type *bus)
 {
 	pr_debug("bus: '%s': unregistering\n", bus->name);
+#ifdef CONFIG_PM_LOSS
+	pm_loss_on_bus_removed(bus);
+#endif
 	bus_remove_attrs(bus);
 	remove_probe_files(bus);
 	kset_unregister(bus->p->drivers_kset);
diff --git a/drivers/base/init.c b/drivers/base/init.c
index c8a934e..d979da9 100644
--- a/drivers/base/init.c
+++ b/drivers/base/init.c
@@ -10,6 +10,7 @@ 
 #include <linux/memory.h>
 
 #include "base.h"
+#include "power/power.h"
 
 /**
  * driver_init - initialize driver model.
diff --git a/drivers/base/platform.c b/drivers/base/platform.c
index f051cff..427b686 100644
--- a/drivers/base/platform.c
+++ b/drivers/base/platform.c
@@ -20,6 +20,7 @@ 
 #include <linux/err.h>
 #include <linux/slab.h>
 #include <linux/pm_runtime.h>
+#include <linux/pm_loss.h>
 
 #include "base.h"
 
@@ -947,6 +948,18 @@  int __weak platform_pm_runtime_idle(struct device *dev)
 
 #endif /* !CONFIG_PM_RUNTIME */
 
+#ifdef CONFIG_PM_LOSS
+
+int __weak platform_pm_loss_power_changed(struct device *dev,
+					  enum sys_power_state s)
+{
+	return -ENOSYS;
+}
+
+#else
+#define platform_pm_loss_power_changed NULL
+#endif
+
 static const struct dev_pm_ops platform_dev_pm_ops = {
 	.prepare = platform_pm_prepare,
 	.complete = platform_pm_complete,
@@ -965,6 +978,7 @@  static const struct dev_pm_ops platform_dev_pm_ops = {
 	.runtime_suspend = platform_pm_runtime_suspend,
 	.runtime_resume = platform_pm_runtime_resume,
 	.runtime_idle = platform_pm_runtime_idle,
+	.power_changed = platform_pm_loss_power_changed,
 };
 
 struct bus_type platform_bus_type = {
diff --git a/drivers/base/power/Makefile b/drivers/base/power/Makefile
index abe46ed..6373944 100644
--- a/drivers/base/power/Makefile
+++ b/drivers/base/power/Makefile
@@ -1,6 +1,8 @@ 
 obj-$(CONFIG_PM)	+= sysfs.o
 obj-$(CONFIG_PM_SLEEP)	+= main.o wakeup.o
 obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
+obj-$(CONFIG_PM_LOSS)            += loss.o
+obj-$(CONFIG_PM_LOSS_TEST)       += loss_test.o
 obj-$(CONFIG_PM_OPS)	+= generic_ops.o
 obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
 obj-$(CONFIG_PM_OPP)	+= opp.o
diff --git a/drivers/base/power/loss.c b/drivers/base/power/loss.c
new file mode 100644
index 0000000..a265b8b
--- /dev/null
+++ b/drivers/base/power/loss.c
@@ -0,0 +1,648 @@ 
+/*
+ * drivers/base/power/loss.c - Power loss management related functions
+ *
+ * Copyright (C) 2011 Davide Ciminaghi <ciminaghi@gnudd.com>
+ * Copyright (C) 2011 BTicino S.p.A.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#ifdef CONFIG_PM_LOSS_DEBUG
+#define DEBUG
+#endif
+
+#include <linux/kernel.h>
+#include <linux/kobject.h>
+#include <linux/list.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/pm_loss.h>
+#include "../base.h"
+
+#define LOSS_STR "pm_loss"
+
+#define xstr(s) str(s)
+#define str(s) #s
+#define MAX_POLICY_NAME_LENGHT 20
+
+struct power_loss_state_struct {
+	spinlock_t lock;
+	enum sys_power_state state;
+	struct pm_loss_policy *active_policy;
+	wait_queue_head_t wq;
+	struct kobject *loss_kobj;
+	struct kset *policies;
+};
+
+static struct power_loss_state_struct power_state;
+
+
+void pm_loss_power_changed(enum sys_power_state s)
+{
+	pr_debug_pm_loss("pm_loss_power_changed, "
+			 "power_state.active_policy = %p\n",
+			 power_state.active_policy);
+	if (!power_state.active_policy ||
+	    !power_state.active_policy->ops ||
+	    !power_state.active_policy->ops->power_changed) {
+		pr_debug_pm_loss("pm_loss_power_changed with no active "
+				 "policy or "
+				 "policy with no sys_power_changed cb\n");
+		return ;
+	}
+	pr_debug_pm_loss("pm_loss_power_changed(%d)\n", s);
+	spin_lock_irq(&power_state.lock);
+	if (power_state.state == s) {
+		spin_unlock_irq(&power_state.lock);
+		return ;
+	}
+	if (power_state.state == SYS_PWR_NOTIFYING) {
+		DEFINE_WAIT(wait);
+
+		for (;;) {
+			prepare_to_wait(&power_state.wq, &wait,
+					TASK_UNINTERRUPTIBLE);
+			if (power_state.state != SYS_PWR_NOTIFYING)
+				break;
+			spin_unlock_irq(&power_state.lock);
+			schedule();
+			spin_lock_irq(&power_state.lock);
+		}
+	}
+	power_state.state = SYS_PWR_NOTIFYING ;
+	spin_unlock_irq(&power_state.lock);
+	if (power_state.active_policy->ops->power_changed) {
+		pr_debug_pm_loss("invoking sys_power_changed cb\n");
+		power_state.active_policy->ops->power_changed
+			(power_state.active_policy, s);
+	}
+	power_state.state = s;
+}
+EXPORT_SYMBOL(pm_loss_power_changed);
+
+#define to_bus(obj) container_of(obj, struct subsys_private, subsys.kobj)
+
+static int __bus_added_removed(struct pm_loss_policy *p,
+			       struct bus_type *bus, bool added)
+{
+	int ret = 0;
+	spin_lock_irq(&power_state.lock);
+	if (power_state.state == SYS_PWR_NOTIFYING) {
+		DEFINE_WAIT(wait);
+
+		for (;;) {
+			prepare_to_wait(&power_state.wq, &wait,
+					TASK_UNINTERRUPTIBLE);
+			if (power_state.state != SYS_PWR_NOTIFYING)
+				break;
+			spin_unlock_irq(&power_state.lock);
+			schedule();
+			spin_lock_irq(&power_state.lock);
+		}
+	}
+	if (added)
+		ret = p->ops->bus_added ? p->ops->bus_added(p, bus) : 0;
+	else
+		ret = p->ops->bus_removed ? p->ops->bus_removed(p, bus) : 0;
+	spin_unlock_irq(&power_state.lock);
+	return ret;
+}
+
+static int bus_added(struct pm_loss_policy *p, struct bus_type *bus)
+{
+	return __bus_added_removed(p, bus, true);
+}
+
+static int bus_removed(struct pm_loss_policy *p, struct bus_type *bus)
+{
+	return __bus_added_removed(p, bus, false);
+}
+
+static int policy_add_busses(struct pm_loss_policy *p)
+{
+	struct kobject *k;
+	int stat;
+
+	if (!p || !p->ops)
+		return -EINVAL;
+	if (!p->ops->bus_added)
+		return 0;
+
+	pr_debug_pm_loss("policy_add_busses()\n");
+
+	if (!bus_kset) {
+		pr_debug_pm_loss("bus_kset still NULL\n");
+		return 0;
+	}
+
+	spin_lock(&bus_kset->list_lock);
+	list_for_each_entry(k, &bus_kset->list, entry) {
+		spin_unlock(&bus_kset->list_lock);
+		/* This might sleep */
+		pr_debug_pm_loss("invoking bus_added()\n");
+		stat = bus_added(p, to_bus(k)->bus);
+		spin_lock(&bus_kset->list_lock);
+	}
+	spin_unlock(&bus_kset->list_lock);
+	return 0;
+}
+
+static void __pm_loss_on_bus_added_removed(struct bus_type *bus, bool added)
+{
+	struct list_head *i;
+	struct kobject *k;
+
+	if (!power_state.policies)
+		/* Not yet initialized */
+		return ;
+
+	spin_lock(&power_state.policies->list_lock);
+	list_for_each(i, &power_state.policies->list) {
+		k = container_of(i, struct kobject, entry);
+		if (added)
+			bus_added(to_policy(k), bus);
+		else
+			bus_removed(to_policy(k), bus);
+	}
+	spin_unlock(&power_state.policies->list_lock);
+}
+
+void pm_loss_on_bus_added(struct bus_type *bus)
+{
+	__pm_loss_on_bus_added_removed(bus, true);
+}
+EXPORT_SYMBOL(pm_loss_on_bus_added);
+
+void pm_loss_on_bus_removed(struct bus_type *bus)
+{
+	__pm_loss_on_bus_added_removed(bus, false);
+}
+EXPORT_SYMBOL(pm_loss_on_bus_removed);
+
+struct pm_loss_policy *
+pm_loss_register_policy(const struct pm_loss_policy_ops *ops,
+			struct kobj_type *ktype,
+			const char *name, void *priv)
+{
+	struct pm_loss_policy *out = NULL;
+	int ret ;
+
+	if (strlen(name) > MAX_POLICY_NAME_LENGHT)
+		printk(KERN_WARNING LOSS_STR "warning : name %s is too long, "
+		       "will be truncated\n", name);
+
+	out = kzalloc(sizeof(struct pm_loss_policy), GFP_KERNEL);
+	if (!out)
+		return NULL;
+	out->ops = ops;
+	out->priv = priv;
+	out->kobj.kset = power_state.policies;
+
+	if (policy_add_busses(out) < 0)
+		goto err0;
+
+	ret = kobject_init_and_add(&out->kobj, ktype, NULL, "%s", name);
+	if (ret < 0)
+		goto err1;
+
+	kobject_uevent(&out->kobj, KOBJ_ADD);
+
+	return out;
+
+ err1:
+	kobject_put(&out->kobj);
+ err0:
+	kfree(out);
+	return NULL;
+}
+EXPORT_SYMBOL(pm_loss_register_policy);
+
+int pm_loss_unregister_policy(struct pm_loss_policy *p)
+{
+	pr_debug_pm_loss("pm_loss_unregister_policy(%p)\n", p);
+	if (power_state.active_policy == p) {
+		pr_debug_pm_loss("cannot unregister policy (busy)\n");
+		return -EBUSY ;
+	}
+	kobject_del(&p->kobj);
+	kobject_put(&p->kobj);
+	return 0;
+}
+EXPORT_SYMBOL(pm_loss_unregister_policy);
+
+int pm_loss_set_policy(const char *name)
+{
+	struct kobject *o;
+	struct pm_loss_policy *p ;
+	spin_lock_irq(&power_state.lock);
+	o = kset_find_obj(power_state.policies, name);
+	if (!o) {
+		spin_unlock_irq(&power_state.lock);
+		return -ENODEV;
+	}
+	p = to_policy(o);
+	if (power_state.active_policy) {
+		if (power_state.active_policy->ops->stop)
+			power_state.active_policy->ops->stop
+				(power_state.active_policy);
+	}
+	power_state.active_policy = p;
+	if (power_state.active_policy->ops->start)
+		power_state.active_policy->ops->start
+			(power_state.active_policy);
+	spin_unlock_irq(&power_state.lock);
+	return 0;
+}
+EXPORT_SYMBOL(pm_loss_set_policy);
+
+/*
+ * Default release function for pm_loss_policy object
+ */
+static void pm_loss_policy_release(struct kobject *kobj)
+{
+	struct pm_loss_policy *p;
+	p = to_policy(kobj);
+	pr_debug_pm_loss("pm_loss_policy_release(%p)\n", p);
+	kfree(p);
+}
+
+/*
+ * Dummy nop policy implementation (all null pointers)
+ */
+static const struct pm_loss_policy_ops nop_policy_ops = {
+};
+
+static const struct sysfs_ops nop_sysfs_ops = {
+	.show = NULL,
+	.store = NULL,
+};
+
+static struct attribute *nop_default_attrs[] = {
+	NULL,
+};
+
+static struct kobj_type nop_ktype = {
+	.sysfs_ops = &nop_sysfs_ops,
+	.release = pm_loss_policy_release,
+	.default_attrs = nop_default_attrs,
+};
+
+/*
+ * Default policy implementation
+ */
+
+struct pm_loss_default_policy_data {
+	struct pm_loss_default_policy_table *table;
+	struct list_head busses ;
+};
+
+struct pm_loss_default_policy_bus_data {
+	struct bus_type *bus;
+	int priority ;
+	struct list_head list;
+};
+
+static struct pm_loss_default_policy_item *
+__find_bus_by_name(const char *name,
+		   struct pm_loss_default_policy_table *table)
+{
+	int i;
+	struct pm_loss_default_policy_item *out = NULL;
+	for (i = 0; i < table->nitems; i++)
+		if (!strcmp(table->items[i].bus_name, name)) {
+			out = &table->items[i];
+			break;
+		}
+	return out;
+}
+
+static void __add_bus(struct pm_loss_default_policy_data *data,
+		      struct pm_loss_default_policy_bus_data *new)
+{
+	struct pm_loss_default_policy_bus_data *bd;
+	list_for_each_entry(bd, &data->busses, list)
+		if (new->priority > bd->priority) {
+			list_add_tail(&new->list, &bd->list);
+			return;
+		}
+	list_add_tail(&new->list, &bd->list);
+}
+
+static int pm_loss_default_policy_bus_added(struct pm_loss_policy *p,
+					   struct bus_type *bus)
+{
+	struct pm_loss_default_policy_item *item ;
+	struct pm_loss_default_policy_data *pd = p->priv;
+	struct pm_loss_default_policy_bus_data *bd =
+		kzalloc(sizeof(struct pm_loss_default_policy_bus_data),
+			GFP_KERNEL);
+
+	BUG_ON(!pd);
+	if (!bd)
+		return -ENOMEM;
+	pr_debug_pm_loss("pm_loss_default_policy_bus_added(), name = %s\n",
+		 bus->name);
+	item = __find_bus_by_name(bus->name, pd->table);
+	if (!item)
+		return 0;
+
+	pr_debug_pm_loss("bus_found\n");
+
+	bd->priority = item->bus_priority ;
+	bd->bus = bus ;
+	__add_bus(pd, bd);
+
+	return 0;
+}
+
+static int pm_loss_default_policy_bus_removed(struct pm_loss_policy *p,
+					     struct bus_type *bus)
+{
+	struct pm_loss_default_policy_data *pd = p->priv;
+	struct pm_loss_default_policy_bus_data *bd, *tmp;
+
+	BUG_ON(!pd);
+	list_for_each_entry_safe(bd, tmp, &pd->busses, list)
+		if (bd->bus == bus) {
+			list_del(&bd->list);
+			kfree(bd);
+			break;
+		}
+	return 0;
+}
+
+struct bus_sys_pwr_changed_data {
+	struct bus_type *bus;
+	enum sys_power_state s;
+};
+
+static int __bus_sys_pwr_changed(struct device *dev, void *_d)
+{
+	struct bus_sys_pwr_changed_data *data =
+		(struct bus_sys_pwr_changed_data *)_d;
+	struct bus_type *bus = data->bus;
+
+	BUG_ON(!dev);
+	BUG_ON(!bus);
+
+	pr_debug_pm_loss("__bus_sys_pwr_changed(%p), bus %s\n", dev,
+			 bus->name);
+
+	if (bus->pm && bus->pm->power_changed)
+		bus->pm->power_changed(dev, data->s);
+	return 0;
+}
+
+
+static void
+pm_loss_default_policy_sys_power_changed(struct pm_loss_policy *p,
+					enum sys_power_state s)
+{
+	struct pm_loss_default_policy_data *pd = p->priv;
+	struct pm_loss_default_policy_bus_data *bd ;
+	struct bus_sys_pwr_changed_data cb_data ;
+
+	pr_debug_pm_loss("pm_loss_default_policy_sys_power_changed()\n");
+
+	BUG_ON(!pd);
+
+	list_for_each_entry(bd, &pd->busses, list) {
+		struct bus_type *bus = bd->bus;
+		const struct dev_pm_ops *pm = bus->pm;
+		pr_debug_pm_loss("bus = %s\n", bus->name);
+		if (pm && pm->power_changed) {
+			cb_data.bus = bus ; cb_data.s = s;
+			pr_debug_pm_loss("before bus_for_each_dev\n");
+			bus_for_each_dev(bus, NULL, (void *)&cb_data,
+					 __bus_sys_pwr_changed);
+		}
+	}
+}
+
+static int pm_loss_default_policy_start(struct pm_loss_policy *p)
+{
+	return 0;
+}
+
+static void pm_loss_default_policy_stop(struct pm_loss_policy *p)
+{
+	struct pm_loss_default_policy_data *pd = p->priv;
+	if (pd->table->owner)
+		module_put(pd->table->owner);
+}
+
+static const struct pm_loss_policy_ops default_policy_ops = {
+	.bus_added = pm_loss_default_policy_bus_added,
+	.bus_removed = pm_loss_default_policy_bus_removed,
+	.power_changed = pm_loss_default_policy_sys_power_changed,
+	.start = pm_loss_default_policy_start,
+	.stop = pm_loss_default_policy_stop,
+};
+
+/* sysfs stuff */
+
+struct dflt_bus_table_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct pm_loss_policy *,
+			struct dflt_bus_table_attribute *attr,
+			char *buf);
+};
+#define to_dflt_bus_table_attr(x) \
+container_of(x, struct dflt_bus_table_attribute, attr)
+
+
+static ssize_t bus_table_show(struct pm_loss_policy *p,
+			      struct dflt_bus_table_attribute *attr,
+			      char *buf)
+{
+	struct pm_loss_default_policy_data *pd = p->priv;
+	struct pm_loss_default_policy_bus_data *bd ;
+	char *ptr = buf;
+
+	BUG_ON(!pd);
+
+	list_for_each_entry(bd, &pd->busses, list) {
+		struct bus_type *bus = bd->bus;
+		ptr += snprintf(ptr, PAGE_SIZE - (ptr - buf), "%s:%d,",
+				bus->name, bd->priority);
+	}
+	return ptr - buf;
+}
+
+static struct dflt_bus_table_attribute bus_table_attribute =
+__ATTR_RO(bus_table);
+
+
+static struct attribute *dflt_default_attrs[] = {
+	&bus_table_attribute.attr,
+	NULL,
+};
+
+/* dflt sysfs ops */
+
+static ssize_t dflt_show(struct kobject *kobj, struct attribute *_attr,
+			 char *buf)
+{
+	if (!strcmp(_attr->name, "bus_table")) {
+		struct dflt_bus_table_attribute *attr =
+			to_dflt_bus_table_attr(_attr);
+		struct pm_loss_policy *p =
+			to_policy(kobj);
+		return attr->show(p, attr, buf);
+	}
+	return -EIO;
+}
+
+static ssize_t dflt_store(struct kobject *kobj,
+			  struct attribute *attr,
+			  const char *buf, size_t len)
+{
+	return -EIO;
+}
+
+static const struct sysfs_ops dflt_sysfs_ops = {
+	.show = dflt_show,
+	.store = dflt_store,
+};
+
+static struct kobj_type dflt_ktype = {
+	.sysfs_ops = &dflt_sysfs_ops,
+	.release = pm_loss_policy_release,
+	.default_attrs = dflt_default_attrs,
+};
+
+struct pm_loss_policy *
+pm_loss_setup_default_policy(struct pm_loss_default_policy_table *table)
+{
+	struct pm_loss_policy *p ;
+	struct pm_loss_default_policy_data *data ;
+
+	if (!table)
+		return NULL;
+
+	if (table->owner && try_module_get(table->owner) < 0)
+		return NULL;
+
+	data = kzalloc(sizeof(struct pm_loss_default_policy_data), GFP_KERNEL);
+	if (!data)
+		return NULL;
+	INIT_LIST_HEAD(&data->busses);
+	data->table = table;
+
+	p = pm_loss_register_policy(&default_policy_ops, &dflt_ktype,
+				    table->name, data);
+	if (!p) {
+		kfree(data);
+		return p;
+	}
+	return p;
+}
+EXPORT_SYMBOL(pm_loss_setup_default_policy);
+
+int pm_loss_shutdown_default_policy(struct pm_loss_policy *p)
+{
+	int ret;
+	pr_debug_pm_loss("pm_loss_shutdown_default_policy()\n");
+	ret = pm_loss_unregister_policy(p);
+	if (ret < 0)
+		pr_debug_pm_loss("cannot unregister policy\n");
+	return ret;
+}
+EXPORT_SYMBOL(pm_loss_shutdown_default_policy);
+
+static ssize_t active_policy_show(struct kobject *kobj,
+				  struct kobj_attribute *attr,
+				  char *buf)
+{
+	return sprintf(buf, "%s\n",
+		       kobject_name(&power_state.active_policy->kobj));
+}
+
+
+static ssize_t active_policy_store(struct kobject *kobj,
+				   struct kobj_attribute *attr,
+				   const char *buf, size_t count)
+{
+	char pname[MAX_POLICY_NAME_LENGHT];
+	int ret;
+
+	sscanf(buf, "%" xstr(MAX_POLICY_NAME_LENGHT) "s", pname);
+	ret = pm_loss_set_policy(pname);
+	return ret < 0 ? ret : count;
+}
+
+static struct kobj_attribute active_policy_attribute =
+	__ATTR(active_policy, 0644, active_policy_show, active_policy_store);
+
+#ifdef CONFIG_PM_LOSS_SIM
+static ssize_t pwrfail_sim_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count)
+{
+	int n;
+	sscanf(buf, "%d", &n);
+	pm_loss_power_changed(n ? SYS_PWR_FAILING : SYS_PWR_GOOD);
+	return count;
+}
+
+
+static struct kobj_attribute pwrfail_sim_attribute =
+	__ATTR(pwrfail_sim, 0200, NULL, pwrfail_sim_store);
+#endif
+
+
+static struct attribute *loss_attrs[] = {
+	&active_policy_attribute.attr,
+#ifdef CONFIG_PM_LOSS_SIM
+	&pwrfail_sim_attribute.attr,
+#endif
+	NULL,
+};
+
+static struct attribute_group loss_attr_group = {
+	.attrs = loss_attrs,
+};
+
+/*
+ * Init function
+ */
+int __init pm_loss_init(void)
+{
+	struct pm_loss_policy *p ;
+	int ret = 0;
+
+	pr_debug_pm_loss("pm_loss_init()\n");
+	spin_lock_init(&power_state.lock);
+	power_state.state = SYS_PWR_GOOD;
+	init_waitqueue_head(&power_state.wq);
+	power_state.active_policy = NULL;
+	power_state.loss_kobj = kobject_create_and_add("loss", power_kobj);
+	if (!power_state.loss_kobj) {
+		ret = -ENOMEM;
+		goto err0;
+	}
+	power_state.policies =
+		kset_create_and_add("policies", NULL, power_state.loss_kobj);
+	if (!power_state.policies) {
+		ret = -ENOMEM;
+		goto err1;
+	}
+	ret = sysfs_create_group(power_state.loss_kobj, &loss_attr_group);
+	if (ret < 0)
+		goto err2;
+	pr_debug_pm_loss("invoking pm_loss_register_policy (nop)\n");
+	p = pm_loss_register_policy(&nop_policy_ops, &nop_ktype, "nop", NULL);
+	BUG_ON(!p);
+	pr_debug_pm_loss("setting nop policy as current one\n");
+	pm_loss_set_policy("nop");
+	return ret;
+
+ err2:
+	kset_unregister(power_state.policies);
+ err1:
+	kobject_put(power_state.loss_kobj);
+ err0:
+	return ret;
+}
+
+core_initcall(pm_loss_init);
diff --git a/drivers/base/power/power.h b/drivers/base/power/power.h
index 698dde7..ad54ef5 100644
--- a/drivers/base/power/power.h
+++ b/drivers/base/power/power.h
@@ -10,6 +10,20 @@  static inline void pm_runtime_remove(struct device *dev) {}
 
 #endif /* !CONFIG_PM_RUNTIME */
 
+#ifdef CONFIG_PM_LOSS
+
+extern void pm_loss_init(void) ;
+extern void pm_loss_on_bus_added(struct bus_type *);
+extern void pm_loss_on_bus_removed(struct bus_type *);
+
+#else /* !CONFIG_PM_LOSS */
+
+static inline void pm_loss_init(void) {}
+static inline void pm_loss_on_bus_added(struct bus_type *b) {};
+static inline void pm_loss_on_bus_removed(struct bus_type *b) {};
+
+#endif
+
 #ifdef CONFIG_PM_SLEEP
 
 /* kernel/power/main.c */
diff --git a/include/linux/pm.h b/include/linux/pm.h
index 21415cc..0316cd0 100644
--- a/include/linux/pm.h
+++ b/include/linux/pm.h
@@ -40,6 +40,7 @@  extern void (*pm_power_off_prepare)(void);
  */
 
 struct device;
+enum sys_power_state ;
 
 #ifdef CONFIG_PM
 extern const char power_group_name[];		/* = "power" */
@@ -199,6 +200,10 @@  typedef struct pm_message {
  *	power state if all of the necessary conditions are satisfied.  Check
  *	these conditions and handle the device as appropriate, possibly queueing
  *	a suspend request for it.  The return value is ignored by the PM core.
+ *
+ * @power_changed: device power has changed state (used by pm_loss). This
+ *      is invoked whenever system power changes its state (from GOOD to
+ *      FAILING or viceversa)
  */
 
 struct dev_pm_ops {
@@ -219,6 +224,7 @@  struct dev_pm_ops {
 	int (*runtime_suspend)(struct device *dev);
 	int (*runtime_resume)(struct device *dev);
 	int (*runtime_idle)(struct device *dev);
+	int (*power_changed)(struct device *, enum sys_power_state);
 };
 
 #ifdef CONFIG_PM_SLEEP
@@ -421,6 +427,16 @@  enum rpm_request {
 
 struct wakeup_source;
 
+/**
+ * System power states
+ *
+ */
+enum sys_power_state {
+	SYS_PWR_GOOD = 0,
+	SYS_PWR_FAILING,
+	SYS_PWR_NOTIFYING,
+};
+
 struct dev_pm_info {
 	pm_message_t		power_state;
 	unsigned int		can_wakeup:1;
diff --git a/include/linux/pm_loss.h b/include/linux/pm_loss.h
new file mode 100644
index 0000000..46af47c
--- /dev/null
+++ b/include/linux/pm_loss.h
@@ -0,0 +1,109 @@ 
+/*
+ * pm_loss.h - Power loss management related functions, header file
+ *
+ * Copyright (C) 2011 Davide Ciminaghi <ciminaghi@gnudd.com>
+ * Copyright (C) 2011 BTicino S.p.A.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#ifndef _LINUX_PM_LOSS_H
+#define _LINUX_PM_LOSS_H
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+struct pm_loss_policy ;
+
+struct pm_loss_policy_ops {
+	int (*bus_added)(struct pm_loss_policy *,
+			 struct bus_type *);
+	int (*bus_removed)(struct pm_loss_policy *,
+			   struct bus_type *);
+	void (*power_changed)(struct pm_loss_policy *,
+			      enum sys_power_state);
+	int (*start)(struct pm_loss_policy *);
+	void (*stop)(struct pm_loss_policy *);
+};
+
+struct pm_loss_policy {
+	const struct pm_loss_policy_ops *ops;
+	void *priv;
+	struct list_head list;
+	struct kobject kobj;
+};
+
+#define to_policy(k) container_of(k, struct pm_loss_policy, kobj)
+
+struct pm_loss_default_policy_table {
+	struct module *owner ;
+	const char *name ;
+	struct pm_loss_default_policy_item *items;
+	int nitems;
+};
+
+#ifdef CONFIG_PM_LOSS
+
+#ifdef CONFIG_PM_LOSS_DEBUG
+#define PM_LOSS_STR "pm_loss:"
+#define pr_debug_pm_loss(a, args...) \
+printk(KERN_INFO PM_LOSS_STR a, ## args)
+#else
+#define pr_debug_pm_loss(a, args...)
+#endif
+
+extern void pm_loss_power_changed(enum sys_power_state s);
+
+extern struct pm_loss_policy *
+pm_loss_register_policy(const struct pm_loss_policy_ops *ops,
+			struct kobj_type *ktype,
+			const char *name, void *priv);
+
+extern int pm_loss_unregister_policy(struct pm_loss_policy *);
+
+extern int pm_loss_set_policy(const char *name);
+
+struct pm_loss_default_policy_item {
+	const char *bus_name;
+	int bus_priority ;
+};
+
+extern struct pm_loss_policy *
+pm_loss_setup_default_policy(struct pm_loss_default_policy_table *);
+
+extern int pm_loss_shutdown_default_policy(struct pm_loss_policy *);
+
+#else /* !CONFIG_PM_LOSS */
+
+static inline struct pm_loss_policy *
+pm_loss_register_policy(const struct pm_loss_policy_ops *ops,
+		       void *priv)
+{
+	return NULL ;
+}
+
+static inline int pm_loss_unregister_policy(struct pm_loss_policy *p)
+{
+	return -ENOSYS ;
+}
+
+static inline int pm_loss_set_policy(struct pm_loss_policy *p)
+{
+	return -ENOSYS ;
+}
+
+static inline struct pm_loss_policy *
+pm_loss_setup_default_policy(struct pm_loss_default_policy_table *t)
+{
+	return NULL;
+}
+
+static inline int
+pm_loss_shutdown_default_policy(struct pm_loss_policy *p)
+{
+	return -ENOSYS;
+}
+
+#endif /* !CONFIG_PM_LOSS */
+
+#endif /* _LINUX_PM_LOSS_H */
diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig
index 2657299..7527847 100644
--- a/kernel/power/Kconfig
+++ b/kernel/power/Kconfig
@@ -236,6 +236,33 @@  config PM_RUNTIME
 	  responsible for the actual handling of the autosuspend requests and
 	  wake-up events.
 
+config PM_LOSS
+	bool "Help drivers reacting to system power losses"
+	depends on PM
+	---help---
+	  Enable power loss management. On some embedded systems, an
+          asynchronous power failure notification event is sometimes available
+	  less than one second before the actual power failure takes place.
+	  Custom policies can be setup to react to such an event. See
+	  Documentation/power/loss.txt for more details.
+	  If unsure, say No.
+
+config PM_LOSS_SIM
+	bool "Simulate power fail events"
+	depends on PM_LOSS
+	---help---
+	  Add the /sys/power/loss/pwrfail_sim file. Do
+		echo 1 > /sys/power/loss/pwrfail_sim
+	  to simulate a power failure and
+		echo 0 > /sys/power/loss/pwrfail_sim
+	  to simulate a power resume
+
+config PM_LOSS_DEBUG
+	bool "Enable power fail debug"
+	depends on PM_LOSS
+	---help---
+	This option enables debugging printk's for the PM_LOSS framework
+
 config PM_OPS
 	bool
 	depends on PM_SLEEP || PM_RUNTIME