mbox series

[v1,0/3] introduce priority-based shutdown support

Message ID 20231124145338.3112416-1-o.rempel@pengutronix.de (mailing list archive)
Headers show
Series introduce priority-based shutdown support | expand

Message

Oleksij Rempel Nov. 24, 2023, 2:53 p.m. UTC
Hi,

This patch series introduces support for prioritized device shutdown.
The main goal is to enable prioritization for shutting down specific
devices, particularly crucial in scenarios like power loss where
hardware damage can occur if not handled properly.

Oleksij Rempel (3):
  driver core: move core part of device_shutdown() to a separate
    function
  driver core: introduce prioritized device shutdown sequence
  mmc: core: increase shutdown priority for MMC devices

 drivers/base/core.c    | 157 +++++++++++++++++++++++++++--------------
 drivers/mmc/core/bus.c |   2 +
 include/linux/device.h |  51 ++++++++++++-
 kernel/reboot.c        |   4 +-
 4 files changed, 157 insertions(+), 57 deletions(-)

Comments

Greg Kroah-Hartman Nov. 24, 2023, 3:05 p.m. UTC | #1
On Fri, Nov 24, 2023 at 03:53:35PM +0100, Oleksij Rempel wrote:
> Hi,
> 
> This patch series introduces support for prioritized device shutdown.
> The main goal is to enable prioritization for shutting down specific
> devices, particularly crucial in scenarios like power loss where
> hardware damage can occur if not handled properly.

Oh fun, now we will have drivers and subsystems fighting over their
priority, with each one insisting that they are the most important!

/s

Anyway, this is ripe for problems and issues in the long-run, what is so
special about this hardware that it can not just shutdown in the
existing order that it has to be "first" over everyone else?  What
exactly does this prevent and what devices are requiring this?

And most importantly, what has changed in the past 20+ years to
suddenly require this new functionality and how does any other operating
system handle it?

thanks,

greg k-h
Mark Brown Nov. 24, 2023, 3:21 p.m. UTC | #2
On Fri, Nov 24, 2023 at 03:05:47PM +0000, Greg Kroah-Hartman wrote:

> Anyway, this is ripe for problems and issues in the long-run, what is so
> special about this hardware that it can not just shutdown in the
> existing order that it has to be "first" over everyone else?  What
> exactly does this prevent and what devices are requiring this?

> And most importantly, what has changed in the past 20+ years to
> suddenly require this new functionality and how does any other operating
> system handle it?

This came out of some discussions about trying to handle emergency power
failure notifications.
Greg Kroah-Hartman Nov. 24, 2023, 3:27 p.m. UTC | #3
On Fri, Nov 24, 2023 at 03:21:40PM +0000, Mark Brown wrote:
> On Fri, Nov 24, 2023 at 03:05:47PM +0000, Greg Kroah-Hartman wrote:
> 
> > Anyway, this is ripe for problems and issues in the long-run, what is so
> > special about this hardware that it can not just shutdown in the
> > existing order that it has to be "first" over everyone else?  What
> > exactly does this prevent and what devices are requiring this?
> 
> > And most importantly, what has changed in the past 20+ years to
> > suddenly require this new functionality and how does any other operating
> > system handle it?
> 
> This came out of some discussions about trying to handle emergency power
> failure notifications.

I'm sorry, but I don't know what that means.  Are you saying that the
kernel is now going to try to provide a hard guarantee that some devices
are going to be shut down in X number of seconds when asked?  If so, why
not do this in userspace?

thanks,

greg k-h
Mark Brown Nov. 24, 2023, 3:49 p.m. UTC | #4
On Fri, Nov 24, 2023 at 03:27:48PM +0000, Greg Kroah-Hartman wrote:
> On Fri, Nov 24, 2023 at 03:21:40PM +0000, Mark Brown wrote:

> > This came out of some discussions about trying to handle emergency power
> > failure notifications.

> I'm sorry, but I don't know what that means.  Are you saying that the
> kernel is now going to try to provide a hard guarantee that some devices
> are going to be shut down in X number of seconds when asked?  If so, why
> not do this in userspace?

No, it was initially (or when I initially saw it anyway) handling of
notifications from regulators that they're in trouble and we have some
small amount of time to do anything we might want to do about it before
we expire.
Greg Kroah-Hartman Nov. 24, 2023, 3:56 p.m. UTC | #5
On Fri, Nov 24, 2023 at 03:49:46PM +0000, Mark Brown wrote:
> On Fri, Nov 24, 2023 at 03:27:48PM +0000, Greg Kroah-Hartman wrote:
> > On Fri, Nov 24, 2023 at 03:21:40PM +0000, Mark Brown wrote:
> 
> > > This came out of some discussions about trying to handle emergency power
> > > failure notifications.
> 
> > I'm sorry, but I don't know what that means.  Are you saying that the
> > kernel is now going to try to provide a hard guarantee that some devices
> > are going to be shut down in X number of seconds when asked?  If so, why
> > not do this in userspace?
> 
> No, it was initially (or when I initially saw it anyway) handling of
> notifications from regulators that they're in trouble and we have some
> small amount of time to do anything we might want to do about it before
> we expire.

So we are going to guarantee a "time" in which we are going to do
something?  Again, if that's required, why not do it in userspace using
a RT kernel?

thanks,

greg k-h
Oleksij Rempel Nov. 24, 2023, 4:32 p.m. UTC | #6
On Fri, Nov 24, 2023 at 03:56:19PM +0000, Greg Kroah-Hartman wrote:
> On Fri, Nov 24, 2023 at 03:49:46PM +0000, Mark Brown wrote:
> > On Fri, Nov 24, 2023 at 03:27:48PM +0000, Greg Kroah-Hartman wrote:
> > > On Fri, Nov 24, 2023 at 03:21:40PM +0000, Mark Brown wrote:
> > 
> > > > This came out of some discussions about trying to handle emergency power
> > > > failure notifications.
> > 
> > > I'm sorry, but I don't know what that means.  Are you saying that the
> > > kernel is now going to try to provide a hard guarantee that some devices
> > > are going to be shut down in X number of seconds when asked?  If so, why
> > > not do this in userspace?
> > 
> > No, it was initially (or when I initially saw it anyway) handling of
> > notifications from regulators that they're in trouble and we have some
> > small amount of time to do anything we might want to do about it before
> > we expire.
> 
> So we are going to guarantee a "time" in which we are going to do
> something?  Again, if that's required, why not do it in userspace using
> a RT kernel?

For the HW in question I have only 100ms time before power loss. By
doing it over use space some we will have even less time to react.

In fact, this is not a new requirement. It exist on different flavors of
automotive Linux for about 10 years. Linux in cars should be able to
handle voltage drops for example on ignition and so on. The only new thing is
the attempt to mainline it.

Regards,
Oleksij
Greg Kroah-Hartman Nov. 24, 2023, 5:26 p.m. UTC | #7
On Fri, Nov 24, 2023 at 05:32:34PM +0100, Oleksij Rempel wrote:
> On Fri, Nov 24, 2023 at 03:56:19PM +0000, Greg Kroah-Hartman wrote:
> > On Fri, Nov 24, 2023 at 03:49:46PM +0000, Mark Brown wrote:
> > > On Fri, Nov 24, 2023 at 03:27:48PM +0000, Greg Kroah-Hartman wrote:
> > > > On Fri, Nov 24, 2023 at 03:21:40PM +0000, Mark Brown wrote:
> > > 
> > > > > This came out of some discussions about trying to handle emergency power
> > > > > failure notifications.
> > > 
> > > > I'm sorry, but I don't know what that means.  Are you saying that the
> > > > kernel is now going to try to provide a hard guarantee that some devices
> > > > are going to be shut down in X number of seconds when asked?  If so, why
> > > > not do this in userspace?
> > > 
> > > No, it was initially (or when I initially saw it anyway) handling of
> > > notifications from regulators that they're in trouble and we have some
> > > small amount of time to do anything we might want to do about it before
> > > we expire.
> > 
> > So we are going to guarantee a "time" in which we are going to do
> > something?  Again, if that's required, why not do it in userspace using
> > a RT kernel?
> 
> For the HW in question I have only 100ms time before power loss. By
> doing it over use space some we will have even less time to react.

Why can't userspace react that fast?  Why will the kernel be somehow
faster?  Speed should be the same, just get the "power is cut" signal
and have userspace flush and unmount the disk before power is gone.  Why
can the kernel do this any differently?

> In fact, this is not a new requirement. It exist on different flavors of
> automotive Linux for about 10 years. Linux in cars should be able to
> handle voltage drops for example on ignition and so on. The only new thing is
> the attempt to mainline it.

But your patch is not guaranteeing anything, it's just doing a "I want
this done before the other devices are handled", that's it.  There is no
chance that 100ms is going to be a requirement, or that some other
device type is not going to come along and demand to be ahead of your
device in the list.

So you are going to have a constant fight among device types over the
years, and people complaining that the kernel is now somehow going to
guarantee that a device is shutdown in a set amount of time, which
again, the kernel can not guarantee here.

This might work as a one-off for a specific hardware platform, which is
odd, but not anything you really should be adding for anyone else to use
here as your reasoning for it does not reflect what the code does.

thanks,

greg k-h
Oleksij Rempel Nov. 24, 2023, 6:57 p.m. UTC | #8
On Fri, Nov 24, 2023 at 05:26:30PM +0000, Greg Kroah-Hartman wrote:
> On Fri, Nov 24, 2023 at 05:32:34PM +0100, Oleksij Rempel wrote:
> > On Fri, Nov 24, 2023 at 03:56:19PM +0000, Greg Kroah-Hartman wrote:
> > > On Fri, Nov 24, 2023 at 03:49:46PM +0000, Mark Brown wrote:
> > > > On Fri, Nov 24, 2023 at 03:27:48PM +0000, Greg Kroah-Hartman wrote:
> > > > > On Fri, Nov 24, 2023 at 03:21:40PM +0000, Mark Brown wrote:
> > > > 
> > > > > > This came out of some discussions about trying to handle emergency power
> > > > > > failure notifications.
> > > > 
> > > > > I'm sorry, but I don't know what that means.  Are you saying that the
> > > > > kernel is now going to try to provide a hard guarantee that some devices
> > > > > are going to be shut down in X number of seconds when asked?  If so, why
> > > > > not do this in userspace?
> > > > 
> > > > No, it was initially (or when I initially saw it anyway) handling of
> > > > notifications from regulators that they're in trouble and we have some
> > > > small amount of time to do anything we might want to do about it before
> > > > we expire.
> > > 
> > > So we are going to guarantee a "time" in which we are going to do
> > > something?  Again, if that's required, why not do it in userspace using
> > > a RT kernel?
> > 
> > For the HW in question I have only 100ms time before power loss. By
> > doing it over use space some we will have even less time to react.
> 
> Why can't userspace react that fast?  Why will the kernel be somehow
> faster?  Speed should be the same, just get the "power is cut" signal
> and have userspace flush and unmount the disk before power is gone.  Why
> can the kernel do this any differently?
> 
> > In fact, this is not a new requirement. It exist on different flavors of
> > automotive Linux for about 10 years. Linux in cars should be able to
> > handle voltage drops for example on ignition and so on. The only new thing is
> > the attempt to mainline it.
> 
> But your patch is not guaranteeing anything, it's just doing a "I want
> this done before the other devices are handled", that's it.  There is no
> chance that 100ms is going to be a requirement, or that some other
> device type is not going to come along and demand to be ahead of your
> device in the list.
> 
> So you are going to have a constant fight among device types over the
> years, and people complaining that the kernel is now somehow going to
> guarantee that a device is shutdown in a set amount of time, which
> again, the kernel can not guarantee here.
> 
> This might work as a one-off for a specific hardware platform, which is
> odd, but not anything you really should be adding for anyone else to use
> here as your reasoning for it does not reflect what the code does.

I see. Good point.

In my case umount is not needed, there is not enough time to write down
the data. We should send a shutdown command to the eMMC ASAP.

@Ulf, are there a way request mmc shutdown from user space?
If I see it correctly, sysfs-devices-power-control support only "auto" and
"on". Unbinding the module will not execute MMC shutdown notification.
If user space is the way to go, do sysfs-devices-power-control "off"
command will be acceptable?

The other option I have is to add a regulator event handler to the MMC
framework and do shutdown notification on under-voltage event.

Are there other options?

Regards,
Oleksij
Greg Kroah-Hartman Nov. 25, 2023, 6:51 a.m. UTC | #9
On Fri, Nov 24, 2023 at 07:57:25PM +0100, Oleksij Rempel wrote:
> On Fri, Nov 24, 2023 at 05:26:30PM +0000, Greg Kroah-Hartman wrote:
> > On Fri, Nov 24, 2023 at 05:32:34PM +0100, Oleksij Rempel wrote:
> > > On Fri, Nov 24, 2023 at 03:56:19PM +0000, Greg Kroah-Hartman wrote:
> > > > On Fri, Nov 24, 2023 at 03:49:46PM +0000, Mark Brown wrote:
> > > > > On Fri, Nov 24, 2023 at 03:27:48PM +0000, Greg Kroah-Hartman wrote:
> > > > > > On Fri, Nov 24, 2023 at 03:21:40PM +0000, Mark Brown wrote:
> > > > > 
> > > > > > > This came out of some discussions about trying to handle emergency power
> > > > > > > failure notifications.
> > > > > 
> > > > > > I'm sorry, but I don't know what that means.  Are you saying that the
> > > > > > kernel is now going to try to provide a hard guarantee that some devices
> > > > > > are going to be shut down in X number of seconds when asked?  If so, why
> > > > > > not do this in userspace?
> > > > > 
> > > > > No, it was initially (or when I initially saw it anyway) handling of
> > > > > notifications from regulators that they're in trouble and we have some
> > > > > small amount of time to do anything we might want to do about it before
> > > > > we expire.
> > > > 
> > > > So we are going to guarantee a "time" in which we are going to do
> > > > something?  Again, if that's required, why not do it in userspace using
> > > > a RT kernel?
> > > 
> > > For the HW in question I have only 100ms time before power loss. By
> > > doing it over use space some we will have even less time to react.
> > 
> > Why can't userspace react that fast?  Why will the kernel be somehow
> > faster?  Speed should be the same, just get the "power is cut" signal
> > and have userspace flush and unmount the disk before power is gone.  Why
> > can the kernel do this any differently?
> > 
> > > In fact, this is not a new requirement. It exist on different flavors of
> > > automotive Linux for about 10 years. Linux in cars should be able to
> > > handle voltage drops for example on ignition and so on. The only new thing is
> > > the attempt to mainline it.
> > 
> > But your patch is not guaranteeing anything, it's just doing a "I want
> > this done before the other devices are handled", that's it.  There is no
> > chance that 100ms is going to be a requirement, or that some other
> > device type is not going to come along and demand to be ahead of your
> > device in the list.
> > 
> > So you are going to have a constant fight among device types over the
> > years, and people complaining that the kernel is now somehow going to
> > guarantee that a device is shutdown in a set amount of time, which
> > again, the kernel can not guarantee here.
> > 
> > This might work as a one-off for a specific hardware platform, which is
> > odd, but not anything you really should be adding for anyone else to use
> > here as your reasoning for it does not reflect what the code does.
> 
> I see. Good point.
> 
> In my case umount is not needed, there is not enough time to write down
> the data. We should send a shutdown command to the eMMC ASAP.

If you don't care about the data, why is a shutdown command to the
hardware needed?  What does that do that makes anything "safe" if your
data is lost.

thanks,

greg k-h
Oleksij Rempel Nov. 25, 2023, 8:50 a.m. UTC | #10
On Sat, Nov 25, 2023 at 06:51:55AM +0000, Greg Kroah-Hartman wrote:
> On Fri, Nov 24, 2023 at 07:57:25PM +0100, Oleksij Rempel wrote:
> > On Fri, Nov 24, 2023 at 05:26:30PM +0000, Greg Kroah-Hartman wrote:
> > > On Fri, Nov 24, 2023 at 05:32:34PM +0100, Oleksij Rempel wrote:
> > > > On Fri, Nov 24, 2023 at 03:56:19PM +0000, Greg Kroah-Hartman wrote:
> > > > > On Fri, Nov 24, 2023 at 03:49:46PM +0000, Mark Brown wrote:
> > > > > > On Fri, Nov 24, 2023 at 03:27:48PM +0000, Greg Kroah-Hartman wrote:
> > > > > > > On Fri, Nov 24, 2023 at 03:21:40PM +0000, Mark Brown wrote:
> > > > > > 
> > > > > > > > This came out of some discussions about trying to handle emergency power
> > > > > > > > failure notifications.
> > > > > > 
> > > > > > > I'm sorry, but I don't know what that means.  Are you saying that the
> > > > > > > kernel is now going to try to provide a hard guarantee that some devices
> > > > > > > are going to be shut down in X number of seconds when asked?  If so, why
> > > > > > > not do this in userspace?
> > > > > > 
> > > > > > No, it was initially (or when I initially saw it anyway) handling of
> > > > > > notifications from regulators that they're in trouble and we have some
> > > > > > small amount of time to do anything we might want to do about it before
> > > > > > we expire.
> > > > > 
> > > > > So we are going to guarantee a "time" in which we are going to do
> > > > > something?  Again, if that's required, why not do it in userspace using
> > > > > a RT kernel?
> > > > 
> > > > For the HW in question I have only 100ms time before power loss. By
> > > > doing it over use space some we will have even less time to react.
> > > 
> > > Why can't userspace react that fast?  Why will the kernel be somehow
> > > faster?  Speed should be the same, just get the "power is cut" signal
> > > and have userspace flush and unmount the disk before power is gone.  Why
> > > can the kernel do this any differently?
> > > 
> > > > In fact, this is not a new requirement. It exist on different flavors of
> > > > automotive Linux for about 10 years. Linux in cars should be able to
> > > > handle voltage drops for example on ignition and so on. The only new thing is
> > > > the attempt to mainline it.
> > > 
> > > But your patch is not guaranteeing anything, it's just doing a "I want
> > > this done before the other devices are handled", that's it.  There is no
> > > chance that 100ms is going to be a requirement, or that some other
> > > device type is not going to come along and demand to be ahead of your
> > > device in the list.
> > > 
> > > So you are going to have a constant fight among device types over the
> > > years, and people complaining that the kernel is now somehow going to
> > > guarantee that a device is shutdown in a set amount of time, which
> > > again, the kernel can not guarantee here.
> > > 
> > > This might work as a one-off for a specific hardware platform, which is
> > > odd, but not anything you really should be adding for anyone else to use
> > > here as your reasoning for it does not reflect what the code does.
> > 
> > I see. Good point.
> > 
> > In my case umount is not needed, there is not enough time to write down
> > the data. We should send a shutdown command to the eMMC ASAP.
> 
> If you don't care about the data, why is a shutdown command to the
> hardware needed?  What does that do that makes anything "safe" if your
> data is lost.

It prevents HW damage. In a typical automotive under-voltage labor it is
usually possible to reproduce X amount of bricked eMMCs or NANDs on Y
amount of under-voltage cycles (I do not have exact numbers right now).
Even if the numbers not so high in the labor tests (sometimes something
like one bricked device in a month of tests), the field returns are
significant enough to care about software solution for this problem.

Same problem was seen not only in automotive devices, but also in
industrial or agricultural. With other words, it is important enough to bring
some kind of solution mainline.
Greg Kroah-Hartman Nov. 25, 2023, 9:09 a.m. UTC | #11
On Sat, Nov 25, 2023 at 09:50:38AM +0100, Oleksij Rempel wrote:
> On Sat, Nov 25, 2023 at 06:51:55AM +0000, Greg Kroah-Hartman wrote:
> > On Fri, Nov 24, 2023 at 07:57:25PM +0100, Oleksij Rempel wrote:
> > > On Fri, Nov 24, 2023 at 05:26:30PM +0000, Greg Kroah-Hartman wrote:
> > > > On Fri, Nov 24, 2023 at 05:32:34PM +0100, Oleksij Rempel wrote:
> > > > > On Fri, Nov 24, 2023 at 03:56:19PM +0000, Greg Kroah-Hartman wrote:
> > > > > > On Fri, Nov 24, 2023 at 03:49:46PM +0000, Mark Brown wrote:
> > > > > > > On Fri, Nov 24, 2023 at 03:27:48PM +0000, Greg Kroah-Hartman wrote:
> > > > > > > > On Fri, Nov 24, 2023 at 03:21:40PM +0000, Mark Brown wrote:
> > > > > > > 
> > > > > > > > > This came out of some discussions about trying to handle emergency power
> > > > > > > > > failure notifications.
> > > > > > > 
> > > > > > > > I'm sorry, but I don't know what that means.  Are you saying that the
> > > > > > > > kernel is now going to try to provide a hard guarantee that some devices
> > > > > > > > are going to be shut down in X number of seconds when asked?  If so, why
> > > > > > > > not do this in userspace?
> > > > > > > 
> > > > > > > No, it was initially (or when I initially saw it anyway) handling of
> > > > > > > notifications from regulators that they're in trouble and we have some
> > > > > > > small amount of time to do anything we might want to do about it before
> > > > > > > we expire.
> > > > > > 
> > > > > > So we are going to guarantee a "time" in which we are going to do
> > > > > > something?  Again, if that's required, why not do it in userspace using
> > > > > > a RT kernel?
> > > > > 
> > > > > For the HW in question I have only 100ms time before power loss. By
> > > > > doing it over use space some we will have even less time to react.
> > > > 
> > > > Why can't userspace react that fast?  Why will the kernel be somehow
> > > > faster?  Speed should be the same, just get the "power is cut" signal
> > > > and have userspace flush and unmount the disk before power is gone.  Why
> > > > can the kernel do this any differently?
> > > > 
> > > > > In fact, this is not a new requirement. It exist on different flavors of
> > > > > automotive Linux for about 10 years. Linux in cars should be able to
> > > > > handle voltage drops for example on ignition and so on. The only new thing is
> > > > > the attempt to mainline it.
> > > > 
> > > > But your patch is not guaranteeing anything, it's just doing a "I want
> > > > this done before the other devices are handled", that's it.  There is no
> > > > chance that 100ms is going to be a requirement, or that some other
> > > > device type is not going to come along and demand to be ahead of your
> > > > device in the list.
> > > > 
> > > > So you are going to have a constant fight among device types over the
> > > > years, and people complaining that the kernel is now somehow going to
> > > > guarantee that a device is shutdown in a set amount of time, which
> > > > again, the kernel can not guarantee here.
> > > > 
> > > > This might work as a one-off for a specific hardware platform, which is
> > > > odd, but not anything you really should be adding for anyone else to use
> > > > here as your reasoning for it does not reflect what the code does.
> > > 
> > > I see. Good point.
> > > 
> > > In my case umount is not needed, there is not enough time to write down
> > > the data. We should send a shutdown command to the eMMC ASAP.
> > 
> > If you don't care about the data, why is a shutdown command to the
> > hardware needed?  What does that do that makes anything "safe" if your
> > data is lost.
> 
> It prevents HW damage. In a typical automotive under-voltage labor it is
> usually possible to reproduce X amount of bricked eMMCs or NANDs on Y
> amount of under-voltage cycles (I do not have exact numbers right now).
> Even if the numbers not so high in the labor tests (sometimes something
> like one bricked device in a month of tests), the field returns are
> significant enough to care about software solution for this problem.

So hardware is attempting to rely on software in order to prevent the
destruction of that same hardware?  Surely hardware designers aren't
that crazy, right?  (rhetorical question, I know...)

> Same problem was seen not only in automotive devices, but also in
> industrial or agricultural. With other words, it is important enough to bring
> some kind of solution mainline.

But you are not providing a real solution here, only a "I am going to
attempt to shut down a specific type of device before the others, there
are no time or ordering guarantees here, so good luck!" solution.

And again, how are you going to prevent the in-fighting of all device
types to be "first" in the list?

thanks,

greg k-h
Mark Brown Nov. 25, 2023, 10:30 a.m. UTC | #12
On Sat, Nov 25, 2023 at 09:09:01AM +0000, Greg Kroah-Hartman wrote:
> On Sat, Nov 25, 2023 at 09:50:38AM +0100, Oleksij Rempel wrote:

> > It prevents HW damage. In a typical automotive under-voltage labor it is
> > usually possible to reproduce X amount of bricked eMMCs or NANDs on Y
> > amount of under-voltage cycles (I do not have exact numbers right now).
> > Even if the numbers not so high in the labor tests (sometimes something
> > like one bricked device in a month of tests), the field returns are
> > significant enough to care about software solution for this problem.

> So hardware is attempting to rely on software in order to prevent the
> destruction of that same hardware?  Surely hardware designers aren't
> that crazy, right?  (rhetorical question, I know...)

Surely software people aren't going to make no effort to integrate with
the notification features that the hardware engineers have so helpfully
provided us with?

> > Same problem was seen not only in automotive devices, but also in
> > industrial or agricultural. With other words, it is important enough to bring
> > some kind of solution mainline.

> But you are not providing a real solution here, only a "I am going to
> attempt to shut down a specific type of device before the others, there
> are no time or ordering guarantees here, so good luck!" solution.

I'm not sure there are great solutions here, the system integrators are
constrained by the what the application appropriate silicon that's on
the market is capable of, the siicon is constrained by the area costs of
dealing with corner cases for system robustness and how much of the
market cares about fixing these issues and software is constrained by
what hardware ends up being built.  Everyone's just got to try their
best with the reality they're confronted with, hopefully what's possible
will improve with time.

> And again, how are you going to prevent the in-fighting of all device
> types to be "first" in the list?

It doesn't seem like the most complex integration challenge we've ever
had to deal with TBH.
Greg Kroah-Hartman Nov. 25, 2023, 2:35 p.m. UTC | #13
On Sat, Nov 25, 2023 at 10:30:42AM +0000, Mark Brown wrote:
> On Sat, Nov 25, 2023 at 09:09:01AM +0000, Greg Kroah-Hartman wrote:
> > On Sat, Nov 25, 2023 at 09:50:38AM +0100, Oleksij Rempel wrote:
> 
> > > It prevents HW damage. In a typical automotive under-voltage labor it is
> > > usually possible to reproduce X amount of bricked eMMCs or NANDs on Y
> > > amount of under-voltage cycles (I do not have exact numbers right now).
> > > Even if the numbers not so high in the labor tests (sometimes something
> > > like one bricked device in a month of tests), the field returns are
> > > significant enough to care about software solution for this problem.
> 
> > So hardware is attempting to rely on software in order to prevent the
> > destruction of that same hardware?  Surely hardware designers aren't
> > that crazy, right?  (rhetorical question, I know...)
> 
> Surely software people aren't going to make no effort to integrate with
> the notification features that the hardware engineers have so helpfully
> provided us with?

That would be great, but I don't see that here, do you?  All I see is
the shutdown sequence changing because someone wants it to go "faster"
with the threat of hardware breaking if we don't meet that "faster"
number, yet no knowledge or guarantee that this number can ever be known
or happen.

> > > Same problem was seen not only in automotive devices, but also in
> > > industrial or agricultural. With other words, it is important enough to bring
> > > some kind of solution mainline.
> 
> > But you are not providing a real solution here, only a "I am going to
> > attempt to shut down a specific type of device before the others, there
> > are no time or ordering guarantees here, so good luck!" solution.
> 
> I'm not sure there are great solutions here, the system integrators are
> constrained by the what the application appropriate silicon that's on
> the market is capable of, the siicon is constrained by the area costs of
> dealing with corner cases for system robustness and how much of the
> market cares about fixing these issues and software is constrained by
> what hardware ends up being built.  Everyone's just got to try their
> best with the reality they're confronted with, hopefully what's possible
> will improve with time.

Agreed, but I don't think this patch is going to actually work properly
over time as there is no time values involved :)

> > And again, how are you going to prevent the in-fighting of all device
> > types to be "first" in the list?
> 
> It doesn't seem like the most complex integration challenge we've ever
> had to deal with TBH.

True, but we all know how this grows and thinking about how to handle it
now is key for this to be acceptable.

thanks,

greg k-h
Mark Brown Nov. 25, 2023, 3:43 p.m. UTC | #14
On Sat, Nov 25, 2023 at 02:35:41PM +0000, Greg Kroah-Hartman wrote:
> On Sat, Nov 25, 2023 at 10:30:42AM +0000, Mark Brown wrote:
> > On Sat, Nov 25, 2023 at 09:09:01AM +0000, Greg Kroah-Hartman wrote:

> > > So hardware is attempting to rely on software in order to prevent the
> > > destruction of that same hardware?  Surely hardware designers aren't
> > > that crazy, right?  (rhetorical question, I know...)

> > Surely software people aren't going to make no effort to integrate with
> > the notification features that the hardware engineers have so helpfully
> > provided us with?

> That would be great, but I don't see that here, do you?  All I see is
> the shutdown sequence changing because someone wants it to go "faster"
> with the threat of hardware breaking if we don't meet that "faster"
> number, yet no knowledge or guarantee that this number can ever be known
> or happen.

The idea was to have somewhere to send notifications when the hardware
starts reporting things like power supplies starting to fail.  We do
have those from hardware, we just don't do anything terribly useful
with them yet.

TBH it does seem reasonable that there will be systems that can usefully
detect these issues but hasn't got a detailed characterisation of
exactly how long you've got before things expire, it's also likely that
the actual bound is going to be highly variable depending on what the
system is up to at the point of detection.  It's quite likely that we'd
only get a worst case bound so it's also likely that we'd have more time
in practice than in spec.  I'd expect characterisation that does happen
to be very system specific at this point, I don't think we can rely on
getting that information.  I'd certainly expect that we have vastly more
systems can usefully detect issues than systems where we have firm
numbers.

> > > > Same problem was seen not only in automotive devices, but also in
> > > > industrial or agricultural. With other words, it is important enough to bring
> > > > some kind of solution mainline.

> > > But you are not providing a real solution here, only a "I am going to
> > > attempt to shut down a specific type of device before the others, there
> > > are no time or ordering guarantees here, so good luck!" solution.

> > I'm not sure there are great solutions here, the system integrators are
> > constrained by the what the application appropriate silicon that's on
> > the market is capable of, the siicon is constrained by the area costs of
> > dealing with corner cases for system robustness and how much of the
> > market cares about fixing these issues and software is constrained by
> > what hardware ends up being built.  Everyone's just got to try their
> > best with the reality they're confronted with, hopefully what's possible
> > will improve with time.

> Agreed, but I don't think this patch is going to actually work properly
> over time as there is no time values involved :)

This seems to be more into the area of mitigation than firm solution, I
suspect users will be pleased if they can make a noticable dent in the
number of failures they're seeing.

> > > And again, how are you going to prevent the in-fighting of all device
> > > types to be "first" in the list?

> > It doesn't seem like the most complex integration challenge we've ever
> > had to deal with TBH.

> True, but we all know how this grows and thinking about how to handle it
> now is key for this to be acceptable.

It feels like if we're concerned about mitigating physical damage during
the process of power failure that's a very limited set of devices - the
storage case where we're in the middle of writing to flash or whatever
is the most obvious case.
Greg Kroah-Hartman Nov. 25, 2023, 7:58 p.m. UTC | #15
On Sat, Nov 25, 2023 at 03:43:02PM +0000, Mark Brown wrote:
> On Sat, Nov 25, 2023 at 02:35:41PM +0000, Greg Kroah-Hartman wrote:
> > On Sat, Nov 25, 2023 at 10:30:42AM +0000, Mark Brown wrote:
> > > On Sat, Nov 25, 2023 at 09:09:01AM +0000, Greg Kroah-Hartman wrote:
> 
> > > > So hardware is attempting to rely on software in order to prevent the
> > > > destruction of that same hardware?  Surely hardware designers aren't
> > > > that crazy, right?  (rhetorical question, I know...)
> 
> > > Surely software people aren't going to make no effort to integrate with
> > > the notification features that the hardware engineers have so helpfully
> > > provided us with?
> 
> > That would be great, but I don't see that here, do you?  All I see is
> > the shutdown sequence changing because someone wants it to go "faster"
> > with the threat of hardware breaking if we don't meet that "faster"
> > number, yet no knowledge or guarantee that this number can ever be known
> > or happen.
> 
> The idea was to have somewhere to send notifications when the hardware
> starts reporting things like power supplies starting to fail.  We do
> have those from hardware, we just don't do anything terribly useful
> with them yet.

Ok, but that's not what I recall this patchset doing, or did I missing
something?  All I saw was a "reorder the shutdown sequence" set of
changes.  Or at least that's all I remember at this point in time,
sorry, it's been a few days, but at least that lines up with what the
Subject line says above :)

> TBH it does seem reasonable that there will be systems that can usefully
> detect these issues but hasn't got a detailed characterisation of
> exactly how long you've got before things expire, it's also likely that
> the actual bound is going to be highly variable depending on what the
> system is up to at the point of detection.  It's quite likely that we'd
> only get a worst case bound so it's also likely that we'd have more time
> in practice than in spec.  I'd expect characterisation that does happen
> to be very system specific at this point, I don't think we can rely on
> getting that information.  I'd certainly expect that we have vastly more
> systems can usefully detect issues than systems where we have firm
> numbers.

Sure, that all sounds good, but again, I don't think that's what is
happening here.

> > > > > Same problem was seen not only in automotive devices, but also in
> > > > > industrial or agricultural. With other words, it is important enough to bring
> > > > > some kind of solution mainline.
> 
> > > > But you are not providing a real solution here, only a "I am going to
> > > > attempt to shut down a specific type of device before the others, there
> > > > are no time or ordering guarantees here, so good luck!" solution.
> 
> > > I'm not sure there are great solutions here, the system integrators are
> > > constrained by the what the application appropriate silicon that's on
> > > the market is capable of, the siicon is constrained by the area costs of
> > > dealing with corner cases for system robustness and how much of the
> > > market cares about fixing these issues and software is constrained by
> > > what hardware ends up being built.  Everyone's just got to try their
> > > best with the reality they're confronted with, hopefully what's possible
> > > will improve with time.

Note, if you attempt to mitigate broken hardware with software fixes,
hardware will never get unbroken as it never needs to change.  Push back
on this, it's the only real way forward here.  I know it's not always
possible, but the number of times I have heard hardware engineers say
"but no one ever told us that was broken/impossible/whatever, we just
assumed software could handle it" is uncountable.

> > Agreed, but I don't think this patch is going to actually work properly
> > over time as there is no time values involved :)
> 
> This seems to be more into the area of mitigation than firm solution, I
> suspect users will be pleased if they can make a noticable dent in the
> number of failures they're seeing.

Mitigation is good, but this patch series is just a hack by doing "throw
this device type at the front of the shutdown list because we have
hardware that crashes a lot" :)

> > > > And again, how are you going to prevent the in-fighting of all device
> > > > types to be "first" in the list?
> 
> > > It doesn't seem like the most complex integration challenge we've ever
> > > had to deal with TBH.
> 
> > True, but we all know how this grows and thinking about how to handle it
> > now is key for this to be acceptable.
> 
> It feels like if we're concerned about mitigating physical damage during
> the process of power failure that's a very limited set of devices - the
> storage case where we're in the middle of writing to flash or whatever
> is the most obvious case.

Then why isn't userspace handling this?  This is a policy decision that
it needs to take to properly know what hardware needs to be shut down,
and what needs to happen in order to do that (i.e. flush, unmount,
etc.?)  And userspace today should be able to say, "power down this
device now!" for any device in the system based on the sysfs device
tree, or at the very least, force it to a specific power state.  So why
not handle this policy there?

thanks,

greg k-h
Mark Brown Nov. 26, 2023, 10:14 a.m. UTC | #16
On Sat, Nov 25, 2023 at 07:58:12PM +0000, Greg Kroah-Hartman wrote:
> On Sat, Nov 25, 2023 at 03:43:02PM +0000, Mark Brown wrote:
> > On Sat, Nov 25, 2023 at 02:35:41PM +0000, Greg Kroah-Hartman wrote:

> > > That would be great, but I don't see that here, do you?  All I see is
> > > the shutdown sequence changing because someone wants it to go "faster"
> > > with the threat of hardware breaking if we don't meet that "faster"
> > > number, yet no knowledge or guarantee that this number can ever be known
> > > or happen.

> > The idea was to have somewhere to send notifications when the hardware
> > starts reporting things like power supplies starting to fail.  We do
> > have those from hardware, we just don't do anything terribly useful
> > with them yet.

> Ok, but that's not what I recall this patchset doing, or did I missing
> something?  All I saw was a "reorder the shutdown sequence" set of
> changes.  Or at least that's all I remember at this point in time,
> sorry, it's been a few days, but at least that lines up with what the
> Subject line says above :)

That's not in the series, a bunch of it is merged in some form (eg, see
hw_protection_shutdown()) and more of it would need to be built on top
if this were merged.

> > > Agreed, but I don't think this patch is going to actually work properly
> > > over time as there is no time values involved :)

> > This seems to be more into the area of mitigation than firm solution, I
> > suspect users will be pleased if they can make a noticable dent in the
> > number of failures they're seeing.

> Mitigation is good, but this patch series is just a hack by doing "throw
> this device type at the front of the shutdown list because we have
> hardware that crashes a lot" :)

Sounds like a mitigation to me.

> > It feels like if we're concerned about mitigating physical damage during
> > the process of power failure that's a very limited set of devices - the
> > storage case where we're in the middle of writing to flash or whatever
> > is the most obvious case.

> Then why isn't userspace handling this?  This is a policy decision that
> it needs to take to properly know what hardware needs to be shut down,
> and what needs to happen in order to do that (i.e. flush, unmount,
> etc.?)  And userspace today should be able to say, "power down this
> device now!" for any device in the system based on the sysfs device
> tree, or at the very least, force it to a specific power state.  So why
> not handle this policy there?

Given the tight timelines it does seem reasonable to have some of this
in the kernel - the specific decisions about how to handle these events
can always be controlled from userspace (eg, with a sysfs file like we
do for autosuspend delay times which seem to be in a similar ballpark).
Oleksij Rempel Nov. 26, 2023, 7:31 p.m. UTC | #17
On Sun, Nov 26, 2023 at 10:14:45AM +0000, Mark Brown wrote:
> On Sat, Nov 25, 2023 at 07:58:12PM +0000, Greg Kroah-Hartman wrote:
> > On Sat, Nov 25, 2023 at 03:43:02PM +0000, Mark Brown wrote:
> > > On Sat, Nov 25, 2023 at 02:35:41PM +0000, Greg Kroah-Hartman wrote:
> 
> > > > That would be great, but I don't see that here, do you?  All I see is
> > > > the shutdown sequence changing because someone wants it to go "faster"
> > > > with the threat of hardware breaking if we don't meet that "faster"
> > > > number, yet no knowledge or guarantee that this number can ever be known
> > > > or happen.
> 
> > > The idea was to have somewhere to send notifications when the hardware
> > > starts reporting things like power supplies starting to fail.  We do
> > > have those from hardware, we just don't do anything terribly useful
> > > with them yet.
> 
> > Ok, but that's not what I recall this patchset doing, or did I missing
> > something?  All I saw was a "reorder the shutdown sequence" set of
> > changes.  Or at least that's all I remember at this point in time,
> > sorry, it's been a few days, but at least that lines up with what the
> > Subject line says above :)
> 
> That's not in the series, a bunch of it is merged in some form (eg, see
> hw_protection_shutdown()) and more of it would need to be built on top
> if this were merged.

The current kernel has enough infrastructure to manage essential functions
related to hardware protection:
- The Device Tree specifies the source of interrupts for detecting
  under-voltage events. It also details critical system regulators and some
  of specification of backup power supplied by the board.
- Various frameworks within the kernel can identify critical hardware
  conditions like over-temperature and under-voltage. Upon detection, these
  frameworks invoke the hw_protection_shutdown() function.

> > > > Agreed, but I don't think this patch is going to actually work properly
> > > > over time as there is no time values involved :)

If we're to implement a deadline for each shutdown call (as the requirement for
"time values" suggests?), then prioritization becomes essential. Without
establishing a shutdown order, the inclusion of time values might not be
effectively utilized.  Am I overlooking anything in this regard?

> > > This seems to be more into the area of mitigation than firm solution, I
> > > suspect users will be pleased if they can make a noticable dent in the
> > > number of failures they're seeing.
>
> > Mitigation is good, but this patch series is just a hack by doing "throw
> > this device type at the front of the shutdown list because we have
> > hardware that crashes a lot" :)

The root of the issue seems to be the choice of primary storage device.

All storage technologies - HDD, SSD, eMMC, NAND - are vulnerable to power
loss. The only foolproof safeguard is a backup power source, but this
introduces its own set of challenges:

1. Batteries: While they provide a backup, they come with limitations like a
finite number of charge cycles, sensitivity to temperature (a significant
concern in industrial and automotive environments), higher costs, and
increased device size. For most embedded applications, a UPS isn't a viable
solution.

2. Capacitors: A potential alternative, but they cannot offer prolonged
backup time. Increasing the number of capacitors to extend backup time leads
to additional issues:
   - Increased costs and space requirements on the PCB.
   - The need to manage partially charged capacitors during power failures.
   - The requirement for a power supply capable of rapid charging.
   - The risk of not reaching a safe state before the backup energy
     depletes.
   - In specific environments, like explosive atmospheres, storing large
     amounts of energy can be hazardous.

Given these considerations, it's crucial to understand that such design choices
aren't merely "hacks". They represent a balance between different types of
trade-offs.

> > > It feels like if we're concerned about mitigating physical damage during
> > > the process of power failure that's a very limited set of devices - the
> > > storage case where we're in the middle of writing to flash or whatever
> > > is the most obvious case.
> 
> > Then why isn't userspace handling this?  This is a policy decision that
> > it needs to take to properly know what hardware needs to be shut down,
> > and what needs to happen in order to do that (i.e. flush, unmount,
> > etc.?)  And userspace today should be able to say, "power down this
> > device now!" for any device in the system based on the sysfs device
> > tree, or at the very least, force it to a specific power state.  So why
> > not handle this policy there?
> 
> Given the tight timelines it does seem reasonable to have some of this
> in the kernel - the specific decisions about how to handle these events
> can always be controlled from userspace (eg, with a sysfs file like we
> do for autosuspend delay times which seem to be in a similar ballpark).

Upon investigating the feasibility of a user space solution for eMMC
power control, I've concluded that it's likely not possible. The primary
issue is that most board designs don't include reset signaling for
eMMCs. Additionally, the eMMC power rail is usually linked to the
system's main power controller. While powering off is doable, cleanly
powering it back on isn’t feasible. This is especially problematic when
the rootfs is located on the eMMC, as power cycling the storage device
could lead to system instability.

Therefore, any user space method to power off eMMC wouldn't be reliable
or safe, as there's no way to ensure it can be turned back on without
risking the integrity of the system. The design rationale is clear:
avoiding the risks associated with powering off the primary storage
device.

Considering these constraints, the only practical implementation I see
is integrating this functionality into the system's shutdown sequence.
This approach ensures a controlled environment for powering off the
eMMC, avoiding potential issues.
Ferry Toth Nov. 26, 2023, 7:42 p.m. UTC | #18
Ha ha,

Funny discussion. As a hardware engineer (with no experience in 
automotive, but actual experience in industrial applications and 
debugging issues arising from bad shutdowns) let me add my 5ct at the end.

Op 26-11-2023 om 11:14 schreef Mark Brown:
> On Sat, Nov 25, 2023 at 07:58:12PM +0000, Greg Kroah-Hartman wrote:
>> On Sat, Nov 25, 2023 at 03:43:02PM +0000, Mark Brown wrote:
>>> On Sat, Nov 25, 2023 at 02:35:41PM +0000, Greg Kroah-Hartman wrote:
> 
>>>> That would be great, but I don't see that here, do you?  All I see is
>>>> the shutdown sequence changing because someone wants it to go "faster"
>>>> with the threat of hardware breaking if we don't meet that "faster"
>>>> number, yet no knowledge or guarantee that this number can ever be known
>>>> or happen.
> 
>>> The idea was to have somewhere to send notifications when the hardware
>>> starts reporting things like power supplies starting to fail.  We do
>>> have those from hardware, we just don't do anything terribly useful
>>> with them yet.
> 
>> Ok, but that's not what I recall this patchset doing, or did I missing
>> something?  All I saw was a "reorder the shutdown sequence" set of
>> changes.  Or at least that's all I remember at this point in time,
>> sorry, it's been a few days, but at least that lines up with what the
>> Subject line says above :)
> 
> That's not in the series, a bunch of it is merged in some form (eg, see
> hw_protection_shutdown()) and more of it would need to be built on top
> if this were merged.
> 
>>>> Agreed, but I don't think this patch is going to actually work properly
>>>> over time as there is no time values involved :)
> 
>>> This seems to be more into the area of mitigation than firm solution, I
>>> suspect users will be pleased if they can make a noticable dent in the
>>> number of failures they're seeing.
> 
>> Mitigation is good, but this patch series is just a hack by doing "throw
>> this device type at the front of the shutdown list because we have
>> hardware that crashes a lot" :)
> 
> Sounds like a mitigation to me.
> 
>>> It feels like if we're concerned about mitigating physical damage during
>>> the process of power failure that's a very limited set of devices - the
>>> storage case where we're in the middle of writing to flash or whatever
>>> is the most obvious case.
> 
>> Then why isn't userspace handling this?  This is a policy decision that
>> it needs to take to properly know what hardware needs to be shut down,
>> and what needs to happen in order to do that (i.e. flush, unmount,
>> etc.?)  And userspace today should be able to say, "power down this
>> device now!" for any device in the system based on the sysfs device
>> tree, or at the very least, force it to a specific power state.  So why
>> not handle this policy there?
> 
> Given the tight timelines it does seem reasonable to have some of this
> in the kernel - the specific decisions about how to handle these events
> can always be controlled from userspace (eg, with a sysfs file like we
> do for autosuspend delay times which seem to be in a similar ballpark).

I'd prefer not to call the HW broken in this case. The life of hardware 
(unlike software) continues during and after power down. That means 
there may be requirements and specs for it to conform to during those 
transitions and states. Unlike broken hardware, which does not conform 
to its specs. Typically, a HDD that autoparks its heads to a safe 
position on its last rotation energy, that's not broken, that's 
carefully designed.

That said, I agree with Greg, if there is a hard requirement to shutdown 
safely to prevent damage, the solution is not to shutdown fast. The 
solution is to shutdown on time.

In fact, if the software needs more energy to shutdown safely, any 
hardware engineer will consider that a requirement. And ask the 
appropriate question: "how much energy do you need exactly?". There are 
various reasons why that can not be answered in general. The most funny 
answer I ever got (thanks Albert) being: "My software doesn't consume 
energy".
Now, we do need to keep in mind that storing J in a supercap, executing 
a CPU at GHz, storing GB data do not come free. So, after making sure 
things shutdown in time, it often pays off to shorten that deadline, and 
indeed make it faster.

Looking at the above discussion from the different angles:
1) The hardware mentioned does not need to shutdown (as said, it doesn't 
need to be unmounted). It needs to be placed into a safe state on time. 
And the only thing here that can know for the particular hardware what 
is a safe state, is the driver itself.
2) To get a signal (Low Power Warning) to the driver on time, the 
PREEMPT_RT kernel seems like a natural choice.
3) To me (but hey who am I) it makes sense to have a generic mechanism 
from drivers to transition to their safe state if they require that.
4) I wouldn't worry about drivers fighting for priority, these systems 
are normally "embedded" with fixed hardware. Otherwise there is no way 
to calculate shutdown energy required and do proper hardware design.
Christian Loehle Nov. 27, 2023, 10:13 a.m. UTC | #19
On 25/11/2023 08:50, Oleksij Rempel wrote:
> On Sat, Nov 25, 2023 at 06:51:55AM +0000, Greg Kroah-Hartman wrote:
>> On Fri, Nov 24, 2023 at 07:57:25PM +0100, Oleksij Rempel wrote:
>>> On Fri, Nov 24, 2023 at 05:26:30PM +0000, Greg Kroah-Hartman wrote:
>>>> On Fri, Nov 24, 2023 at 05:32:34PM +0100, Oleksij Rempel wrote:
>>>>> On Fri, Nov 24, 2023 at 03:56:19PM +0000, Greg Kroah-Hartman wrote:
>>>>>> On Fri, Nov 24, 2023 at 03:49:46PM +0000, Mark Brown wrote:
>>>>>>> On Fri, Nov 24, 2023 at 03:27:48PM +0000, Greg Kroah-Hartman wrote:
>>>>>>>> On Fri, Nov 24, 2023 at 03:21:40PM +0000, Mark Brown wrote:
>>>>>>>
>>>>>>>>> This came out of some discussions about trying to handle emergency power
>>>>>>>>> failure notifications.
>>>>>>>
>>>>>>>> I'm sorry, but I don't know what that means.  Are you saying that the
>>>>>>>> kernel is now going to try to provide a hard guarantee that some devices
>>>>>>>> are going to be shut down in X number of seconds when asked?  If so, why
>>>>>>>> not do this in userspace?
>>>>>>>
>>>>>>> No, it was initially (or when I initially saw it anyway) handling of
>>>>>>> notifications from regulators that they're in trouble and we have some
>>>>>>> small amount of time to do anything we might want to do about it before
>>>>>>> we expire.
>>>>>>
>>>>>> So we are going to guarantee a "time" in which we are going to do
>>>>>> something?  Again, if that's required, why not do it in userspace using
>>>>>> a RT kernel?
>>>>>
>>>>> For the HW in question I have only 100ms time before power loss. By
>>>>> doing it over use space some we will have even less time to react.
>>>>
>>>> Why can't userspace react that fast?  Why will the kernel be somehow
>>>> faster?  Speed should be the same, just get the "power is cut" signal
>>>> and have userspace flush and unmount the disk before power is gone.  Why
>>>> can the kernel do this any differently?
>>>>
>>>>> In fact, this is not a new requirement. It exist on different flavors of
>>>>> automotive Linux for about 10 years. Linux in cars should be able to
>>>>> handle voltage drops for example on ignition and so on. The only new thing is
>>>>> the attempt to mainline it.
>>>>
>>>> But your patch is not guaranteeing anything, it's just doing a "I want
>>>> this done before the other devices are handled", that's it.  There is no
>>>> chance that 100ms is going to be a requirement, or that some other
>>>> device type is not going to come along and demand to be ahead of your
>>>> device in the list.
>>>>
>>>> So you are going to have a constant fight among device types over the
>>>> years, and people complaining that the kernel is now somehow going to
>>>> guarantee that a device is shutdown in a set amount of time, which
>>>> again, the kernel can not guarantee here.
>>>>
>>>> This might work as a one-off for a specific hardware platform, which is
>>>> odd, but not anything you really should be adding for anyone else to use
>>>> here as your reasoning for it does not reflect what the code does.
>>>
>>> I see. Good point.
>>>
>>> In my case umount is not needed, there is not enough time to write down
>>> the data. We should send a shutdown command to the eMMC ASAP.
>>
>> If you don't care about the data, why is a shutdown command to the
>> hardware needed?  What does that do that makes anything "safe" if your
>> data is lost.
> 
> It prevents HW damage. In a typical automotive under-voltage labor it is
> usually possible to reproduce X amount of bricked eMMCs or NANDs on Y
> amount of under-voltage cycles (I do not have exact numbers right now).
> Even if the numbers not so high in the labor tests (sometimes something
> like one bricked device in a month of tests), the field returns are
> significant enough to care about software solution for this problem.
> 
> Same problem was seen not only in automotive devices, but also in
> industrial or agricultural. With other words, it is important enough to bring
> some kind of solution mainline.
> 

IMO that is a serious problem with the used storage / eMMC in that case and it
is not suitable for industrial/automotive uses?
Any industrial/automotive-suitable storage device should detect under-voltage and
just treat it as a power-down/loss, and while that isn't nice for the storage device,
it really shouldn't be able to brick a device (within <1M cycles anyway).
What does the storage module vendor say about this?

BR,
Christian
Christian Loehle Nov. 27, 2023, 11:27 a.m. UTC | #20
On 26/11/2023 19:31, Oleksij Rempel wrote:
> On Sun, Nov 26, 2023 at 10:14:45AM +0000, Mark Brown wrote:
>> On Sat, Nov 25, 2023 at 07:58:12PM +0000, Greg Kroah-Hartman wrote:
>>> On Sat, Nov 25, 2023 at 03:43:02PM +0000, Mark Brown wrote:
>>>> On Sat, Nov 25, 2023 at 02:35:41PM +0000, Greg Kroah-Hartman wrote:
>>
>>>>> That would be great, but I don't see that here, do you?  All I see is
>>>>> the shutdown sequence changing because someone wants it to go "faster"
>>>>> with the threat of hardware breaking if we don't meet that "faster"
>>>>> number, yet no knowledge or guarantee that this number can ever be known
>>>>> or happen.
>>
>>>> The idea was to have somewhere to send notifications when the hardware
>>>> starts reporting things like power supplies starting to fail.  We do
>>>> have those from hardware, we just don't do anything terribly useful
>>>> with them yet.
>>
>>> Ok, but that's not what I recall this patchset doing, or did I missing
>>> something?  All I saw was a "reorder the shutdown sequence" set of
>>> changes.  Or at least that's all I remember at this point in time,
>>> sorry, it's been a few days, but at least that lines up with what the
>>> Subject line says above :)
>>
>> That's not in the series, a bunch of it is merged in some form (eg, see
>> hw_protection_shutdown()) and more of it would need to be built on top
>> if this were merged.
> 
> The current kernel has enough infrastructure to manage essential functions
> related to hardware protection:
> - The Device Tree specifies the source of interrupts for detecting
>   under-voltage events. It also details critical system regulators and some
>   of specification of backup power supplied by the board.
> - Various frameworks within the kernel can identify critical hardware
>   conditions like over-temperature and under-voltage. Upon detection, these
>   frameworks invoke the hw_protection_shutdown() function.
> 
>>>>> Agreed, but I don't think this patch is going to actually work properly
>>>>> over time as there is no time values involved :)
> 
> If we're to implement a deadline for each shutdown call (as the requirement for
> "time values" suggests?), then prioritization becomes essential. Without
> establishing a shutdown order, the inclusion of time values might not be
> effectively utilized.  Am I overlooking anything in this regard?
> 
>>>> This seems to be more into the area of mitigation than firm solution, I
>>>> suspect users will be pleased if they can make a noticable dent in the
>>>> number of failures they're seeing.
>>
>>> Mitigation is good, but this patch series is just a hack by doing "throw
>>> this device type at the front of the shutdown list because we have
>>> hardware that crashes a lot" :)
> 
> The root of the issue seems to be the choice of primary storage device.
> 
> All storage technologies - HDD, SSD, eMMC, NAND - are vulnerable to power
> loss. The only foolproof safeguard is a backup power source, but this
> introduces its own set of challenges:

I disagree and would say that any storage device sold as "industrial" should
guarantee power-fail safety. Plus, you mentioned data loss isn't even your concern,
but the storage device fails/bricks.
> 
> 1. Batteries: While they provide a backup, they come with limitations like a
> finite number of charge cycles, sensitivity to temperature (a significant
> concern in industrial and automotive environments), higher costs, and
> increased device size. For most embedded applications, a UPS isn't a viable
> solution.
> 
> 2. Capacitors: A potential alternative, but they cannot offer prolonged
> backup time. Increasing the number of capacitors to extend backup time leads
> to additional issues:
>    - Increased costs and space requirements on the PCB.
>    - The need to manage partially charged capacitors during power failures.
>    - The requirement for a power supply capable of rapid charging.
>    - The risk of not reaching a safe state before the backup energy
>      depletes.
>    - In specific environments, like explosive atmospheres, storing large
>      amounts of energy can be hazardous.

And also just practically, ensuring a safe power down could be in the order
of a second, so it would be quite a capacitor.

> 
> Given these considerations, it's crucial to understand that such design choices
> aren't merely "hacks". They represent a balance between different types of
> trade-offs.
> 
>>>> It feels like if we're concerned about mitigating physical damage during
>>>> the process of power failure that's a very limited set of devices - the
>>>> storage case where we're in the middle of writing to flash or whatever
>>>> is the most obvious case.
>>
>>> Then why isn't userspace handling this?  This is a policy decision that
>>> it needs to take to properly know what hardware needs to be shut down,
>>> and what needs to happen in order to do that (i.e. flush, unmount,
>>> etc.?)  And userspace today should be able to say, "power down this
>>> device now!" for any device in the system based on the sysfs device
>>> tree, or at the very least, force it to a specific power state.  So why
>>> not handle this policy there?
>>
>> Given the tight timelines it does seem reasonable to have some of this
>> in the kernel - the specific decisions about how to handle these events
>> can always be controlled from userspace (eg, with a sysfs file like we
>> do for autosuspend delay times which seem to be in a similar ballpark).
> 
> Upon investigating the feasibility of a user space solution for eMMC
> power control, I've concluded that it's likely not possible. The primary
> issue is that most board designs don't include reset signaling for
> eMMCs. Additionally, the eMMC power rail is usually linked to the
> system's main power controller. While powering off is doable, cleanly
> powering it back on isn’t feasible. This is especially problematic when
> the rootfs is located on the eMMC, as power cycling the storage device
> could lead to system instability.
> 
> Therefore, any user space method to power off eMMC wouldn't be reliable
> or safe, as there's no way to ensure it can be turned back on without
> risking the integrity of the system. The design rationale is clear:
> avoiding the risks associated with powering off the primary storage
> device.
> 
> Considering these constraints, the only practical implementation I see
> is integrating this functionality into the system's shutdown sequence.
> This approach ensures a controlled environment for powering off the
> eMMC, avoiding potential issues.

You don't need the RST signal, in fact even if you had it it would be
the wrong thing to do. (Implementation is vendor-specific but RST
assumes that eMMCs' VCC and VCCQ are left untouched.)
You can try turning off eMMC cache completely and/or sending power down
notification on 'emergency shutdown', but since power-loss/fail behavior
is vendor-specific asking the storage device vendor how to ensure a safe
power-down.
Anyway the proper eMMC power-down methods are up to a second in timeouts,
so infeasible for your requirements from what I can see.

BR,
Christian
Oleksij Rempel Nov. 27, 2023, 11:36 a.m. UTC | #21
On Mon, Nov 27, 2023 at 10:13:49AM +0000, Christian Loehle wrote:
> > Same problem was seen not only in automotive devices, but also in
> > industrial or agricultural. With other words, it is important enough to bring
> > some kind of solution mainline.
> > 
> 
> IMO that is a serious problem with the used storage / eMMC in that case and it
> is not suitable for industrial/automotive uses?
> Any industrial/automotive-suitable storage device should detect under-voltage and
> just treat it as a power-down/loss, and while that isn't nice for the storage device,
> it really shouldn't be able to brick a device (within <1M cycles anyway).
> What does the storage module vendor say about this?

Good question. I do not have insights ATM. I'll forward it.
Oleksij Rempel Nov. 27, 2023, 11:44 a.m. UTC | #22
On Mon, Nov 27, 2023 at 11:27:31AM +0000, Christian Loehle wrote:
> On 26/11/2023 19:31, Oleksij Rempel wrote:
> > On Sun, Nov 26, 2023 at 10:14:45AM +0000, Mark Brown wrote:
> >> On Sat, Nov 25, 2023 at 07:58:12PM +0000, Greg Kroah-Hartman wrote:
> >>> On Sat, Nov 25, 2023 at 03:43:02PM +0000, Mark Brown wrote:
> >>>> On Sat, Nov 25, 2023 at 02:35:41PM +0000, Greg Kroah-Hartman wrote:
> >>
> >>>>> That would be great, but I don't see that here, do you?  All I see is
> >>>>> the shutdown sequence changing because someone wants it to go "faster"
> >>>>> with the threat of hardware breaking if we don't meet that "faster"
> >>>>> number, yet no knowledge or guarantee that this number can ever be known
> >>>>> or happen.
> >>
> >>>> The idea was to have somewhere to send notifications when the hardware
> >>>> starts reporting things like power supplies starting to fail.  We do
> >>>> have those from hardware, we just don't do anything terribly useful
> >>>> with them yet.
> >>
> >>> Ok, but that's not what I recall this patchset doing, or did I missing
> >>> something?  All I saw was a "reorder the shutdown sequence" set of
> >>> changes.  Or at least that's all I remember at this point in time,
> >>> sorry, it's been a few days, but at least that lines up with what the
> >>> Subject line says above :)
> >>
> >> That's not in the series, a bunch of it is merged in some form (eg, see
> >> hw_protection_shutdown()) and more of it would need to be built on top
> >> if this were merged.
> > 
> > The current kernel has enough infrastructure to manage essential functions
> > related to hardware protection:
> > - The Device Tree specifies the source of interrupts for detecting
> >   under-voltage events. It also details critical system regulators and some
> >   of specification of backup power supplied by the board.
> > - Various frameworks within the kernel can identify critical hardware
> >   conditions like over-temperature and under-voltage. Upon detection, these
> >   frameworks invoke the hw_protection_shutdown() function.
> > 
> >>>>> Agreed, but I don't think this patch is going to actually work properly
> >>>>> over time as there is no time values involved :)
> > 
> > If we're to implement a deadline for each shutdown call (as the requirement for
> > "time values" suggests?), then prioritization becomes essential. Without
> > establishing a shutdown order, the inclusion of time values might not be
> > effectively utilized.  Am I overlooking anything in this regard?
> > 
> >>>> This seems to be more into the area of mitigation than firm solution, I
> >>>> suspect users will be pleased if they can make a noticable dent in the
> >>>> number of failures they're seeing.
> >>
> >>> Mitigation is good, but this patch series is just a hack by doing "throw
> >>> this device type at the front of the shutdown list because we have
> >>> hardware that crashes a lot" :)
> > 
> > The root of the issue seems to be the choice of primary storage device.
> > 
> > All storage technologies - HDD, SSD, eMMC, NAND - are vulnerable to power
> > loss. The only foolproof safeguard is a backup power source, but this
> > introduces its own set of challenges:
> 
> I disagree and would say that any storage device sold as "industrial" should
> guarantee power-fail safety. Plus, you mentioned data loss isn't even your concern,
> but the storage device fails/bricks.
> > 
> > 1. Batteries: While they provide a backup, they come with limitations like a
> > finite number of charge cycles, sensitivity to temperature (a significant
> > concern in industrial and automotive environments), higher costs, and
> > increased device size. For most embedded applications, a UPS isn't a viable
> > solution.
> > 
> > 2. Capacitors: A potential alternative, but they cannot offer prolonged
> > backup time. Increasing the number of capacitors to extend backup time leads
> > to additional issues:
> >    - Increased costs and space requirements on the PCB.
> >    - The need to manage partially charged capacitors during power failures.
> >    - The requirement for a power supply capable of rapid charging.
> >    - The risk of not reaching a safe state before the backup energy
> >      depletes.
> >    - In specific environments, like explosive atmospheres, storing large
> >      amounts of energy can be hazardous.
> 
> And also just practically, ensuring a safe power down could be in the order
> of a second, so it would be quite a capacitor.
> 
> > 
> > Given these considerations, it's crucial to understand that such design choices
> > aren't merely "hacks". They represent a balance between different types of
> > trade-offs.
> > 
> >>>> It feels like if we're concerned about mitigating physical damage during
> >>>> the process of power failure that's a very limited set of devices - the
> >>>> storage case where we're in the middle of writing to flash or whatever
> >>>> is the most obvious case.
> >>
> >>> Then why isn't userspace handling this?  This is a policy decision that
> >>> it needs to take to properly know what hardware needs to be shut down,
> >>> and what needs to happen in order to do that (i.e. flush, unmount,
> >>> etc.?)  And userspace today should be able to say, "power down this
> >>> device now!" for any device in the system based on the sysfs device
> >>> tree, or at the very least, force it to a specific power state.  So why
> >>> not handle this policy there?
> >>
> >> Given the tight timelines it does seem reasonable to have some of this
> >> in the kernel - the specific decisions about how to handle these events
> >> can always be controlled from userspace (eg, with a sysfs file like we
> >> do for autosuspend delay times which seem to be in a similar ballpark).
> > 
> > Upon investigating the feasibility of a user space solution for eMMC
> > power control, I've concluded that it's likely not possible. The primary
> > issue is that most board designs don't include reset signaling for
> > eMMCs. Additionally, the eMMC power rail is usually linked to the
> > system's main power controller. While powering off is doable, cleanly
> > powering it back on isn’t feasible. This is especially problematic when
> > the rootfs is located on the eMMC, as power cycling the storage device
> > could lead to system instability.
> > 
> > Therefore, any user space method to power off eMMC wouldn't be reliable
> > or safe, as there's no way to ensure it can be turned back on without
> > risking the integrity of the system. The design rationale is clear:
> > avoiding the risks associated with powering off the primary storage
> > device.
> > 
> > Considering these constraints, the only practical implementation I see
> > is integrating this functionality into the system's shutdown sequence.
> > This approach ensures a controlled environment for powering off the
> > eMMC, avoiding potential issues.
> 
> You don't need the RST signal, in fact even if you had it it would be
> the wrong thing to do. (Implementation is vendor-specific but RST
> assumes that eMMCs' VCC and VCCQ are left untouched.)

It means, if VCC and VCCQ are off on reboot or watchdog reset, there is
potentially bigger problem?

> You can try turning off eMMC cache completely and/or sending power down
> notification on 'emergency shutdown', but since power-loss/fail behavior
> is vendor-specific asking the storage device vendor how to ensure a safe
> power-down.
> Anyway the proper eMMC power-down methods are up to a second in timeouts,
> so infeasible for your requirements from what I can see.

Ok. So, increasing capacity at least to one second should be main goal
for now? But even if capacity is increased, emergency shutdown should
notify eMMCs as early as possible?
Christian Loehle Nov. 27, 2023, 11:57 a.m. UTC | #23
On 27/11/2023 11:44, Oleksij Rempel wrote:
> On Mon, Nov 27, 2023 at 11:27:31AM +0000, Christian Loehle wrote:
>> On 26/11/2023 19:31, Oleksij Rempel wrote:
>>> On Sun, Nov 26, 2023 at 10:14:45AM +0000, Mark Brown wrote:
>>>> On Sat, Nov 25, 2023 at 07:58:12PM +0000, Greg Kroah-Hartman wrote:
>>>>> On Sat, Nov 25, 2023 at 03:43:02PM +0000, Mark Brown wrote:
>>>>>> On Sat, Nov 25, 2023 at 02:35:41PM +0000, Greg Kroah-Hartman wrote:
>>>>
>>>>>>> That would be great, but I don't see that here, do you?  All I see is
>>>>>>> the shutdown sequence changing because someone wants it to go "faster"
>>>>>>> with the threat of hardware breaking if we don't meet that "faster"
>>>>>>> number, yet no knowledge or guarantee that this number can ever be known
>>>>>>> or happen.
>>>>
>>>>>> The idea was to have somewhere to send notifications when the hardware
>>>>>> starts reporting things like power supplies starting to fail.  We do
>>>>>> have those from hardware, we just don't do anything terribly useful
>>>>>> with them yet.
>>>>
>>>>> Ok, but that's not what I recall this patchset doing, or did I missing
>>>>> something?  All I saw was a "reorder the shutdown sequence" set of
>>>>> changes.  Or at least that's all I remember at this point in time,
>>>>> sorry, it's been a few days, but at least that lines up with what the
>>>>> Subject line says above :)
>>>>
>>>> That's not in the series, a bunch of it is merged in some form (eg, see
>>>> hw_protection_shutdown()) and more of it would need to be built on top
>>>> if this were merged.
>>>
>>> The current kernel has enough infrastructure to manage essential functions
>>> related to hardware protection:
>>> - The Device Tree specifies the source of interrupts for detecting
>>>   under-voltage events. It also details critical system regulators and some
>>>   of specification of backup power supplied by the board.
>>> - Various frameworks within the kernel can identify critical hardware
>>>   conditions like over-temperature and under-voltage. Upon detection, these
>>>   frameworks invoke the hw_protection_shutdown() function.
>>>
>>>>>>> Agreed, but I don't think this patch is going to actually work properly
>>>>>>> over time as there is no time values involved :)
>>>
>>> If we're to implement a deadline for each shutdown call (as the requirement for
>>> "time values" suggests?), then prioritization becomes essential. Without
>>> establishing a shutdown order, the inclusion of time values might not be
>>> effectively utilized.  Am I overlooking anything in this regard?
>>>
>>>>>> This seems to be more into the area of mitigation than firm solution, I
>>>>>> suspect users will be pleased if they can make a noticable dent in the
>>>>>> number of failures they're seeing.
>>>>
>>>>> Mitigation is good, but this patch series is just a hack by doing "throw
>>>>> this device type at the front of the shutdown list because we have
>>>>> hardware that crashes a lot" :)
>>>
>>> The root of the issue seems to be the choice of primary storage device.
>>>
>>> All storage technologies - HDD, SSD, eMMC, NAND - are vulnerable to power
>>> loss. The only foolproof safeguard is a backup power source, but this
>>> introduces its own set of challenges:
>>
>> I disagree and would say that any storage device sold as "industrial" should
>> guarantee power-fail safety. Plus, you mentioned data loss isn't even your concern,
>> but the storage device fails/bricks.
>>>
>>> 1. Batteries: While they provide a backup, they come with limitations like a
>>> finite number of charge cycles, sensitivity to temperature (a significant
>>> concern in industrial and automotive environments), higher costs, and
>>> increased device size. For most embedded applications, a UPS isn't a viable
>>> solution.
>>>
>>> 2. Capacitors: A potential alternative, but they cannot offer prolonged
>>> backup time. Increasing the number of capacitors to extend backup time leads
>>> to additional issues:
>>>    - Increased costs and space requirements on the PCB.
>>>    - The need to manage partially charged capacitors during power failures.
>>>    - The requirement for a power supply capable of rapid charging.
>>>    - The risk of not reaching a safe state before the backup energy
>>>      depletes.
>>>    - In specific environments, like explosive atmospheres, storing large
>>>      amounts of energy can be hazardous.
>>
>> And also just practically, ensuring a safe power down could be in the order
>> of a second, so it would be quite a capacitor.
>>
>>>
>>> Given these considerations, it's crucial to understand that such design choices
>>> aren't merely "hacks". They represent a balance between different types of
>>> trade-offs.
>>>
>>>>>> It feels like if we're concerned about mitigating physical damage during
>>>>>> the process of power failure that's a very limited set of devices - the
>>>>>> storage case where we're in the middle of writing to flash or whatever
>>>>>> is the most obvious case.
>>>>
>>>>> Then why isn't userspace handling this?  This is a policy decision that
>>>>> it needs to take to properly know what hardware needs to be shut down,
>>>>> and what needs to happen in order to do that (i.e. flush, unmount,
>>>>> etc.?)  And userspace today should be able to say, "power down this
>>>>> device now!" for any device in the system based on the sysfs device
>>>>> tree, or at the very least, force it to a specific power state.  So why
>>>>> not handle this policy there?
>>>>
>>>> Given the tight timelines it does seem reasonable to have some of this
>>>> in the kernel - the specific decisions about how to handle these events
>>>> can always be controlled from userspace (eg, with a sysfs file like we
>>>> do for autosuspend delay times which seem to be in a similar ballpark).
>>>
>>> Upon investigating the feasibility of a user space solution for eMMC
>>> power control, I've concluded that it's likely not possible. The primary
>>> issue is that most board designs don't include reset signaling for
>>> eMMCs. Additionally, the eMMC power rail is usually linked to the
>>> system's main power controller. While powering off is doable, cleanly
>>> powering it back on isn’t feasible. This is especially problematic when
>>> the rootfs is located on the eMMC, as power cycling the storage device
>>> could lead to system instability.
>>>
>>> Therefore, any user space method to power off eMMC wouldn't be reliable
>>> or safe, as there's no way to ensure it can be turned back on without
>>> risking the integrity of the system. The design rationale is clear:
>>> avoiding the risks associated with powering off the primary storage
>>> device.
>>>
>>> Considering these constraints, the only practical implementation I see
>>> is integrating this functionality into the system's shutdown sequence.
>>> This approach ensures a controlled environment for powering off the
>>> eMMC, avoiding potential issues.
>>
>> You don't need the RST signal, in fact even if you had it it would be
>> the wrong thing to do. (Implementation is vendor-specific but RST
>> assumes that eMMCs' VCC and VCCQ are left untouched.)
> 
> It means, if VCC and VCCQ are off on reboot or watchdog reset, there is
> potentially bigger problem?

Just to confirm I was talking about EMMC_RST signal, which I understood you
where talking about, too?
Sending a EMMC_RST pulse does not have to trigger a safe shutdown for the eMMC,
(again it could), but both VCC and VCCQ should be left untouched and stable.
(Otherwise with the short timeout on EMMC_RST you might as well toggle EMMC_VCC.)
If you toggle EMMC_VCC and EMMC_VCCQ on system reboot/reset is a design choice,
but if your eMMC module has trouble with 'sudden' power-loss, leaving them
on could be beneficial?
Anyway definitely not required and not really related to your issue.

> 
>> You can try turning off eMMC cache completely and/or sending power down
>> notification on 'emergency shutdown', but since power-loss/fail behavior
>> is vendor-specific asking the storage device vendor how to ensure a safe
>> power-down.
>> Anyway the proper eMMC power-down methods are up to a second in timeouts,
>> so infeasible for your requirements from what I can see.
> 
> Ok. So, increasing capacity at least to one second should be main goal
> for now? But even if capacity is increased, emergency shutdown should
> notify eMMCs as early as possible? 

Well if that is an option that is a path to explore for sure, that is,
if you don't want to switch the eMMC to something more fitting.
Yes you would need to notify the eMMC somehow, otherwise you just delayed
the power-fail for a second.
Sending a sleep, power-down notification or flushing the cache could all be
ways to trigger a safe shutdown for the eMMC, but again, you really would have
to confirm this with the eMMC vendor, as all of these can essentially be
implemented as a NOP (apart from the state-machine transition) and be
spec-compliant.

BR,
Christian
Matti Vaittinen Nov. 27, 2023, 12:54 p.m. UTC | #24
pe 24. marrask. 2023 klo 19.26 Greg Kroah-Hartman
(gregkh@linuxfoundation.org) kirjoitti:
>
> On Fri, Nov 24, 2023 at 05:32:34PM +0100, Oleksij Rempel wrote:
> > On Fri, Nov 24, 2023 at 03:56:19PM +0000, Greg Kroah-Hartman wrote:
> > > On Fri, Nov 24, 2023 at 03:49:46PM +0000, Mark Brown wrote:
> > > > On Fri, Nov 24, 2023 at 03:27:48PM +0000, Greg Kroah-Hartman wrote:
> > > > > On Fri, Nov 24, 2023 at 03:21:40PM +0000, Mark Brown wrote:
> > > >
> > > > > > This came out of some discussions about trying to handle emergency power
> > > > > > failure notifications.
> > > >
> > > > > I'm sorry, but I don't know what that means.  Are you saying that the
> > > > > kernel is now going to try to provide a hard guarantee that some devices
> > > > > are going to be shut down in X number of seconds when asked?  If so, why
> > > > > not do this in userspace?
> > > >
> > > > No, it was initially (or when I initially saw it anyway) handling of
> > > > notifications from regulators that they're in trouble and we have some
> > > > small amount of time to do anything we might want to do about it before
> > > > we expire.
> > >
> > > So we are going to guarantee a "time" in which we are going to do
> > > something?  Again, if that's required, why not do it in userspace using
> > > a RT kernel?
> >
> > For the HW in question I have only 100ms time before power loss. By
> > doing it over use space some we will have even less time to react.
>
> Why can't userspace react that fast?  Why will the kernel be somehow
> faster?  Speed should be the same, just get the "power is cut" signal
> and have userspace flush and unmount the disk before power is gone.  Why
> can the kernel do this any differently?
>
> > In fact, this is not a new requirement. It exist on different flavors of
> > automotive Linux for about 10 years. Linux in cars should be able to
> > handle voltage drops for example on ignition and so on. The only new thing is
> > the attempt to mainline it.
>
> But your patch is not guaranteeing anything, it's just doing a "I want
> this done before the other devices are handled", that's it.  There is no
> chance that 100ms is going to be a requirement, or that some other
> device type is not going to come along and demand to be ahead of your
> device in the list.
>
> So you are going to have a constant fight among device types over the
> years, and people complaining that the kernel is now somehow going to
> guarantee that a device is shutdown in a set amount of time, which
> again, the kernel can not guarantee here.
>
> This might work as a one-off for a specific hardware platform, which is
> odd, but not anything you really should be adding for anyone else to use
> here as your reasoning for it does not reflect what the code does.

I was (am) interested in knowing how/where the regulator error
notifications are utilized - hence I asked this in ELCE last summer.
Replies indeed mostly pointed to automotive and handling the under
voltage events.

As to what has changed (I think this was asked in another mail on this
topic) - I understood from the discussions that the demand of running
systems with as low power as possible is even more
important/desirable. Hence, the under-voltage events are more usual
than they were when cars used to be working by burning flammable
liquids :)

Anyways, what I thought I'd comment on is that the severity of the
regulator error notifications can be given from device-tree. Rationale
behind this is that figuring out whether a certain detected problem is
fatal or not (in embedded systems) should be done by the board
designers, per board. Maybe the understanding which hardware should
react first is also a property of hardware and could come from the
device-tree? Eg, instead of having a "DEVICE_SHUTDOWN_PRIO_STORAGE"
set unconditionally for EMMC, systems could set shutdown priority per
board and per device explicitly using device-tree?

Yours,
    -- Matti
Greg Kroah-Hartman Nov. 27, 2023, 1:08 p.m. UTC | #25
On Mon, Nov 27, 2023 at 02:54:21PM +0200, Matti Vaittinen wrote:
> pe 24. marrask. 2023 klo 19.26 Greg Kroah-Hartman
> (gregkh@linuxfoundation.org) kirjoitti:
> >
> > On Fri, Nov 24, 2023 at 05:32:34PM +0100, Oleksij Rempel wrote:
> > > On Fri, Nov 24, 2023 at 03:56:19PM +0000, Greg Kroah-Hartman wrote:
> > > > On Fri, Nov 24, 2023 at 03:49:46PM +0000, Mark Brown wrote:
> > > > > On Fri, Nov 24, 2023 at 03:27:48PM +0000, Greg Kroah-Hartman wrote:
> > > > > > On Fri, Nov 24, 2023 at 03:21:40PM +0000, Mark Brown wrote:
> > > > >
> > > > > > > This came out of some discussions about trying to handle emergency power
> > > > > > > failure notifications.
> > > > >
> > > > > > I'm sorry, but I don't know what that means.  Are you saying that the
> > > > > > kernel is now going to try to provide a hard guarantee that some devices
> > > > > > are going to be shut down in X number of seconds when asked?  If so, why
> > > > > > not do this in userspace?
> > > > >
> > > > > No, it was initially (or when I initially saw it anyway) handling of
> > > > > notifications from regulators that they're in trouble and we have some
> > > > > small amount of time to do anything we might want to do about it before
> > > > > we expire.
> > > >
> > > > So we are going to guarantee a "time" in which we are going to do
> > > > something?  Again, if that's required, why not do it in userspace using
> > > > a RT kernel?
> > >
> > > For the HW in question I have only 100ms time before power loss. By
> > > doing it over use space some we will have even less time to react.
> >
> > Why can't userspace react that fast?  Why will the kernel be somehow
> > faster?  Speed should be the same, just get the "power is cut" signal
> > and have userspace flush and unmount the disk before power is gone.  Why
> > can the kernel do this any differently?
> >
> > > In fact, this is not a new requirement. It exist on different flavors of
> > > automotive Linux for about 10 years. Linux in cars should be able to
> > > handle voltage drops for example on ignition and so on. The only new thing is
> > > the attempt to mainline it.
> >
> > But your patch is not guaranteeing anything, it's just doing a "I want
> > this done before the other devices are handled", that's it.  There is no
> > chance that 100ms is going to be a requirement, or that some other
> > device type is not going to come along and demand to be ahead of your
> > device in the list.
> >
> > So you are going to have a constant fight among device types over the
> > years, and people complaining that the kernel is now somehow going to
> > guarantee that a device is shutdown in a set amount of time, which
> > again, the kernel can not guarantee here.
> >
> > This might work as a one-off for a specific hardware platform, which is
> > odd, but not anything you really should be adding for anyone else to use
> > here as your reasoning for it does not reflect what the code does.
> 
> I was (am) interested in knowing how/where the regulator error
> notifications are utilized - hence I asked this in ELCE last summer.
> Replies indeed mostly pointed to automotive and handling the under
> voltage events.
> 
> As to what has changed (I think this was asked in another mail on this
> topic) - I understood from the discussions that the demand of running
> systems with as low power as possible is even more
> important/desirable. Hence, the under-voltage events are more usual
> than they were when cars used to be working by burning flammable
> liquids :)
> 
> Anyways, what I thought I'd comment on is that the severity of the
> regulator error notifications can be given from device-tree. Rationale
> behind this is that figuring out whether a certain detected problem is
> fatal or not (in embedded systems) should be done by the board
> designers, per board. Maybe the understanding which hardware should
> react first is also a property of hardware and could come from the
> device-tree? Eg, instead of having a "DEVICE_SHUTDOWN_PRIO_STORAGE"
> set unconditionally for EMMC, systems could set shutdown priority per
> board and per device explicitly using device-tree?

Yes, using device tree would be good, but now you have created something
that is device-tree-specific and not all the world is device tree :(

Also, many devices are finally moving out to non-device-tree busses,
like PCI and USB, so how would you handle them in this type of scheme?

thanks,

greg k-h
Mark Brown Nov. 27, 2023, 2:09 p.m. UTC | #26
On Sun, Nov 26, 2023 at 08:42:02PM +0100, Ferry Toth wrote:

> Funny discussion. As a hardware engineer (with no experience in automotive,
> but actual experience in industrial applications and debugging issues
> arising from bad shutdowns) let me add my 5ct at the end.

I suspect there's also a space here beyond systems that were designed
with these failure modes in mind where people run into issues once they
have the hardware and are trying to improve what they can after the fact.

> Now, we do need to keep in mind that storing J in a supercap, executing a
> CPU at GHz, storing GB data do not come free. So, after making sure things
> shutdown in time, it often pays off to shorten that deadline, and indeed
> make it faster.

Indeed.
Mark Brown Nov. 27, 2023, 2:24 p.m. UTC | #27
On Mon, Nov 27, 2023 at 01:08:24PM +0000, Greg Kroah-Hartman wrote:

> Yes, using device tree would be good, but now you have created something
> that is device-tree-specific and not all the world is device tree :(

AFAICT the idiomatic thing for ACPI would be platform quirks based on
DMI information.  Yay ACPI.  If the system is more Linux targetted then
you can use _DSD properties to store DT properties, these can then be
parsed out in a firmware interface neutral way via the fwnode API.  I'm
not sure there's any avoiding dealing with firmware interface specifics
at some point if we need platform description.

> Also, many devices are finally moving out to non-device-tree busses,
> like PCI and USB, so how would you handle them in this type of scheme?

DT does have bindings for devices on discoverable buses like PCI - I
think the original thing was for vendors cheaping out on EEPROMs though
it's also useful when things are soldered down in embedded systems.
Matti Vaittinen Nov. 27, 2023, 2:49 p.m. UTC | #28
On 11/27/23 15:08, Greg Kroah-Hartman wrote:
> On Mon, Nov 27, 2023 at 02:54:21PM +0200, Matti Vaittinen wrote:
>> pe 24. marrask. 2023 klo 19.26 Greg Kroah-Hartman
>> (gregkh@linuxfoundation.org) kirjoitti:
>>>
>>> On Fri, Nov 24, 2023 at 05:32:34PM +0100, Oleksij Rempel wrote:
>>>> On Fri, Nov 24, 2023 at 03:56:19PM +0000, Greg Kroah-Hartman wrote:
>>>>> On Fri, Nov 24, 2023 at 03:49:46PM +0000, Mark Brown wrote:
>>>>>> On Fri, Nov 24, 2023 at 03:27:48PM +0000, Greg Kroah-Hartman wrote:
>>>>>>> On Fri, Nov 24, 2023 at 03:21:40PM +0000, Mark Brown wrote:
>>>>>>
>>>>>>>> This came out of some discussions about trying to handle emergency power
>>>>>>>> failure notifications.
>>>>>>
>>>>>>> I'm sorry, but I don't know what that means.  Are you saying that the
>>>>>>> kernel is now going to try to provide a hard guarantee that some devices
>>>>>>> are going to be shut down in X number of seconds when asked?  If so, why
>>>>>>> not do this in userspace?
>>>>>>
>>>>>> No, it was initially (or when I initially saw it anyway) handling of
>>>>>> notifications from regulators that they're in trouble and we have some
>>>>>> small amount of time to do anything we might want to do about it before
>>>>>> we expire.
>>>>>
>>>>> So we are going to guarantee a "time" in which we are going to do
>>>>> something?  Again, if that's required, why not do it in userspace using
>>>>> a RT kernel?
>>>>
>>>> For the HW in question I have only 100ms time before power loss. By
>>>> doing it over use space some we will have even less time to react.
>>>
>>> Why can't userspace react that fast?  Why will the kernel be somehow
>>> faster?  Speed should be the same, just get the "power is cut" signal
>>> and have userspace flush and unmount the disk before power is gone.  Why
>>> can the kernel do this any differently?
>>>
>>>> In fact, this is not a new requirement. It exist on different flavors of
>>>> automotive Linux for about 10 years. Linux in cars should be able to
>>>> handle voltage drops for example on ignition and so on. The only new thing is
>>>> the attempt to mainline it.
>>>
>>> But your patch is not guaranteeing anything, it's just doing a "I want
>>> this done before the other devices are handled", that's it.  There is no
>>> chance that 100ms is going to be a requirement, or that some other
>>> device type is not going to come along and demand to be ahead of your
>>> device in the list.
>>>
>>> So you are going to have a constant fight among device types over the
>>> years, and people complaining that the kernel is now somehow going to
>>> guarantee that a device is shutdown in a set amount of time, which
>>> again, the kernel can not guarantee here.
>>>
>>> This might work as a one-off for a specific hardware platform, which is
>>> odd, but not anything you really should be adding for anyone else to use
>>> here as your reasoning for it does not reflect what the code does.
>>
>> I was (am) interested in knowing how/where the regulator error
>> notifications are utilized - hence I asked this in ELCE last summer.
>> Replies indeed mostly pointed to automotive and handling the under
>> voltage events.
>>
>> As to what has changed (I think this was asked in another mail on this
>> topic) - I understood from the discussions that the demand of running
>> systems with as low power as possible is even more
>> important/desirable. Hence, the under-voltage events are more usual
>> than they were when cars used to be working by burning flammable
>> liquids :)
>>
>> Anyways, what I thought I'd comment on is that the severity of the
>> regulator error notifications can be given from device-tree. Rationale
>> behind this is that figuring out whether a certain detected problem is
>> fatal or not (in embedded systems) should be done by the board
>> designers, per board. Maybe the understanding which hardware should
>> react first is also a property of hardware and could come from the
>> device-tree? Eg, instead of having a "DEVICE_SHUTDOWN_PRIO_STORAGE"
>> set unconditionally for EMMC, systems could set shutdown priority per
>> board and per device explicitly using device-tree?
> 
> Yes, using device tree would be good, but now you have created something
> that is device-tree-specific and not all the world is device tree :(

True. However, my understanding is that the regulator subsystem is 
largely written to work with DT-based systems. Hence supporting the 
DT-based solution would probably fit to this specific use-case as source 
of problem notifications is the regulator subsystem.

> Also, many devices are finally moving out to non-device-tree busses,
> like PCI and USB, so how would you handle them in this type of scheme?

I do readily admit I don't have [all ;) ] the answers. I also think that 
if we add support for prioritized shutdown on device-tree-based systems, 
people may eventually want to use this on non device-tree setups too. 
There may also be other use-cases for prioritized shutdown (Don't know 
what they would be though).

For now I would leave that to be the problem of the folks who need non 
device-tree systems when (if) this needs realizes. Assuming there was 
the handling of priorities in place, the missing piece would then be to 
find out the place to store this hardware specific priority information. 
If this is solved for the non DT cases, then the DT-based and non 
DT-based solutions can co-exist.

Just a suggestion though. I am not working on under-voltage "stuff" 
right now.

Yours,
	-- Matti
Mark Brown Nov. 27, 2023, 4:23 p.m. UTC | #29
On Mon, Nov 27, 2023 at 04:49:49PM +0200, Matti Vaittinen wrote:
> On 11/27/23 15:08, Greg Kroah-Hartman wrote:

> > Yes, using device tree would be good, but now you have created something
> > that is device-tree-specific and not all the world is device tree :(

> True. However, my understanding is that the regulator subsystem is largely
> written to work with DT-based systems. Hence supporting the DT-based
> solution would probably fit to this specific use-case as source of problem
> notifications is the regulator subsystem.

Yes, ACPI has a strong model that things like regulators and clocks are
not visible to the OS.
Ulf Hansson Nov. 30, 2023, 9:57 a.m. UTC | #30
On Fri, 24 Nov 2023 at 15:53, Oleksij Rempel <o.rempel@pengutronix.de> wrote:
>
> Hi,
>
> This patch series introduces support for prioritized device shutdown.
> The main goal is to enable prioritization for shutting down specific
> devices, particularly crucial in scenarios like power loss where
> hardware damage can occur if not handled properly.
>
> Oleksij Rempel (3):
>   driver core: move core part of device_shutdown() to a separate
>     function
>   driver core: introduce prioritized device shutdown sequence
>   mmc: core: increase shutdown priority for MMC devices
>
>  drivers/base/core.c    | 157 +++++++++++++++++++++++++++--------------
>  drivers/mmc/core/bus.c |   2 +
>  include/linux/device.h |  51 ++++++++++++-
>  kernel/reboot.c        |   4 +-
>  4 files changed, 157 insertions(+), 57 deletions(-)
>

Sorry for joining the discussions a bit late! Besides the valuable
feedback that you already received from others (which indicates that
we have quite some work to do in the commit messages to better explain
and justify these changes), I wanted to share my overall thoughts
around this.

So, I fully understand the reason behind the $subject series, as we
unfortunately can't rely on flash-based (NAND/NOR) storage devices
being 100% tolerant to sudden-power failures. Besides for the reasons
already discussed in the thread, the robustness simply depends on the
"quality" of the FTL (flash translation layer) and the NAND/NOR/etc
device it runs.

For example, back in the days when Android showed up, we were testing
YAFFS and UBIFS on rawNAND, which failed miserably after just a few
thousands of power-cycles. It was even worse with ext3/4 on the early
variants of eMMC devices, as those survived only a few hundreds of
power-cycles. Now, I assume this has improved a lot over the years,
but I haven't really verified this myself.

That said, for eMMC and other flash-based storage devices, industrial
or not, I think it would make sense to try to notify the device about
the power-failure, if possible. This would add another level of
mitigation, I think.

From an implementation point of view, it looks to me that the approach
in the $subject series has some potential. Although, rather than
diving into the details, I will defer to review the next version.

Kind regards
Uffe
Francesco Dolcini Nov. 30, 2023, 9:59 p.m. UTC | #31
Hello all,

On Mon, Nov 27, 2023 at 12:36:11PM +0100, Oleksij Rempel wrote:
> On Mon, Nov 27, 2023 at 10:13:49AM +0000, Christian Loehle wrote:
> > > Same problem was seen not only in automotive devices, but also in
> > > industrial or agricultural. With other words, it is important enough to bring
> > > some kind of solution mainline.
> > > 
> > 
> > IMO that is a serious problem with the used storage / eMMC in that case and it
> > is not suitable for industrial/automotive uses?
> > Any industrial/automotive-suitable storage device should detect under-voltage and
> > just treat it as a power-down/loss, and while that isn't nice for the storage device,
> > it really shouldn't be able to brick a device (within <1M cycles anyway).
> > What does the storage module vendor say about this?
> 
> Good question. I do not have insights ATM. I'll forward it.

From personal experience I can tell that bricked eMMC devices happen because of
eMMC controller firmware bugs. You might find some recently committed quirk
on this regard.

Waiting for any additional details you might be able to find, given my past
experience I would agree with what Christian wrote.

Francesco