mbox series

[0/3] thermal: hook in with reboot and crash

Message ID 20230525211655.627415-1-evalenti@kernel.org (mailing list archive)
Headers show
Series thermal: hook in with reboot and crash | expand

Message

Eduardo Valentin May 25, 2023, 9:16 p.m. UTC
From: Eduardo Valentin <eduval@amazon.com>

Hello,

This small series of changes teaches thermal core about
reboot and crash callbacks. The intention is to have the core
to get notified and the pass in the event to thermal governors
that are willing to perform actions during reboot or crash events.
The thermal workers will be teared down in the process too.

There is no code dependency this series was built on top of:
https://lkml.org/lkml/2023/5/18/1207

Separate governor changes will be sent in another series.

BR,

Eduardo Valentin (3):
  thermal: core: introduce governor .reboot_prepare()
  thermal: core: register reboot nb
  thermal: core: register a crash callback

 drivers/thermal/thermal_core.c | 54 ++++++++++++++++++++++++++++++++++
 include/linux/thermal.h        |  4 +++
 2 files changed, 58 insertions(+)

Comments

Daniel Lezcano May 26, 2023, 8:27 a.m. UTC | #1
Hi,

On 25/05/2023 23:16, Eduardo Valentin wrote:
> From: Eduardo Valentin <eduval@amazon.com>
> 
> Hello,
> 
> This small series of changes teaches thermal core about
> reboot and crash callbacks. The intention is to have the core
> to get notified and the pass in the event to thermal governors
> that are willing to perform actions during reboot or crash events.
> The thermal workers will be teared down in the process too.

What problem does it solve?


> There is no code dependency this series was built on top of:
> https://lkml.org/lkml/2023/5/18/1207
> 
> Separate governor changes will be sent in another series.
> 
> BR,
> 
> Eduardo Valentin (3):
>    thermal: core: introduce governor .reboot_prepare()
>    thermal: core: register reboot nb
>    thermal: core: register a crash callback
> 
>   drivers/thermal/thermal_core.c | 54 ++++++++++++++++++++++++++++++++++
>   include/linux/thermal.h        |  4 +++
>   2 files changed, 58 insertions(+)
>
Eduardo Valentin June 5, 2023, 11:27 p.m. UTC | #2
On Fri, May 26, 2023 at 10:27:39AM +0200, Daniel Lezcano wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> Hi,
> 
> On 25/05/2023 23:16, Eduardo Valentin wrote:
> > From: Eduardo Valentin <eduval@amazon.com>
> > 
> > Hello,
> > 
> > This small series of changes teaches thermal core about
> > reboot and crash callbacks. The intention is to have the core
> > to get notified and the pass in the event to thermal governors
> > that are willing to perform actions during reboot or crash events.
> > The thermal workers will be teared down in the process too.
> 
> What problem does it solve?

This cover letter was not clear enough. In fact, the context for
all patches I am sending now and will be sending in near future
is when the thermal subsystem is configured to control temperature
of a target device. The thermal subsystem is configured to have
cooling devices that will act on the target system, and has
input, temperature sensors, to have visibility to the target system
temperature.  In this case, the problem is when the controlling system
becomes unresponsive upon reboot or crash, therefore losing
control of temperature of the target system. This series solves the
problem by giving knowledge to the governors of such events, allowing
the governors to have opportunity to act before the actual event happens. 

> 
> 
> > There is no code dependency this series was built on top of:
> > https://lkml.org/lkml/2023/5/18/1207
> > 
> > Separate governor changes will be sent in another series.
> > 
> > BR,
> > 
> > Eduardo Valentin (3):
> >    thermal: core: introduce governor .reboot_prepare()
> >    thermal: core: register reboot nb
> >    thermal: core: register a crash callback
> > 
> >   drivers/thermal/thermal_core.c | 54 ++++++++++++++++++++++++++++++++++
> >   include/linux/thermal.h        |  4 +++
> >   2 files changed, 58 insertions(+)
> > 
> 
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>
Eduardo Valentin June 5, 2023, 11:32 p.m. UTC | #3
On Mon, Jun 05, 2023 at 04:27:55PM -0700, Eduardo Valentin wrote:
> 
> 
> 
> On Fri, May 26, 2023 at 10:27:39AM +0200, Daniel Lezcano wrote:
> >
> >
> >
> > Hi,
> >
> > On 25/05/2023 23:16, Eduardo Valentin wrote:
> > > From: Eduardo Valentin <eduval@amazon.com>
> > >
> > > Hello,
> > >
> > > This small series of changes teaches thermal core about
> > > reboot and crash callbacks. The intention is to have the core
> > > to get notified and the pass in the event to thermal governors
> > > that are willing to perform actions during reboot or crash events.
> > > The thermal workers will be teared down in the process too.
> >
> > What problem does it solve?
> 
> This cover letter was not clear enough. In fact, the context for
> all patches I am sending now and will be sending in near future
> is when the thermal subsystem is configured to control temperature
> of a target device. The thermal subsystem is configured to have
> cooling devices that will act on the target system, and has
> input, temperature sensors, to have visibility to the target system
> temperature.  In this case, the problem is when the controlling system
> becomes unresponsive upon reboot or crash, therefore losing
> control of temperature of the target system. This series solves the
> problem by giving knowledge to the governors of such events, allowing
> the governors to have opportunity to act before the actual event happens.


Again, this is a different situation than a emergence shutdown due to 
temperature/overheat on the typical application of the thermal subsystem.
Where it runs in the same system it controls the temperature of.

Here we want to reduce the likelihood of loosing control of temperature of a target systems
upon events where the controlling system is unavailable.
Rafael J. Wysocki June 29, 2023, 7:07 p.m. UTC | #4
On Tue, Jun 6, 2023 at 1:32 AM Eduardo Valentin <evalenti@kernel.org> wrote:
>
> On Mon, Jun 05, 2023 at 04:27:55PM -0700, Eduardo Valentin wrote:
> >
> >
> >
> > On Fri, May 26, 2023 at 10:27:39AM +0200, Daniel Lezcano wrote:
> > >
> > >
> > >
> > > Hi,
> > >
> > > On 25/05/2023 23:16, Eduardo Valentin wrote:
> > > > From: Eduardo Valentin <eduval@amazon.com>
> > > >
> > > > Hello,
> > > >
> > > > This small series of changes teaches thermal core about
> > > > reboot and crash callbacks. The intention is to have the core
> > > > to get notified and the pass in the event to thermal governors
> > > > that are willing to perform actions during reboot or crash events.
> > > > The thermal workers will be teared down in the process too.
> > >
> > > What problem does it solve?
> >
> > This cover letter was not clear enough. In fact, the context for
> > all patches I am sending now and will be sending in near future
> > is when the thermal subsystem is configured to control temperature
> > of a target device. The thermal subsystem is configured to have
> > cooling devices that will act on the target system, and has
> > input, temperature sensors, to have visibility to the target system
> > temperature.  In this case, the problem is when the controlling system
> > becomes unresponsive upon reboot or crash, therefore losing
> > control of temperature of the target system. This series solves the
> > problem by giving knowledge to the governors of such events, allowing
> > the governors to have opportunity to act before the actual event happens.
>
>
> Again, this is a different situation than a emergence shutdown due to
> temperature/overheat on the typical application of the thermal subsystem.
> Where it runs in the same system it controls the temperature of.
>
> Here we want to reduce the likelihood of loosing control of temperature of a target systems
> upon events where the controlling system is unavailable.

So the use case for this seems to be a BMC running Linux that is
responsible for the thermal control of a host system.

It kind of escapes me why you want a thermal governor in the BMC's
kernel to be part of this.
Eduardo Valentin July 1, 2023, 1:36 a.m. UTC | #5
Hey Rafael,

On Thu, Jun 29, 2023 at 09:07:36PM +0200, Rafael J. Wysocki wrote:
> 
> 
> 
> On Tue, Jun 6, 2023 at 1:32 AM Eduardo Valentin <evalenti@kernel.org> wrote:
> >
> > On Mon, Jun 05, 2023 at 04:27:55PM -0700, Eduardo Valentin wrote:
> > >
> > >
> > >
> > > On Fri, May 26, 2023 at 10:27:39AM +0200, Daniel Lezcano wrote:
> > > >
> > > >
> > > >
> > > > Hi,
> > > >
> > > > On 25/05/2023 23:16, Eduardo Valentin wrote:
> > > > > From: Eduardo Valentin <eduval@amazon.com>
> > > > >
> > > > > Hello,
> > > > >
> > > > > This small series of changes teaches thermal core about
> > > > > reboot and crash callbacks. The intention is to have the core
> > > > > to get notified and the pass in the event to thermal governors
> > > > > that are willing to perform actions during reboot or crash events.
> > > > > The thermal workers will be teared down in the process too.
> > > >
> > > > What problem does it solve?
> > >
> > > This cover letter was not clear enough. In fact, the context for
> > > all patches I am sending now and will be sending in near future
> > > is when the thermal subsystem is configured to control temperature
> > > of a target device. The thermal subsystem is configured to have
> > > cooling devices that will act on the target system, and has
> > > input, temperature sensors, to have visibility to the target system
> > > temperature.  In this case, the problem is when the controlling system
> > > becomes unresponsive upon reboot or crash, therefore losing
> > > control of temperature of the target system. This series solves the
> > > problem by giving knowledge to the governors of such events, allowing
> > > the governors to have opportunity to act before the actual event happens.
> >
> >
> > Again, this is a different situation than a emergence shutdown due to
> > temperature/overheat on the typical application of the thermal subsystem.
> > Where it runs in the same system it controls the temperature of.
> >
> > Here we want to reduce the likelihood of loosing control of temperature of a target systems
> > upon events where the controlling system is unavailable.
> 
> So the use case for this seems to be a BMC running Linux that is
> responsible for the thermal control of a host system.

BMC is a great example here, yes. But not limited to it.

BMC is responsible of thermal control of not only host though,
but the entire server thermal, considering not only
host temperature sensor as inputs, but plenty of other sensing
in the motherboard, for instance.

The changes are not limited to BMC controlling a host temperature though.
Any system controlling a target device temperature will fit.

> 
> It kind of escapes me why you want a thermal governor in the BMC's
> kernel to be part of this.

Yeah, I understand the typical BMC design would have its control
in userland. But as I said this not necessarily only about BMCs
controlling host temperature.

Primary reason to have in kernel is reliability compared to a
userspace based control, avoiding downtime of the control,
having the control available and stable during the entire
life time of the controlling system, e.g from
early boot, and avoiding unnecessary down time of the control
due to application crashes. Sure the kernel can crash too :-),
that is why this patch was written in the first place, but
it is a more rare case.

Userspace may still be involved here where user wants to have,
for example a fixed override cooling state. But in that case
the user would need to disable the in kernel control and
take care of the cooling state and corner scenarios, like
crash and reboot.

I also do not see anything that would prevent a control
of the controlling system, like a BMC, be put in the kernel,
given all the abstractions needed, cooling devices, governor,
sensors, are available there. Mechanically and physically
the thermal zone is still something part of the hardware
where the controlling system, say a BMC, is responsible for.