diff mbox series

[v4,7/7] thermal: core: Record PSCR before hw_protection_shutdown()

Message ID 20250306093900.2199442-8-o.rempel@pengutronix.de (mailing list archive)
State Handled Elsewhere, archived
Headers show
Series Introduction of PSCR Framework and Related Components | expand

Commit Message

Oleksij Rempel March 6, 2025, 9:38 a.m. UTC
Enhance the thermal core to record the Power State Change Reason (PSCR)
prior to invoking hw_protection_shutdown(). This change integrates the
PSCR framework with the thermal subsystem, ensuring that reasons for
power state changes, such as overtemperature events, are stored in a
dedicated non-volatile memory (NVMEM) cell.

This 'black box' recording is crucial for post-mortem analysis, enabling
a deeper understanding of system failures and abrupt shutdowns,
especially in scenarios where PMICs or watchdog timers are incapable of
logging such events.  The recorded data can be utilized during system
recovery routines in the bootloader or early kernel stages of subsequent
boots, significantly enhancing system diagnostics, reliability, and
debugging capabilities.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
---
 drivers/thermal/thermal_core.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Daniel Lezcano March 12, 2025, 3:35 p.m. UTC | #1
Hi Oleksij,


On 06/03/2025 10:38, Oleksij Rempel wrote:
> Enhance the thermal core to record the Power State Change Reason (PSCR)
> prior to invoking hw_protection_shutdown(). This change integrates the
> PSCR framework with the thermal subsystem, ensuring that reasons for
> power state changes, such as overtemperature events, are stored in a
> dedicated non-volatile memory (NVMEM) cell.
> 
> This 'black box' recording is crucial for post-mortem analysis, enabling
> a deeper understanding of system failures and abrupt shutdowns,
> especially in scenarios where PMICs or watchdog timers are incapable of
> logging such events.  The recorded data can be utilized during system
> recovery routines in the bootloader or early kernel stages of subsequent
> boots, significantly enhancing system diagnostics, reliability, and
> debugging capabilities.
> 
> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
> ---
>   drivers/thermal/thermal_core.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 2328ac0d8561..af4e9cf22bf6 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -16,6 +16,7 @@
>   #include <linux/kdev_t.h>
>   #include <linux/idr.h>
>   #include <linux/thermal.h>
> +#include <linux/pscrr.h>
>   #include <linux/reboot.h>
>   #include <linux/string.h>
>   #include <linux/of.h>
> @@ -380,6 +381,8 @@ static void thermal_zone_device_halt(struct thermal_zone_device *tz, bool shutdo
>   
>   	dev_emerg(&tz->device, "%s: critical temperature reached\n", tz->type);
>   
> +	set_power_state_change_reason(PSCR_OVERTEMPERATURE);
> +
>   	if (shutdown)
>   		hw_protection_shutdown(msg, poweroff_delay_ms);
>   	else

In the future could you add me as recipient to the series instead of 
this one ? so I can get more context.

Given there are no so much hw_protection_shutdown() users in the kernel, 
it could be more interesting to change the function to receive a enum 
pscr_reason and then in the hw_protection_shutdown() call 
pscrr_reason_to_str().
Oleksij Rempel March 12, 2025, 4:51 p.m. UTC | #2
Hi Daniel,

On Wed, Mar 12, 2025 at 04:35:51PM +0100, Daniel Lezcano wrote:
> > @@ -380,6 +381,8 @@ static void thermal_zone_device_halt(struct thermal_zone_device *tz, bool shutdo
> >   	dev_emerg(&tz->device, "%s: critical temperature reached\n", tz->type);
> > +	set_power_state_change_reason(PSCR_OVERTEMPERATURE);
> > +
> >   	if (shutdown)
> >   		hw_protection_shutdown(msg, poweroff_delay_ms);
> >   	else
> 
> In the future could you add me as recipient to the series instead of this
> one ? so I can get more context.

ack.

> Given there are no so much hw_protection_shutdown() users in the kernel, it
> could be more interesting to change the function to receive a enum
> pscr_reason and then in the hw_protection_shutdown() call
> pscrr_reason_to_str().
 
Do you mean, make it work with CONFIG_PSCRR=n?

Beside, the latest version is v5:
https://lore.kernel.org/all/20250310103732.423542-1-o.rempel@pengutronix.de/
Daniel Lezcano March 12, 2025, 5:22 p.m. UTC | #3
On 12/03/2025 17:51, Oleksij Rempel wrote:
> Hi Daniel,
> 
> On Wed, Mar 12, 2025 at 04:35:51PM +0100, Daniel Lezcano wrote:
>>> @@ -380,6 +381,8 @@ static void thermal_zone_device_halt(struct thermal_zone_device *tz, bool shutdo
>>>    	dev_emerg(&tz->device, "%s: critical temperature reached\n", tz->type);
>>> +	set_power_state_change_reason(PSCR_OVERTEMPERATURE);
>>> +
>>>    	if (shutdown)
>>>    		hw_protection_shutdown(msg, poweroff_delay_ms);
>>>    	else
>>
>> In the future could you add me as recipient to the series instead of this
>> one ? so I can get more context.
> 
> ack.
> 
>> Given there are no so much hw_protection_shutdown() users in the kernel, it
>> could be more interesting to change the function to receive a enum
>> pscr_reason and then in the hw_protection_shutdown() call
>> pscrr_reason_to_str().
>   
> Do you mean, make it work with CONFIG_PSCRR=n?

No I meant instead of doing:

+	set_power_state_change_reason(PSCR_OVERTEMPERATURE);
+
  	if (shutdown)
  		hw_protection_shutdown(msg, poweroff_delay_ms);

Replace it by:

  	if (shutdown)
  		hw_protection_shutdown(PSCR_OVERTEMPERATURE, poweroff_delay_ms);


and in hw_protection_shutdown() use pscrr_reason_to_str() to display a msg.

That can work with CONFIG_PSCRR=n


> Beside, the latest version is v5:
> https://lore.kernel.org/all/20250310103732.423542-1-o.rempel@pengutronix.de/

Ah ok thanks for the pointer
diff mbox series

Patch

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 2328ac0d8561..af4e9cf22bf6 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -16,6 +16,7 @@ 
 #include <linux/kdev_t.h>
 #include <linux/idr.h>
 #include <linux/thermal.h>
+#include <linux/pscrr.h>
 #include <linux/reboot.h>
 #include <linux/string.h>
 #include <linux/of.h>
@@ -380,6 +381,8 @@  static void thermal_zone_device_halt(struct thermal_zone_device *tz, bool shutdo
 
 	dev_emerg(&tz->device, "%s: critical temperature reached\n", tz->type);
 
+	set_power_state_change_reason(PSCR_OVERTEMPERATURE);
+
 	if (shutdown)
 		hw_protection_shutdown(msg, poweroff_delay_ms);
 	else