diff mbox

thermal: core: Add a back up thermal shutdown mechanism

Message ID 1490941820-13511-1-git-send-email-j-keerthy@ti.com (mailing list archive)
State Changes Requested
Delegated to: Eduardo Valentin
Headers show

Commit Message

J, KEERTHY March 31, 2017, 6:30 a.m. UTC
orderly_poweroff is triggered when a graceful shutdown
of system is desired. This may be used in many critical states of the
kernel such as when subsystems detects conditions such as critical
temperature conditions. However, in certain conditions in system
boot up sequences like those in the middle of driver probes being
initiated, userspace will be unable to power off the system in a clean
manner and leaves the system in a critical state. In cases like these,
the /sbin/poweroff will return success (having forked off to attempt
powering off the system. However, the system overall will fail to
completely poweroff (since other modules will be probed) and the system
is still functional with no userspace (since that would have shut itself
off).

However, there is no clean way of detecting such failure of userspace
powering off the system. In such scenarios, it is necessary for a backup
workqueue to be able to force a shutdown of the system when orderly
shutdown is not successful after a configurable time period.

Reported-by: Nishanth Menon <nm@ti.com>
Signed-off-by: Keerthy <j-keerthy@ti.com>
---
 drivers/thermal/Kconfig        | 13 +++++++++++++
 drivers/thermal/thermal_core.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+)

Comments

Eduardo Valentin April 11, 2017, 5:29 p.m. UTC | #1
Hey,

On Fri, Mar 31, 2017 at 12:00:20PM +0530, Keerthy wrote:
> orderly_poweroff is triggered when a graceful shutdown
> of system is desired. This may be used in many critical states of the
> kernel such as when subsystems detects conditions such as critical
> temperature conditions. However, in certain conditions in system
> boot up sequences like those in the middle of driver probes being
> initiated, userspace will be unable to power off the system in a clean
> manner and leaves the system in a critical state. In cases like these,
> the /sbin/poweroff will return success (having forked off to attempt
> powering off the system. However, the system overall will fail to
> completely poweroff (since other modules will be probed) and the system
> is still functional with no userspace (since that would have shut itself
> off).

OK... This seams to me, still a corner case supposed to be fixed at
orderly_power_off, not at thermal. But..

> 
> However, there is no clean way of detecting such failure of userspace
> powering off the system. In such scenarios, it is necessary for a backup
> workqueue to be able to force a shutdown of the system when orderly
> shutdown is not successful after a configurable time period.
> 

Given that system running hot is a thermal issue, I guess we care more
on this matter then..

> Reported-by: Nishanth Menon <nm@ti.com>
> Signed-off-by: Keerthy <j-keerthy@ti.com>
> ---
>  drivers/thermal/Kconfig        | 13 +++++++++++++
>  drivers/thermal/thermal_core.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 55 insertions(+)
> 
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index 0a16cf4..4cc55f9 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -15,6 +15,19 @@ menuconfig THERMAL
>  
>  if THERMAL
>  
> +config THERMAL_EMERGENCY_POWEROFF_DELAY_MS
> +	int "Emergency poweroff delay in milli-seconds"
> +	depends on THERMAL
> +	default 0
> +	help
> +	  The number of milliseconds to delay before emergency
> +	  poweroff kicks in. The delay should be carefully profiled
> +	  so as to give adequate time for orderly_poweroff. In case
> +	  of failure of an orderly_poweroff the emergency poweroff
> +	  kicks in after the delay has elapsed and shuts down the system.
> +
> +	  If set to 0 poweroff will happen immediately.
> +
>  config THERMAL_HWMON
>  	bool
>  	prompt "Expose thermal sensors as hwmon device"
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 11f0675..dc7fdd4 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -322,6 +322,47 @@ static void handle_non_critical_trips(struct thermal_zone_device *tz,
>  		       def_governor->throttle(tz, trip);
>  }
>  
> +/**
> + * emergency_poweroff_func - emergency poweroff work after a known delay
> + * @work: work_struct associated with the emergency poweroff function
> + *
> + * This function is called in very critical situations to force
> + * a kernel poweroff after a configurable timeout value.
> + */
> +static void emergency_poweroff_func(struct work_struct *work)
> +{
> +	/**
> +	 * We have reached here after the emergency thermal shutdown
> +	 * Waiting period has expired. This means orderly_poweroff has
> +	 * not been able to shut off the system for some reason.
> +	 * Try to shut down the system immediately using pm_power_off
> +	 * if populated
> +	 */

The above is not a kernel doc entry...

> +	pr_warn("Attempting kernel_power_off\n");
> +	if (pm_power_off)
> +		pm_power_off();

Why not calling kernel_power_off() directly instead? That is what is called by orderly
power off in case it fails, which seams to be  the missing part when
user land returns success, and therefore we don't call
kernel_power_off(). That path goes through the machine_power_off(),
which seams to be the default for pm_power_off() anyway.

kernel_power_off() handles the power off system call too.

> +
> +	/**

not a kernel doc entry...

> +	 * Worst of the worst case trigger emergency restart
> +	 */
> +	pr_warn("kernel_power_off has failed! Attempting emergency_restart\n");
> +	emergency_restart();
> +}
> +
> +static DECLARE_DELAYED_WORK(emergency_poweroff_work, emergency_poweroff_func);
> +
> +/**
> + * emergency_poweroff - Trigger an emergency system poweroff
> + *
> + * This may be called from any critical situation to trigger a system shutdown
> + * after a known period of time. By default the delay is 0 millisecond
> + */
> +void thermal_emergency_poweroff(void)
> +{
> +	schedule_delayed_work(&emergency_poweroff_work,
> +			      msecs_to_jiffies(CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS));
> +}
> +
>  static void handle_critical_trips(struct thermal_zone_device *tz,
>  				  int trip, enum thermal_trip_type trip_type)
>  {
> @@ -343,6 +384,7 @@ static void handle_critical_trips(struct thermal_zone_device *tz,
>  			  "critical temperature reached(%d C),shutting down\n",
>  			  tz->temperature / 1000);
>  		orderly_poweroff(true);
> +		thermal_emergency_poweroff();

Shouldn't we start count the timeout before calling orderly_poweroff?

>  	}
>  }
>  
> -- 
> 1.9.1
>
J, KEERTHY April 12, 2017, 2:49 a.m. UTC | #2
On Tuesday 11 April 2017 10:59 PM, Eduardo Valentin wrote:
> Hey,
> 
> On Fri, Mar 31, 2017 at 12:00:20PM +0530, Keerthy wrote:
>> orderly_poweroff is triggered when a graceful shutdown
>> of system is desired. This may be used in many critical states of the
>> kernel such as when subsystems detects conditions such as critical
>> temperature conditions. However, in certain conditions in system
>> boot up sequences like those in the middle of driver probes being
>> initiated, userspace will be unable to power off the system in a clean
>> manner and leaves the system in a critical state. In cases like these,
>> the /sbin/poweroff will return success (having forked off to attempt
>> powering off the system. However, the system overall will fail to
>> completely poweroff (since other modules will be probed) and the system
>> is still functional with no userspace (since that would have shut itself
>> off).
> 
> OK... This seams to me, still a corner case supposed to be fixed at
> orderly_power_off, not at thermal. But..
> 
>>
>> However, there is no clean way of detecting such failure of userspace
>> powering off the system. In such scenarios, it is necessary for a backup
>> workqueue to be able to force a shutdown of the system when orderly
>> shutdown is not successful after a configurable time period.
>>
> 
> Given that system running hot is a thermal issue, I guess we care more
> on this matter then..

Yes!

> 
>> Reported-by: Nishanth Menon <nm@ti.com>
>> Signed-off-by: Keerthy <j-keerthy@ti.com>
>> ---
>>  drivers/thermal/Kconfig        | 13 +++++++++++++
>>  drivers/thermal/thermal_core.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 55 insertions(+)
>>
>> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
>> index 0a16cf4..4cc55f9 100644
>> --- a/drivers/thermal/Kconfig
>> +++ b/drivers/thermal/Kconfig
>> @@ -15,6 +15,19 @@ menuconfig THERMAL
>>  
>>  if THERMAL
>>  
>> +config THERMAL_EMERGENCY_POWEROFF_DELAY_MS
>> +	int "Emergency poweroff delay in milli-seconds"
>> +	depends on THERMAL
>> +	default 0
>> +	help
>> +	  The number of milliseconds to delay before emergency
>> +	  poweroff kicks in. The delay should be carefully profiled
>> +	  so as to give adequate time for orderly_poweroff. In case
>> +	  of failure of an orderly_poweroff the emergency poweroff
>> +	  kicks in after the delay has elapsed and shuts down the system.
>> +
>> +	  If set to 0 poweroff will happen immediately.
>> +
>>  config THERMAL_HWMON
>>  	bool
>>  	prompt "Expose thermal sensors as hwmon device"
>> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
>> index 11f0675..dc7fdd4 100644
>> --- a/drivers/thermal/thermal_core.c
>> +++ b/drivers/thermal/thermal_core.c
>> @@ -322,6 +322,47 @@ static void handle_non_critical_trips(struct thermal_zone_device *tz,
>>  		       def_governor->throttle(tz, trip);
>>  }
>>  
>> +/**
>> + * emergency_poweroff_func - emergency poweroff work after a known delay
>> + * @work: work_struct associated with the emergency poweroff function
>> + *
>> + * This function is called in very critical situations to force
>> + * a kernel poweroff after a configurable timeout value.
>> + */
>> +static void emergency_poweroff_func(struct work_struct *work)
>> +{
>> +	/**
>> +	 * We have reached here after the emergency thermal shutdown
>> +	 * Waiting period has expired. This means orderly_poweroff has
>> +	 * not been able to shut off the system for some reason.
>> +	 * Try to shut down the system immediately using pm_power_off
>> +	 * if populated
>> +	 */
> 
> The above is not a kernel doc entry...

I will fix that.

> 
>> +	pr_warn("Attempting kernel_power_off\n");
>> +	if (pm_power_off)
>> +		pm_power_off();
> 
> Why not calling kernel_power_off() directly instead? That is what is called by orderly
> power off in case it fails, which seams to be  the missing part when
> user land returns success, and therefore we don't call
> kernel_power_off(). That path goes through the machine_power_off(),
> which seams to be the default for pm_power_off() anyway.
> 
> kernel_power_off() handles the power off system call too.

Yes. This is after orderly_poweroff fails so i felt why go through
kernel_power_off and directly call pm_power_off which directly pulls out
the power plug. This is in dire straits situation. Hence preferred to
call the last piece directly.

> 
>> +
>> +	/**
> 
> not a kernel doc entry...

Okay.

> 
>> +	 * Worst of the worst case trigger emergency restart
>> +	 */
>> +	pr_warn("kernel_power_off has failed! Attempting emergency_restart\n");
>> +	emergency_restart();
>> +}
>> +
>> +static DECLARE_DELAYED_WORK(emergency_poweroff_work, emergency_poweroff_func);
>> +
>> +/**
>> + * emergency_poweroff - Trigger an emergency system poweroff
>> + *
>> + * This may be called from any critical situation to trigger a system shutdown
>> + * after a known period of time. By default the delay is 0 millisecond
>> + */
>> +void thermal_emergency_poweroff(void)
>> +{
>> +	schedule_delayed_work(&emergency_poweroff_work,
>> +			      msecs_to_jiffies(CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS));
>> +}
>> +
>>  static void handle_critical_trips(struct thermal_zone_device *tz,
>>  				  int trip, enum thermal_trip_type trip_type)
>>  {
>> @@ -343,6 +384,7 @@ static void handle_critical_trips(struct thermal_zone_device *tz,
>>  			  "critical temperature reached(%d C),shutting down\n",
>>  			  tz->temperature / 1000);
>>  		orderly_poweroff(true);
>> +		thermal_emergency_poweroff();
> 
> Shouldn't we start count the timeout before calling orderly_poweroff?

Okay yes. That makes more sense. Queue the emergency function, start the
countdown and immediately call the orderly_poweroff. I will fix the
above comments and send a v2. I still want to go with pm_power_off over
kernel_poweroff as we have already elapsed the time out and the first
thing we want is to shut off the SoC! Let me know.


> 
>>  	}
>>  }
>>  
>> -- 
>> 1.9.1
>>
Zhang, Rui April 12, 2017, 3:20 a.m. UTC | #3
On Wed, 2017-04-12 at 08:19 +0530, Keerthy wrote:
> 
> On Tuesday 11 April 2017 10:59 PM, Eduardo Valentin wrote:
> > 
> > Hey,
> > 
> > On Fri, Mar 31, 2017 at 12:00:20PM +0530, Keerthy wrote:
> > > 
> > > orderly_poweroff is triggered when a graceful shutdown
> > > of system is desired. This may be used in many critical states of
> > > the
> > > kernel such as when subsystems detects conditions such as
> > > critical
> > > temperature conditions. However, in certain conditions in system
> > > boot up sequences like those in the middle of driver probes being
> > > initiated, userspace will be unable to power off the system in a
> > > clean
> > > manner and leaves the system in a critical state. In cases like
> > > these,
> > > the /sbin/poweroff will return success (having forked off to
> > > attempt
> > > powering off the system. However, the system overall will fail to
> > > completely poweroff (since other modules will be probed) and the
> > > system
> > > is still functional with no userspace (since that would have shut
> > > itself
> > > off).
> > OK... This seams to me, still a corner case supposed to be fixed at
> > orderly_power_off, not at thermal. But..
> > 
> > > 
> > > 
> > > However, there is no clean way of detecting such failure of
> > > userspace
> > > powering off the system. In such scenarios, it is necessary for a
> > > backup
> > > workqueue to be able to force a shutdown of the system when
> > > orderly
> > > shutdown is not successful after a configurable time period.
> > > 
> > Given that system running hot is a thermal issue, I guess we care
> > more
> > on this matter then..
> Yes!
> 
I just read this thread again https://patchwork.kernel.org/patch/802458
1/ to recall the previous discussion.

https://patchwork.kernel.org/patch/8149891/
https://patchwork.kernel.org/patch/8149861/
should be the solution made based on Ingo' suggestion, right?

And to me, this sounds like the right direction to go, thermal does not
need a back up shutdown solution, it just needs a kernel function call
which guarantees the system can be shutdown/reboot immediately.

is there any reason that patch 1/2 is not accepted?

thanks,
rui
> > 
> > 
> > > 
> > > Reported-by: Nishanth Menon <nm@ti.com>
> > > Signed-off-by: Keerthy <j-keerthy@ti.com>
> > > ---
> > >  drivers/thermal/Kconfig        | 13 +++++++++++++
> > >  drivers/thermal/thermal_core.c | 42
> > > ++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 55 insertions(+)
> > > 
> > > diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> > > index 0a16cf4..4cc55f9 100644
> > > --- a/drivers/thermal/Kconfig
> > > +++ b/drivers/thermal/Kconfig
> > > @@ -15,6 +15,19 @@ menuconfig THERMAL
> > >  
> > >  if THERMAL
> > >  
> > > +config THERMAL_EMERGENCY_POWEROFF_DELAY_MS
> > > +	int "Emergency poweroff delay in milli-seconds"
> > > +	depends on THERMAL
> > > +	default 0
> > > +	help
> > > +	  The number of milliseconds to delay before emergency
> > > +	  poweroff kicks in. The delay should be carefully
> > > profiled
> > > +	  so as to give adequate time for orderly_poweroff. In
> > > case
> > > +	  of failure of an orderly_poweroff the emergency
> > > poweroff
> > > +	  kicks in after the delay has elapsed and shuts down
> > > the system.
> > > +
> > > +	  If set to 0 poweroff will happen immediately.
> > > +
> > >  config THERMAL_HWMON
> > >  	bool
> > >  	prompt "Expose thermal sensors as hwmon device"
> > > diff --git a/drivers/thermal/thermal_core.c
> > > b/drivers/thermal/thermal_core.c
> > > index 11f0675..dc7fdd4 100644
> > > --- a/drivers/thermal/thermal_core.c
> > > +++ b/drivers/thermal/thermal_core.c
> > > @@ -322,6 +322,47 @@ static void handle_non_critical_trips(struct
> > > thermal_zone_device *tz,
> > >  		       def_governor->throttle(tz, trip);
> > >  }
> > >  
> > > +/**
> > > + * emergency_poweroff_func - emergency poweroff work after a
> > > known delay
> > > + * @work: work_struct associated with the emergency poweroff
> > > function
> > > + *
> > > + * This function is called in very critical situations to force
> > > + * a kernel poweroff after a configurable timeout value.
> > > + */
> > > +static void emergency_poweroff_func(struct work_struct *work)
> > > +{
> > > +	/**
> > > +	 * We have reached here after the emergency thermal
> > > shutdown
> > > +	 * Waiting period has expired. This means
> > > orderly_poweroff has
> > > +	 * not been able to shut off the system for some reason.
> > > +	 * Try to shut down the system immediately using
> > > pm_power_off
> > > +	 * if populated
> > > +	 */
> > The above is not a kernel doc entry...
> I will fix that.
> 
> > 
> > 
> > > 
> > > +	pr_warn("Attempting kernel_power_off\n");
> > > +	if (pm_power_off)
> > > +		pm_power_off();
> > Why not calling kernel_power_off() directly instead? That is what
> > is called by orderly
> > power off in case it fails, which seams to be  the missing part
> > when
> > user land returns success, and therefore we don't call
> > kernel_power_off(). That path goes through the machine_power_off(),
> > which seams to be the default for pm_power_off() anyway.
> > 
> > kernel_power_off() handles the power off system call too.
> Yes. This is after orderly_poweroff fails so i felt why go through
> kernel_power_off and directly call pm_power_off which directly pulls
> out
> the power plug. This is in dire straits situation. Hence preferred to
> call the last piece directly.
> 
> > 
> > 
> > > 
> > > +
> > > +	/**
> > not a kernel doc entry...
> Okay.
> 
> > 
> > 
> > > 
> > > +	 * Worst of the worst case trigger emergency restart
> > > +	 */
> > > +	pr_warn("kernel_power_off has failed! Attempting
> > > emergency_restart\n");
> > > +	emergency_restart();
> > > +}
> > > +
> > > +static DECLARE_DELAYED_WORK(emergency_poweroff_work,
> > > emergency_poweroff_func);
> > > +
> > > +/**
> > > + * emergency_poweroff - Trigger an emergency system poweroff
> > > + *
> > > + * This may be called from any critical situation to trigger a
> > > system shutdown
> > > + * after a known period of time. By default the delay is 0
> > > millisecond
> > > + */
> > > +void thermal_emergency_poweroff(void)
> > > +{
> > > +	schedule_delayed_work(&emergency_poweroff_work,
> > > +			      msecs_to_jiffies(CONFIG_THERMAL_EM
> > > ERGENCY_POWEROFF_DELAY_MS));
> > > +}
> > > +
> > >  static void handle_critical_trips(struct thermal_zone_device
> > > *tz,
> > >  				  int trip, enum
> > > thermal_trip_type trip_type)
> > >  {
> > > @@ -343,6 +384,7 @@ static void handle_critical_trips(struct
> > > thermal_zone_device *tz,
> > >  			  "critical temperature reached(%d
> > > C),shutting down\n",
> > >  			  tz->temperature / 1000);
> > >  		orderly_poweroff(true);
> > > +		thermal_emergency_poweroff();
> > Shouldn't we start count the timeout before calling
> > orderly_poweroff?
> Okay yes. That makes more sense. Queue the emergency function, start
> the
> countdown and immediately call the orderly_poweroff. I will fix the
> above comments and send a v2. I still want to go with pm_power_off
> over
> kernel_poweroff as we have already elapsed the time out and the first
> thing we want is to shut off the SoC! Let me know.
> 
> 
> > 
> > 
> > > 
> > >  	}
> > >  }
> > >
J, KEERTHY April 12, 2017, 3:39 a.m. UTC | #4
On Wednesday 12 April 2017 08:50 AM, Zhang Rui wrote:
> On Wed, 2017-04-12 at 08:19 +0530, Keerthy wrote:
>>
>> On Tuesday 11 April 2017 10:59 PM, Eduardo Valentin wrote:
>>>
>>> Hey,
>>>
>>> On Fri, Mar 31, 2017 at 12:00:20PM +0530, Keerthy wrote:
>>>>
>>>> orderly_poweroff is triggered when a graceful shutdown
>>>> of system is desired. This may be used in many critical states of
>>>> the
>>>> kernel such as when subsystems detects conditions such as
>>>> critical
>>>> temperature conditions. However, in certain conditions in system
>>>> boot up sequences like those in the middle of driver probes being
>>>> initiated, userspace will be unable to power off the system in a
>>>> clean
>>>> manner and leaves the system in a critical state. In cases like
>>>> these,
>>>> the /sbin/poweroff will return success (having forked off to
>>>> attempt
>>>> powering off the system. However, the system overall will fail to
>>>> completely poweroff (since other modules will be probed) and the
>>>> system
>>>> is still functional with no userspace (since that would have shut
>>>> itself
>>>> off).
>>> OK... This seams to me, still a corner case supposed to be fixed at
>>> orderly_power_off, not at thermal. But..
>>>
>>>>
>>>>
>>>> However, there is no clean way of detecting such failure of
>>>> userspace
>>>> powering off the system. In such scenarios, it is necessary for a
>>>> backup
>>>> workqueue to be able to force a shutdown of the system when
>>>> orderly
>>>> shutdown is not successful after a configurable time period.
>>>>
>>> Given that system running hot is a thermal issue, I guess we care
>>> more
>>> on this matter then..
>> Yes!
>>
> I just read this thread again https://patchwork.kernel.org/patch/802458
> 1/ to recall the previous discussion.
> 
> https://patchwork.kernel.org/patch/8149891/
> https://patchwork.kernel.org/patch/8149861/
> should be the solution made based on Ingo' suggestion, right?
> 
> And to me, this sounds like the right direction to go, thermal does not
> need a back up shutdown solution, it just needs a kernel function call
> which guarantees the system can be shutdown/reboot immediately.
> 
> is there any reason that patch 1/2 is not accepted?

Zhang,

http://www.serverphorums.com/read.php?12,1400964

I got a NAK from Alan and was given this direction on a thermal_poweroff
which is more or less what is done in this patch.

Thanks,
Keerthy

> 
> thanks,
> rui
>>>
>>>
>>>>
>>>> Reported-by: Nishanth Menon <nm@ti.com>
>>>> Signed-off-by: Keerthy <j-keerthy@ti.com>
>>>> ---
>>>>  drivers/thermal/Kconfig        | 13 +++++++++++++
>>>>  drivers/thermal/thermal_core.c | 42
>>>> ++++++++++++++++++++++++++++++++++++++++++
>>>>  2 files changed, 55 insertions(+)
>>>>
>>>> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
>>>> index 0a16cf4..4cc55f9 100644
>>>> --- a/drivers/thermal/Kconfig
>>>> +++ b/drivers/thermal/Kconfig
>>>> @@ -15,6 +15,19 @@ menuconfig THERMAL
>>>>  
>>>>  if THERMAL
>>>>  
>>>> +config THERMAL_EMERGENCY_POWEROFF_DELAY_MS
>>>> +	int "Emergency poweroff delay in milli-seconds"
>>>> +	depends on THERMAL
>>>> +	default 0
>>>> +	help
>>>> +	  The number of milliseconds to delay before emergency
>>>> +	  poweroff kicks in. The delay should be carefully
>>>> profiled
>>>> +	  so as to give adequate time for orderly_poweroff. In
>>>> case
>>>> +	  of failure of an orderly_poweroff the emergency
>>>> poweroff
>>>> +	  kicks in after the delay has elapsed and shuts down
>>>> the system.
>>>> +
>>>> +	  If set to 0 poweroff will happen immediately.
>>>> +
>>>>  config THERMAL_HWMON
>>>>  	bool
>>>>  	prompt "Expose thermal sensors as hwmon device"
>>>> diff --git a/drivers/thermal/thermal_core.c
>>>> b/drivers/thermal/thermal_core.c
>>>> index 11f0675..dc7fdd4 100644
>>>> --- a/drivers/thermal/thermal_core.c
>>>> +++ b/drivers/thermal/thermal_core.c
>>>> @@ -322,6 +322,47 @@ static void handle_non_critical_trips(struct
>>>> thermal_zone_device *tz,
>>>>  		       def_governor->throttle(tz, trip);
>>>>  }
>>>>  
>>>> +/**
>>>> + * emergency_poweroff_func - emergency poweroff work after a
>>>> known delay
>>>> + * @work: work_struct associated with the emergency poweroff
>>>> function
>>>> + *
>>>> + * This function is called in very critical situations to force
>>>> + * a kernel poweroff after a configurable timeout value.
>>>> + */
>>>> +static void emergency_poweroff_func(struct work_struct *work)
>>>> +{
>>>> +	/**
>>>> +	 * We have reached here after the emergency thermal
>>>> shutdown
>>>> +	 * Waiting period has expired. This means
>>>> orderly_poweroff has
>>>> +	 * not been able to shut off the system for some reason.
>>>> +	 * Try to shut down the system immediately using
>>>> pm_power_off
>>>> +	 * if populated
>>>> +	 */
>>> The above is not a kernel doc entry...
>> I will fix that.
>>
>>>
>>>
>>>>
>>>> +	pr_warn("Attempting kernel_power_off\n");
>>>> +	if (pm_power_off)
>>>> +		pm_power_off();
>>> Why not calling kernel_power_off() directly instead? That is what
>>> is called by orderly
>>> power off in case it fails, which seams to be  the missing part
>>> when
>>> user land returns success, and therefore we don't call
>>> kernel_power_off(). That path goes through the machine_power_off(),
>>> which seams to be the default for pm_power_off() anyway.
>>>
>>> kernel_power_off() handles the power off system call too.
>> Yes. This is after orderly_poweroff fails so i felt why go through
>> kernel_power_off and directly call pm_power_off which directly pulls
>> out
>> the power plug. This is in dire straits situation. Hence preferred to
>> call the last piece directly.
>>
>>>
>>>
>>>>
>>>> +
>>>> +	/**
>>> not a kernel doc entry...
>> Okay.
>>
>>>
>>>
>>>>
>>>> +	 * Worst of the worst case trigger emergency restart
>>>> +	 */
>>>> +	pr_warn("kernel_power_off has failed! Attempting
>>>> emergency_restart\n");
>>>> +	emergency_restart();
>>>> +}
>>>> +
>>>> +static DECLARE_DELAYED_WORK(emergency_poweroff_work,
>>>> emergency_poweroff_func);
>>>> +
>>>> +/**
>>>> + * emergency_poweroff - Trigger an emergency system poweroff
>>>> + *
>>>> + * This may be called from any critical situation to trigger a
>>>> system shutdown
>>>> + * after a known period of time. By default the delay is 0
>>>> millisecond
>>>> + */
>>>> +void thermal_emergency_poweroff(void)
>>>> +{
>>>> +	schedule_delayed_work(&emergency_poweroff_work,
>>>> +			      msecs_to_jiffies(CONFIG_THERMAL_EM
>>>> ERGENCY_POWEROFF_DELAY_MS));
>>>> +}
>>>> +
>>>>  static void handle_critical_trips(struct thermal_zone_device
>>>> *tz,
>>>>  				  int trip, enum
>>>> thermal_trip_type trip_type)
>>>>  {
>>>> @@ -343,6 +384,7 @@ static void handle_critical_trips(struct
>>>> thermal_zone_device *tz,
>>>>  			  "critical temperature reached(%d
>>>> C),shutting down\n",
>>>>  			  tz->temperature / 1000);
>>>>  		orderly_poweroff(true);
>>>> +		thermal_emergency_poweroff();
>>> Shouldn't we start count the timeout before calling
>>> orderly_poweroff?
>> Okay yes. That makes more sense. Queue the emergency function, start
>> the
>> countdown and immediately call the orderly_poweroff. I will fix the
>> above comments and send a v2. I still want to go with pm_power_off
>> over
>> kernel_poweroff as we have already elapsed the time out and the first
>> thing we want is to shut off the SoC! Let me know.
>>
>>
>>>
>>>
>>>>
>>>>  	}
>>>>  }
>>>>
Eduardo Valentin April 12, 2017, 4:05 a.m. UTC | #5
Keerthy,

On Wed, Apr 12, 2017 at 09:09:36AM +0530, Keerthy wrote:
> 
> 
> On Wednesday 12 April 2017 08:50 AM, Zhang Rui wrote:
> > On Wed, 2017-04-12 at 08:19 +0530, Keerthy wrote:
> >>
> >> On Tuesday 11 April 2017 10:59 PM, Eduardo Valentin wrote:
> >>>
> >>> Hey,
> >>>
> >>> On Fri, Mar 31, 2017 at 12:00:20PM +0530, Keerthy wrote:
> >>>>
> >>>> off).

<cut>

> >>> OK... This seams to me, still a corner case supposed to be fixed at
> >>> orderly_power_off, not at thermal. But..
> >>>

^^^ Then again, this must be fixed not at thermal core. And re-reading
the history of this thread, this seams to be really something broken at
OMAP/DRA7, as mentioned in previous messages. That is probably why you
are pushing for pm_power_off(), which seams to be the one that works for
you, pulling the plug correctly (DRA7).

> >>>>
> >>>>
> >>>> However, there is no clean way of detecting such failure of
> >>>> userspace
> >>>> powering off the system. In such scenarios, it is necessary for a
> >>>> backup
> >>>> workqueue to be able to force a shutdown of the system when
> >>>> orderly
> >>>> shutdown is not successful after a configurable time period.
> >>>>
> >>> Given that system running hot is a thermal issue, I guess we care
> >>> more
> >>> on this matter then..
> >> Yes!
> >>
> > I just read this thread again https://patchwork.kernel.org/patch/802458
> > 1/ to recall the previous discussion.
> > 
> > https://patchwork.kernel.org/patch/8149891/
> > https://patchwork.kernel.org/patch/8149861/
> > should be the solution made based on Ingo' suggestion, right?
> > 
> > And to me, this sounds like the right direction to go, thermal does not
> > need a back up shutdown solution, it just needs a kernel function call
> > which guarantees the system can be shutdown/reboot immediately.
> > 
> > is there any reason that patch 1/2 is not accepted?
> 
> Zhang,
> 
> http://www.serverphorums.com/read.php?12,1400964
> 
> I got a NAK from Alan and was given this direction on a thermal_poweroff
> which is more or less what is done in this patch.
> 


Actually, Alan's suggestion is more for you to define a
thermal_poweroff() that can be defined per architecture.

Also, please, keep track of your patch versions and also do copy
everybody who has stated their opinion on previous discussions. These
patches must have Ingo, Alan, and RMK copied too. In this way we avoid
loosing track of what has been suggested and we also converge faster to
something everybody (or most of us) agree. Next version, please, fix
that.


To me, thermal core needs a function that simply powers off the system.
No timeouts, delayed works, backups, etc. Simple and straight.

The idea of having a per architecture implementation, as per Alan's
suggestion, makes sense to me too. Having something different from
pm_power_off(), specific to thermal, might also give the opportunity to
save the power off reason.

BR,

Eduardo Valentin
J, KEERTHY April 12, 2017, 4:18 a.m. UTC | #6
On Wednesday 12 April 2017 09:35 AM, Eduardo Valentin wrote:
> Keerthy,
> 
> On Wed, Apr 12, 2017 at 09:09:36AM +0530, Keerthy wrote:
>>
>>
>> On Wednesday 12 April 2017 08:50 AM, Zhang Rui wrote:
>>> On Wed, 2017-04-12 at 08:19 +0530, Keerthy wrote:
>>>>
>>>> On Tuesday 11 April 2017 10:59 PM, Eduardo Valentin wrote:
>>>>>
>>>>> Hey,
>>>>>
>>>>> On Fri, Mar 31, 2017 at 12:00:20PM +0530, Keerthy wrote:
>>>>>>
>>>>>> off).
> 
> <cut>
> 
>>>>> OK... This seams to me, still a corner case supposed to be fixed at
>>>>> orderly_power_off, not at thermal. But..
>>>>>
> 
> ^^^ Then again, this must be fixed not at thermal core. And re-reading
> the history of this thread, this seams to be really something broken at
> OMAP/DRA7, as mentioned in previous messages. That is probably why you
> are pushing for pm_power_off(), which seams to be the one that works for
> you, pulling the plug correctly (DRA7).
> 
>>>>>>
>>>>>>
>>>>>> However, there is no clean way of detecting such failure of
>>>>>> userspace
>>>>>> powering off the system. In such scenarios, it is necessary for a
>>>>>> backup
>>>>>> workqueue to be able to force a shutdown of the system when
>>>>>> orderly
>>>>>> shutdown is not successful after a configurable time period.
>>>>>>
>>>>> Given that system running hot is a thermal issue, I guess we care
>>>>> more
>>>>> on this matter then..
>>>> Yes!
>>>>
>>> I just read this thread again https://patchwork.kernel.org/patch/802458
>>> 1/ to recall the previous discussion.
>>>
>>> https://patchwork.kernel.org/patch/8149891/
>>> https://patchwork.kernel.org/patch/8149861/
>>> should be the solution made based on Ingo' suggestion, right?
>>>
>>> And to me, this sounds like the right direction to go, thermal does not
>>> need a back up shutdown solution, it just needs a kernel function call
>>> which guarantees the system can be shutdown/reboot immediately.
>>>
>>> is there any reason that patch 1/2 is not accepted?
>>
>> Zhang,
>>
>> http://www.serverphorums.com/read.php?12,1400964
>>
>> I got a NAK from Alan and was given this direction on a thermal_poweroff
>> which is more or less what is done in this patch.
>>
> 
> 
> Actually, Alan's suggestion is more for you to define a
> thermal_poweroff() that can be defined per architecture.
> 
> Also, please, keep track of your patch versions and also do copy
> everybody who has stated their opinion on previous discussions. These
> patches must have Ingo, Alan, and RMK copied too. In this way we avoid
> loosing track of what has been suggested and we also converge faster to
> something everybody (or most of us) agree. Next version, please, fix
> that.

Sure. This was resurrected from last year. I will add the links to
previous discussions. my bad.

> 
> 
> To me, thermal core needs a function that simply powers off the system.
> No timeouts, delayed works, backups, etc. Simple and straight.

You mean replacing orderly_power_off during critical trip point cross
over with a thermal specific thermal_poweroff function that ensures
that hardware is indeed shut off?

> 
> The idea of having a per architecture implementation, as per Alan's
> suggestion, makes sense to me too. Having something different from
> pm_power_off(), specific to thermal, might also give the opportunity to
> save the power off reason.

I did not get the 'save the power off reason' point. Care to explain more?

> 
> BR,
> 
> Eduardo Valentin
>
J, KEERTHY April 12, 2017, 7:55 a.m. UTC | #7
On Wednesday 12 April 2017 09:35 AM, Eduardo Valentin wrote:
> Keerthy,
> 
> On Wed, Apr 12, 2017 at 09:09:36AM +0530, Keerthy wrote:
>>
>>
>> On Wednesday 12 April 2017 08:50 AM, Zhang Rui wrote:
>>> On Wed, 2017-04-12 at 08:19 +0530, Keerthy wrote:
>>>>
>>>> On Tuesday 11 April 2017 10:59 PM, Eduardo Valentin wrote:
>>>>>
>>>>> Hey,
>>>>>
>>>>> On Fri, Mar 31, 2017 at 12:00:20PM +0530, Keerthy wrote:
>>>>>>
>>>>>> off).
> 
> <cut>
> 
>>>>> OK... This seams to me, still a corner case supposed to be fixed at
>>>>> orderly_power_off, not at thermal. But..
>>>>>
> 
> ^^^ Then again, this must be fixed not at thermal core. And re-reading
> the history of this thread, this seams to be really something broken at
> OMAP/DRA7, as mentioned in previous messages. That is probably why you
> are pushing for pm_power_off(), which seams to be the one that works for
> you, pulling the plug correctly (DRA7).

Zhang/Eduardo,

OMAP5/DRA7 is one case.

I believe i this is the root cause of this failure.

thermal_zone_device_check --> thermal_zone_device_update -->
handle_thermal_trip --> handle_critical_trips --> orderly_poweroff

The above sequence happens every 250/500 mS based on the configuration.
The orderly_poweroff function is getting called every 250/500 mS and i
see with a full fledged nfs file system it takes at least 5-10 Seconds
to shutdown and during that time we bombard with orderly_poweroff calls
multiple times due to the thermal_zone_device_check triggering periodically.

To confirm that i made sure that handle_critical_trips calls
orderly_poweroff only once and i no longer see the failure on DRA72-EVM
board.

So IMHO once we get to handle_critical_trips case where we do
orderly_poweroff we need to do the following:

1) Make sure that orderly_poweroff is called only once.
2) Cancel all the scheduled work queues to monitor the temperature as
we have already reached a point of shutting down the system.

Let me know your thoughts on this.

Best Regards,
Keerthy
> 
>>>>>>
>>>>>>
>>>>>> However, there is no clean way of detecting such failure of
>>>>>> userspace
>>>>>> powering off the system. In such scenarios, it is necessary for a
>>>>>> backup
>>>>>> workqueue to be able to force a shutdown of the system when
>>>>>> orderly
>>>>>> shutdown is not successful after a configurable time period.
>>>>>>
>>>>> Given that system running hot is a thermal issue, I guess we care
>>>>> more
>>>>> on this matter then..
>>>> Yes!
>>>>
>>> I just read this thread again https://patchwork.kernel.org/patch/802458
>>> 1/ to recall the previous discussion.
>>>
>>> https://patchwork.kernel.org/patch/8149891/
>>> https://patchwork.kernel.org/patch/8149861/
>>> should be the solution made based on Ingo' suggestion, right?
>>>
>>> And to me, this sounds like the right direction to go, thermal does not
>>> need a back up shutdown solution, it just needs a kernel function call
>>> which guarantees the system can be shutdown/reboot immediately.
>>>
>>> is there any reason that patch 1/2 is not accepted?
>>
>> Zhang,
>>
>> http://www.serverphorums.com/read.php?12,1400964
>>
>> I got a NAK from Alan and was given this direction on a thermal_poweroff
>> which is more or less what is done in this patch.
>>
> 
> 
> Actually, Alan's suggestion is more for you to define a
> thermal_poweroff() that can be defined per architecture.
> 
> Also, please, keep track of your patch versions and also do copy
> everybody who has stated their opinion on previous discussions. These
> patches must have Ingo, Alan, and RMK copied too. In this way we avoid
> loosing track of what has been suggested and we also converge faster to
> something everybody (or most of us) agree. Next version, please, fix
> that.
> 
> 
> To me, thermal core needs a function that simply powers off the system.
> No timeouts, delayed works, backups, etc. Simple and straight.
> 
> The idea of having a per architecture implementation, as per Alan's
> suggestion, makes sense to me too. Having something different from
> pm_power_off(), specific to thermal, might also give the opportunity to
> save the power off reason.
> 
> BR,
> 
> Eduardo Valentin
>
Zhang, Rui April 12, 2017, 8:26 a.m. UTC | #8
On Wed, 2017-04-12 at 13:25 +0530, Keerthy wrote:
> 
> On Wednesday 12 April 2017 09:35 AM, Eduardo Valentin wrote:
> > 
> > Keerthy,
> > 
> > On Wed, Apr 12, 2017 at 09:09:36AM +0530, Keerthy wrote:
> > > 
> > > 
> > > 
> > > On Wednesday 12 April 2017 08:50 AM, Zhang Rui wrote:
> > > > 
> > > > On Wed, 2017-04-12 at 08:19 +0530, Keerthy wrote:
> > > > > 
> > > > > 
> > > > > On Tuesday 11 April 2017 10:59 PM, Eduardo Valentin wrote:
> > > > > > 
> > > > > > 
> > > > > > Hey,
> > > > > > 
> > > > > > On Fri, Mar 31, 2017 at 12:00:20PM +0530, Keerthy wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > off).
> > <cut>
> > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > OK... This seams to me, still a corner case supposed to be
> > > > > > fixed at
> > > > > > orderly_power_off, not at thermal. But..
> > > > > > 
> > ^^^ Then again, this must be fixed not at thermal core. And re-
> > reading
> > the history of this thread, this seams to be really something
> > broken at
> > OMAP/DRA7, as mentioned in previous messages. That is probably why
> > you
> > are pushing for pm_power_off(), which seams to be the one that
> > works for
> > you, pulling the plug correctly (DRA7).
> Zhang/Eduardo,
> 
> OMAP5/DRA7 is one case.
> 
> I believe i this is the root cause of this failure.
> 
> thermal_zone_device_check --> thermal_zone_device_update -->
> handle_thermal_trip --> handle_critical_trips --> orderly_poweroff
> 
> The above sequence happens every 250/500 mS based on the
> configuration.
> The orderly_poweroff function is getting called every 250/500 mS and
> i
> see with a full fledged nfs file system it takes at least 5-10
> Seconds
> to shutdown and during that time we bombard with orderly_poweroff
> calls
> multiple times due to the thermal_zone_device_check triggering
> periodically.
> 
> To confirm that i made sure that handle_critical_trips calls
> orderly_poweroff only once and i no longer see the failure on DRA72-
> EVM
> board.
> 
Nice catch!

> So IMHO once we get to handle_critical_trips case where we do
> orderly_poweroff we need to do the following:
> 
> 1) Make sure that orderly_poweroff is called only once.

agreed.

> 2) Cancel all the scheduled work queues to monitor the temperature as
> we have already reached a point of shutting down the system.
> 
agreed.

now I think we've found the root cause of the problem.
orderly_poweroff() is not reenterable and it does not have to be.
If we're using orderly_poweroff() for emergency power off, we have to
use it correctly.

will you generate a patch to do this?

thanks,
rui

> Let me know your thoughts on this.
> 
> Best Regards,
> Keerthy
> > 
> > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > However, there is no clean way of detecting such failure
> > > > > > > of
> > > > > > > userspace
> > > > > > > powering off the system. In such scenarios, it is
> > > > > > > necessary for a
> > > > > > > backup
> > > > > > > workqueue to be able to force a shutdown of the system
> > > > > > > when
> > > > > > > orderly
> > > > > > > shutdown is not successful after a configurable time
> > > > > > > period.
> > > > > > > 
> > > > > > Given that system running hot is a thermal issue, I guess
> > > > > > we care
> > > > > > more
> > > > > > on this matter then..
> > > > > Yes!
> > > > > 
> > > > I just read this thread again https://patchwork.kernel.org/patc
> > > > h/802458
> > > > 1/ to recall the previous discussion.
> > > > 
> > > > https://patchwork.kernel.org/patch/8149891/
> > > > https://patchwork.kernel.org/patch/8149861/
> > > > should be the solution made based on Ingo' suggestion, right?
> > > > 
> > > > And to me, this sounds like the right direction to go, thermal
> > > > does not
> > > > need a back up shutdown solution, it just needs a kernel
> > > > function call
> > > > which guarantees the system can be shutdown/reboot immediately.
> > > > 
> > > > is there any reason that patch 1/2 is not accepted?
> > > Zhang,
> > > 
> > > http://www.serverphorums.com/read.php?12,1400964
> > > 
> > > I got a NAK from Alan and was given this direction on a
> > > thermal_poweroff
> > > which is more or less what is done in this patch.
> > > 
> > 
> > Actually, Alan's suggestion is more for you to define a
> > thermal_poweroff() that can be defined per architecture.
> > 
> > Also, please, keep track of your patch versions and also do copy
> > everybody who has stated their opinion on previous discussions.
> > These
> > patches must have Ingo, Alan, and RMK copied too. In this way we
> > avoid
> > loosing track of what has been suggested and we also converge
> > faster to
> > something everybody (or most of us) agree. Next version, please,
> > fix
> > that.
> > 
> > 
> > To me, thermal core needs a function that simply powers off the
> > system.
> > No timeouts, delayed works, backups, etc. Simple and straight.
> > 
> > The idea of having a per architecture implementation, as per Alan's
> > suggestion, makes sense to me too. Having something different from
> > pm_power_off(), specific to thermal, might also give the
> > opportunity to
> > save the power off reason.
> > 
> > BR,
> > 
> > Eduardo Valentin
> >
J, KEERTHY April 12, 2017, 8:36 a.m. UTC | #9
On Wednesday 12 April 2017 01:56 PM, Zhang Rui wrote:
> On Wed, 2017-04-12 at 13:25 +0530, Keerthy wrote:
>>
>> On Wednesday 12 April 2017 09:35 AM, Eduardo Valentin wrote:
>>>
>>> Keerthy,
>>>
>>> On Wed, Apr 12, 2017 at 09:09:36AM +0530, Keerthy wrote:
>>>>
>>>>
>>>>
>>>> On Wednesday 12 April 2017 08:50 AM, Zhang Rui wrote:
>>>>>
>>>>> On Wed, 2017-04-12 at 08:19 +0530, Keerthy wrote:
>>>>>>
>>>>>>
>>>>>> On Tuesday 11 April 2017 10:59 PM, Eduardo Valentin wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hey,
>>>>>>>
>>>>>>> On Fri, Mar 31, 2017 at 12:00:20PM +0530, Keerthy wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> off).
>>> <cut>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> OK... This seams to me, still a corner case supposed to be
>>>>>>> fixed at
>>>>>>> orderly_power_off, not at thermal. But..
>>>>>>>
>>> ^^^ Then again, this must be fixed not at thermal core. And re-
>>> reading
>>> the history of this thread, this seams to be really something
>>> broken at
>>> OMAP/DRA7, as mentioned in previous messages. That is probably why
>>> you
>>> are pushing for pm_power_off(), which seams to be the one that
>>> works for
>>> you, pulling the plug correctly (DRA7).
>> Zhang/Eduardo,
>>
>> OMAP5/DRA7 is one case.
>>
>> I believe i this is the root cause of this failure.
>>
>> thermal_zone_device_check --> thermal_zone_device_update -->
>> handle_thermal_trip --> handle_critical_trips --> orderly_poweroff
>>
>> The above sequence happens every 250/500 mS based on the
>> configuration.
>> The orderly_poweroff function is getting called every 250/500 mS and
>> i
>> see with a full fledged nfs file system it takes at least 5-10
>> Seconds
>> to shutdown and during that time we bombard with orderly_poweroff
>> calls
>> multiple times due to the thermal_zone_device_check triggering
>> periodically.
>>
>> To confirm that i made sure that handle_critical_trips calls
>> orderly_poweroff only once and i no longer see the failure on DRA72-
>> EVM
>> board.
>>
> Nice catch!

Thanks.

> 
>> So IMHO once we get to handle_critical_trips case where we do
>> orderly_poweroff we need to do the following:
>>
>> 1) Make sure that orderly_poweroff is called only once.
> 
> agreed.
> 
>> 2) Cancel all the scheduled work queues to monitor the temperature as
>> we have already reached a point of shutting down the system.
>>
> agreed.
> 
> now I think we've found the root cause of the problem.
> orderly_poweroff() is not reenterable and it does not have to be.
> If we're using orderly_poweroff() for emergency power off, we have to
> use it correctly.
> 
> will you generate a patch to do this?

Sure. I will generate a patch to take care of 1) To make sure that
orderly_poweroff is called only once right away. I have already tested.

for 2) Cancel all the scheduled work queues to monitor the temperature.
I will take some more time to make it and test.

Is that okay? Or you want me to send both together?

Regards,
Keerthy

> 
> thanks,
> rui
> 
>> Let me know your thoughts on this.
>>
>> Best Regards,
>> Keerthy
>>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> However, there is no clean way of detecting such failure
>>>>>>>> of
>>>>>>>> userspace
>>>>>>>> powering off the system. In such scenarios, it is
>>>>>>>> necessary for a
>>>>>>>> backup
>>>>>>>> workqueue to be able to force a shutdown of the system
>>>>>>>> when
>>>>>>>> orderly
>>>>>>>> shutdown is not successful after a configurable time
>>>>>>>> period.
>>>>>>>>
>>>>>>> Given that system running hot is a thermal issue, I guess
>>>>>>> we care
>>>>>>> more
>>>>>>> on this matter then..
>>>>>> Yes!
>>>>>>
>>>>> I just read this thread again https://patchwork.kernel.org/patc
>>>>> h/802458
>>>>> 1/ to recall the previous discussion.
>>>>>
>>>>> https://patchwork.kernel.org/patch/8149891/
>>>>> https://patchwork.kernel.org/patch/8149861/
>>>>> should be the solution made based on Ingo' suggestion, right?
>>>>>
>>>>> And to me, this sounds like the right direction to go, thermal
>>>>> does not
>>>>> need a back up shutdown solution, it just needs a kernel
>>>>> function call
>>>>> which guarantees the system can be shutdown/reboot immediately.
>>>>>
>>>>> is there any reason that patch 1/2 is not accepted?
>>>> Zhang,
>>>>
>>>> http://www.serverphorums.com/read.php?12,1400964
>>>>
>>>> I got a NAK from Alan and was given this direction on a
>>>> thermal_poweroff
>>>> which is more or less what is done in this patch.
>>>>
>>>
>>> Actually, Alan's suggestion is more for you to define a
>>> thermal_poweroff() that can be defined per architecture.
>>>
>>> Also, please, keep track of your patch versions and also do copy
>>> everybody who has stated their opinion on previous discussions.
>>> These
>>> patches must have Ingo, Alan, and RMK copied too. In this way we
>>> avoid
>>> loosing track of what has been suggested and we also converge
>>> faster to
>>> something everybody (or most of us) agree. Next version, please,
>>> fix
>>> that.
>>>
>>>
>>> To me, thermal core needs a function that simply powers off the
>>> system.
>>> No timeouts, delayed works, backups, etc. Simple and straight.
>>>
>>> The idea of having a per architecture implementation, as per Alan's
>>> suggestion, makes sense to me too. Having something different from
>>> pm_power_off(), specific to thermal, might also give the
>>> opportunity to
>>> save the power off reason.
>>>
>>> BR,
>>>
>>> Eduardo Valentin
>>>
Zhang, Rui April 12, 2017, 8:45 a.m. UTC | #10
On Wed, 2017-04-12 at 14:06 +0530, Keerthy wrote:
> 
> On Wednesday 12 April 2017 01:56 PM, Zhang Rui wrote:
> > 
> > On Wed, 2017-04-12 at 13:25 +0530, Keerthy wrote:
> > > 
> > > 
> > > On Wednesday 12 April 2017 09:35 AM, Eduardo Valentin wrote:
> > > > 
> > > > 
> > > > Keerthy,
> > > > 
> > > > On Wed, Apr 12, 2017 at 09:09:36AM +0530, Keerthy wrote:
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > On Wednesday 12 April 2017 08:50 AM, Zhang Rui wrote:
> > > > > > 
> > > > > > 
> > > > > > On Wed, 2017-04-12 at 08:19 +0530, Keerthy wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > On Tuesday 11 April 2017 10:59 PM, Eduardo Valentin
> > > > > > > wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Hey,
> > > > > > > > 
> > > > > > > > On Fri, Mar 31, 2017 at 12:00:20PM +0530, Keerthy
> > > > > > > > wrote:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > off).
> > > > <cut>
> > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > OK... This seams to me, still a corner case supposed to
> > > > > > > > be
> > > > > > > > fixed at
> > > > > > > > orderly_power_off, not at thermal. But..
> > > > > > > > 
> > > > ^^^ Then again, this must be fixed not at thermal core. And re-
> > > > reading
> > > > the history of this thread, this seams to be really something
> > > > broken at
> > > > OMAP/DRA7, as mentioned in previous messages. That is probably
> > > > why
> > > > you
> > > > are pushing for pm_power_off(), which seams to be the one that
> > > > works for
> > > > you, pulling the plug correctly (DRA7).
> > > Zhang/Eduardo,
> > > 
> > > OMAP5/DRA7 is one case.
> > > 
> > > I believe i this is the root cause of this failure.
> > > 
> > > thermal_zone_device_check --> thermal_zone_device_update -->
> > > handle_thermal_trip --> handle_critical_trips -->
> > > orderly_poweroff
> > > 
> > > The above sequence happens every 250/500 mS based on the
> > > configuration.
> > > The orderly_poweroff function is getting called every 250/500 mS
> > > and
> > > i
> > > see with a full fledged nfs file system it takes at least 5-10
> > > Seconds
> > > to shutdown and during that time we bombard with orderly_poweroff
> > > calls
> > > multiple times due to the thermal_zone_device_check triggering
> > > periodically.
> > > 
> > > To confirm that i made sure that handle_critical_trips calls
> > > orderly_poweroff only once and i no longer see the failure on
> > > DRA72-
> > > EVM
> > > board.
> > > 
> > Nice catch!
> Thanks.
> 
> > 
> > 
> > > 
> > > So IMHO once we get to handle_critical_trips case where we do
> > > orderly_poweroff we need to do the following:
> > > 
> > > 1) Make sure that orderly_poweroff is called only once.
> > agreed.
> > 
> > > 
> > > 2) Cancel all the scheduled work queues to monitor the
> > > temperature as
> > > we have already reached a point of shutting down the system.
> > > 
> > agreed.
> > 
> > now I think we've found the root cause of the problem.
> > orderly_poweroff() is not reenterable and it does not have to be.
> > If we're using orderly_poweroff() for emergency power off, we have
> > to
> > use it correctly.
> > 
> > will you generate a patch to do this?
> Sure. I will generate a patch to take care of 1) To make sure that
> orderly_poweroff is called only once right away. I have already
> tested.
> 
> for 2) Cancel all the scheduled work queues to monitor the
> temperature.
> I will take some more time to make it and test.
> 
> Is that okay? Or you want me to send both together?
> 
I think you can send patch for step 1 first.

thanks,
rui
> Regards,
> Keerthy
> 
> > 
> > 
> > thanks,
> > rui
> > 
> > > 
> > > Let me know your thoughts on this.
> > > 
> > > Best Regards,
> > > Keerthy
> > > > 
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > However, there is no clean way of detecting such
> > > > > > > > > failure
> > > > > > > > > of
> > > > > > > > > userspace
> > > > > > > > > powering off the system. In such scenarios, it is
> > > > > > > > > necessary for a
> > > > > > > > > backup
> > > > > > > > > workqueue to be able to force a shutdown of the
> > > > > > > > > system
> > > > > > > > > when
> > > > > > > > > orderly
> > > > > > > > > shutdown is not successful after a configurable time
> > > > > > > > > period.
> > > > > > > > > 
> > > > > > > > Given that system running hot is a thermal issue, I
> > > > > > > > guess
> > > > > > > > we care
> > > > > > > > more
> > > > > > > > on this matter then..
> > > > > > > Yes!
> > > > > > > 
> > > > > > I just read this thread again https://patchwork.kernel.org/
> > > > > > patc
> > > > > > h/802458
> > > > > > 1/ to recall the previous discussion.
> > > > > > 
> > > > > > https://patchwork.kernel.org/patch/8149891/
> > > > > > https://patchwork.kernel.org/patch/8149861/
> > > > > > should be the solution made based on Ingo' suggestion,
> > > > > > right?
> > > > > > 
> > > > > > And to me, this sounds like the right direction to go,
> > > > > > thermal
> > > > > > does not
> > > > > > need a back up shutdown solution, it just needs a kernel
> > > > > > function call
> > > > > > which guarantees the system can be shutdown/reboot
> > > > > > immediately.
> > > > > > 
> > > > > > is there any reason that patch 1/2 is not accepted?
> > > > > Zhang,
> > > > > 
> > > > > http://www.serverphorums.com/read.php?12,1400964
> > > > > 
> > > > > I got a NAK from Alan and was given this direction on a
> > > > > thermal_poweroff
> > > > > which is more or less what is done in this patch.
> > > > > 
> > > > Actually, Alan's suggestion is more for you to define a
> > > > thermal_poweroff() that can be defined per architecture.
> > > > 
> > > > Also, please, keep track of your patch versions and also do
> > > > copy
> > > > everybody who has stated their opinion on previous discussions.
> > > > These
> > > > patches must have Ingo, Alan, and RMK copied too. In this way
> > > > we
> > > > avoid
> > > > loosing track of what has been suggested and we also converge
> > > > faster to
> > > > something everybody (or most of us) agree. Next version,
> > > > please,
> > > > fix
> > > > that.
> > > > 
> > > > 
> > > > To me, thermal core needs a function that simply powers off the
> > > > system.
> > > > No timeouts, delayed works, backups, etc. Simple and straight.
> > > > 
> > > > The idea of having a per architecture implementation, as per
> > > > Alan's
> > > > suggestion, makes sense to me too. Having something different
> > > > from
> > > > pm_power_off(), specific to thermal, might also give the
> > > > opportunity to
> > > > save the power off reason.
> > > > 
> > > > BR,
> > > > 
> > > > Eduardo Valentin
> > > >
Eduardo Valentin April 12, 2017, 3:44 p.m. UTC | #11
Hello,

On Wed, Apr 12, 2017 at 04:45:44PM +0800, Zhang Rui wrote:

<cut>

> > > > Zhang/Eduardo,
> > > > 
> > > > OMAP5/DRA7 is one case.
> > > > 
> > > > I believe i this is the root cause of this failure.
> > > > 
> > > > thermal_zone_device_check --> thermal_zone_device_update -->
> > > > handle_thermal_trip --> handle_critical_trips -->
> > > > orderly_poweroff
> > > > 
> > > > The above sequence happens every 250/500 mS based on the
> > > > configuration.
> > > > The orderly_poweroff function is getting called every 250/500 mS
> > > > and
> > > > i
> > > > see with a full fledged nfs file system it takes at least 5-10
> > > > Seconds
> > > > to shutdown and during that time we bombard with orderly_poweroff
> > > > calls
> > > > multiple times due to the thermal_zone_device_check triggering
> > > > periodically.

I see. A couple of questions here:
a. A regular shutdown command on your setup takes 5 to 10 s? What is the
PHY underneath your NFS? 56K modem?
b. Or did you mean it takes 5 to 10 s because you keep calling
orderly_poweroff?

> > > > 
> > > > To confirm that i made sure that handle_critical_trips calls
> > > > orderly_poweroff only once and i no longer see the failure on
> > > > DRA72-
> > > > EVM
> > > > board.
> > > > 


> > > Nice catch!

Ok. Nice. But how long does it take?

> > Thanks.
> > 
> > > 
> > > 
> > > > 
> > > > So IMHO once we get to handle_critical_trips case where we do
> > > > orderly_poweroff we need to do the following:
> > > > 
> > > > 1) Make sure that orderly_poweroff is called only once.
> > > agreed.
> > > 
> > > > 
> > > > 2) Cancel all the scheduled work queues to monitor the
> > > > temperature as
> > > > we have already reached a point of shutting down the system.
> > > > 
> > > agreed.
> > > 
> > > now I think we've found the root cause of the problem.
> > > orderly_poweroff() is not reenterable and it does not have to be.


Well, why not? Because we assume that all sources of shutdown within
kernel are all gonna happen in different time? What if thermal calls and
another subsystem/driver calls it too. Does work if user space also
calls shutdown in the middle of a thermal shutdown? I think we need to
think this through a bit more..

> > > If we're using orderly_poweroff() for emergency power off, we have
> > > to
> > > use it correctly.
> > > 

I agree. But there it nothing that says it is not reenterable. If you
saw something in this line, can you please share?

> > > will you generate a patch to do this?
> > Sure. I will generate a patch to take care of 1) To make sure that
> > orderly_poweroff is called only once right away. I have already
> > tested.
> > 
> > for 2) Cancel all the scheduled work queues to monitor the
> > temperature.
> > I will take some more time to make it and test.
> > 
> > Is that okay? Or you want me to send both together?
> > 
> I think you can send patch for step 1 first.

I am happy to see that Keerthy found the problem with his setup and a
possible solution. But I have a few concerns here.

1. If regular shutdown process takes 10seconds, that is a ballpark that
thermal should never wait. orderly_poweroff() calls run_cmd() with wait
flag set. That means, if regular userland shutdown takes 10s, we are
waiting for it. Obviously this not acceptable. Specially if you setup
critical trip to be 125C. Now, if you properly size the critical trip to
fire before hotspot really reach 125C, for 10s (or the time it takes to
shutdown), then fine. But based on what was described in this thread,
his system is waiting 10s on regular shutdown, and his silicon is on
out-of-spec temperature for 10s, which is wrong.

2. The above scenario is not acceptable in a long run, specially from a
reliability perspective. If orderly_poweroff() has a possibility to
simply never return (or take too long), I would say the thermal
subsystem is using the wrong API.


If you are going to implement the above two patches, keep in mind:
i. At least within the thermal subsystem, you need to take care of all
zones that could trigger a shutdown.
ii. serializing the calls to orderly_poweroff() seams to be more
concerning than cancelling all monitoring.


BR,

Eduardo Valentin
> 
> thanks,
> rui
> > Regards,
> > Keerthy
> > 
> > > 
> > > 
> > > thanks,
> > > rui
> > > 
> > > > 
> > > > Let me know your thoughts on this.
> > > > 
> > > > Best Regards,
> > > > Keerthy
> > > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > However, there is no clean way of detecting such
> > > > > > > > > > failure
> > > > > > > > > > of
> > > > > > > > > > userspace
> > > > > > > > > > powering off the system. In such scenarios, it is
> > > > > > > > > > necessary for a
> > > > > > > > > > backup
> > > > > > > > > > workqueue to be able to force a shutdown of the
> > > > > > > > > > system
> > > > > > > > > > when
> > > > > > > > > > orderly
> > > > > > > > > > shutdown is not successful after a configurable time
> > > > > > > > > > period.
> > > > > > > > > > 
> > > > > > > > > Given that system running hot is a thermal issue, I
> > > > > > > > > guess
> > > > > > > > > we care
> > > > > > > > > more
> > > > > > > > > on this matter then..
> > > > > > > > Yes!
> > > > > > > > 
> > > > > > > I just read this thread again https://patchwork.kernel.org/
> > > > > > > patc
> > > > > > > h/802458
> > > > > > > 1/ to recall the previous discussion.
> > > > > > > 
> > > > > > > https://patchwork.kernel.org/patch/8149891/
> > > > > > > https://patchwork.kernel.org/patch/8149861/
> > > > > > > should be the solution made based on Ingo' suggestion,
> > > > > > > right?
> > > > > > > 
> > > > > > > And to me, this sounds like the right direction to go,
> > > > > > > thermal
> > > > > > > does not
> > > > > > > need a back up shutdown solution, it just needs a kernel
> > > > > > > function call
> > > > > > > which guarantees the system can be shutdown/reboot
> > > > > > > immediately.
> > > > > > > 
> > > > > > > is there any reason that patch 1/2 is not accepted?
> > > > > > Zhang,
> > > > > > 
> > > > > > http://www.serverphorums.com/read.php?12,1400964
> > > > > > 
> > > > > > I got a NAK from Alan and was given this direction on a
> > > > > > thermal_poweroff
> > > > > > which is more or less what is done in this patch.
> > > > > > 
> > > > > Actually, Alan's suggestion is more for you to define a
> > > > > thermal_poweroff() that can be defined per architecture.
> > > > > 
> > > > > Also, please, keep track of your patch versions and also do
> > > > > copy
> > > > > everybody who has stated their opinion on previous discussions.
> > > > > These
> > > > > patches must have Ingo, Alan, and RMK copied too. In this way
> > > > > we
> > > > > avoid
> > > > > loosing track of what has been suggested and we also converge
> > > > > faster to
> > > > > something everybody (or most of us) agree. Next version,
> > > > > please,
> > > > > fix
> > > > > that.
> > > > > 
> > > > > 
> > > > > To me, thermal core needs a function that simply powers off the
> > > > > system.
> > > > > No timeouts, delayed works, backups, etc. Simple and straight.
> > > > > 
> > > > > The idea of having a per architecture implementation, as per
> > > > > Alan's
> > > > > suggestion, makes sense to me too. Having something different
> > > > > from
> > > > > pm_power_off(), specific to thermal, might also give the
> > > > > opportunity to
> > > > > save the power off reason.
> > > > > 
> > > > > BR,
> > > > > 
> > > > > Eduardo Valentin
> > > > >
J, KEERTHY April 12, 2017, 4:16 p.m. UTC | #12
On Wednesday 12 April 2017 09:14 PM, Eduardo Valentin wrote:
> Hello,
> 
> On Wed, Apr 12, 2017 at 04:45:44PM +0800, Zhang Rui wrote:
> 
> <cut>
> 
>>>>> Zhang/Eduardo,
>>>>>
>>>>> OMAP5/DRA7 is one case.
>>>>>
>>>>> I believe i this is the root cause of this failure.
>>>>>
>>>>> thermal_zone_device_check --> thermal_zone_device_update -->
>>>>> handle_thermal_trip --> handle_critical_trips -->
>>>>> orderly_poweroff
>>>>>
>>>>> The above sequence happens every 250/500 mS based on the
>>>>> configuration.
>>>>> The orderly_poweroff function is getting called every 250/500 mS
>>>>> and
>>>>> i
>>>>> see with a full fledged nfs file system it takes at least 5-10
>>>>> Seconds
>>>>> to shutdown and during that time we bombard with orderly_poweroff
>>>>> calls
>>>>> multiple times due to the thermal_zone_device_check triggering
>>>>> periodically.
> 
> I see. A couple of questions here:
> a. A regular shutdown command on your setup takes 5 to 10 s? What is the
> PHY underneath your NFS? 56K modem?

Its not 56K modem but also i am not running on busybox!
Its a full fledged arago file system. Yes i have run a basic poweroff
and it takes about 5S. I will share the logs with timings the first
thing tomorrow.

> b. Or did you mean it takes 5 to 10 s because you keep calling
> orderly_poweroff?

If we keep calling orderly_poweroff it would never shutdown. Hence the
issue.

> 
>>>>>
>>>>> To confirm that i made sure that handle_critical_trips calls
>>>>> orderly_poweroff only once and i no longer see the failure on
>>>>> DRA72-
>>>>> EVM
>>>>> board.
>>>>>
> 
> 
>>>> Nice catch!
> 
> Ok. Nice. But how long does it take?

About 5-10S as i mentioned.

First and foremost there is an issue here where in we keep calling
orderly_poweroff which needs to be addressed.

> 
>>> Thanks.
>>>
>>>>
>>>>
>>>>>
>>>>> So IMHO once we get to handle_critical_trips case where we do
>>>>> orderly_poweroff we need to do the following:
>>>>>
>>>>> 1) Make sure that orderly_poweroff is called only once.
>>>> agreed.
>>>>
>>>>>
>>>>> 2) Cancel all the scheduled work queues to monitor the
>>>>> temperature as
>>>>> we have already reached a point of shutting down the system.
>>>>>
>>>> agreed.
>>>>
>>>> now I think we've found the root cause of the problem.
>>>> orderly_poweroff() is not reenterable and it does not have to be.
> 
> 
> Well, why not? Because we assume that all sources of shutdown within
> kernel are all gonna happen in different time? What if thermal calls and
> another subsystem/driver calls it too. Does work if user space also
> calls shutdown in the middle of a thermal shutdown? I think we need to
> think this through a bit more..

Definitely we need to think a lot more but point agreed. Why is thermal
framework calling orderly_poweroff multiple times? Say even if you
manage to shut off in 2 seconds you still end up calling 4 to 8 times
depending on 500mS or 250mS delay.

> 
>>>> If we're using orderly_poweroff() for emergency power off, we have
>>>> to
>>>> use it correctly.
>>>>
> 
> I agree. But there it nothing that says it is not reenterable. If you
> saw something in this line, can you please share?
> 
>>>> will you generate a patch to do this?
>>> Sure. I will generate a patch to take care of 1) To make sure that
>>> orderly_poweroff is called only once right away. I have already
>>> tested.
>>>
>>> for 2) Cancel all the scheduled work queues to monitor the
>>> temperature.
>>> I will take some more time to make it and test.
>>>
>>> Is that okay? Or you want me to send both together?
>>>
>> I think you can send patch for step 1 first.
> 
> I am happy to see that Keerthy found the problem with his setup and a
> possible solution. But I have a few concerns here.
> 
> 1. If regular shutdown process takes 10seconds, that is a ballpark that
> thermal should never wait. orderly_poweroff() calls run_cmd() with wait
> flag set. That means, if regular userland shutdown takes 10s, we are
> waiting for it. Obviously this not acceptable. Specially if you setup
> critical trip to be 125C. Now, if you properly size the critical trip to
> fire before hotspot really reach 125C, for 10s (or the time it takes to
> shutdown), then fine. But based on what was described in this thread,
> his system is waiting 10s on regular shutdown, and his silicon is on
> out-of-spec temperature for 10s, which is wrong.

2 approaches can be taken here:

1) Reduce the critical temperature to something lesser than the hardware
critical point.

Or

2) Call kernel_power_off directly as you are in a pretty critical
situation! That only takes less than a second and powers off the PMIC at
least on OMAP5/DRA7.

> 
> 2. The above scenario is not acceptable in a long run, specially from a
> reliability perspective. If orderly_poweroff() has a possibility to
> simply never return (or take too long), I would say the thermal
> subsystem is using the wrong API.

As mentioned above kernel_power_off?

> 
> 
> If you are going to implement the above two patches, keep in mind:
> i. At least within the thermal subsystem, you need to take care of all
> zones that could trigger a shutdown.

Do you think it makes sense for all the 'n' sensors to trigger
orderly_poweroff one by one? Or we should worry about the first source
and ensure that it shuts off the system?

Is it not enough to catch the first critical alert and poweroff
> ii. serializing the calls to orderly_poweroff() seams to be more
> concerning than cancelling all monitoring.
> 
> 
> BR,
> 
> Eduardo Valentin
>>
>> thanks,
>> rui
>>> Regards,
>>> Keerthy
>>>
>>>>
>>>>
>>>> thanks,
>>>> rui
>>>>
>>>>>
>>>>> Let me know your thoughts on this.
>>>>>
>>>>> Best Regards,
>>>>> Keerthy
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> However, there is no clean way of detecting such
>>>>>>>>>>> failure
>>>>>>>>>>> of
>>>>>>>>>>> userspace
>>>>>>>>>>> powering off the system. In such scenarios, it is
>>>>>>>>>>> necessary for a
>>>>>>>>>>> backup
>>>>>>>>>>> workqueue to be able to force a shutdown of the
>>>>>>>>>>> system
>>>>>>>>>>> when
>>>>>>>>>>> orderly
>>>>>>>>>>> shutdown is not successful after a configurable time
>>>>>>>>>>> period.
>>>>>>>>>>>
>>>>>>>>>> Given that system running hot is a thermal issue, I
>>>>>>>>>> guess
>>>>>>>>>> we care
>>>>>>>>>> more
>>>>>>>>>> on this matter then..
>>>>>>>>> Yes!
>>>>>>>>>
>>>>>>>> I just read this thread again https://patchwork.kernel.org/
>>>>>>>> patc
>>>>>>>> h/802458
>>>>>>>> 1/ to recall the previous discussion.
>>>>>>>>
>>>>>>>> https://patchwork.kernel.org/patch/8149891/
>>>>>>>> https://patchwork.kernel.org/patch/8149861/
>>>>>>>> should be the solution made based on Ingo' suggestion,
>>>>>>>> right?
>>>>>>>>
>>>>>>>> And to me, this sounds like the right direction to go,
>>>>>>>> thermal
>>>>>>>> does not
>>>>>>>> need a back up shutdown solution, it just needs a kernel
>>>>>>>> function call
>>>>>>>> which guarantees the system can be shutdown/reboot
>>>>>>>> immediately.
>>>>>>>>
>>>>>>>> is there any reason that patch 1/2 is not accepted?
>>>>>>> Zhang,
>>>>>>>
>>>>>>> http://www.serverphorums.com/read.php?12,1400964
>>>>>>>
>>>>>>> I got a NAK from Alan and was given this direction on a
>>>>>>> thermal_poweroff
>>>>>>> which is more or less what is done in this patch.
>>>>>>>
>>>>>> Actually, Alan's suggestion is more for you to define a
>>>>>> thermal_poweroff() that can be defined per architecture.
>>>>>>
>>>>>> Also, please, keep track of your patch versions and also do
>>>>>> copy
>>>>>> everybody who has stated their opinion on previous discussions.
>>>>>> These
>>>>>> patches must have Ingo, Alan, and RMK copied too. In this way
>>>>>> we
>>>>>> avoid
>>>>>> loosing track of what has been suggested and we also converge
>>>>>> faster to
>>>>>> something everybody (or most of us) agree. Next version,
>>>>>> please,
>>>>>> fix
>>>>>> that.
>>>>>>
>>>>>>
>>>>>> To me, thermal core needs a function that simply powers off the
>>>>>> system.
>>>>>> No timeouts, delayed works, backups, etc. Simple and straight.
>>>>>>
>>>>>> The idea of having a per architecture implementation, as per
>>>>>> Alan's
>>>>>> suggestion, makes sense to me too. Having something different
>>>>>> from
>>>>>> pm_power_off(), specific to thermal, might also give the
>>>>>> opportunity to
>>>>>> save the power off reason.
>>>>>>
>>>>>> BR,
>>>>>>
>>>>>> Eduardo Valentin
>>>>>>
Grygorii Strashko April 12, 2017, 4:31 p.m. UTC | #13
On 04/12/2017 10:44 AM, Eduardo Valentin wrote:
> Hello,
> 
...

> 
> I agree. But there it nothing that says it is not reenterable. If you
> saw something in this line, can you please share?
> 
>>>> will you generate a patch to do this?
>>> Sure. I will generate a patch to take care of 1) To make sure that
>>> orderly_poweroff is called only once right away. I have already
>>> tested.
>>>
>>> for 2) Cancel all the scheduled work queues to monitor the
>>> temperature.
>>> I will take some more time to make it and test.
>>>
>>> Is that okay? Or you want me to send both together?
>>>
>> I think you can send patch for step 1 first.
> 
> I am happy to see that Keerthy found the problem with his setup and a
> possible solution. But I have a few concerns here.
> 
> 1. If regular shutdown process takes 10seconds, that is a ballpark that
> thermal should never wait. orderly_poweroff() calls run_cmd() with wait
> flag set. That means, if regular userland shutdown takes 10s, we are
> waiting for it. Obviously this not acceptable. Specially if you setup
> critical trip to be 125C. Now, if you properly size the critical trip to
> fire before hotspot really reach 125C, for 10s (or the time it takes to
> shutdown), then fine. But based on what was described in this thread,
> his system is waiting 10s on regular shutdown, and his silicon is on
> out-of-spec temperature for 10s, which is wrong.
> 
> 2. The above scenario is not acceptable in a long run, specially from a
> reliability perspective. If orderly_poweroff() has a possibility to
> simply never return (or take too long), I would say the thermal
> subsystem is using the wrong API.
> 


Hh, I do not see that orderly_poweroff() will wait for anything now:
void orderly_poweroff(bool force)
{
	if (force) /* do not override the pending "true" */
		poweroff_force = true;
	schedule_work(&poweroff_work); 
^^^^^^^ async call. even here can be pretty big delay if system is under pressure
}


static int __orderly_poweroff(bool force)
{
	int ret;

	ret = run_cmd(poweroff_cmd);
^^^^ no wait for the process - only for exec. flags == UMH_WAIT_EXEC

	if (ret && force) {
		pr_warn("Failed to start orderly shutdown: forcing the issue\n");

		/*
		 * I guess this should try to kick off some daemon to sync and
		 * poweroff asap.  Or not even bother syncing if we're doing an
		 * emergency shutdown?
		 */
		emergency_sync();
		kernel_power_off();
^^^ force power off, but only if run_cmd() failed - for example /sbin/poweroff doesn't exist
	}

	return ret;
}

static bool poweroff_force;

static void poweroff_work_func(struct work_struct *work)
{
	__orderly_poweroff(poweroff_force);
}

As result thermal has no control of power off any more after calling orderly_poweroff() and can get the result
of US poweroff binary execution.

> 
> If you are going to implement the above two patches, keep in mind:
> i. At least within the thermal subsystem, you need to take care of all
> zones that could trigger a shutdown.
> ii. serializing the calls to orderly_poweroff() seams to be more
> concerning than cancelling all monitoring.
> 
>
Eduardo Valentin April 12, 2017, 4:34 p.m. UTC | #14
Hey,

On Wed, Apr 12, 2017 at 11:31:18AM -0500, Grygorii Strashko wrote:
> 
> 
> On 04/12/2017 10:44 AM, Eduardo Valentin wrote:
> > Hello,
> > 
> ...
> 
> > 
> > I agree. But there it nothing that says it is not reenterable. If you
> > saw something in this line, can you please share?
> > 
> >>>> will you generate a patch to do this?
> >>> Sure. I will generate a patch to take care of 1) To make sure that
> >>> orderly_poweroff is called only once right away. I have already
> >>> tested.
> >>>
> >>> for 2) Cancel all the scheduled work queues to monitor the
> >>> temperature.
> >>> I will take some more time to make it and test.
> >>>
> >>> Is that okay? Or you want me to send both together?
> >>>
> >> I think you can send patch for step 1 first.
> > 
> > I am happy to see that Keerthy found the problem with his setup and a
> > possible solution. But I have a few concerns here.
> > 
> > 1. If regular shutdown process takes 10seconds, that is a ballpark that
> > thermal should never wait. orderly_poweroff() calls run_cmd() with wait
> > flag set. That means, if regular userland shutdown takes 10s, we are
> > waiting for it. Obviously this not acceptable. Specially if you setup
> > critical trip to be 125C. Now, if you properly size the critical trip to
> > fire before hotspot really reach 125C, for 10s (or the time it takes to
> > shutdown), then fine. But based on what was described in this thread,
> > his system is waiting 10s on regular shutdown, and his silicon is on
> > out-of-spec temperature for 10s, which is wrong.
> > 
> > 2. The above scenario is not acceptable in a long run, specially from a
> > reliability perspective. If orderly_poweroff() has a possibility to
> > simply never return (or take too long), I would say the thermal
> > subsystem is using the wrong API.
> > 
> 
> 
> Hh, I do not see that orderly_poweroff() will wait for anything now:
> void orderly_poweroff(bool force)
> {
> 	if (force) /* do not override the pending "true" */
> 		poweroff_force = true;
> 	schedule_work(&poweroff_work); 
> ^^^^^^^ async call. even here can be pretty big delay if system is under pressure
> }
> 
> 
> static int __orderly_poweroff(bool force)
> {
> 	int ret;
> 
> 	ret = run_cmd(poweroff_cmd);
> ^^^^ no wait for the process - only for exec. flags == UMH_WAIT_EXEC

Yeah, and that is what I really meant. Sorry for the confusion. The exec
is problematic in his scenario too, given he is running on a very
interesting NFS setup. Yes, the WAIT_EXEC is set:
392 static int run_cmd(const char *cmd)
393 {
394         char **argv;
395         static char *envp[] = {
396                 "HOME=/",
397                 "PATH=/sbin:/bin:/usr/sbin:/usr/bin",
398                 NULL
399         };
400         int ret;
401         argv = argv_split(GFP_KERNEL, cmd, NULL);
402         if (argv) {
403                 ret = call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC);
404                 argv_free(argv);
405         } else {
406                 ret = -ENOMEM;
407         }
408 
409         return ret;
410 }
411 


> 
> 	if (ret && force) {
> 		pr_warn("Failed to start orderly shutdown: forcing the issue\n");
> 
> 		/*
> 		 * I guess this should try to kick off some daemon to sync and
> 		 * poweroff asap.  Or not even bother syncing if we're doing an
> 		 * emergency shutdown?
> 		 */
> 		emergency_sync();
> 		kernel_power_off();
> ^^^ force power off, but only if run_cmd() failed - for example /sbin/poweroff doesn't exist
> 	}
> 
> 	return ret;
> }
> 
> static bool poweroff_force;
> 
> static void poweroff_work_func(struct work_struct *work)
> {
> 	__orderly_poweroff(poweroff_force);
> }
> 
> As result thermal has no control of power off any more after calling orderly_poweroff() and can get the result
> of US poweroff binary execution.
> 
> > 
> > If you are going to implement the above two patches, keep in mind:
> > i. At least within the thermal subsystem, you need to take care of all
> > zones that could trigger a shutdown.
> > ii. serializing the calls to orderly_poweroff() seams to be more
> > concerning than cancelling all monitoring.
> > 
> > 
> 
> -- 
> regards,
> -grygorii
J, KEERTHY April 12, 2017, 4:44 p.m. UTC | #15
On Wednesday 12 April 2017 10:01 PM, Grygorii Strashko wrote:
> 
> 
> On 04/12/2017 10:44 AM, Eduardo Valentin wrote:
>> Hello,
>>
> ...
> 
>>
>> I agree. But there it nothing that says it is not reenterable. If you
>> saw something in this line, can you please share?
>>
>>>>> will you generate a patch to do this?
>>>> Sure. I will generate a patch to take care of 1) To make sure that
>>>> orderly_poweroff is called only once right away. I have already
>>>> tested.
>>>>
>>>> for 2) Cancel all the scheduled work queues to monitor the
>>>> temperature.
>>>> I will take some more time to make it and test.
>>>>
>>>> Is that okay? Or you want me to send both together?
>>>>
>>> I think you can send patch for step 1 first.
>>
>> I am happy to see that Keerthy found the problem with his setup and a
>> possible solution. But I have a few concerns here.
>>
>> 1. If regular shutdown process takes 10seconds, that is a ballpark that
>> thermal should never wait. orderly_poweroff() calls run_cmd() with wait
>> flag set. That means, if regular userland shutdown takes 10s, we are
>> waiting for it. Obviously this not acceptable. Specially if you setup
>> critical trip to be 125C. Now, if you properly size the critical trip to
>> fire before hotspot really reach 125C, for 10s (or the time it takes to
>> shutdown), then fine. But based on what was described in this thread,
>> his system is waiting 10s on regular shutdown, and his silicon is on
>> out-of-spec temperature for 10s, which is wrong.
>>
>> 2. The above scenario is not acceptable in a long run, specially from a
>> reliability perspective. If orderly_poweroff() has a possibility to
>> simply never return (or take too long), I would say the thermal
>> subsystem is using the wrong API.
>>
> 
> 
> Hh, I do not see that orderly_poweroff() will wait for anything now:
> void orderly_poweroff(bool force)
> {
> 	if (force) /* do not override the pending "true" */
> 		poweroff_force = true;
> 	schedule_work(&poweroff_work); 
> ^^^^^^^ async call. even here can be pretty big delay if system is under pressure
> }
> 
> 
> static int __orderly_poweroff(bool force)
> {
> 	int ret;
> 
> 	ret = run_cmd(poweroff_cmd);

When i tried with multiple orderly_poweroff calls ret was always 0.
So every 250mS i see this ret = 0.

> ^^^^ no wait for the process - only for exec. flags == UMH_WAIT_EXEC
> 
> 	if (ret && force) {

So it never entered this path. ret = 0 so if is not executed.

> 		pr_warn("Failed to start orderly shutdown: forcing the issue\n");
> 
> 		/*
> 		 * I guess this should try to kick off some daemon to sync and
> 		 * poweroff asap.  Or not even bother syncing if we're doing an
> 		 * emergency shutdown?
> 		 */
> 		emergency_sync();
> 		kernel_power_off();
> ^^^ force power off, but only if run_cmd() failed - for example /sbin/poweroff doesn't exist
> 	}
> 
> 	return ret;
> }
> 
> static bool poweroff_force;
> 
> static void poweroff_work_func(struct work_struct *work)
> {
> 	__orderly_poweroff(poweroff_force);
> }
> 
> As result thermal has no control of power off any more after calling orderly_poweroff() and can get the result
> of US poweroff binary execution.
> 
>>
>> If you are going to implement the above two patches, keep in mind:
>> i. At least within the thermal subsystem, you need to take care of all
>> zones that could trigger a shutdown.
>> ii. serializing the calls to orderly_poweroff() seams to be more
>> concerning than cancelling all monitoring.
>>
>>
>
Eduardo Valentin April 12, 2017, 4:50 p.m. UTC | #16
Hey

On Wed, Apr 12, 2017 at 09:46:47PM +0530, Keerthy wrote:
> 
> 
> On Wednesday 12 April 2017 09:14 PM, Eduardo Valentin wrote:
> > Hello,
> > 
> > On Wed, Apr 12, 2017 at 04:45:44PM +0800, Zhang Rui wrote:
> > 
> > <cut>
> > 
> >>>>> Zhang/Eduardo,
> >>>>>
> >>>>> OMAP5/DRA7 is one case.
> >>>>>
> >>>>> I believe i this is the root cause of this failure.
> >>>>>
> >>>>> thermal_zone_device_check --> thermal_zone_device_update -->
> >>>>> handle_thermal_trip --> handle_critical_trips -->
> >>>>> orderly_poweroff
> >>>>>
> >>>>> The above sequence happens every 250/500 mS based on the
> >>>>> configuration.
> >>>>> The orderly_poweroff function is getting called every 250/500 mS
> >>>>> and
> >>>>> i
> >>>>> see with a full fledged nfs file system it takes at least 5-10
> >>>>> Seconds
> >>>>> to shutdown and during that time we bombard with orderly_poweroff
> >>>>> calls
> >>>>> multiple times due to the thermal_zone_device_check triggering
> >>>>> periodically.
> > 
> > I see. A couple of questions here:
> > a. A regular shutdown command on your setup takes 5 to 10 s? What is the
> > PHY underneath your NFS? 56K modem?
> 
> Its not 56K modem but also i am not running on busybox!

OK. :-)

> Its a full fledged arago file system. Yes i have run a basic poweroff
> and it takes about 5S. I will share the logs with timings the first
> thing tomorrow.
> 

I see. 

> > b. Or did you mean it takes 5 to 10 s because you keep calling
> > orderly_poweroff?
> 
> If we keep calling orderly_poweroff it would never shutdown. Hence the
> issue.

Yeah, if you could share the logs would be great to understand where the
wait sits.

> 
> > 
> >>>>>
> >>>>> To confirm that i made sure that handle_critical_trips calls
> >>>>> orderly_poweroff only once and i no longer see the failure on
> >>>>> DRA72-
> >>>>> EVM
> >>>>> board.
> >>>>>
> > 
> > 
> >>>> Nice catch!
> > 
> > Ok. Nice. But how long does it take?
> 
> About 5-10S as i mentioned.
> 
> First and foremost there is an issue here where in we keep calling
> orderly_poweroff which needs to be addressed.
> 

I agree here. Apparently, the expectations of the API were wrong. I
agree on refraining from calling it multiple times before it finishes.

But, I said this before, and I will repeat myself. I believe thermal is
not the only user of this API, maybe the problem is more apparent for
thermal because we call it multiple times, and we want it to finishes,
but even after fixing the serialization on thermal side, we can still
collide with other parts of the kernel and userland.

> > 
> >>> Thanks.
> >>>
> >>>>
> >>>>
> >>>>>
> >>>>> So IMHO once we get to handle_critical_trips case where we do
> >>>>> orderly_poweroff we need to do the following:
> >>>>>
> >>>>> 1) Make sure that orderly_poweroff is called only once.
> >>>> agreed.
> >>>>
> >>>>>
> >>>>> 2) Cancel all the scheduled work queues to monitor the
> >>>>> temperature as
> >>>>> we have already reached a point of shutting down the system.
> >>>>>
> >>>> agreed.
> >>>>
> >>>> now I think we've found the root cause of the problem.
> >>>> orderly_poweroff() is not reenterable and it does not have to be.
> > 
> > 
> > Well, why not? Because we assume that all sources of shutdown within
> > kernel are all gonna happen in different time? What if thermal calls and
> > another subsystem/driver calls it too. Does work if user space also
> > calls shutdown in the middle of a thermal shutdown? I think we need to
> > think this through a bit more..
> 
> Definitely we need to think a lot more but point agreed. Why is thermal
> framework calling orderly_poweroff multiple times? Say even if you
> manage to shut off in 2 seconds you still end up calling 4 to 8 times
> depending on 500mS or 250mS delay.

I agree here. Also, a graceful thermal shutdown may also mean displaying
a message, etc. In this case, you have to size properly the trip,
accounting shutdown down time, and your reliability expectation.


> 
> > 
> >>>> If we're using orderly_poweroff() for emergency power off, we have
> >>>> to
> >>>> use it correctly.
> >>>>
> > 
> > I agree. But there it nothing that says it is not reenterable. If you
> > saw something in this line, can you please share?
> > 
> >>>> will you generate a patch to do this?
> >>> Sure. I will generate a patch to take care of 1) To make sure that
> >>> orderly_poweroff is called only once right away. I have already
> >>> tested.
> >>>
> >>> for 2) Cancel all the scheduled work queues to monitor the
> >>> temperature.
> >>> I will take some more time to make it and test.
> >>>
> >>> Is that okay? Or you want me to send both together?
> >>>
> >> I think you can send patch for step 1 first.
> > 
> > I am happy to see that Keerthy found the problem with his setup and a
> > possible solution. But I have a few concerns here.
> > 
> > 1. If regular shutdown process takes 10seconds, that is a ballpark that
> > thermal should never wait. orderly_poweroff() calls run_cmd() with wait
> > flag set. That means, if regular userland shutdown takes 10s, we are
> > waiting for it. Obviously this not acceptable. Specially if you setup
> > critical trip to be 125C. Now, if you properly size the critical trip to
> > fire before hotspot really reach 125C, for 10s (or the time it takes to
> > shutdown), then fine. But based on what was described in this thread,
> > his system is waiting 10s on regular shutdown, and his silicon is on
> > out-of-spec temperature for 10s, which is wrong.
> 
> 2 approaches can be taken here:
> 
> 1) Reduce the critical temperature to something lesser than the hardware
> critical point.
> 
> Or
> 
> 2) Call kernel_power_off directly as you are in a pretty critical
> situation! That only takes less than a second and powers off the PMIC at
> least on OMAP5/DRA7.


I think the code needs to allow doing both, actually. Considering both,
the silicon and system reliability, and userland (and end user)
interaction, the thermal shutdown typically needs to:
1. Make sure it avoids reliability problems, i.e., one shall not allow
device to run on out-of-spec temperature.
2. Give the opportunity for the system to gracefully shutdown, so you
have the time to keep system state sane (save your data, notify user,
etc), even if you are on a 56K modem :-)

> 
> > 
> > 2. The above scenario is not acceptable in a long run, specially from a
> > reliability perspective. If orderly_poweroff() has a possibility to
> > simply never return (or take too long), I would say the thermal
> > subsystem is using the wrong API.
> 
> As mentioned above kernel_power_off?
> 
> > 
> > 
> > If you are going to implement the above two patches, keep in mind:
> > i. At least within the thermal subsystem, you need to take care of all
> > zones that could trigger a shutdown.
> 
> Do you think it makes sense for all the 'n' sensors to trigger
> orderly_poweroff one by one? Or we should worry about the first source
> and ensure that it shuts off the system?
> 
> Is it not enough to catch the first critical alert and poweroff

I think it is enough if we make sure the first one goes through
properly. For accountability purposes, some people would like to also
know if other sensors are too hot too, and could be also firing the
shutdown.

Only making sure that the first shutdown goes all the way through,
and block any other thermal shutdowns, it is enough. Then again, I do
not think you need to cancel all the monitoring in the system.

Given the above points, my suggestion is to:
1. still call orderly_poweroff(), therefore, you still give the
opportunity for userland to gracefully power off.
2. but still make sure, once one of the zones hits critical, no other
will call orderly_poweroff()
3. Also, when in the critical path, make sure there is no way back, or
long delays, allowing system engineer to size the shutdown wait.
Shutdown wait is a system property, not a zone property. That is, we
eventually call kernel_power_off().

All in all, 1. and 2. above are part of what you found and what has been
proposed to make sure we call orderly_poweroff() only once, system wide
(or at least thermal subsystem wide). And 3. is pretty much the proposed
patch in this series, I think this still needs to go, and I am convinced
that thermal core is best place to write the backup mechanism, given the
expected variability of orderly_poweroff().

BR,

Eduardo Valentin
Eduardo Valentin April 12, 2017, 4:54 p.m. UTC | #17
Keerthy,

On Wed, Apr 12, 2017 at 10:14:36PM +0530, Keerthy wrote:
> 
> 
> On Wednesday 12 April 2017 10:01 PM, Grygorii Strashko wrote:
> > 
> > 
> > On 04/12/2017 10:44 AM, Eduardo Valentin wrote:
> >> Hello,
> >>
> > ...
> > 
> >>
> >> I agree. But there it nothing that says it is not reenterable. If you
> >> saw something in this line, can you please share?
> >>
> >>>>> will you generate a patch to do this?
> >>>> Sure. I will generate a patch to take care of 1) To make sure that
> >>>> orderly_poweroff is called only once right away. I have already
> >>>> tested.
> >>>>
> >>>> for 2) Cancel all the scheduled work queues to monitor the
> >>>> temperature.
> >>>> I will take some more time to make it and test.
> >>>>
> >>>> Is that okay? Or you want me to send both together?
> >>>>
> >>> I think you can send patch for step 1 first.
> >>
> >> I am happy to see that Keerthy found the problem with his setup and a
> >> possible solution. But I have a few concerns here.
> >>
> >> 1. If regular shutdown process takes 10seconds, that is a ballpark that
> >> thermal should never wait. orderly_poweroff() calls run_cmd() with wait
> >> flag set. That means, if regular userland shutdown takes 10s, we are
> >> waiting for it. Obviously this not acceptable. Specially if you setup
> >> critical trip to be 125C. Now, if you properly size the critical trip to
> >> fire before hotspot really reach 125C, for 10s (or the time it takes to
> >> shutdown), then fine. But based on what was described in this thread,
> >> his system is waiting 10s on regular shutdown, and his silicon is on
> >> out-of-spec temperature for 10s, which is wrong.
> >>
> >> 2. The above scenario is not acceptable in a long run, specially from a
> >> reliability perspective. If orderly_poweroff() has a possibility to
> >> simply never return (or take too long), I would say the thermal
> >> subsystem is using the wrong API.
> >>
> > 
> > 
> > Hh, I do not see that orderly_poweroff() will wait for anything now:
> > void orderly_poweroff(bool force)
> > {
> > 	if (force) /* do not override the pending "true" */
> > 		poweroff_force = true;
> > 	schedule_work(&poweroff_work); 
> > ^^^^^^^ async call. even here can be pretty big delay if system is under pressure
> > }
> > 
> > 
> > static int __orderly_poweroff(bool force)
> > {
> > 	int ret;
> > 
> > 	ret = run_cmd(poweroff_cmd);
> 
> When i tried with multiple orderly_poweroff calls ret was always 0.
> So every 250mS i see this ret = 0.
> 
> > ^^^^ no wait for the process - only for exec. flags == UMH_WAIT_EXEC
> > 
> > 	if (ret && force) {
> 
> So it never entered this path. ret = 0 so if is not executed.

I think your setup has two major problems then:
1. when kernel runs userspace power off, it execs properly, in fact, it
is not triggered.
2. when you finally exec it, it takes 5s to finish.

If this is correct, I think my suggestions on the other email
still holds.

BR,
J, KEERTHY April 12, 2017, 5:07 p.m. UTC | #18
On Wednesday 12 April 2017 10:24 PM, Eduardo Valentin wrote:
> Keerthy,
> 
> On Wed, Apr 12, 2017 at 10:14:36PM +0530, Keerthy wrote:
>>
>>
>> On Wednesday 12 April 2017 10:01 PM, Grygorii Strashko wrote:
>>>
>>>
>>> On 04/12/2017 10:44 AM, Eduardo Valentin wrote:
>>>> Hello,
>>>>
>>> ...
>>>
>>>>
>>>> I agree. But there it nothing that says it is not reenterable. If you
>>>> saw something in this line, can you please share?
>>>>
>>>>>>> will you generate a patch to do this?
>>>>>> Sure. I will generate a patch to take care of 1) To make sure that
>>>>>> orderly_poweroff is called only once right away. I have already
>>>>>> tested.
>>>>>>
>>>>>> for 2) Cancel all the scheduled work queues to monitor the
>>>>>> temperature.
>>>>>> I will take some more time to make it and test.
>>>>>>
>>>>>> Is that okay? Or you want me to send both together?
>>>>>>
>>>>> I think you can send patch for step 1 first.
>>>>
>>>> I am happy to see that Keerthy found the problem with his setup and a
>>>> possible solution. But I have a few concerns here.
>>>>
>>>> 1. If regular shutdown process takes 10seconds, that is a ballpark that
>>>> thermal should never wait. orderly_poweroff() calls run_cmd() with wait
>>>> flag set. That means, if regular userland shutdown takes 10s, we are
>>>> waiting for it. Obviously this not acceptable. Specially if you setup
>>>> critical trip to be 125C. Now, if you properly size the critical trip to
>>>> fire before hotspot really reach 125C, for 10s (or the time it takes to
>>>> shutdown), then fine. But based on what was described in this thread,
>>>> his system is waiting 10s on regular shutdown, and his silicon is on
>>>> out-of-spec temperature for 10s, which is wrong.
>>>>
>>>> 2. The above scenario is not acceptable in a long run, specially from a
>>>> reliability perspective. If orderly_poweroff() has a possibility to
>>>> simply never return (or take too long), I would say the thermal
>>>> subsystem is using the wrong API.
>>>>
>>>
>>>
>>> Hh, I do not see that orderly_poweroff() will wait for anything now:
>>> void orderly_poweroff(bool force)
>>> {
>>> 	if (force) /* do not override the pending "true" */
>>> 		poweroff_force = true;
>>> 	schedule_work(&poweroff_work); 
>>> ^^^^^^^ async call. even here can be pretty big delay if system is under pressure
>>> }
>>>
>>>
>>> static int __orderly_poweroff(bool force)
>>> {
>>> 	int ret;
>>>
>>> 	ret = run_cmd(poweroff_cmd);
>>
>> When i tried with multiple orderly_poweroff calls ret was always 0.
>> So every 250mS i see this ret = 0.
>>
>>> ^^^^ no wait for the process - only for exec. flags == UMH_WAIT_EXEC
>>>
>>> 	if (ret && force) {
>>
>> So it never entered this path. ret = 0 so if is not executed.
> 
> I think your setup has two major problems then:
> 1. when kernel runs userspace power off, it execs properly, in fact, it
> is not triggered.

It does work neatly when orderly_poweroff is called once. It gracefully
shuts down the system. I see problem is when we call run_cmd every 250mS
multiple times.

> 2. when you finally exec it, it takes 5s to finish.

I will share the logs.

> 
> If this is correct, I think my suggestions on the other email
> still holds.
> 
> BR,
>
Grygorii Strashko April 12, 2017, 5:08 p.m. UTC | #19
On 04/12/2017 11:44 AM, Keerthy wrote:
> 
> 
> On Wednesday 12 April 2017 10:01 PM, Grygorii Strashko wrote:
>>
>>
>> On 04/12/2017 10:44 AM, Eduardo Valentin wrote:
>>> Hello,
>>>
>> ...
>>
>>>
>>> I agree. But there it nothing that says it is not reenterable. If you
>>> saw something in this line, can you please share?
>>>
>>>>>> will you generate a patch to do this?
>>>>> Sure. I will generate a patch to take care of 1) To make sure that
>>>>> orderly_poweroff is called only once right away. I have already
>>>>> tested.
>>>>>
>>>>> for 2) Cancel all the scheduled work queues to monitor the
>>>>> temperature.
>>>>> I will take some more time to make it and test.
>>>>>
>>>>> Is that okay? Or you want me to send both together?
>>>>>
>>>> I think you can send patch for step 1 first.
>>>
>>> I am happy to see that Keerthy found the problem with his setup and a
>>> possible solution. But I have a few concerns here.
>>>
>>> 1. If regular shutdown process takes 10seconds, that is a ballpark that
>>> thermal should never wait. orderly_poweroff() calls run_cmd() with wait
>>> flag set. That means, if regular userland shutdown takes 10s, we are
>>> waiting for it. Obviously this not acceptable. Specially if you setup
>>> critical trip to be 125C. Now, if you properly size the critical trip to
>>> fire before hotspot really reach 125C, for 10s (or the time it takes to
>>> shutdown), then fine. But based on what was described in this thread,
>>> his system is waiting 10s on regular shutdown, and his silicon is on
>>> out-of-spec temperature for 10s, which is wrong.
>>>
>>> 2. The above scenario is not acceptable in a long run, specially from a
>>> reliability perspective. If orderly_poweroff() has a possibility to
>>> simply never return (or take too long), I would say the thermal
>>> subsystem is using the wrong API.

^ this question just repeat everything which was already discussed in
previous versions of this patch - orderly_poweroff() is not good for critical shutdown/poweroff,
but what to use instead?


>>>
>>
>>
>> Hh, I do not see that orderly_poweroff() will wait for anything now:
>> void orderly_poweroff(bool force)
>> {
>> 	if (force) /* do not override the pending "true" */
>> 		poweroff_force = true;
>> 	schedule_work(&poweroff_work); 
>> ^^^^^^^ async call. even here can be pretty big delay if system is under pressure
>> }
>>
>>
>> static int __orderly_poweroff(bool force)
>> {
>> 	int ret;
>>
>> 	ret = run_cmd(poweroff_cmd);
> 
> When i tried with multiple orderly_poweroff calls ret was always 0.
> So every 250mS i see this ret = 0.
> 
>> ^^^^ no wait for the process - only for exec. flags == UMH_WAIT_EXEC
>>
>> 	if (ret && force) {
> 
> So it never entered this path. ret = 0 so if is not executed.

correct, because exec can find poweroff tool and start it, so you,
most probably, have bunch of this tool instance running in parallel (some of them can fail or block)
Issue 1 - you've sent fix for is actual :).

Again, thermal has no control of power off process once  run_cmd() is returned,
and it do not know what US poweroff binary is doing and how much time can it take
(which include disks maintenance - loooong delay).

> 
>> 		pr_warn("Failed to start orderly shutdown: forcing the issue\n");
>>
>> 		/*
>> 		 * I guess this should try to kick off some daemon to sync and
>> 		 * poweroff asap.  Or not even bother syncing if we're doing an
>> 		 * emergency shutdown?
>> 		 */
>> 		emergency_sync();
>> 		kernel_power_off();
>> ^^^ force power off, but only if run_cmd() failed - for example /sbin/poweroff doesn't exist
>> 	}
>>
>> 	return ret;
>> }
>>
>> static bool poweroff_force;
>>
>> static void poweroff_work_func(struct work_struct *work)
>> {
>> 	__orderly_poweroff(poweroff_force);
>> }
>>
>> As result thermal has no control of power off any more after calling orderly_poweroff() and can get the result
>> of US poweroff binary execution.
>>
>>>
>>> If you are going to implement the above two patches, keep in mind:
>>> i. At least within the thermal subsystem, you need to take care of all
>>> zones that could trigger a shutdown.
>>> ii. serializing the calls to orderly_poweroff() seams to be more
>>> concerning than cancelling all monitoring.
>>>
>>>
>>
J, KEERTHY April 12, 2017, 5:11 p.m. UTC | #20
On Wednesday 12 April 2017 10:38 PM, Grygorii Strashko wrote:
> 
> 
> On 04/12/2017 11:44 AM, Keerthy wrote:
>>
>>
>> On Wednesday 12 April 2017 10:01 PM, Grygorii Strashko wrote:
>>>
>>>
>>> On 04/12/2017 10:44 AM, Eduardo Valentin wrote:
>>>> Hello,
>>>>
>>> ...
>>>
>>>>
>>>> I agree. But there it nothing that says it is not reenterable. If you
>>>> saw something in this line, can you please share?
>>>>
>>>>>>> will you generate a patch to do this?
>>>>>> Sure. I will generate a patch to take care of 1) To make sure that
>>>>>> orderly_poweroff is called only once right away. I have already
>>>>>> tested.
>>>>>>
>>>>>> for 2) Cancel all the scheduled work queues to monitor the
>>>>>> temperature.
>>>>>> I will take some more time to make it and test.
>>>>>>
>>>>>> Is that okay? Or you want me to send both together?
>>>>>>
>>>>> I think you can send patch for step 1 first.
>>>>
>>>> I am happy to see that Keerthy found the problem with his setup and a
>>>> possible solution. But I have a few concerns here.
>>>>
>>>> 1. If regular shutdown process takes 10seconds, that is a ballpark that
>>>> thermal should never wait. orderly_poweroff() calls run_cmd() with wait
>>>> flag set. That means, if regular userland shutdown takes 10s, we are
>>>> waiting for it. Obviously this not acceptable. Specially if you setup
>>>> critical trip to be 125C. Now, if you properly size the critical trip to
>>>> fire before hotspot really reach 125C, for 10s (or the time it takes to
>>>> shutdown), then fine. But based on what was described in this thread,
>>>> his system is waiting 10s on regular shutdown, and his silicon is on
>>>> out-of-spec temperature for 10s, which is wrong.
>>>>
>>>> 2. The above scenario is not acceptable in a long run, specially from a
>>>> reliability perspective. If orderly_poweroff() has a possibility to
>>>> simply never return (or take too long), I would say the thermal
>>>> subsystem is using the wrong API.
> 
> ^ this question just repeat everything which was already discussed in
> previous versions of this patch - orderly_poweroff() is not good for critical shutdown/poweroff,
> but what to use instead?
> 
> 
>>>>
>>>
>>>
>>> Hh, I do not see that orderly_poweroff() will wait for anything now:
>>> void orderly_poweroff(bool force)
>>> {
>>> 	if (force) /* do not override the pending "true" */
>>> 		poweroff_force = true;
>>> 	schedule_work(&poweroff_work); 
>>> ^^^^^^^ async call. even here can be pretty big delay if system is under pressure
>>> }
>>>
>>>
>>> static int __orderly_poweroff(bool force)
>>> {
>>> 	int ret;
>>>
>>> 	ret = run_cmd(poweroff_cmd);
>>
>> When i tried with multiple orderly_poweroff calls ret was always 0.
>> So every 250mS i see this ret = 0.
>>
>>> ^^^^ no wait for the process - only for exec. flags == UMH_WAIT_EXEC
>>>
>>> 	if (ret && force) {
>>
>> So it never entered this path. ret = 0 so if is not executed.
> 
> correct, because exec can find poweroff tool and start it, so you,
> most probably, have bunch of this tool instance running in parallel (some of them can fail or block)
> Issue 1 - you've sent fix for is actual :).

Precisely yes!

> 
> Again, thermal has no control of power off process once  run_cmd() is returned,
> and it do not know what US poweroff binary is doing and how much time can it take
> (which include disks maintenance - loooong delay).
> 
>>
>>> 		pr_warn("Failed to start orderly shutdown: forcing the issue\n");
>>>
>>> 		/*
>>> 		 * I guess this should try to kick off some daemon to sync and
>>> 		 * poweroff asap.  Or not even bother syncing if we're doing an
>>> 		 * emergency shutdown?
>>> 		 */
>>> 		emergency_sync();
>>> 		kernel_power_off();
>>> ^^^ force power off, but only if run_cmd() failed - for example /sbin/poweroff doesn't exist
>>> 	}
>>>
>>> 	return ret;
>>> }
>>>
>>> static bool poweroff_force;
>>>
>>> static void poweroff_work_func(struct work_struct *work)
>>> {
>>> 	__orderly_poweroff(poweroff_force);
>>> }
>>>
>>> As result thermal has no control of power off any more after calling orderly_poweroff() and can get the result
>>> of US poweroff binary execution.
>>>
>>>>
>>>> If you are going to implement the above two patches, keep in mind:
>>>> i. At least within the thermal subsystem, you need to take care of all
>>>> zones that could trigger a shutdown.
>>>> ii. serializing the calls to orderly_poweroff() seams to be more
>>>> concerning than cancelling all monitoring.
>>>>
>>>>
>>>
>
Eduardo Valentin April 12, 2017, 5:24 p.m. UTC | #21
On Wed, Apr 12, 2017 at 10:41:00PM +0530, Keerthy wrote:
> 
> 
> On Wednesday 12 April 2017 10:38 PM, Grygorii Strashko wrote:
> > 
> > 
> > On 04/12/2017 11:44 AM, Keerthy wrote:
> >>
> >>
> >> On Wednesday 12 April 2017 10:01 PM, Grygorii Strashko wrote:
> >>>
> >>>
> >>> On 04/12/2017 10:44 AM, Eduardo Valentin wrote:
> >>>> Hello,
> >>>>
> >>> ...
> >>>
> >>>>
> >>>> I agree. But there it nothing that says it is not reenterable. If you
> >>>> saw something in this line, can you please share?
> >>>>
> >>>>>>> will you generate a patch to do this?
> >>>>>> Sure. I will generate a patch to take care of 1) To make sure that
> >>>>>> orderly_poweroff is called only once right away. I have already
> >>>>>> tested.
> >>>>>>
> >>>>>> for 2) Cancel all the scheduled work queues to monitor the
> >>>>>> temperature.
> >>>>>> I will take some more time to make it and test.
> >>>>>>
> >>>>>> Is that okay? Or you want me to send both together?
> >>>>>>
> >>>>> I think you can send patch for step 1 first.
> >>>>
> >>>> I am happy to see that Keerthy found the problem with his setup and a
> >>>> possible solution. But I have a few concerns here.
> >>>>
> >>>> 1. If regular shutdown process takes 10seconds, that is a ballpark that
> >>>> thermal should never wait. orderly_poweroff() calls run_cmd() with wait
> >>>> flag set. That means, if regular userland shutdown takes 10s, we are
> >>>> waiting for it. Obviously this not acceptable. Specially if you setup
> >>>> critical trip to be 125C. Now, if you properly size the critical trip to
> >>>> fire before hotspot really reach 125C, for 10s (or the time it takes to
> >>>> shutdown), then fine. But based on what was described in this thread,
> >>>> his system is waiting 10s on regular shutdown, and his silicon is on
> >>>> out-of-spec temperature for 10s, which is wrong.
> >>>>
> >>>> 2. The above scenario is not acceptable in a long run, specially from a
> >>>> reliability perspective. If orderly_poweroff() has a possibility to
> >>>> simply never return (or take too long), I would say the thermal
> >>>> subsystem is using the wrong API.
> > 
> > ^ this question just repeat everything which was already discussed in
> > previous versions of this patch - orderly_poweroff() is not good for critical shutdown/poweroff,
> > but what to use instead?

It is still useful on a properly sized system. The point is the thermal
subsystem still wants to give one opportunity to gracefully shutdown the
running system on a thermal scenario, as I explained in the other email.
But, you have to do this accounting the down time, and your reliability
concerns.

> > 
> > 
> >>>>
> >>>
> >>>
> >>> Hh, I do not see that orderly_poweroff() will wait for anything now:
> >>> void orderly_poweroff(bool force)
> >>> {
> >>> 	if (force) /* do not override the pending "true" */
> >>> 		poweroff_force = true;
> >>> 	schedule_work(&poweroff_work); 
> >>> ^^^^^^^ async call. even here can be pretty big delay if system is under pressure
> >>> }
> >>>
> >>>
> >>> static int __orderly_poweroff(bool force)
> >>> {
> >>> 	int ret;
> >>>
> >>> 	ret = run_cmd(poweroff_cmd);
> >>
> >> When i tried with multiple orderly_poweroff calls ret was always 0.
> >> So every 250mS i see this ret = 0.
> >>
> >>> ^^^^ no wait for the process - only for exec. flags == UMH_WAIT_EXEC
> >>>
> >>> 	if (ret && force) {
> >>
> >> So it never entered this path. ret = 0 so if is not executed.
> > 
> > correct, because exec can find poweroff tool and start it, so you,
> > most probably, have bunch of this tool instance running in parallel (some of them can fail or block)
> > Issue 1 - you've sent fix for is actual :).
> 
> Precisely yes!
> 

As I mentioned, the fix is a two fold, a. avoid spam of
orderly_poweroff(), but make sure eventually we shutdown.

> > 
> > Again, thermal has no control of power off process once  run_cmd() is returned,
> > and it do not know what US poweroff binary is doing and how much time can it take
> > (which include disks maintenance - loooong delay).
> > 
> >>
> >>> 		pr_warn("Failed to start orderly shutdown: forcing the issue\n");
> >>>
> >>> 		/*
> >>> 		 * I guess this should try to kick off some daemon to sync and
> >>> 		 * poweroff asap.  Or not even bother syncing if we're doing an
> >>> 		 * emergency shutdown?
> >>> 		 */
> >>> 		emergency_sync();
> >>> 		kernel_power_off();
> >>> ^^^ force power off, but only if run_cmd() failed - for example /sbin/poweroff doesn't exist
> >>> 	}
> >>>
> >>> 	return ret;
> >>> }
> >>>
> >>> static bool poweroff_force;
> >>>
> >>> static void poweroff_work_func(struct work_struct *work)
> >>> {
> >>> 	__orderly_poweroff(poweroff_force);
> >>> }
> >>>
> >>> As result thermal has no control of power off any more after calling orderly_poweroff() and can get the result
> >>> of US poweroff binary execution.
> >>>
> >>>>
> >>>> If you are going to implement the above two patches, keep in mind:
> >>>> i. At least within the thermal subsystem, you need to take care of all
> >>>> zones that could trigger a shutdown.
> >>>> ii. serializing the calls to orderly_poweroff() seams to be more
> >>>> concerning than cancelling all monitoring.
> >>>>
> >>>>
> >>>
> >
Tero Kristo April 12, 2017, 6:43 p.m. UTC | #22
On 12/04/17 20:24, Eduardo Valentin wrote:
> On Wed, Apr 12, 2017 at 10:41:00PM +0530, Keerthy wrote:
>>
>>
>> On Wednesday 12 April 2017 10:38 PM, Grygorii Strashko wrote:
>>>
>>>
>>> On 04/12/2017 11:44 AM, Keerthy wrote:
>>>>
>>>>
>>>> On Wednesday 12 April 2017 10:01 PM, Grygorii Strashko wrote:
>>>>>
>>>>>
>>>>> On 04/12/2017 10:44 AM, Eduardo Valentin wrote:
>>>>>> Hello,
>>>>>>
>>>>> ...
>>>>>
>>>>>>
>>>>>> I agree. But there it nothing that says it is not reenterable. If you
>>>>>> saw something in this line, can you please share?
>>>>>>
>>>>>>>>> will you generate a patch to do this?
>>>>>>>> Sure. I will generate a patch to take care of 1) To make sure that
>>>>>>>> orderly_poweroff is called only once right away. I have already
>>>>>>>> tested.
>>>>>>>>
>>>>>>>> for 2) Cancel all the scheduled work queues to monitor the
>>>>>>>> temperature.
>>>>>>>> I will take some more time to make it and test.
>>>>>>>>
>>>>>>>> Is that okay? Or you want me to send both together?
>>>>>>>>
>>>>>>> I think you can send patch for step 1 first.
>>>>>>
>>>>>> I am happy to see that Keerthy found the problem with his setup and a
>>>>>> possible solution. But I have a few concerns here.
>>>>>>
>>>>>> 1. If regular shutdown process takes 10seconds, that is a ballpark that
>>>>>> thermal should never wait. orderly_poweroff() calls run_cmd() with wait
>>>>>> flag set. That means, if regular userland shutdown takes 10s, we are
>>>>>> waiting for it. Obviously this not acceptable. Specially if you setup
>>>>>> critical trip to be 125C. Now, if you properly size the critical trip to
>>>>>> fire before hotspot really reach 125C, for 10s (or the time it takes to
>>>>>> shutdown), then fine. But based on what was described in this thread,
>>>>>> his system is waiting 10s on regular shutdown, and his silicon is on
>>>>>> out-of-spec temperature for 10s, which is wrong.
>>>>>>
>>>>>> 2. The above scenario is not acceptable in a long run, specially from a
>>>>>> reliability perspective. If orderly_poweroff() has a possibility to
>>>>>> simply never return (or take too long), I would say the thermal
>>>>>> subsystem is using the wrong API.
>>>
>>> ^ this question just repeat everything which was already discussed in
>>> previous versions of this patch - orderly_poweroff() is not good for critical shutdown/poweroff,
>>> but what to use instead?
>
> It is still useful on a properly sized system. The point is the thermal
> subsystem still wants to give one opportunity to gracefully shutdown the
> running system on a thermal scenario, as I explained in the other email.
> But, you have to do this accounting the down time, and your reliability
> concerns.
>
>>>
>>>
>>>>>>
>>>>>
>>>>>
>>>>> Hh, I do not see that orderly_poweroff() will wait for anything now:
>>>>> void orderly_poweroff(bool force)
>>>>> {
>>>>> 	if (force) /* do not override the pending "true" */
>>>>> 		poweroff_force = true;
>>>>> 	schedule_work(&poweroff_work);
>>>>> ^^^^^^^ async call. even here can be pretty big delay if system is under pressure
>>>>> }
>>>>>
>>>>>
>>>>> static int __orderly_poweroff(bool force)
>>>>> {
>>>>> 	int ret;
>>>>>
>>>>> 	ret = run_cmd(poweroff_cmd);
>>>>
>>>> When i tried with multiple orderly_poweroff calls ret was always 0.
>>>> So every 250mS i see this ret = 0.
>>>>
>>>>> ^^^^ no wait for the process - only for exec. flags == UMH_WAIT_EXEC
>>>>>
>>>>> 	if (ret && force) {
>>>>
>>>> So it never entered this path. ret = 0 so if is not executed.
>>>
>>> correct, because exec can find poweroff tool and start it, so you,
>>> most probably, have bunch of this tool instance running in parallel (some of them can fail or block)
>>> Issue 1 - you've sent fix for is actual :).
>>
>> Precisely yes!
>>
>
> As I mentioned, the fix is a two fold, a. avoid spam of
> orderly_poweroff(), but make sure eventually we shutdown.

Just chirping in here a bit myself also, the long latencies in the 
poweroff executing are basically because in our case it will do all of 
the following:

- stop all running daemons
- kill all remaining processes
- unload all modules
- sync / unmount all filesystems
- etc.
- poweroff the system when everything else has been gracefully done

The order of these things are not necessarily what I listed above, but 
overall it takes quite a bit of time. It doesn't matter if you execute 
all of this over NFS or SD card or ramdisk, it is a long procedure.

-Tero

>
>>>
>>> Again, thermal has no control of power off process once  run_cmd() is returned,
>>> and it do not know what US poweroff binary is doing and how much time can it take
>>> (which include disks maintenance - loooong delay).
>>>
>>>>
>>>>> 		pr_warn("Failed to start orderly shutdown: forcing the issue\n");
>>>>>
>>>>> 		/*
>>>>> 		 * I guess this should try to kick off some daemon to sync and
>>>>> 		 * poweroff asap.  Or not even bother syncing if we're doing an
>>>>> 		 * emergency shutdown?
>>>>> 		 */
>>>>> 		emergency_sync();
>>>>> 		kernel_power_off();
>>>>> ^^^ force power off, but only if run_cmd() failed - for example /sbin/poweroff doesn't exist
>>>>> 	}
>>>>>
>>>>> 	return ret;
>>>>> }
>>>>>
>>>>> static bool poweroff_force;
>>>>>
>>>>> static void poweroff_work_func(struct work_struct *work)
>>>>> {
>>>>> 	__orderly_poweroff(poweroff_force);
>>>>> }
>>>>>
>>>>> As result thermal has no control of power off any more after calling orderly_poweroff() and can get the result
>>>>> of US poweroff binary execution.
>>>>>
>>>>>>
>>>>>> If you are going to implement the above two patches, keep in mind:
>>>>>> i. At least within the thermal subsystem, you need to take care of all
>>>>>> zones that could trigger a shutdown.
>>>>>> ii. serializing the calls to orderly_poweroff() seams to be more
>>>>>> concerning than cancelling all monitoring.
>>>>>>
>>>>>>
>>>>>
>>>
J, KEERTHY April 13, 2017, 3:50 a.m. UTC | #23
On Thursday 13 April 2017 12:13 AM, Tero Kristo wrote:
> On 12/04/17 20:24, Eduardo Valentin wrote:
>> On Wed, Apr 12, 2017 at 10:41:00PM +0530, Keerthy wrote:
>>>
>>>
>>> On Wednesday 12 April 2017 10:38 PM, Grygorii Strashko wrote:
>>>>
>>>>
>>>> On 04/12/2017 11:44 AM, Keerthy wrote:
>>>>>
>>>>>
>>>>> On Wednesday 12 April 2017 10:01 PM, Grygorii Strashko wrote:
>>>>>>
>>>>>>
>>>>>> On 04/12/2017 10:44 AM, Eduardo Valentin wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>> ...
>>>>>>
>>>>>>>
>>>>>>> I agree. But there it nothing that says it is not reenterable. If
>>>>>>> you
>>>>>>> saw something in this line, can you please share?
>>>>>>>
>>>>>>>>>> will you generate a patch to do this?
>>>>>>>>> Sure. I will generate a patch to take care of 1) To make sure that
>>>>>>>>> orderly_poweroff is called only once right away. I have already
>>>>>>>>> tested.
>>>>>>>>>
>>>>>>>>> for 2) Cancel all the scheduled work queues to monitor the
>>>>>>>>> temperature.
>>>>>>>>> I will take some more time to make it and test.
>>>>>>>>>
>>>>>>>>> Is that okay? Or you want me to send both together?
>>>>>>>>>
>>>>>>>> I think you can send patch for step 1 first.
>>>>>>>
>>>>>>> I am happy to see that Keerthy found the problem with his setup
>>>>>>> and a
>>>>>>> possible solution. But I have a few concerns here.
>>>>>>>
>>>>>>> 1. If regular shutdown process takes 10seconds, that is a
>>>>>>> ballpark that
>>>>>>> thermal should never wait. orderly_poweroff() calls run_cmd()
>>>>>>> with wait
>>>>>>> flag set. That means, if regular userland shutdown takes 10s, we are
>>>>>>> waiting for it. Obviously this not acceptable. Specially if you
>>>>>>> setup
>>>>>>> critical trip to be 125C. Now, if you properly size the critical
>>>>>>> trip to
>>>>>>> fire before hotspot really reach 125C, for 10s (or the time it
>>>>>>> takes to
>>>>>>> shutdown), then fine. But based on what was described in this
>>>>>>> thread,
>>>>>>> his system is waiting 10s on regular shutdown, and his silicon is on
>>>>>>> out-of-spec temperature for 10s, which is wrong.
>>>>>>>
>>>>>>> 2. The above scenario is not acceptable in a long run, specially
>>>>>>> from a
>>>>>>> reliability perspective. If orderly_poweroff() has a possibility to
>>>>>>> simply never return (or take too long), I would say the thermal
>>>>>>> subsystem is using the wrong API.
>>>>
>>>> ^ this question just repeat everything which was already discussed in
>>>> previous versions of this patch - orderly_poweroff() is not good for
>>>> critical shutdown/poweroff,
>>>> but what to use instead?
>>
>> It is still useful on a properly sized system. The point is the thermal
>> subsystem still wants to give one opportunity to gracefully shutdown the
>> running system on a thermal scenario, as I explained in the other email.
>> But, you have to do this accounting the down time, and your reliability
>> concerns.
>>
>>>>
>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Hh, I do not see that orderly_poweroff() will wait for anything now:
>>>>>> void orderly_poweroff(bool force)
>>>>>> {
>>>>>>     if (force) /* do not override the pending "true" */
>>>>>>         poweroff_force = true;
>>>>>>     schedule_work(&poweroff_work);
>>>>>> ^^^^^^^ async call. even here can be pretty big delay if system is
>>>>>> under pressure
>>>>>> }
>>>>>>
>>>>>>
>>>>>> static int __orderly_poweroff(bool force)
>>>>>> {
>>>>>>     int ret;
>>>>>>
>>>>>>     ret = run_cmd(poweroff_cmd);
>>>>>
>>>>> When i tried with multiple orderly_poweroff calls ret was always 0.
>>>>> So every 250mS i see this ret = 0.
>>>>>
>>>>>> ^^^^ no wait for the process - only for exec. flags == UMH_WAIT_EXEC
>>>>>>
>>>>>>     if (ret && force) {
>>>>>
>>>>> So it never entered this path. ret = 0 so if is not executed.
>>>>
>>>> correct, because exec can find poweroff tool and start it, so you,
>>>> most probably, have bunch of this tool instance running in parallel
>>>> (some of them can fail or block)
>>>> Issue 1 - you've sent fix for is actual :).
>>>
>>> Precisely yes!
>>>
>>
>> As I mentioned, the fix is a two fold, a. avoid spam of
>> orderly_poweroff(), but make sure eventually we shutdown.
> 
> Just chirping in here a bit myself also, the long latencies in the
> poweroff executing are basically because in our case it will do all of
> the following:
> 
> - stop all running daemons
> - kill all remaining processes
> - unload all modules
> - sync / unmount all filesystems
> - etc.
> - poweroff the system when everything else has been gracefully done
> 
> The order of these things are not necessarily what I listed above, but
> overall it takes quite a bit of time. It doesn't matter if you execute
> all of this over NFS or SD card or ramdisk, it is a long procedure.

Yes. Thanks for the pointers Tero.

As i had mentioned, Here is the log on DRA72-EVM with arago filesystem
over nfs on the next branch with my patch to restrict orderly_poweroff
to one cycle.

http://pastebin.ubuntu.com/24371980/

I triggered thermal shutdown by using THERMAL_EMULATION.

5-10S was on a good run and we can see that with a full size file system
over nfs its taking about 30+ seconds to orderly_poweroff.

I also profiled a poweroff command timing. That also takes more than 20
Seconds. Here is the log:
http://pastebin.ubuntu.com/24372012/

As Eduardo pointed out this is pretty long. I had 2 suggestions for that:

1) To decrease the thermal critical threshold below the actual hardware
thermal shutdown threshold.

2) To have a thermal_backup shutdown which uses kernel_power_off when a
configured time expires after we have triggered orderly_poweroff and
directly shuts off the system.

Regards,
Keerthy

> 
> -Tero
> 
>>
>>>>
>>>> Again, thermal has no control of power off process once  run_cmd()
>>>> is returned,
>>>> and it do not know what US poweroff binary is doing and how much
>>>> time can it take
>>>> (which include disks maintenance - loooong delay).
>>>>
>>>>>
>>>>>>         pr_warn("Failed to start orderly shutdown: forcing the
>>>>>> issue\n");
>>>>>>
>>>>>>         /*
>>>>>>          * I guess this should try to kick off some daemon to sync
>>>>>> and
>>>>>>          * poweroff asap.  Or not even bother syncing if we're
>>>>>> doing an
>>>>>>          * emergency shutdown?
>>>>>>          */
>>>>>>         emergency_sync();
>>>>>>         kernel_power_off();
>>>>>> ^^^ force power off, but only if run_cmd() failed - for example
>>>>>> /sbin/poweroff doesn't exist
>>>>>>     }
>>>>>>
>>>>>>     return ret;
>>>>>> }
>>>>>>
>>>>>> static bool poweroff_force;
>>>>>>
>>>>>> static void poweroff_work_func(struct work_struct *work)
>>>>>> {
>>>>>>     __orderly_poweroff(poweroff_force);
>>>>>> }
>>>>>>
>>>>>> As result thermal has no control of power off any more after
>>>>>> calling orderly_poweroff() and can get the result
>>>>>> of US poweroff binary execution.
>>>>>>
>>>>>>>
>>>>>>> If you are going to implement the above two patches, keep in mind:
>>>>>>> i. At least within the thermal subsystem, you need to take care
>>>>>>> of all
>>>>>>> zones that could trigger a shutdown.
>>>>>>> ii. serializing the calls to orderly_poweroff() seams to be more
>>>>>>> concerning than cancelling all monitoring.
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>
diff mbox

Patch

diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 0a16cf4..4cc55f9 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -15,6 +15,19 @@  menuconfig THERMAL
 
 if THERMAL
 
+config THERMAL_EMERGENCY_POWEROFF_DELAY_MS
+	int "Emergency poweroff delay in milli-seconds"
+	depends on THERMAL
+	default 0
+	help
+	  The number of milliseconds to delay before emergency
+	  poweroff kicks in. The delay should be carefully profiled
+	  so as to give adequate time for orderly_poweroff. In case
+	  of failure of an orderly_poweroff the emergency poweroff
+	  kicks in after the delay has elapsed and shuts down the system.
+
+	  If set to 0 poweroff will happen immediately.
+
 config THERMAL_HWMON
 	bool
 	prompt "Expose thermal sensors as hwmon device"
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 11f0675..dc7fdd4 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -322,6 +322,47 @@  static void handle_non_critical_trips(struct thermal_zone_device *tz,
 		       def_governor->throttle(tz, trip);
 }
 
+/**
+ * emergency_poweroff_func - emergency poweroff work after a known delay
+ * @work: work_struct associated with the emergency poweroff function
+ *
+ * This function is called in very critical situations to force
+ * a kernel poweroff after a configurable timeout value.
+ */
+static void emergency_poweroff_func(struct work_struct *work)
+{
+	/**
+	 * We have reached here after the emergency thermal shutdown
+	 * Waiting period has expired. This means orderly_poweroff has
+	 * not been able to shut off the system for some reason.
+	 * Try to shut down the system immediately using pm_power_off
+	 * if populated
+	 */
+	pr_warn("Attempting kernel_power_off\n");
+	if (pm_power_off)
+		pm_power_off();
+
+	/**
+	 * Worst of the worst case trigger emergency restart
+	 */
+	pr_warn("kernel_power_off has failed! Attempting emergency_restart\n");
+	emergency_restart();
+}
+
+static DECLARE_DELAYED_WORK(emergency_poweroff_work, emergency_poweroff_func);
+
+/**
+ * emergency_poweroff - Trigger an emergency system poweroff
+ *
+ * This may be called from any critical situation to trigger a system shutdown
+ * after a known period of time. By default the delay is 0 millisecond
+ */
+void thermal_emergency_poweroff(void)
+{
+	schedule_delayed_work(&emergency_poweroff_work,
+			      msecs_to_jiffies(CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS));
+}
+
 static void handle_critical_trips(struct thermal_zone_device *tz,
 				  int trip, enum thermal_trip_type trip_type)
 {
@@ -343,6 +384,7 @@  static void handle_critical_trips(struct thermal_zone_device *tz,
 			  "critical temperature reached(%d C),shutting down\n",
 			  tz->temperature / 1000);
 		orderly_poweroff(true);
+		thermal_emergency_poweroff();
 	}
 }