diff mbox

[3/3] arm64: dts: qcom: pm8998: Add thermal zone

Message ID 20180628210915.160893-3-mka@chromium.org (mailing list archive)
State New, archived
Headers show

Commit Message

Matthias Kaehlcke June 28, 2018, 9:09 p.m. UTC
Add pm8998 thermal zone based on the examples in the spmi-temp-alarm
bindings.

Note: devices with the pm8998 need to have a 'thermal-zones' node (which
may be empty) with a label 'thermal_zones'.

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
---
 arch/arm64/boot/dts/qcom/pm8998.dtsi | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

Comments

Doug Anderson June 28, 2018, 10:58 p.m. UTC | #1
Hi,

On Thu, Jun 28, 2018 at 2:09 PM, Matthias Kaehlcke <mka@chromium.org> wrote:
> Add pm8998 thermal zone based on the examples in the spmi-temp-alarm
> bindings.
>
> Note: devices with the pm8998 need to have a 'thermal-zones' node (which
> may be empty) with a label 'thermal_zones'.
>
> Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
> ---
>  arch/arm64/boot/dts/qcom/pm8998.dtsi | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)

Do you know if this patch actually does anything since you didn't
define a cooling-maps?  Hopefully at least the critical shuts things
down?

Do you have any idea how we'll eventually want to specify a
cooling-maps?  Are we just going to assume we're included by an sdm845
device and refer to the big/little CPU phandles?


> diff --git a/arch/arm64/boot/dts/qcom/pm8998.dtsi b/arch/arm64/boot/dts/qcom/pm8998.dtsi
> index f2d39074ed74..d85ceb4f976b 100644
> --- a/arch/arm64/boot/dts/qcom/pm8998.dtsi
> +++ b/arch/arm64/boot/dts/qcom/pm8998.dtsi
> @@ -3,6 +3,7 @@
>
>  #include <dt-bindings/spmi/spmi.h>
>  #include <dt-bindings/interrupt-controller/irq.h>
> +#include <dt-bindings/thermal/thermal.h>
>
>  &spmi_bus {
>         pm8998_lsid0: pmic@0 {
> @@ -59,3 +60,30 @@
>                 #size-cells = <0>;
>         };
>  };
> +
> +&thermal_zones {

As per comments in patch #1, don't rely on a label.  Just put your
stuff in a top-level "thermal-zones" node.

> +       pm8998 {
> +               polling-delay-passive = <250>;
> +               polling-delay = <1000>;
> +
> +               thermal-sensors = <&pm8998_temp>;
> +
> +               trips {
> +                       passive {

IMO you should proactively put a label on these trips even if there's
no cooling device yet.  It's expected that at some point someone will
add a cooling device and refer to them, right?



-Doug
Matthias Kaehlcke June 29, 2018, 6:51 p.m. UTC | #2
On Thu, Jun 28, 2018 at 03:58:41PM -0700, Doug Anderson wrote:
> Hi,
> 
> On Thu, Jun 28, 2018 at 2:09 PM, Matthias Kaehlcke <mka@chromium.org> wrote:
> > Add pm8998 thermal zone based on the examples in the spmi-temp-alarm
> > bindings.
> >
> > Note: devices with the pm8998 need to have a 'thermal-zones' node (which
> > may be empty) with a label 'thermal_zones'.
> >
> > Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
> > ---
> >  arch/arm64/boot/dts/qcom/pm8998.dtsi | 28 ++++++++++++++++++++++++++++
> >  1 file changed, 28 insertions(+)
> 
> Do you know if this patch actually does anything since you didn't
> define a cooling-maps?  Hopefully at least the critical shuts things
> down?

I need to do some additional testing, currently waiting to get the
heat gun back ...

I would expect the critical trip point to shut the system down, though
I'm not sure whether the HW temperature watchdog wouldn't cut power
before that. The documentation I have access to contains some register
descriptions, but isn't very verbose about the overall behavior and
from the driver code that's also not really clear to me. The driver
"disables software override of stage 2 and 3 shutdowns" which make me
guess that a hardware shutdown kicks in at stage 2 (135°C ?). This
would be roughly in line with a system reset I observed in an earlier
test at a temperature > 125°C. If that's correct the trip points need
to be revisited.

Maybe David Collins who recently extended the driver to add support
for GEN2 PMIC peripherals can provide more details.

> Do you have any idea how we'll eventually want to specify a
> cooling-maps?  Are we just going to assume we're included by an sdm845
> device and refer to the big/little CPU phandles?

No clear idea on my side at this point, but limiting CPU frequencies
seems likely, potentially also devfreq devices.

> > diff --git a/arch/arm64/boot/dts/qcom/pm8998.dtsi b/arch/arm64/boot/dts/qcom/pm8998.dtsi
> > index f2d39074ed74..d85ceb4f976b 100644
> > --- a/arch/arm64/boot/dts/qcom/pm8998.dtsi
> > +++ b/arch/arm64/boot/dts/qcom/pm8998.dtsi
> > @@ -3,6 +3,7 @@
> >
> >  #include <dt-bindings/spmi/spmi.h>
> >  #include <dt-bindings/interrupt-controller/irq.h>
> > +#include <dt-bindings/thermal/thermal.h>
> >
> >  &spmi_bus {
> >         pm8998_lsid0: pmic@0 {
> > @@ -59,3 +60,30 @@
> >                 #size-cells = <0>;
> >         };
> >  };
> > +
> > +&thermal_zones {
> 
> As per comments in patch #1, don't rely on a label.  Just put your
> stuff in a top-level "thermal-zones" node.

ack

> > +       pm8998 {
> > +               polling-delay-passive = <250>;
> > +               polling-delay = <1000>;
> > +
> > +               thermal-sensors = <&pm8998_temp>;
> > +
> > +               trips {
> > +                       passive {
> 
> IMO you should proactively put a label on these trips even if there's
> no cooling device yet.  It's expected that at some point someone will
> add a cooling device and refer to them, right?

ok
David Collins June 29, 2018, 9:29 p.m. UTC | #3
Hello Matthias,

On 06/29/2018 11:51 AM, Matthias Kaehlcke wrote:
> On Thu, Jun 28, 2018 at 03:58:41PM -0700, Doug Anderson wrote:
>> Hi,
>>
>> On Thu, Jun 28, 2018 at 2:09 PM, Matthias Kaehlcke <mka@chromium.org> wrote:
>>> Add pm8998 thermal zone based on the examples in the spmi-temp-alarm
>>> bindings.
>>>
>>> Note: devices with the pm8998 need to have a 'thermal-zones' node (which
>>> may be empty) with a label 'thermal_zones'.
>>>
>>> Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
>>> ---
>>>  arch/arm64/boot/dts/qcom/pm8998.dtsi | 28 ++++++++++++++++++++++++++++
>>>  1 file changed, 28 insertions(+)
>>
>> Do you know if this patch actually does anything since you didn't
>> define a cooling-maps?  Hopefully at least the critical shuts things
>> down?
> 
> I need to do some additional testing, currently waiting to get the
> heat gun back ...
> 
> I would expect the critical trip point to shut the system down, though
> I'm not sure whether the HW temperature watchdog wouldn't cut power
> before that. The documentation I have access to contains some register
> descriptions, but isn't very verbose about the overall behavior and
> from the driver code that's also not really clear to me. The driver
> "disables software override of stage 2 and 3 shutdowns" which make me
> guess that a hardware shutdown kicks in at stage 2 (135°C ?). This
> would be roughly in line with a system reset I observed in an earlier
> test at a temperature > 125°C. If that's correct the trip points need
> to be revisited.
> 
> Maybe David Collins who recently extended the driver to add support
> for GEN2 PMIC peripherals can provide more details.

The PMIC TEMP_ALARM hardware peripheral will perform an automatic partial
PMIC shutdown upon hitting over-temperature stage 2 (125 C).  This turns
off peripherals within the PMIC that are expected to draw significant
current.  The set of peripherals included varies between PMICs.  This
partial shutdown will occur simultaneously with the triggering of an
interrupt to the APPS processor that informs the qcom-spmi-temp-alarm
driver that an over-temperature threshold has been crossed.

The TEMP_ALARM peripheral will perform an automatic full PMIC shutdown
upon hitting over-temperature stage 3 (145 C).  Software won't receive an
interrupt in this case because all power is cut.

If you are not specifying an ADC channel for the qcom-spmi-temp-alarm
device (which would allow for polling of the real-time PMIC die
temperature), then notifications about stage 0 -> 1 and 1 -> 0 transitions
(105 C) are the only time that software could take meaningful corrective
action to avoid a PMIC automatic partial or full shutdown.

Take care,
David
Matthias Kaehlcke June 29, 2018, 11:54 p.m. UTC | #4
On Fri, Jun 29, 2018 at 02:29:55PM -0700, David Collins wrote:
> Hello Matthias,
> 
> On 06/29/2018 11:51 AM, Matthias Kaehlcke wrote:
> > On Thu, Jun 28, 2018 at 03:58:41PM -0700, Doug Anderson wrote:
> >> Hi,
> >>
> >> On Thu, Jun 28, 2018 at 2:09 PM, Matthias Kaehlcke <mka@chromium.org> wrote:
> >>> Add pm8998 thermal zone based on the examples in the spmi-temp-alarm
> >>> bindings.
> >>>
> >>> Note: devices with the pm8998 need to have a 'thermal-zones' node (which
> >>> may be empty) with a label 'thermal_zones'.
> >>>
> >>> Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
> >>> ---
> >>>  arch/arm64/boot/dts/qcom/pm8998.dtsi | 28 ++++++++++++++++++++++++++++
> >>>  1 file changed, 28 insertions(+)
> >>
> >> Do you know if this patch actually does anything since you didn't
> >> define a cooling-maps?  Hopefully at least the critical shuts things
> >> down?
> > 
> > I need to do some additional testing, currently waiting to get the
> > heat gun back ...
> > 
> > I would expect the critical trip point to shut the system down, though
> > I'm not sure whether the HW temperature watchdog wouldn't cut power
> > before that. The documentation I have access to contains some register
> > descriptions, but isn't very verbose about the overall behavior and
> > from the driver code that's also not really clear to me. The driver
> > "disables software override of stage 2 and 3 shutdowns" which make me
> > guess that a hardware shutdown kicks in at stage 2 (135°C ?). This
> > would be roughly in line with a system reset I observed in an earlier
> > test at a temperature > 125°C. If that's correct the trip points need
> > to be revisited.
> > 
> > Maybe David Collins who recently extended the driver to add support
> > for GEN2 PMIC peripherals can provide more details.
> 
> The PMIC TEMP_ALARM hardware peripheral will perform an automatic partial
> PMIC shutdown upon hitting over-temperature stage 2 (125 C).  This turns
> off peripherals within the PMIC that are expected to draw significant
> current.  The set of peripherals included varies between PMICs.  This
> partial shutdown will occur simultaneously with the triggering of an
> interrupt to the APPS processor that informs the qcom-spmi-temp-alarm
> driver that an over-temperature threshold has been crossed.
> 
> The TEMP_ALARM peripheral will perform an automatic full PMIC shutdown
> upon hitting over-temperature stage 3 (145 C).  Software won't receive an
> interrupt in this case because all power is cut.

This information is very useful, thanks David!

The (partial) hardware shutdown seems like a good measure of last
resort, however I suppose we prefer Linux to initiate a shutdown
before losing part of the peripherals (drivers might not be happy
about this and probably not revover even when the temperature goes
down again) or reach a full PMIC shutdown.

Please let me know if there are reasons to prefer to go the hardware
limits, it's also an option for device makers to overwrite these
settings if they want different behavior.

> If you are not specifying an ADC channel for the qcom-spmi-temp-alarm
> device (which would allow for polling of the real-time PMIC die
> temperature), then notifications about stage 0 -> 1 and 1 -> 0 transitions
> (105 C) are the only time that software could take meaningful corrective
> action to avoid a PMIC automatic partial or full shutdown.

Thanks, I already experimented a a bit with this. For the record, the
driver is https://patchwork.kernel.org/patch/10494771/ (this version
is broken though).

Cheers

Matthias
David Collins July 10, 2018, 5:45 p.m. UTC | #5
Hello Matthias,

On 06/29/2018 04:54 PM, Matthias Kaehlcke wrote:
> On Fri, Jun 29, 2018 at 02:29:55PM -0700, David Collins wrote:
...
>> The PMIC TEMP_ALARM hardware peripheral will perform an automatic partial
>> PMIC shutdown upon hitting over-temperature stage 2 (125 C).  This turns
>> off peripherals within the PMIC that are expected to draw significant
>> current.  The set of peripherals included varies between PMICs.  This
>> partial shutdown will occur simultaneously with the triggering of an
>> interrupt to the APPS processor that informs the qcom-spmi-temp-alarm
>> driver that an over-temperature threshold has been crossed.
>>
>> The TEMP_ALARM peripheral will perform an automatic full PMIC shutdown
>> upon hitting over-temperature stage 3 (145 C).  Software won't receive an
>> interrupt in this case because all power is cut.
> 
> This information is very useful, thanks David!
> 
> The (partial) hardware shutdown seems like a good measure of last
> resort, however I suppose we prefer Linux to initiate a shutdown
> before losing part of the peripherals (drivers might not be happy
> about this and probably not revover even when the temperature goes
> down again) or reach a full PMIC shutdown.
> 
> Please let me know if there are reasons to prefer to go the hardware
> limits, it's also an option for device makers to overwrite these
> settings if they want different behavior.

Disabling stage 3 automatic full PMIC shutdown at 145 C is definitely a
bad idea.  This exists as a last resort in order to save the hardware and
ensure end user safety in case of excessive temperature even if software
is locked up.

Disabling stage 2 automatic partial PMIC shutdown at 125 C is not
recommended as the PMIC is already outside of reasonable operating
conditions and needs to take corrective action quickly.  However, doing so
may be acceptable if software is taking action to shut down the system
immediately upon receiving the stage 2 over-temperature interrupt.

Take care,
David
Doug Anderson July 11, 2018, 9:56 p.m. UTC | #6
Hi David,

On Tue, Jul 10, 2018 at 10:45 AM, David Collins <collinsd@codeaurora.org> wrote:
> Hello Matthias,
>
> On 06/29/2018 04:54 PM, Matthias Kaehlcke wrote:
>> On Fri, Jun 29, 2018 at 02:29:55PM -0700, David Collins wrote:
> ...
>>> The PMIC TEMP_ALARM hardware peripheral will perform an automatic partial
>>> PMIC shutdown upon hitting over-temperature stage 2 (125 C).  This turns
>>> off peripherals within the PMIC that are expected to draw significant
>>> current.  The set of peripherals included varies between PMICs.  This
>>> partial shutdown will occur simultaneously with the triggering of an
>>> interrupt to the APPS processor that informs the qcom-spmi-temp-alarm
>>> driver that an over-temperature threshold has been crossed.
>>>
>>> The TEMP_ALARM peripheral will perform an automatic full PMIC shutdown
>>> upon hitting over-temperature stage 3 (145 C).  Software won't receive an
>>> interrupt in this case because all power is cut.
>>
>> This information is very useful, thanks David!
>>
>> The (partial) hardware shutdown seems like a good measure of last
>> resort, however I suppose we prefer Linux to initiate a shutdown
>> before losing part of the peripherals (drivers might not be happy
>> about this and probably not revover even when the temperature goes
>> down again) or reach a full PMIC shutdown.
>>
>> Please let me know if there are reasons to prefer to go the hardware
>> limits, it's also an option for device makers to overwrite these
>> settings if they want different behavior.
>
> Disabling stage 3 automatic full PMIC shutdown at 145 C is definitely a
> bad idea.  This exists as a last resort in order to save the hardware and
> ensure end user safety in case of excessive temperature even if software
> is locked up.
>
> Disabling stage 2 automatic partial PMIC shutdown at 125 C is not
> recommended as the PMIC is already outside of reasonable operating
> conditions and needs to take corrective action quickly.  However, doing so
> may be acceptable if software is taking action to shut down the system
> immediately upon receiving the stage 2 over-temperature interrupt.

Just to confirm: is it expected that at stage 2 the CPU's on the SoC
should continue running even with partial PMIC shutdown enabled?  It
sounded to me like partial PMIC shutdown was supposed to shut down
high-power rails that were not essential to the task of performing an
orderly shutdown.

I think Matthias was seeing that when he reached stage 2 and partial
PMIC shutdown happened that the system was just falling on the floor.
...maybe we just have things configured incorrectly?

-Doug
David Collins July 11, 2018, 10:36 p.m. UTC | #7
Hello Doug,

> On Tue, Jul 10, 2018 at 10:45 AM, David Collins <collinsd@codeaurora.org> wrote:
>> On 06/29/2018 04:54 PM, Matthias Kaehlcke wrote:
>>> On Fri, Jun 29, 2018 at 02:29:55PM -0700, David Collins wrote:
>> ...
>>>> The PMIC TEMP_ALARM hardware peripheral will perform an automatic partial
>>>> PMIC shutdown upon hitting over-temperature stage 2 (125 C).  This turns
>>>> off peripherals within the PMIC that are expected to draw significant
>>>> current.  The set of peripherals included varies between PMICs.  This
>>>> partial shutdown will occur simultaneously with the triggering of an
>>>> interrupt to the APPS processor that informs the qcom-spmi-temp-alarm
>>>> driver that an over-temperature threshold has been crossed.
>>>>
>>>> The TEMP_ALARM peripheral will perform an automatic full PMIC shutdown
>>>> upon hitting over-temperature stage 3 (145 C).  Software won't receive an
>>>> interrupt in this case because all power is cut.
>>>
>>> This information is very useful, thanks David!
>>>
>>> The (partial) hardware shutdown seems like a good measure of last
>>> resort, however I suppose we prefer Linux to initiate a shutdown
>>> before losing part of the peripherals (drivers might not be happy
>>> about this and probably not revover even when the temperature goes
>>> down again) or reach a full PMIC shutdown.
>>>
>>> Please let me know if there are reasons to prefer to go the hardware
>>> limits, it's also an option for device makers to overwrite these
>>> settings if they want different behavior.
>>
>> Disabling stage 3 automatic full PMIC shutdown at 145 C is definitely a
>> bad idea.  This exists as a last resort in order to save the hardware and
>> ensure end user safety in case of excessive temperature even if software
>> is locked up.
>>
>> Disabling stage 2 automatic partial PMIC shutdown at 125 C is not
>> recommended as the PMIC is already outside of reasonable operating
>> conditions and needs to take corrective action quickly.  However, doing so
>> may be acceptable if software is taking action to shut down the system
>> immediately upon receiving the stage 2 over-temperature interrupt.
>> Just to confirm: is it expected that at stage 2 the CPU's on the SoC
> should continue running even with partial PMIC shutdown enabled?

This is not guaranteed.


> It sounded to me like partial PMIC shutdown was supposed to shut down
> high-power rails that were not essential to the task of performing an
> orderly shutdown.

Shutting down high-power peripherals is accurate; however, special care is
not taken to ensure that an orderly shutdown is possible.  At the very
least, the HW and SW state will be out of sync for the peripherals that
are shut down.


> I think Matthias was seeing that when he reached stage 2 and partial
> PMIC shutdown happened that the system was just falling on the floor.
> ...maybe we just have things configured incorrectly?

More information about the exact crash steps would be helpful to
investigate this further.  I'm not sure how much time you want to put into
it though.  Disabling stage 2 partial shutdown and then using software to
perform a controlled shutdown at 125 C is probably the best option for you
at this point.

Take care,
David
Doug Anderson July 11, 2018, 10:43 p.m. UTC | #8
Hi

On Wed, Jul 11, 2018 at 3:36 PM, David Collins <collinsd@codeaurora.org> wrote:
> Hello Doug,
>
>> On Tue, Jul 10, 2018 at 10:45 AM, David Collins <collinsd@codeaurora.org> wrote:
>>> On 06/29/2018 04:54 PM, Matthias Kaehlcke wrote:
>>>> On Fri, Jun 29, 2018 at 02:29:55PM -0700, David Collins wrote:
>>> ...
>>>>> The PMIC TEMP_ALARM hardware peripheral will perform an automatic partial
>>>>> PMIC shutdown upon hitting over-temperature stage 2 (125 C).  This turns
>>>>> off peripherals within the PMIC that are expected to draw significant
>>>>> current.  The set of peripherals included varies between PMICs.  This
>>>>> partial shutdown will occur simultaneously with the triggering of an
>>>>> interrupt to the APPS processor that informs the qcom-spmi-temp-alarm
>>>>> driver that an over-temperature threshold has been crossed.
>>>>>
>>>>> The TEMP_ALARM peripheral will perform an automatic full PMIC shutdown
>>>>> upon hitting over-temperature stage 3 (145 C).  Software won't receive an
>>>>> interrupt in this case because all power is cut.
>>>>
>>>> This information is very useful, thanks David!
>>>>
>>>> The (partial) hardware shutdown seems like a good measure of last
>>>> resort, however I suppose we prefer Linux to initiate a shutdown
>>>> before losing part of the peripherals (drivers might not be happy
>>>> about this and probably not revover even when the temperature goes
>>>> down again) or reach a full PMIC shutdown.
>>>>
>>>> Please let me know if there are reasons to prefer to go the hardware
>>>> limits, it's also an option for device makers to overwrite these
>>>> settings if they want different behavior.
>>>
>>> Disabling stage 3 automatic full PMIC shutdown at 145 C is definitely a
>>> bad idea.  This exists as a last resort in order to save the hardware and
>>> ensure end user safety in case of excessive temperature even if software
>>> is locked up.
>>>
>>> Disabling stage 2 automatic partial PMIC shutdown at 125 C is not
>>> recommended as the PMIC is already outside of reasonable operating
>>> conditions and needs to take corrective action quickly.  However, doing so
>>> may be acceptable if software is taking action to shut down the system
>>> immediately upon receiving the stage 2 over-temperature interrupt.
>>> Just to confirm: is it expected that at stage 2 the CPU's on the SoC
>> should continue running even with partial PMIC shutdown enabled?
>
> This is not guaranteed.
>
>
>> It sounded to me like partial PMIC shutdown was supposed to shut down
>> high-power rails that were not essential to the task of performing an
>> orderly shutdown.
>
> Shutting down high-power peripherals is accurate; however, special care is
> not taken to ensure that an orderly shutdown is possible.  At the very
> least, the HW and SW state will be out of sync for the peripherals that
> are shut down.

OK, I guess I'm confused now.  Why does partial PMIC shutdown even
exist then?  What is the point of leaving some rails alive if software
could stop running?  It seems like it would be better to just shut
everything down.

Said another way: can you describe what benefit you see for only
partially shutting down the PMIC at stage 2 compared to just fully
shutting it down at stage 2?


>> I think Matthias was seeing that when he reached stage 2 and partial
>> PMIC shutdown happened that the system was just falling on the floor.
>> ...maybe we just have things configured incorrectly?
>
> More information about the exact crash steps would be helpful to
> investigate this further.  I'm not sure how much time you want to put into
> it though.

Matthias can add more, but basically he heated the system up and when
it reached the stage 2 shutdown it was no longer responsive.


> Disabling stage 2 partial shutdown and then using software to
> perform a controlled shutdown at 125 C is probably the best option for you
> at this point.

This seems OK to me given that I don't understand the original purpose
of the partial PMIC shutdown.  Would you expect that all upstream PMIC
users would want stage 2 partial shutdown disabled, so we should just
do this for all users of the PMIC?


-Doug
Matthias Kaehlcke July 11, 2018, 10:53 p.m. UTC | #9
On Wed, Jul 11, 2018 at 03:43:34PM -0700, Doug Anderson wrote:
> Hi
> 
> On Wed, Jul 11, 2018 at 3:36 PM, David Collins <collinsd@codeaurora.org> wrote:
> > Hello Doug,
> >
> >> On Tue, Jul 10, 2018 at 10:45 AM, David Collins <collinsd@codeaurora.org> wrote:
> >>> On 06/29/2018 04:54 PM, Matthias Kaehlcke wrote:
> >>>> On Fri, Jun 29, 2018 at 02:29:55PM -0700, David Collins wrote:
> >>> ...
> >>>>> The PMIC TEMP_ALARM hardware peripheral will perform an automatic partial
> >>>>> PMIC shutdown upon hitting over-temperature stage 2 (125 C).  This turns
> >>>>> off peripherals within the PMIC that are expected to draw significant
> >>>>> current.  The set of peripherals included varies between PMICs.  This
> >>>>> partial shutdown will occur simultaneously with the triggering of an
> >>>>> interrupt to the APPS processor that informs the qcom-spmi-temp-alarm
> >>>>> driver that an over-temperature threshold has been crossed.
> >>>>>
> >>>>> The TEMP_ALARM peripheral will perform an automatic full PMIC shutdown
> >>>>> upon hitting over-temperature stage 3 (145 C).  Software won't receive an
> >>>>> interrupt in this case because all power is cut.
> >>>>
> >>>> This information is very useful, thanks David!
> >>>>
> >>>> The (partial) hardware shutdown seems like a good measure of last
> >>>> resort, however I suppose we prefer Linux to initiate a shutdown
> >>>> before losing part of the peripherals (drivers might not be happy
> >>>> about this and probably not revover even when the temperature goes
> >>>> down again) or reach a full PMIC shutdown.
> >>>>
> >>>> Please let me know if there are reasons to prefer to go the hardware
> >>>> limits, it's also an option for device makers to overwrite these
> >>>> settings if they want different behavior.
> >>>
> >>> Disabling stage 3 automatic full PMIC shutdown at 145 C is definitely a
> >>> bad idea.  This exists as a last resort in order to save the hardware and
> >>> ensure end user safety in case of excessive temperature even if software
> >>> is locked up.
> >>>
> >>> Disabling stage 2 automatic partial PMIC shutdown at 125 C is not
> >>> recommended as the PMIC is already outside of reasonable operating
> >>> conditions and needs to take corrective action quickly.  However, doing so
> >>> may be acceptable if software is taking action to shut down the system
> >>> immediately upon receiving the stage 2 over-temperature interrupt.
> >>> Just to confirm: is it expected that at stage 2 the CPU's on the SoC
> >> should continue running even with partial PMIC shutdown enabled?
> >
> > This is not guaranteed.
> >
> >
> >> It sounded to me like partial PMIC shutdown was supposed to shut down
> >> high-power rails that were not essential to the task of performing an
> >> orderly shutdown.
> >
> > Shutting down high-power peripherals is accurate; however, special care is
> > not taken to ensure that an orderly shutdown is possible.  At the very
> > least, the HW and SW state will be out of sync for the peripherals that
> > are shut down.
> 
> OK, I guess I'm confused now.  Why does partial PMIC shutdown even
> exist then?  What is the point of leaving some rails alive if software
> could stop running?  It seems like it would be better to just shut
> everything down.
> 
> Said another way: can you describe what benefit you see for only
> partially shutting down the PMIC at stage 2 compared to just fully
> shutting it down at stage 2?
> 
> 
> >> I think Matthias was seeing that when he reached stage 2 and partial
> >> PMIC shutdown happened that the system was just falling on the floor.
> >> ...maybe we just have things configured incorrectly?
> >
> > More information about the exact crash steps would be helpful to
> > investigate this further.  I'm not sure how much time you want to put into
> > it though.
> 
> Matthias can add more, but basically he heated the system up and when
> it reached the stage 2 shutdown it was no longer responsive.

The system behaved as on a warm reset when reaching stage 2
temperature, no kernel crash, but messages in /dev/pstore were
preserved.
David Collins July 12, 2018, 12:10 a.m. UTC | #10
Hello Doug,

On 07/11/2018 03:43 PM, Doug Anderson wrote:
> On Wed, Jul 11, 2018 at 3:36 PM, David Collins <collinsd@codeaurora.org> wrote:
>>> On Tue, Jul 10, 2018 at 10:45 AM, David Collins <collinsd@codeaurora.org> wrote:
>>>> On 06/29/2018 04:54 PM, Matthias Kaehlcke wrote:
>>>>> On Fri, Jun 29, 2018 at 02:29:55PM -0700, David Collins wrote:
>>>> ...
>>>>>> The PMIC TEMP_ALARM hardware peripheral will perform an automatic partial
>>>>>> PMIC shutdown upon hitting over-temperature stage 2 (125 C).  This turns
>>>>>> off peripherals within the PMIC that are expected to draw significant
>>>>>> current.  The set of peripherals included varies between PMICs.  This
>>>>>> partial shutdown will occur simultaneously with the triggering of an
>>>>>> interrupt to the APPS processor that informs the qcom-spmi-temp-alarm
>>>>>> driver that an over-temperature threshold has been crossed.
>>>>>>
>>>>>> The TEMP_ALARM peripheral will perform an automatic full PMIC shutdown
>>>>>> upon hitting over-temperature stage 3 (145 C).  Software won't receive an
>>>>>> interrupt in this case because all power is cut.
>>>>>
>>>>> This information is very useful, thanks David!
>>>>>
>>>>> The (partial) hardware shutdown seems like a good measure of last
>>>>> resort, however I suppose we prefer Linux to initiate a shutdown
>>>>> before losing part of the peripherals (drivers might not be happy
>>>>> about this and probably not revover even when the temperature goes
>>>>> down again) or reach a full PMIC shutdown.
>>>>>
>>>>> Please let me know if there are reasons to prefer to go the hardware
>>>>> limits, it's also an option for device makers to overwrite these
>>>>> settings if they want different behavior.
>>>>
>>>> Disabling stage 3 automatic full PMIC shutdown at 145 C is definitely a
>>>> bad idea.  This exists as a last resort in order to save the hardware and
>>>> ensure end user safety in case of excessive temperature even if software
>>>> is locked up.
>>>>
>>>> Disabling stage 2 automatic partial PMIC shutdown at 125 C is not
>>>> recommended as the PMIC is already outside of reasonable operating
>>>> conditions and needs to take corrective action quickly.  However, doing so
>>>> may be acceptable if software is taking action to shut down the system
>>>> immediately upon receiving the stage 2 over-temperature interrupt.
>>>> Just to confirm: is it expected that at stage 2 the CPU's on the SoC
>>> should continue running even with partial PMIC shutdown enabled?
>>
>> This is not guaranteed.
>>
>>
>>> It sounded to me like partial PMIC shutdown was supposed to shut down
>>> high-power rails that were not essential to the task of performing an
>>> orderly shutdown.
>>
>> Shutting down high-power peripherals is accurate; however, special care is
>> not taken to ensure that an orderly shutdown is possible.  At the very
>> least, the HW and SW state will be out of sync for the peripherals that
>> are shut down.
> 
> OK, I guess I'm confused now.  Why does partial PMIC shutdown even
> exist then?  What is the point of leaving some rails alive if software
> could stop running?  It seems like it would be better to just shut
> everything down.
> 
> Said another way: can you describe what benefit you see for only
> partially shutting down the PMIC at stage 2 compared to just fully
> shutting it down at stage 2?

Stage 2 partial shutdown is present on PM8998 for legacy reasons.  It is
being phased out on future PMICs.  My understanding is that it was
originally intended to be a less aggressive mitigation option than a full
shutdown and that it allows for more post-mitigation analysis (e.g.
preserved RAM contents).

The set of peripherals which are disabled during stage 2 partial shutdown
is not well defined which leads to the kind of uncertainty and ill-defined
behavior being discussed in this thread.


>> Disabling stage 2 partial shutdown and then using software to
>> perform a controlled shutdown at 125 C is probably the best option for you
>> at this point.
> 
> This seems OK to me given that I don't understand the original purpose
> of the partial PMIC shutdown.  Would you expect that all upstream PMIC
> users would want stage 2 partial shutdown disabled, so we should just
> do this for all users of the PMIC?

I'd think that we only want to override stage 2 partial shutdown if
thermal nodes are defined which cause a graceful software controlled
shutdown in place of the PMIC partial shutdown.  Therefore, management of
the feature should probably be tied to a boolean DT property.

Take care,
David
Matthias Kaehlcke July 13, 2018, 4:49 p.m. UTC | #11
On Wed, Jul 11, 2018 at 05:10:50PM -0700, David Collins wrote:
> Hello Doug,
> 
> On 07/11/2018 03:43 PM, Doug Anderson wrote:
> > On Wed, Jul 11, 2018 at 3:36 PM, David Collins <collinsd@codeaurora.org> wrote:
> >>> On Tue, Jul 10, 2018 at 10:45 AM, David Collins <collinsd@codeaurora.org> wrote:
> >>>> On 06/29/2018 04:54 PM, Matthias Kaehlcke wrote:
> >>>>> On Fri, Jun 29, 2018 at 02:29:55PM -0700, David Collins wrote:
> >>>> ...
> >>>>>> The PMIC TEMP_ALARM hardware peripheral will perform an automatic partial
> >>>>>> PMIC shutdown upon hitting over-temperature stage 2 (125 C).  This turns
> >>>>>> off peripherals within the PMIC that are expected to draw significant
> >>>>>> current.  The set of peripherals included varies between PMICs.  This
> >>>>>> partial shutdown will occur simultaneously with the triggering of an
> >>>>>> interrupt to the APPS processor that informs the qcom-spmi-temp-alarm
> >>>>>> driver that an over-temperature threshold has been crossed.
> >>>>>>
> >>>>>> The TEMP_ALARM peripheral will perform an automatic full PMIC shutdown
> >>>>>> upon hitting over-temperature stage 3 (145 C).  Software won't receive an
> >>>>>> interrupt in this case because all power is cut.
> >>>>>
> >>>>> This information is very useful, thanks David!
> >>>>>
> >>>>> The (partial) hardware shutdown seems like a good measure of last
> >>>>> resort, however I suppose we prefer Linux to initiate a shutdown
> >>>>> before losing part of the peripherals (drivers might not be happy
> >>>>> about this and probably not revover even when the temperature goes
> >>>>> down again) or reach a full PMIC shutdown.
> >>>>>
> >>>>> Please let me know if there are reasons to prefer to go the hardware
> >>>>> limits, it's also an option for device makers to overwrite these
> >>>>> settings if they want different behavior.
> >>>>
> >>>> Disabling stage 3 automatic full PMIC shutdown at 145 C is definitely a
> >>>> bad idea.  This exists as a last resort in order to save the hardware and
> >>>> ensure end user safety in case of excessive temperature even if software
> >>>> is locked up.
> >>>>
> >>>> Disabling stage 2 automatic partial PMIC shutdown at 125 C is not
> >>>> recommended as the PMIC is already outside of reasonable operating
> >>>> conditions and needs to take corrective action quickly.  However, doing so
> >>>> may be acceptable if software is taking action to shut down the system
> >>>> immediately upon receiving the stage 2 over-temperature interrupt.
> >>>> Just to confirm: is it expected that at stage 2 the CPU's on the SoC
> >>> should continue running even with partial PMIC shutdown enabled?
> >>
> >> This is not guaranteed.
> >>
> >>
> >>> It sounded to me like partial PMIC shutdown was supposed to shut down
> >>> high-power rails that were not essential to the task of performing an
> >>> orderly shutdown.
> >>
> >> Shutting down high-power peripherals is accurate; however, special care is
> >> not taken to ensure that an orderly shutdown is possible.  At the very
> >> least, the HW and SW state will be out of sync for the peripherals that
> >> are shut down.
> > 
> > OK, I guess I'm confused now.  Why does partial PMIC shutdown even
> > exist then?  What is the point of leaving some rails alive if software
> > could stop running?  It seems like it would be better to just shut
> > everything down.
> > 
> > Said another way: can you describe what benefit you see for only
> > partially shutting down the PMIC at stage 2 compared to just fully
> > shutting it down at stage 2?
> 
> Stage 2 partial shutdown is present on PM8998 for legacy reasons.  It is
> being phased out on future PMICs.  My understanding is that it was
> originally intended to be a less aggressive mitigation option than a full
> shutdown and that it allows for more post-mitigation analysis (e.g.
> preserved RAM contents).
> 
> The set of peripherals which are disabled during stage 2 partial shutdown
> is not well defined which leads to the kind of uncertainty and ill-defined
> behavior being discussed in this thread.

Thanks for the information!

> >> Disabling stage 2 partial shutdown and then using software to
> >> perform a controlled shutdown at 125 C is probably the best option for you
> >> at this point.
> > 
> > This seems OK to me given that I don't understand the original purpose
> > of the partial PMIC shutdown.  Would you expect that all upstream PMIC
> > users would want stage 2 partial shutdown disabled, so we should just
> > do this for all users of the PMIC?
> 
> I'd think that we only want to override stage 2 partial shutdown if
> thermal nodes are defined which cause a graceful software controlled
> shutdown in place of the PMIC partial shutdown.  Therefore, management of
> the feature should probably be tied to a boolean DT property.

Sounds good, I'll send a patch to disable the partial shutdown through
a DT property soon.
diff mbox

Patch

diff --git a/arch/arm64/boot/dts/qcom/pm8998.dtsi b/arch/arm64/boot/dts/qcom/pm8998.dtsi
index f2d39074ed74..d85ceb4f976b 100644
--- a/arch/arm64/boot/dts/qcom/pm8998.dtsi
+++ b/arch/arm64/boot/dts/qcom/pm8998.dtsi
@@ -3,6 +3,7 @@ 
 
 #include <dt-bindings/spmi/spmi.h>
 #include <dt-bindings/interrupt-controller/irq.h>
+#include <dt-bindings/thermal/thermal.h>
 
 &spmi_bus {
 	pm8998_lsid0: pmic@0 {
@@ -59,3 +60,30 @@ 
 		#size-cells = <0>;
 	};
 };
+
+&thermal_zones {
+	pm8998 {
+		polling-delay-passive = <250>;
+		polling-delay = <1000>;
+
+		thermal-sensors = <&pm8998_temp>;
+
+		trips {
+			passive {
+				temperature = <1050000>;
+				hysteresis = <2000>;
+				type = "passive";
+			};
+			alert {
+				temperature = <125000>;
+				hysteresis = <2000>;
+				type = "hot";
+			};
+			crit {
+				temperature = <145000>;
+				hysteresis = <2000>;
+				type = "critical";
+			};
+		};
+	};
+};