diff mbox series

[1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

Message ID 20191014212101.25719-1-srinivas.pandruvada@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series [1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages | expand

Commit Message

Srinivas Pandruvada Oct. 14, 2019, 9:21 p.m. UTC
Some modern systems have very tight thermal tolerances. Because of this
they may cross thermal thresholds when running normal workloads (even
during boot). The CPU hardware will react by limiting power/frequency
and using duty cycles to bring the temperature back into normal range.

Thus users may see a "critical" message about the "temperature above
threshold" which is soon followed by "temperature/speed normal". These
messages are rate limited, but still may repeat every few minutes.

The solution here is to set a timeout when the temperature first exceeds
the threshold. If the CPU returns to normal before the timeout fires,
we skip printing any messages. If we reach the timeout, then there may be
a real thermal issue (e.g. inoperative or blocked fan) and we print the
message (together with a count of how many thermal events have occurred).
A rate control method is used to avoid printing repeatedly on these broken
systems.

Some experimentation with fans enabled showed that temperature returned
to normal on a laptop in ~4 seconds. With fans disabled it took over 10
seconds. Default timeout is thus set to 8 seconds, but may be changed
with kernel boot parameter: "x86_therm_warn_delay". This default interval
is twice of typical sampling interval for cooling using running average
power limit from user space thermal control softwares.

In addition a new sysfs attribute is added to show what is the maximum
amount of time in miili-seconds the system was in throttled state. This
will allow to change x86_therm_warn_delay, if required.

Suggested-by: Alan Cox <alan@linux.intel.com>
Commit-comment-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 arch/x86/kernel/cpu/mce/therm_throt.c | 94 ++++++++++++++++++++++-----
 1 file changed, 77 insertions(+), 17 deletions(-)

Comments

Borislav Petkov Oct. 14, 2019, 9:36 p.m. UTC | #1
On Mon, Oct 14, 2019 at 02:21:00PM -0700, Srinivas Pandruvada wrote:
> Some modern systems have very tight thermal tolerances. Because of this
> they may cross thermal thresholds when running normal workloads (even
> during boot). The CPU hardware will react by limiting power/frequency
> and using duty cycles to bring the temperature back into normal range.
> 
> Thus users may see a "critical" message about the "temperature above
> threshold" which is soon followed by "temperature/speed normal". These
> messages are rate limited, but still may repeat every few minutes.
> 
> The solution here is to set a timeout when the temperature first exceeds
> the threshold. If the CPU returns to normal before the timeout fires,
> we skip printing any messages. If we reach the timeout, then there may be
> a real thermal issue (e.g. inoperative or blocked fan) and we print the
> message (together with a count of how many thermal events have occurred).
> A rate control method is used to avoid printing repeatedly on these broken
> systems.
> 
> Some experimentation with fans enabled showed that temperature returned
> to normal on a laptop in ~4 seconds. With fans disabled it took over 10
> seconds. Default timeout is thus set to 8 seconds, but may be changed
> with kernel boot parameter: "x86_therm_warn_delay". This default interval
> is twice of typical sampling interval for cooling using running average
> power limit from user space thermal control softwares.
> 
> In addition a new sysfs attribute is added to show what is the maximum
> amount of time in miili-seconds the system was in throttled state. This
> will allow to change x86_therm_warn_delay, if required.

This description is already *begging* for this delay value to be
automatically set by the kernel. Putting yet another knob in front of
the user who doesn't have a clue most of the time shows one more time
that we haven't done our job properly by asking her to know what we
already do.

IOW, a simple history feedback mechanism which sets the timeout based on
the last couple of values is much smarter. The thing would have a max
value, of course, which, when exceeded should mean an anomaly, etc, but
almost anything else is better than merely asking the user to make an
educated guess.

> Suggested-by: Alan Cox <alan@linux.intel.com>
> Commit-comment-by: Tony Luck <tony.luck@intel.com>
  ^^^^^^^^^^^^^^^^^^

What's that?
Tony Luck Oct. 14, 2019, 10:27 p.m. UTC | #2
On Mon, Oct 14, 2019 at 11:36:18PM +0200, Borislav Petkov wrote:
> This description is already *begging* for this delay value to be
> automatically set by the kernel. Putting yet another knob in front of
> the user who doesn't have a clue most of the time shows one more time
> that we haven't done our job properly by asking her to know what we
> already do.
> 
> IOW, a simple history feedback mechanism which sets the timeout based on
> the last couple of values is much smarter. The thing would have a max
> value, of course, which, when exceeded should mean an anomaly, etc, but
> almost anything else is better than merely asking the user to make an
> educated guess.

You need a plausible start point for the "when to worry the user"
message.  Maybe that is your "max value"?

So if the system has a couple of excursions above temperature lasting
1 second and then 2 seconds ... would you like to see those ignored
(because they are below the initial max)? But now we have a couple
of data points pick some new value to be the threshold for reporting?

What value should we pick (based on 1 sec, then 2 sec)?

I would be worried that it would self tune to the point where it
does report something that it really didn't need to (e.g. as a result
of a few consecutive very short excursions).

We also need to take into account the "typical sampling interval"
for user space thermal control software.

Srinivas: Maybe this needs to have some more detail on what user
solutions are being taken into account here.

> > Suggested-by: Alan Cox <alan@linux.intel.com>
> > Commit-comment-by: Tony Luck <tony.luck@intel.com>
>   ^^^^^^^^^^^^^^^^^^
> 
> What's that?

My fault ... during review process I pretty much re-wrote the
whole commit message to follow the form of:
	"What is the problem?"
	"How are we fixing it"
But I didn't want Srinivas to take the heat for any mistakes
that were my fault. "Co-developed-by" really didn't explain
what happened (since I didn't write any code, just made suggestions
on things that needed to be changed/improved).

-Tony
Srinivas Pandruvada Oct. 14, 2019, 10:41 p.m. UTC | #3
On Mon, 2019-10-14 at 23:36 +0200, Borislav Petkov wrote:
> On Mon, Oct 14, 2019 at 02:21:00PM -0700, Srinivas Pandruvada wrote:
> > Some modern systems have very tight thermal tolerances. Because of
> > this
> > they may cross thermal thresholds when running normal workloads
> > (even
> > during boot). The CPU hardware will react by limiting
> > power/frequency
> > and using duty cycles to bring the temperature back into normal
> > range.
> > 
> > Thus users may see a "critical" message about the "temperature
> > above
> > threshold" which is soon followed by "temperature/speed normal".
> > These
> > messages are rate limited, but still may repeat every few minutes.
> > 
> > The solution here is to set a timeout when the temperature first
> > exceeds
> > the threshold. If the CPU returns to normal before the timeout
> > fires,
> > we skip printing any messages. If we reach the timeout, then there
> > may be
> > a real thermal issue (e.g. inoperative or blocked fan) and we print
> > the
> > message (together with a count of how many thermal events have
> > occurred).
> > A rate control method is used to avoid printing repeatedly on these
> > broken
> > systems.
> > 
> > Some experimentation with fans enabled showed that temperature
> > returned
> > to normal on a laptop in ~4 seconds. With fans disabled it took
> > over 10
> > seconds. Default timeout is thus set to 8 seconds, but may be
> > changed
> > with kernel boot parameter: "x86_therm_warn_delay". This default
> > interval
> > is twice of typical sampling interval for cooling using running
> > average
> > power limit from user space thermal control softwares.
> > 
> > In addition a new sysfs attribute is added to show what is the
> > maximum
> > amount of time in miili-seconds the system was in throttled state.
> > This
> > will allow to change x86_therm_warn_delay, if required.
> 
> This description is already *begging* for this delay value to be
> automatically set by the kernel. Putting yet another knob in front of
> the user who doesn't have a clue most of the time shows one more time
> that we haven't done our job properly by asking her to know what we
> already do.
I experimented on the systems released from Sandy Bridge era. But
someone running on 10 years old system, this is a fallback mechanism.
Don't expect that users have to tune from the default but saying with
certainty is difficult. The source of this PROCHOT signal can be
anything on the board.
So some users who had issues in their systems can try with this patch.
We can get rid of this, till it becomes real issue.

> 
> IOW, a simple history feedback mechanism which sets the timeout based
> on
> the last couple of values is much smarter. The thing would have a max
> value, of course, which, when exceeded should mean an anomaly, etc,
> but
> almost anything else is better than merely asking the user to make an
> educated guess.
The temperature is function of load, time and heat dissipation capacity
of the system. I have to think more about this to come up with some
heuristics where we still warning users about real thermal issues.
Since value is not persistent, then next boot again will start from the
default.

> 
> > Suggested-by: Alan Cox <alan@linux.intel.com>
> > Commit-comment-by: Tony Luck <tony.luck@intel.com>
> 
>   ^^^^^^^^^^^^^^^^^^
> 
> What's that?
Tony suggested this to indicate that he rewrote the commit description
as he didn't like my description. Definitely checkpatch doesn't like
this.

Thanks,
Srinivas
Borislav Petkov Oct. 15, 2019, 8:36 a.m. UTC | #4
On Mon, Oct 14, 2019 at 03:27:35PM -0700, Luck, Tony wrote:
> You need a plausible start point for the "when to worry the user"
> message.  Maybe that is your "max value"?

Yes, that would be a good start.

You need that anyway because the experimentations you guys did to get
your numbers have been done in some ambient temperature of X. I betcha
when the ambient temperature differs considerably from yours, the
numbers don't mean a whole lot.

Which makes a dynamic adjustment even more important.

> So if the system has a couple of excursions above temperature lasting
> 1 second and then 2 seconds ... would you like to see those ignored
> (because they are below the initial max)? But now we have a couple
> of data points pick some new value to be the threshold for reporting?
> 
> What value should we pick (based on 1 sec, then 2 sec)?
> 
> I would be worried that it would self tune to the point where it
> does report something that it really didn't need to (e.g. as a result
> of a few consecutive very short excursions).

You select a history feedback formula with which sudden changes
influence the timeout value relatively slowly and keep the current
timeout value rather inert. They would take effect only when such spikes
hold on for a longer time, i.e., take up a longer chunk of the sampling
interval.

> We also need to take into account the "typical sampling interval"
> for user space thermal control software.

Yes to the sampling interval, not so sure about doing anything in
luserspace. This should all be done in the kernel automatically.

> My fault ... during review process I pretty much re-wrote the
> whole commit message to follow the form of:
> 	"What is the problem?"
> 	"How are we fixing it"

Cool.

> But I didn't want Srinivas to take the heat for any mistakes
> that were my fault. "Co-developed-by" really didn't explain
> what happened (since I didn't write any code, just made suggestions
> on things that needed to be changed/improved).

Yeah, so stuff like that is usually added with free text at the end of
the commit message where you have more than a couple of words in a tag
to explain what happened.
Borislav Petkov Oct. 15, 2019, 8:46 a.m. UTC | #5
On Mon, Oct 14, 2019 at 03:41:38PM -0700, Srinivas Pandruvada wrote:
> So some users who had issues in their systems can try with this patch.
> We can get rid of this, till it becomes real issue.

We don't add command line parameters which we maybe can get rid of
later.

> The temperature is function of load, time and heat dissipation capacity
> of the system. I have to think more about this to come up with some
> heuristics where we still warning users about real thermal issues.
> Since value is not persistent, then next boot again will start from the
> default.

Yes, and the fact that each machine's temperature is influenced by the
specific *individual* environment and load the machine runs, shows that
you need to adjust this timeout automatically and dynamically.

With the command line parameter you're basically putting the onus on the
user to do that which is just silly. And then she'd need to do it during
runtime too, if the ambient temperature or machine load, etc, changes.

The whole thing is crying "dynamic".

For a simple example, see mce_timer_fn() where we switch to polling
during CMCI storms.
Peter Zijlstra Oct. 15, 2019, 8:48 a.m. UTC | #6
On Mon, Oct 14, 2019 at 02:21:00PM -0700, Srinivas Pandruvada wrote:
> Some modern systems have very tight thermal tolerances. Because of this
> they may cross thermal thresholds when running normal workloads (even
> during boot). The CPU hardware will react by limiting power/frequency
> and using duty cycles to bring the temperature back into normal range.
> 
> Thus users may see a "critical" message about the "temperature above
> threshold" which is soon followed by "temperature/speed normal". These
> messages are rate limited, but still may repeat every few minutes.
> 
> The solution here is to set a timeout when the temperature first exceeds
> the threshold.

Why can we even reach critical thresholds when the fans are working? I
always thought it was BAD to ever reach the critical temps and have the
hardware throttle.
Peter Zijlstra Oct. 15, 2019, 8:52 a.m. UTC | #7
On Mon, Oct 14, 2019 at 03:27:35PM -0700, Luck, Tony wrote:
> On Mon, Oct 14, 2019 at 11:36:18PM +0200, Borislav Petkov wrote:
> > This description is already *begging* for this delay value to be
> > automatically set by the kernel. Putting yet another knob in front of
> > the user who doesn't have a clue most of the time shows one more time
> > that we haven't done our job properly by asking her to know what we
> > already do.
> > 
> > IOW, a simple history feedback mechanism which sets the timeout based on
> > the last couple of values is much smarter. The thing would have a max
> > value, of course, which, when exceeded should mean an anomaly, etc, but
> > almost anything else is better than merely asking the user to make an
> > educated guess.
> 
> You need a plausible start point for the "when to worry the user"
> message.  Maybe that is your "max value"?
> 
> So if the system has a couple of excursions above temperature lasting
> 1 second and then 2 seconds ... would you like to see those ignored
> (because they are below the initial max)? But now we have a couple
> of data points pick some new value to be the threshold for reporting?
> 
> What value should we pick (based on 1 sec, then 2 sec)?
> 
> I would be worried that it would self tune to the point where it
> does report something that it really didn't need to (e.g. as a result
> of a few consecutive very short excursions).

I'm guessing Boris is thinking of a simple IIR like avg filter.

	avg = avg + (sample-avg) / 4

And then only print when sample > 2*avg. If you initialize that with
some appropriately large value, it should settle down into what it
'normal' for that particular piece of hardware.

Still, I'm boggled by the whole idea that hitting critical hard throttle
is considered 'normal' at all.

> We also need to take into account the "typical sampling interval"
> for user space thermal control software.

Why is control of critical thermal crud in userspace? That seems like a
massive design fail.
Srinivas Pandruvada Oct. 15, 2019, 1:31 p.m. UTC | #8
On Tue, 2019-10-15 at 10:48 +0200, Peter Zijlstra wrote:
> On Mon, Oct 14, 2019 at 02:21:00PM -0700, Srinivas Pandruvada wrote:
> > Some modern systems have very tight thermal tolerances. Because of
> > this
> > they may cross thermal thresholds when running normal workloads
> > (even
> > during boot). The CPU hardware will react by limiting
> > power/frequency
> > and using duty cycles to bring the temperature back into normal
> > range.
> > 
> > Thus users may see a "critical" message about the "temperature
> > above
> > threshold" which is soon followed by "temperature/speed normal".
> > These
> > messages are rate limited, but still may repeat every few minutes.
> > 
> > The solution here is to set a timeout when the temperature first
> > exceeds
> > the threshold.
> 
> Why can we even reach critical thresholds when the fans are working?
> I
> always thought it was BAD to ever reach the critical temps and have
> the
> hardware throttle.
CPU temperature doesn't have to hit max(TjMax) to get these warnings.
OEMs has an ability to program a threshold where a thermal interrupt
can be generated. In some systems the offset is 20C+ (Read only value).

In recent systems, there is another offset on top of it which can be
programmed by OS, once some agent can adjust power limits dynamically.
By default this is set to low by the firmware, which I guess the prime
motivation of Benjamin to submit the patch.

Thanks,
Srinivas
Srinivas Pandruvada Oct. 15, 2019, 1:43 p.m. UTC | #9
On Tue, 2019-10-15 at 10:52 +0200, Peter Zijlstra wrote:
> On Mon, Oct 14, 2019 at 03:27:35PM -0700, Luck, Tony wrote:
> > On Mon, Oct 14, 2019 at 11:36:18PM +0200, Borislav Petkov wrote:
> > > This description is already *begging* for this delay value to be
> > > automatically set by the kernel. Putting yet another knob in
> > > front of
> > > the user who doesn't have a clue most of the time shows one more
> > > time
> > > that we haven't done our job properly by asking her to know what
> > > we
> > > already do.
> > > 
> > > IOW, a simple history feedback mechanism which sets the timeout
> > > based on
> > > the last couple of values is much smarter. The thing would have a
> > > max
> > > value, of course, which, when exceeded should mean an anomaly,
> > > etc, but
> > > almost anything else is better than merely asking the user to
> > > make an
> > > educated guess.
> > 
> > You need a plausible start point for the "when to worry the user"
> > message.  Maybe that is your "max value"?
> > 
> > So if the system has a couple of excursions above temperature
> > lasting
> > 1 second and then 2 seconds ... would you like to see those ignored
> > (because they are below the initial max)? But now we have a couple
> > of data points pick some new value to be the threshold for
> > reporting?
> > 
> > What value should we pick (based on 1 sec, then 2 sec)?
> > 
> > I would be worried that it would self tune to the point where it
> > does report something that it really didn't need to (e.g. as a
> > result
> > of a few consecutive very short excursions).
> 
> I'm guessing Boris is thinking of a simple IIR like avg filter.
> 
> 	avg = avg + (sample-avg) / 4
> 
> And then only print when sample > 2*avg. If you initialize that with
> some appropriately large value, it should settle down into what it
> 'normal' for that particular piece of hardware.
I will take a shot with some IIR implementation.

> 
> Still, I'm boggled by the whole idea that hitting critical hard
> throttle
> is considered 'normal' at all.
As explained in my previous email, this is not so called TJMax, where
it will shutdown. If you keep this temperature for longer time, cooling
needs be adjusted.

> 
> > We also need to take into account the "typical sampling interval"
> > for user space thermal control software.
> 
> Why is control of critical thermal crud in userspace? That seems like
> a
> massive design fail.
The TjMax is taken care by the embedded firmware or kernel depending on
how OEM wants it to be controlled. User space is for mostly balancing
non CPU parts, which are not urgent. For example you run CPU at high
temperature for long duration, the skin will heat up, which takes much
longer time to cool than CPU itself.

Thanks,
Srinivas
Srinivas Pandruvada Oct. 15, 2019, 2:01 p.m. UTC | #10
On Tue, 2019-10-15 at 10:46 +0200, Borislav Petkov wrote:
> On Mon, Oct 14, 2019 at 03:41:38PM -0700, Srinivas Pandruvada wrote:
> > So some users who had issues in their systems can try with this
> > patch.
> > We can get rid of this, till it becomes real issue.
> 
> We don't add command line parameters which we maybe can get rid of
> later.
I am saying the same.
We will not have command line parameter, till this is a problem.

Thanks,
Srinivas

> 
> > The temperature is function of load, time and heat dissipation
> > capacity
> > of the system. I have to think more about this to come up with some
> > heuristics where we still warning users about real thermal issues.
> > Since value is not persistent, then next boot again will start from
> > the
> > default.
> 
> Yes, and the fact that each machine's temperature is influenced by
> the
> specific *individual* environment and load the machine runs, shows
> that
> you need to adjust this timeout automatically and dynamically.
> 
> With the command line parameter you're basically putting the onus on
> the
> user to do that which is just silly. And then she'd need to do it
> during
> runtime too, if the ambient temperature or machine load, etc,
> changes.
> 
> The whole thing is crying "dynamic".
> 
> For a simple example, see mce_timer_fn() where we switch to polling
> during CMCI storms.
>
Peter Zijlstra Oct. 16, 2019, 8:14 a.m. UTC | #11
On Tue, Oct 15, 2019 at 06:31:46AM -0700, Srinivas Pandruvada wrote:
> On Tue, 2019-10-15 at 10:48 +0200, Peter Zijlstra wrote:
> > On Mon, Oct 14, 2019 at 02:21:00PM -0700, Srinivas Pandruvada wrote:
> > > Some modern systems have very tight thermal tolerances. Because of
> > > this
> > > they may cross thermal thresholds when running normal workloads
> > > (even
> > > during boot). The CPU hardware will react by limiting
> > > power/frequency
> > > and using duty cycles to bring the temperature back into normal
> > > range.
> > > 
> > > Thus users may see a "critical" message about the "temperature
> > > above
> > > threshold" which is soon followed by "temperature/speed normal".
> > > These
> > > messages are rate limited, but still may repeat every few minutes.
> > > 
> > > The solution here is to set a timeout when the temperature first
> > > exceeds
> > > the threshold.
> > 
> > Why can we even reach critical thresholds when the fans are working?
> > I
> > always thought it was BAD to ever reach the critical temps and have
> > the
> > hardware throttle.
> CPU temperature doesn't have to hit max(TjMax) to get these warnings.
> OEMs has an ability to program a threshold where a thermal interrupt
> can be generated. In some systems the offset is 20C+ (Read only value).
> 
> In recent systems, there is another offset on top of it which can be
> programmed by OS, once some agent can adjust power limits dynamically.
> By default this is set to low by the firmware, which I guess the prime
> motivation of Benjamin to submit the patch.

That all sounds like the printk should be downgraded too, it is not a
KERN_CRIT warning. It is more a notification that we're getting warm.
Borislav Petkov Oct. 16, 2019, 2 p.m. UTC | #12
On Wed, Oct 16, 2019 at 10:14:05AM +0200, Peter Zijlstra wrote:
> That all sounds like the printk should be downgraded too, it is not a
> KERN_CRIT warning. It is more a notification that we're getting warm.

Right, and I think we should take Benjamin's patch after all - perhaps
even tag it for stable if that message is annoying people too much - and
Srinivas can do the dynamic thing ontop.

Thx.
Tony Luck Oct. 17, 2019, 9:31 p.m. UTC | #13
>> That all sounds like the printk should be downgraded too, it is not a
>> KERN_CRIT warning. It is more a notification that we're getting warm.
>
> Right, and I think we should take Benjamin's patch after all - perhaps
> even tag it for stable if that message is annoying people too much - and
> Srinivas can do the dynamic thing ontop.

That sounds like the right short term action.

Depending on what we end up with from Srinivas ... we may want
to reconsider the severity.  The basic premise of Srinivas' patch
is to avoid printing anything for short excursions above temperature
threshold. But the effect of that is that when we find the core/package
staying above temperature for an extended period of time, we are
in a serious situation where some action may be needed. E.g.
move the laptop off the soft surface that is blocking the air vents.

-Tony
Borislav Petkov Oct. 17, 2019, 9:44 p.m. UTC | #14
On Thu, Oct 17, 2019 at 09:31:30PM +0000, Luck, Tony wrote:
> That sounds like the right short term action.
> 
> Depending on what we end up with from Srinivas ... we may want
> to reconsider the severity.  The basic premise of Srinivas' patch
> is to avoid printing anything for short excursions above temperature
> threshold. But the effect of that is that when we find the core/package
> staying above temperature for an extended period of time, we are
> in a serious situation where some action may be needed. E.g.
> move the laptop off the soft surface that is blocking the air vents.

I don't think having a critical severity message is nearly enough.
There are cases where the users simply won't see that message, no shell
opened, nothing scanning dmesg, nothing pops up on the desktop to show
KERN_CRIT messages, etc.

If we really wanna handle this case then we must be much more reliable:

* we throttle the machine from within the kernel - whatever that may mean
* if that doesn't help, we stop scheduling !root tasks
* if that doesn't help, we halt
* ...

These are purely hypothetical things to do but I'm pointing them out as
an example that in a high temperature situation we should be actively
doing something and not wait for the user to do that.

Come to think of it, one can apply the same type of logic here and split
the temp severity into action-required events and action-optional events
and then depending on the type, we do things.

Now what those things are, should be determined by the severity of the
events. Which would mean, we'd need to know how severe those events are.
And since this is left in the hands of the OEMs, good luck to us. ;-\
Tony Luck Oct. 17, 2019, 11:53 p.m. UTC | #15
> * we throttle the machine from within the kernel - whatever that may mean
> * if that doesn't help, we stop scheduling !root tasks
> * if that doesn't help, we halt

The silicon will do that "halt" step all by itself if the temperature
continues to rise and hits the highest of the temperature thresholds.

-Tony
Borislav Petkov Oct. 18, 2019, 6:46 a.m. UTC | #16
On Thu, Oct 17, 2019 at 11:53:18PM +0000, Luck, Tony wrote:
> > * we throttle the machine from within the kernel - whatever that may mean
> > * if that doesn't help, we stop scheduling !root tasks
> > * if that doesn't help, we halt
> 
> The silicon will do that "halt" step all by itself if the temperature
> continues to rise and hits the highest of the temperature thresholds.

Oh, I know that. But that is not of our concern - I believe we're
discussing right now what to do when the lower, softer limits are
reached and the thermal interrupt fires.

When the hw forces HLT, it means we didn't do a very good job earlier,
before it got so hot.

:-)
Peter Zijlstra Oct. 18, 2019, 7:17 a.m. UTC | #17
On Thu, Oct 17, 2019 at 11:44:45PM +0200, Borislav Petkov wrote:
> On Thu, Oct 17, 2019 at 09:31:30PM +0000, Luck, Tony wrote:
> > That sounds like the right short term action.
> > 
> > Depending on what we end up with from Srinivas ... we may want
> > to reconsider the severity.  The basic premise of Srinivas' patch
> > is to avoid printing anything for short excursions above temperature
> > threshold. But the effect of that is that when we find the core/package
> > staying above temperature for an extended period of time, we are
> > in a serious situation where some action may be needed. E.g.
> > move the laptop off the soft surface that is blocking the air vents.
> 
> I don't think having a critical severity message is nearly enough.
> There are cases where the users simply won't see that message, no shell
> opened, nothing scanning dmesg, nothing pops up on the desktop to show
> KERN_CRIT messages, etc.
> 
> If we really wanna handle this case then we must be much more reliable:
> 
> * we throttle the machine from within the kernel - whatever that may mean
> * if that doesn't help, we stop scheduling !root tasks
> * if that doesn't help, we halt
> * ...

We have forced idle injection, that should be able to reduce the system
to barely functional but non-cooker status.
Srinivas Pandruvada Oct. 18, 2019, 12:26 p.m. UTC | #18
On Thu, 2019-10-17 at 23:44 +0200, Borislav Petkov wrote:
> On Thu, Oct 17, 2019 at 09:31:30PM +0000, Luck, Tony wrote:
> > That sounds like the right short term action.
> > 
> > Depending on what we end up with from Srinivas ... we may want
> > to reconsider the severity.  The basic premise of Srinivas' patch
> > is to avoid printing anything for short excursions above
> > temperature
> > threshold. But the effect of that is that when we find the
> > core/package
> > staying above temperature for an extended period of time, we are
> > in a serious situation where some action may be needed. E.g.
> > move the laptop off the soft surface that is blocking the air
> > vents.
> 
> I don't think having a critical severity message is nearly enough.
> There are cases where the users simply won't see that message, no
> shell
> opened, nothing scanning dmesg, nothing pops up on the desktop to
> show
> KERN_CRIT messages, etc.
> 
> If we really wanna handle this case then we must be much more
> reliable:
> 
> * we throttle the machine from within the kernel - whatever that may
> mean
There are actions associated with the high temperature using acpi
thermal subsystems. The problem with associating with this warning
directly is that,  this threhold temperature is set to too low in some
recent laptops at power up.

Server/desktops generally rely on the embedded controller for FAN
control, which  kernel have no control. For them this warning helps to
either bring in additional cooling or fix existing cooling.

If something needs to force throttle from kernel, then we should use
some offset from the max temperature (aka TJMax), instead of this
warning threshold. Then we can use idle injection or change duty cycle
of CPU clocks.

Thanks,
Srinivas

> * if that doesn't help, we stop scheduling !root tasks
> * if that doesn't help, we halt
> * ...
> 
> These are purely hypothetical things to do but I'm pointing them out
> as
> an example that in a high temperature situation we should be actively
> doing something and not wait for the user to do that.
> 
> Come to think of it, one can apply the same type of logic here and
> split
> the temp severity into action-required events and action-optional
> events
> and then depending on the type, we do things.
> 
> Now what those things are, should be determined by the severity of
> the
> events. Which would mean, we'd need to know how severe those events
> are.
> And since this is left in the hands of the OEMs, good luck to us. ;-\
>
Borislav Petkov Oct. 18, 2019, 1:23 p.m. UTC | #19
On Fri, Oct 18, 2019 at 05:26:36AM -0700, Srinivas Pandruvada wrote:
> Server/desktops generally rely on the embedded controller for FAN
> control, which  kernel have no control. For them this warning helps to
> either bring in additional cooling or fix existing cooling.

How exactly does this warning help? A detailed example please.

> If something needs to force throttle from kernel, then we should use
> some offset from the max temperature (aka TJMax), instead of this
> warning threshold. Then we can use idle injection or change duty cycle
> of CPU clocks.

Yes, as I said, all this needs to be properly defined first. That is,
*if* there's even need for reacting to thermal interrupts in the kernel.
Srinivas Pandruvada Oct. 18, 2019, 3:55 p.m. UTC | #20
On Fri, 2019-10-18 at 15:23 +0200, Borislav Petkov wrote:
> On Fri, Oct 18, 2019 at 05:26:36AM -0700, Srinivas Pandruvada wrote:
> > Server/desktops generally rely on the embedded controller for FAN
> > control, which  kernel have no control. For them this warning helps
> > to
> > either bring in additional cooling or fix existing cooling.
> 
> How exactly does this warning help? A detailed example please.
I assume that someone is having performance issues or occasion reboots,
look at the logs. Is it a fair assumption? If not, logging has no
value.

In the current code, this logging is misleading. It is reporting all
normal throttling at PROCHOT.

But if a system is running at up to 87.5% of duty cycle on top of
lowest possible frequency of around 800MHz, someone will notice.
If logs are not the starting point, someone has to run tools like
turbostat and understand the cause of performance issues. Then probably
someone cleanup air vents on dusty desktop sitting under the desk.

Anyway, we can provide better document for the sysfs counters this code
is dumping and how to interpret them with or without logging support. I
can add some document under kernel documentation.

Thanks,
Srinivas




> 
> > If something needs to force throttle from kernel, then we should
> > use
> > some offset from the max temperature (aka TJMax), instead of this
> > warning threshold. Then we can use idle injection or change duty
> > cycle
> > of CPU clocks.
> 
> Yes, as I said, all this needs to be properly defined first. That is,
> *if* there's even need for reacting to thermal interrupts in the
> kernel.
>
Tony Luck Oct. 18, 2019, 6:02 p.m. UTC | #21
On Fri, Oct 18, 2019 at 03:23:09PM +0200, Borislav Petkov wrote:
> On Fri, Oct 18, 2019 at 05:26:36AM -0700, Srinivas Pandruvada wrote:
> > Server/desktops generally rely on the embedded controller for FAN
> > control, which  kernel have no control. For them this warning helps to
> > either bring in additional cooling or fix existing cooling.
> 
> How exactly does this warning help? A detailed example please.
> 
> > If something needs to force throttle from kernel, then we should use
> > some offset from the max temperature (aka TJMax), instead of this
> > warning threshold. Then we can use idle injection or change duty cycle
> > of CPU clocks.
> 
> Yes, as I said, all this needs to be properly defined first. That is,
> *if* there's even need for reacting to thermal interrupts in the kernel.

Recap:

We are starting from a place where the kernel prints a message.

Patch already in flight to reduce the severity of the message
(since users are seeing it, and find it annoying/unhelpful that
it has such a high severity).

Srinivas has asserted that in many cases we can eliminate the
message. But wants to keep the message if it seems that there
is something really wrong.

---

So what should we do next?  I don't think there is much by way
of actions that the kernel should take.  While we could stop
scheduling processes, the h/w and f/w have better tools to
reduce frequency, inject idle cycles, speed up fans, etc.
If you do have ideas ... then please share.

So this thread is now about doing the proper definition of
what we actions Linux should take.

Proposal on the table is the algoritm embodied in Srinivas'
patch (which originated from Alan Cox).

I.e.
1) ignore short excursions above this threshold.
2) Print a message for persistent problems.
3) Keep a record of total time spent above threshold.

If that's a reasonable approach, the we just need to come
up with a way to define "short excursion" (which might be
platform dependent). If someone has a brilliant idea on
how to do that, we can use it. If not we #define a number.

If it isn't reasonable ... then propose something better.

-Tony
Borislav Petkov Oct. 18, 2019, 7:40 p.m. UTC | #22
On Fri, Oct 18, 2019 at 08:55:17AM -0700, Srinivas Pandruvada wrote:
> I assume that someone is having performance issues or occasion reboots,
> look at the logs. Is it a fair assumption?

Yes, that is a valid use case IMO.

> But if a system is running at up to 87.5% of duty cycle on top of
> lowest possible frequency of around 800MHz, someone will notice.

Yes, but that doesn't justify for those printk statements to be
KERN_CRIT. They're just fine as warnings.
Borislav Petkov Oct. 18, 2019, 7:45 p.m. UTC | #23
On Fri, Oct 18, 2019 at 11:02:57AM -0700, Luck, Tony wrote:
> So what should we do next?

I was simply keying off this statement of yours:

"Depending on what we end up with from Srinivas ... we may want to
reconsider the severity."

and I don't think that having KERN_CRIT severity for those messages
makes any sense. That's why I was hinting at us organizing and defining
our handling of thermal interrupt events properly so that we handle
those things correctly and not have people look at dmesg.

> I don't think there is much by way of actions that the kernel should
> take. While we could stop scheduling processes, the h/w and f/w have
> better tools to reduce frequency, inject idle cycles, speed up fans,
> etc. If you do have ideas ... then please share.

See above. All resulted from me stating that KERN_CRIT messages or any
type of messages in dmesg as a result of hitting thermal limits are
useless. If we wanna handle those properly, then we need to do something
else.

> Proposal on the table is the algoritm embodied in Srinivas'
> patch (which originated from Alan Cox).

I think we agree on doing the dynamic threshold determination, no?

If, as Srinivas points out in another mail, the purpose of those
messages is when one wants to examine what happened, then fine. If we
must do more, then see above.

Does that make more sense?
Tony Luck Oct. 18, 2019, 8:38 p.m. UTC | #24
On Fri, Oct 18, 2019 at 09:45:03PM +0200, Borislav Petkov wrote:
> On Fri, Oct 18, 2019 at 11:02:57AM -0700, Luck, Tony wrote:
> > So what should we do next?
> 
> I was simply keying off this statement of yours:
> 
> "Depending on what we end up with from Srinivas ... we may want to
> reconsider the severity."
> 
> and I don't think that having KERN_CRIT severity for those messages
> makes any sense. That's why I was hinting at us organizing and defining
> our handling of thermal interrupt events properly so that we handle
> those things correctly and not have people look at dmesg.

Sorry to have caused confusion. The thoughts behind that statement
are that we currently have an issue with too many noisy high severity
messages.  The interim solution we are going with is to downgrade
the severity.  But if we apply a time based filter to remove most of
the noise by not printing at all, maybe what we have left is a very
small number of high severity messages.

But that's completely up for debate.

> I think we agree on doing the dynamic threshold determination, no?

I agree it is a good thing to look at. I'm not so sure we will find
a good enough method that works all the way from tablet to server,
so we might end up with "#define MAX_THERM_TIME 8000" ... but some
study of options would either turn up a good heuristic, or provide
evidence for why that is either hard, or no better than a constant.

> Does that make more sense?

Yes. Thanks for the clarifications.

-Tony
Borislav Petkov Oct. 19, 2019, 8:10 a.m. UTC | #25
On Fri, Oct 18, 2019 at 01:38:32PM -0700, Luck, Tony wrote:
> Sorry to have caused confusion.

Ditto. But us causing confusion is fine - this way we can talk about
what we really wanna do!

:-)))

> The thoughts behind that statement are that we currently have an issue
> with too many noisy high severity messages. The interim solution we
> are going with is to downgrade the severity. But if we apply a time
> based filter to remove most of the noise by not printing at all, maybe
> what we have left is a very small number of high severity messages.
>
> But that's completely up for debate.

Well, I think those messages being pr_warn are fine if one wants to
inspect dmesg for signs of overheating and the platform is hitting some
thermal limits.

And if the time-based filter is not too accurate, that's fine too, I
guess, as long as we don't flood dmesg.

What I don't like is the command line parameter and us putting the onus
on the user to decide although we have all that info in the kernel
already and we can do that decision automatically.

> I agree it is a good thing to look at. I'm not so sure we will find
> a good enough method that works all the way from tablet to server,
> so we might end up with "#define MAX_THERM_TIME 8000" ... but some
> study of options would either turn up a good heuristic, or provide
> evidence for why that is either hard, or no better than a constant.

Yeah, I still think a simple avg filter which starts from a sufficiently
high value and improves it over time, should be good enough.

Hell, even the trivial formula we use in the CMCI interrupt for polling,
might work, where we either double the interval or halve it, depending
on recent history.

Thx.
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/therm_throt.c b/arch/x86/kernel/cpu/mce/therm_throt.c
index 6e2becf547c5..b2e9d10bef44 100644
--- a/arch/x86/kernel/cpu/mce/therm_throt.c
+++ b/arch/x86/kernel/cpu/mce/therm_throt.c
@@ -47,8 +47,13 @@  struct _thermal_state {
 	bool			new_event;
 	int			event;
 	u64			next_check;
+	u64			last_interrupt_time;
+	struct timer_list	timer;
 	unsigned long		count;
-	unsigned long		last_count;
+	unsigned long		max_time_ms;
+	int			rate_control_active;
+	int			cpu;
+	int			level;
 };
 
 struct thermal_state {
@@ -121,8 +126,15 @@  define_therm_throt_device_one_ro(package_throttle_count);
 define_therm_throt_device_show_func(package_power_limit, count);
 define_therm_throt_device_one_ro(package_power_limit_count);
 
+define_therm_throt_device_show_func(core_throttle, max_time_ms);
+define_therm_throt_device_one_ro(core_throttle_max_time_ms);
+
+define_therm_throt_device_show_func(package_throttle, max_time_ms);
+define_therm_throt_device_one_ro(package_throttle_max_time_ms);
+
 static struct attribute *thermal_throttle_attrs[] = {
 	&dev_attr_core_throttle_count.attr,
+	&dev_attr_core_throttle_max_time_ms.attr,
 	NULL
 };
 
@@ -135,6 +147,19 @@  static const struct attribute_group thermal_attr_group = {
 #define CORE_LEVEL	0
 #define PACKAGE_LEVEL	1
 
+#define THERM_THROT_WARN_INTERVAL_MS	8000
+static unsigned int thermal_warn_interval = THERM_THROT_WARN_INTERVAL_MS;
+
+static void therm_throt_active_timer_fn(struct timer_list *t)
+{
+	struct _thermal_state *state = from_timer(state, t, timer);
+
+	pr_crit("CPU%d: %s temperature is above threshold, cpu clock is throttled from last %d milli seconds (total events = %lu)\n",
+		state->cpu,
+		state->level == CORE_LEVEL ? "Core" : "Package",
+		thermal_warn_interval, state->count);
+}
+
 /***
  * therm_throt_process - Process thermal throttling event from interrupt
  * @curr: Whether the condition is current or not (boolean), since the
@@ -174,31 +199,42 @@  static void therm_throt_process(bool new_event, int event, int level)
 
 	old_event = state->new_event;
 	state->new_event = new_event;
+	state->level = level;
 
 	if (new_event)
 		state->count++;
 
 	if (time_before64(now, state->next_check) &&
-			state->count != state->last_count)
+			  state->rate_control_active)
 		return;
 
+	state->rate_control_active = 0;
+
 	state->next_check = now + CHECK_INTERVAL;
-	state->last_count = state->count;
 
-	/* if we just entered the thermal event */
-	if (new_event) {
-		if (event == THERMAL_THROTTLING_EVENT)
-			pr_crit("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n",
-				this_cpu,
-				level == CORE_LEVEL ? "Core" : "Package",
-				state->count);
-		return;
-	}
-	if (old_event) {
-		if (event == THERMAL_THROTTLING_EVENT)
-			pr_info("CPU%d: %s temperature/speed normal\n", this_cpu,
-				level == CORE_LEVEL ? "Core" : "Package");
-		return;
+	if (event == THERMAL_THROTTLING_EVENT) {
+		if (new_event && !state->last_interrupt_time) {
+			state->last_interrupt_time = now;
+			if (!timer_pending(&state->timer))
+				mod_timer(&state->timer,
+					  (now + msecs_to_jiffies(thermal_warn_interval)));
+		} else if (old_event && state->last_interrupt_time) {
+			unsigned long throttle_time;
+			int ret;
+
+			ret = del_timer(&state->timer);
+			throttle_time = jiffies_delta_to_msecs(now - state->last_interrupt_time);
+			if (!ret) {
+				pr_crit("CPU%d: %s temperature/speed normal (total events = %lu, throttled time: %lu milli seconds)\n",
+					state->cpu,
+					state->level == CORE_LEVEL ? "Core" : "Package",
+					state->count, throttle_time);
+				state->rate_control_active = 1;
+			}
+			if (throttle_time > state->max_time_ms)
+				state->max_time_ms = throttle_time;
+			state->last_interrupt_time = 0;
+		}
 	}
 }
 
@@ -252,6 +288,9 @@  static int thermal_throttle_add_dev(struct device *dev, unsigned int cpu)
 		err = sysfs_add_file_to_group(&dev->kobj,
 					      &dev_attr_package_throttle_count.attr,
 					      thermal_attr_group.name);
+		err = sysfs_add_file_to_group(&dev->kobj,
+					      &dev_attr_package_throttle_max_time_ms.attr,
+					      thermal_attr_group.name);
 		if (cpu_has(c, X86_FEATURE_PLN) && int_pln_enable)
 			err = sysfs_add_file_to_group(&dev->kobj,
 					&dev_attr_package_power_limit_count.attr,
@@ -269,15 +308,28 @@  static void thermal_throttle_remove_dev(struct device *dev)
 /* Get notified when a cpu comes on/off. Be hotplug friendly. */
 static int thermal_throttle_online(unsigned int cpu)
 {
+	struct thermal_state *state = &per_cpu(thermal_state, cpu);
 	struct device *dev = get_cpu_device(cpu);
 
+	state->package_throttle.cpu = cpu;
+	state->core_throttle.cpu = cpu;
+
+	timer_setup(&state->package_throttle.timer, therm_throt_active_timer_fn, 0);
+	timer_setup(&state->core_throttle.timer, therm_throt_active_timer_fn, 0);
+
 	return thermal_throttle_add_dev(dev, cpu);
 }
 
 static int thermal_throttle_offline(unsigned int cpu)
 {
+	struct thermal_state *state = &per_cpu(thermal_state, cpu);
 	struct device *dev = get_cpu_device(cpu);
 
+	del_timer(&state->package_throttle.timer);
+	del_timer(&state->core_throttle.timer);
+	state->package_throttle.last_interrupt_time = 0;
+	state->core_throttle.last_interrupt_time = 0;
+
 	thermal_throttle_remove_dev(dev);
 	return 0;
 }
@@ -522,3 +574,11 @@  void intel_init_thermal(struct cpuinfo_x86 *c)
 	/* enable thermal throttle processing */
 	atomic_set(&therm_throt_en, 1);
 }
+
+static int __init therm_warn_delay(char *str)
+{
+	get_option(&str, &thermal_warn_interval);
+
+	return 0;
+}
+early_param("x86_therm_warn_delay", therm_warn_delay);