Message ID | 20191014212101.25719-1-srinivas.pandruvada@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages | expand |
On Mon, Oct 14, 2019 at 02:21:00PM -0700, Srinivas Pandruvada wrote: > Some modern systems have very tight thermal tolerances. Because of this > they may cross thermal thresholds when running normal workloads (even > during boot). The CPU hardware will react by limiting power/frequency > and using duty cycles to bring the temperature back into normal range. > > Thus users may see a "critical" message about the "temperature above > threshold" which is soon followed by "temperature/speed normal". These > messages are rate limited, but still may repeat every few minutes. > > The solution here is to set a timeout when the temperature first exceeds > the threshold. If the CPU returns to normal before the timeout fires, > we skip printing any messages. If we reach the timeout, then there may be > a real thermal issue (e.g. inoperative or blocked fan) and we print the > message (together with a count of how many thermal events have occurred). > A rate control method is used to avoid printing repeatedly on these broken > systems. > > Some experimentation with fans enabled showed that temperature returned > to normal on a laptop in ~4 seconds. With fans disabled it took over 10 > seconds. Default timeout is thus set to 8 seconds, but may be changed > with kernel boot parameter: "x86_therm_warn_delay". This default interval > is twice of typical sampling interval for cooling using running average > power limit from user space thermal control softwares. > > In addition a new sysfs attribute is added to show what is the maximum > amount of time in miili-seconds the system was in throttled state. This > will allow to change x86_therm_warn_delay, if required. This description is already *begging* for this delay value to be automatically set by the kernel. Putting yet another knob in front of the user who doesn't have a clue most of the time shows one more time that we haven't done our job properly by asking her to know what we already do. IOW, a simple history feedback mechanism which sets the timeout based on the last couple of values is much smarter. The thing would have a max value, of course, which, when exceeded should mean an anomaly, etc, but almost anything else is better than merely asking the user to make an educated guess. > Suggested-by: Alan Cox <alan@linux.intel.com> > Commit-comment-by: Tony Luck <tony.luck@intel.com> ^^^^^^^^^^^^^^^^^^ What's that?
On Mon, Oct 14, 2019 at 11:36:18PM +0200, Borislav Petkov wrote: > This description is already *begging* for this delay value to be > automatically set by the kernel. Putting yet another knob in front of > the user who doesn't have a clue most of the time shows one more time > that we haven't done our job properly by asking her to know what we > already do. > > IOW, a simple history feedback mechanism which sets the timeout based on > the last couple of values is much smarter. The thing would have a max > value, of course, which, when exceeded should mean an anomaly, etc, but > almost anything else is better than merely asking the user to make an > educated guess. You need a plausible start point for the "when to worry the user" message. Maybe that is your "max value"? So if the system has a couple of excursions above temperature lasting 1 second and then 2 seconds ... would you like to see those ignored (because they are below the initial max)? But now we have a couple of data points pick some new value to be the threshold for reporting? What value should we pick (based on 1 sec, then 2 sec)? I would be worried that it would self tune to the point where it does report something that it really didn't need to (e.g. as a result of a few consecutive very short excursions). We also need to take into account the "typical sampling interval" for user space thermal control software. Srinivas: Maybe this needs to have some more detail on what user solutions are being taken into account here. > > Suggested-by: Alan Cox <alan@linux.intel.com> > > Commit-comment-by: Tony Luck <tony.luck@intel.com> > ^^^^^^^^^^^^^^^^^^ > > What's that? My fault ... during review process I pretty much re-wrote the whole commit message to follow the form of: "What is the problem?" "How are we fixing it" But I didn't want Srinivas to take the heat for any mistakes that were my fault. "Co-developed-by" really didn't explain what happened (since I didn't write any code, just made suggestions on things that needed to be changed/improved). -Tony
On Mon, 2019-10-14 at 23:36 +0200, Borislav Petkov wrote: > On Mon, Oct 14, 2019 at 02:21:00PM -0700, Srinivas Pandruvada wrote: > > Some modern systems have very tight thermal tolerances. Because of > > this > > they may cross thermal thresholds when running normal workloads > > (even > > during boot). The CPU hardware will react by limiting > > power/frequency > > and using duty cycles to bring the temperature back into normal > > range. > > > > Thus users may see a "critical" message about the "temperature > > above > > threshold" which is soon followed by "temperature/speed normal". > > These > > messages are rate limited, but still may repeat every few minutes. > > > > The solution here is to set a timeout when the temperature first > > exceeds > > the threshold. If the CPU returns to normal before the timeout > > fires, > > we skip printing any messages. If we reach the timeout, then there > > may be > > a real thermal issue (e.g. inoperative or blocked fan) and we print > > the > > message (together with a count of how many thermal events have > > occurred). > > A rate control method is used to avoid printing repeatedly on these > > broken > > systems. > > > > Some experimentation with fans enabled showed that temperature > > returned > > to normal on a laptop in ~4 seconds. With fans disabled it took > > over 10 > > seconds. Default timeout is thus set to 8 seconds, but may be > > changed > > with kernel boot parameter: "x86_therm_warn_delay". This default > > interval > > is twice of typical sampling interval for cooling using running > > average > > power limit from user space thermal control softwares. > > > > In addition a new sysfs attribute is added to show what is the > > maximum > > amount of time in miili-seconds the system was in throttled state. > > This > > will allow to change x86_therm_warn_delay, if required. > > This description is already *begging* for this delay value to be > automatically set by the kernel. Putting yet another knob in front of > the user who doesn't have a clue most of the time shows one more time > that we haven't done our job properly by asking her to know what we > already do. I experimented on the systems released from Sandy Bridge era. But someone running on 10 years old system, this is a fallback mechanism. Don't expect that users have to tune from the default but saying with certainty is difficult. The source of this PROCHOT signal can be anything on the board. So some users who had issues in their systems can try with this patch. We can get rid of this, till it becomes real issue. > > IOW, a simple history feedback mechanism which sets the timeout based > on > the last couple of values is much smarter. The thing would have a max > value, of course, which, when exceeded should mean an anomaly, etc, > but > almost anything else is better than merely asking the user to make an > educated guess. The temperature is function of load, time and heat dissipation capacity of the system. I have to think more about this to come up with some heuristics where we still warning users about real thermal issues. Since value is not persistent, then next boot again will start from the default. > > > Suggested-by: Alan Cox <alan@linux.intel.com> > > Commit-comment-by: Tony Luck <tony.luck@intel.com> > > ^^^^^^^^^^^^^^^^^^ > > What's that? Tony suggested this to indicate that he rewrote the commit description as he didn't like my description. Definitely checkpatch doesn't like this. Thanks, Srinivas
On Mon, Oct 14, 2019 at 03:27:35PM -0700, Luck, Tony wrote: > You need a plausible start point for the "when to worry the user" > message. Maybe that is your "max value"? Yes, that would be a good start. You need that anyway because the experimentations you guys did to get your numbers have been done in some ambient temperature of X. I betcha when the ambient temperature differs considerably from yours, the numbers don't mean a whole lot. Which makes a dynamic adjustment even more important. > So if the system has a couple of excursions above temperature lasting > 1 second and then 2 seconds ... would you like to see those ignored > (because they are below the initial max)? But now we have a couple > of data points pick some new value to be the threshold for reporting? > > What value should we pick (based on 1 sec, then 2 sec)? > > I would be worried that it would self tune to the point where it > does report something that it really didn't need to (e.g. as a result > of a few consecutive very short excursions). You select a history feedback formula with which sudden changes influence the timeout value relatively slowly and keep the current timeout value rather inert. They would take effect only when such spikes hold on for a longer time, i.e., take up a longer chunk of the sampling interval. > We also need to take into account the "typical sampling interval" > for user space thermal control software. Yes to the sampling interval, not so sure about doing anything in luserspace. This should all be done in the kernel automatically. > My fault ... during review process I pretty much re-wrote the > whole commit message to follow the form of: > "What is the problem?" > "How are we fixing it" Cool. > But I didn't want Srinivas to take the heat for any mistakes > that were my fault. "Co-developed-by" really didn't explain > what happened (since I didn't write any code, just made suggestions > on things that needed to be changed/improved). Yeah, so stuff like that is usually added with free text at the end of the commit message where you have more than a couple of words in a tag to explain what happened.
On Mon, Oct 14, 2019 at 03:41:38PM -0700, Srinivas Pandruvada wrote: > So some users who had issues in their systems can try with this patch. > We can get rid of this, till it becomes real issue. We don't add command line parameters which we maybe can get rid of later. > The temperature is function of load, time and heat dissipation capacity > of the system. I have to think more about this to come up with some > heuristics where we still warning users about real thermal issues. > Since value is not persistent, then next boot again will start from the > default. Yes, and the fact that each machine's temperature is influenced by the specific *individual* environment and load the machine runs, shows that you need to adjust this timeout automatically and dynamically. With the command line parameter you're basically putting the onus on the user to do that which is just silly. And then she'd need to do it during runtime too, if the ambient temperature or machine load, etc, changes. The whole thing is crying "dynamic". For a simple example, see mce_timer_fn() where we switch to polling during CMCI storms.
On Mon, Oct 14, 2019 at 02:21:00PM -0700, Srinivas Pandruvada wrote: > Some modern systems have very tight thermal tolerances. Because of this > they may cross thermal thresholds when running normal workloads (even > during boot). The CPU hardware will react by limiting power/frequency > and using duty cycles to bring the temperature back into normal range. > > Thus users may see a "critical" message about the "temperature above > threshold" which is soon followed by "temperature/speed normal". These > messages are rate limited, but still may repeat every few minutes. > > The solution here is to set a timeout when the temperature first exceeds > the threshold. Why can we even reach critical thresholds when the fans are working? I always thought it was BAD to ever reach the critical temps and have the hardware throttle.
On Mon, Oct 14, 2019 at 03:27:35PM -0700, Luck, Tony wrote: > On Mon, Oct 14, 2019 at 11:36:18PM +0200, Borislav Petkov wrote: > > This description is already *begging* for this delay value to be > > automatically set by the kernel. Putting yet another knob in front of > > the user who doesn't have a clue most of the time shows one more time > > that we haven't done our job properly by asking her to know what we > > already do. > > > > IOW, a simple history feedback mechanism which sets the timeout based on > > the last couple of values is much smarter. The thing would have a max > > value, of course, which, when exceeded should mean an anomaly, etc, but > > almost anything else is better than merely asking the user to make an > > educated guess. > > You need a plausible start point for the "when to worry the user" > message. Maybe that is your "max value"? > > So if the system has a couple of excursions above temperature lasting > 1 second and then 2 seconds ... would you like to see those ignored > (because they are below the initial max)? But now we have a couple > of data points pick some new value to be the threshold for reporting? > > What value should we pick (based on 1 sec, then 2 sec)? > > I would be worried that it would self tune to the point where it > does report something that it really didn't need to (e.g. as a result > of a few consecutive very short excursions). I'm guessing Boris is thinking of a simple IIR like avg filter. avg = avg + (sample-avg) / 4 And then only print when sample > 2*avg. If you initialize that with some appropriately large value, it should settle down into what it 'normal' for that particular piece of hardware. Still, I'm boggled by the whole idea that hitting critical hard throttle is considered 'normal' at all. > We also need to take into account the "typical sampling interval" > for user space thermal control software. Why is control of critical thermal crud in userspace? That seems like a massive design fail.
On Tue, 2019-10-15 at 10:48 +0200, Peter Zijlstra wrote: > On Mon, Oct 14, 2019 at 02:21:00PM -0700, Srinivas Pandruvada wrote: > > Some modern systems have very tight thermal tolerances. Because of > > this > > they may cross thermal thresholds when running normal workloads > > (even > > during boot). The CPU hardware will react by limiting > > power/frequency > > and using duty cycles to bring the temperature back into normal > > range. > > > > Thus users may see a "critical" message about the "temperature > > above > > threshold" which is soon followed by "temperature/speed normal". > > These > > messages are rate limited, but still may repeat every few minutes. > > > > The solution here is to set a timeout when the temperature first > > exceeds > > the threshold. > > Why can we even reach critical thresholds when the fans are working? > I > always thought it was BAD to ever reach the critical temps and have > the > hardware throttle. CPU temperature doesn't have to hit max(TjMax) to get these warnings. OEMs has an ability to program a threshold where a thermal interrupt can be generated. In some systems the offset is 20C+ (Read only value). In recent systems, there is another offset on top of it which can be programmed by OS, once some agent can adjust power limits dynamically. By default this is set to low by the firmware, which I guess the prime motivation of Benjamin to submit the patch. Thanks, Srinivas
On Tue, 2019-10-15 at 10:52 +0200, Peter Zijlstra wrote: > On Mon, Oct 14, 2019 at 03:27:35PM -0700, Luck, Tony wrote: > > On Mon, Oct 14, 2019 at 11:36:18PM +0200, Borislav Petkov wrote: > > > This description is already *begging* for this delay value to be > > > automatically set by the kernel. Putting yet another knob in > > > front of > > > the user who doesn't have a clue most of the time shows one more > > > time > > > that we haven't done our job properly by asking her to know what > > > we > > > already do. > > > > > > IOW, a simple history feedback mechanism which sets the timeout > > > based on > > > the last couple of values is much smarter. The thing would have a > > > max > > > value, of course, which, when exceeded should mean an anomaly, > > > etc, but > > > almost anything else is better than merely asking the user to > > > make an > > > educated guess. > > > > You need a plausible start point for the "when to worry the user" > > message. Maybe that is your "max value"? > > > > So if the system has a couple of excursions above temperature > > lasting > > 1 second and then 2 seconds ... would you like to see those ignored > > (because they are below the initial max)? But now we have a couple > > of data points pick some new value to be the threshold for > > reporting? > > > > What value should we pick (based on 1 sec, then 2 sec)? > > > > I would be worried that it would self tune to the point where it > > does report something that it really didn't need to (e.g. as a > > result > > of a few consecutive very short excursions). > > I'm guessing Boris is thinking of a simple IIR like avg filter. > > avg = avg + (sample-avg) / 4 > > And then only print when sample > 2*avg. If you initialize that with > some appropriately large value, it should settle down into what it > 'normal' for that particular piece of hardware. I will take a shot with some IIR implementation. > > Still, I'm boggled by the whole idea that hitting critical hard > throttle > is considered 'normal' at all. As explained in my previous email, this is not so called TJMax, where it will shutdown. If you keep this temperature for longer time, cooling needs be adjusted. > > > We also need to take into account the "typical sampling interval" > > for user space thermal control software. > > Why is control of critical thermal crud in userspace? That seems like > a > massive design fail. The TjMax is taken care by the embedded firmware or kernel depending on how OEM wants it to be controlled. User space is for mostly balancing non CPU parts, which are not urgent. For example you run CPU at high temperature for long duration, the skin will heat up, which takes much longer time to cool than CPU itself. Thanks, Srinivas
On Tue, 2019-10-15 at 10:46 +0200, Borislav Petkov wrote: > On Mon, Oct 14, 2019 at 03:41:38PM -0700, Srinivas Pandruvada wrote: > > So some users who had issues in their systems can try with this > > patch. > > We can get rid of this, till it becomes real issue. > > We don't add command line parameters which we maybe can get rid of > later. I am saying the same. We will not have command line parameter, till this is a problem. Thanks, Srinivas > > > The temperature is function of load, time and heat dissipation > > capacity > > of the system. I have to think more about this to come up with some > > heuristics where we still warning users about real thermal issues. > > Since value is not persistent, then next boot again will start from > > the > > default. > > Yes, and the fact that each machine's temperature is influenced by > the > specific *individual* environment and load the machine runs, shows > that > you need to adjust this timeout automatically and dynamically. > > With the command line parameter you're basically putting the onus on > the > user to do that which is just silly. And then she'd need to do it > during > runtime too, if the ambient temperature or machine load, etc, > changes. > > The whole thing is crying "dynamic". > > For a simple example, see mce_timer_fn() where we switch to polling > during CMCI storms. >
On Tue, Oct 15, 2019 at 06:31:46AM -0700, Srinivas Pandruvada wrote: > On Tue, 2019-10-15 at 10:48 +0200, Peter Zijlstra wrote: > > On Mon, Oct 14, 2019 at 02:21:00PM -0700, Srinivas Pandruvada wrote: > > > Some modern systems have very tight thermal tolerances. Because of > > > this > > > they may cross thermal thresholds when running normal workloads > > > (even > > > during boot). The CPU hardware will react by limiting > > > power/frequency > > > and using duty cycles to bring the temperature back into normal > > > range. > > > > > > Thus users may see a "critical" message about the "temperature > > > above > > > threshold" which is soon followed by "temperature/speed normal". > > > These > > > messages are rate limited, but still may repeat every few minutes. > > > > > > The solution here is to set a timeout when the temperature first > > > exceeds > > > the threshold. > > > > Why can we even reach critical thresholds when the fans are working? > > I > > always thought it was BAD to ever reach the critical temps and have > > the > > hardware throttle. > CPU temperature doesn't have to hit max(TjMax) to get these warnings. > OEMs has an ability to program a threshold where a thermal interrupt > can be generated. In some systems the offset is 20C+ (Read only value). > > In recent systems, there is another offset on top of it which can be > programmed by OS, once some agent can adjust power limits dynamically. > By default this is set to low by the firmware, which I guess the prime > motivation of Benjamin to submit the patch. That all sounds like the printk should be downgraded too, it is not a KERN_CRIT warning. It is more a notification that we're getting warm.
On Wed, Oct 16, 2019 at 10:14:05AM +0200, Peter Zijlstra wrote: > That all sounds like the printk should be downgraded too, it is not a > KERN_CRIT warning. It is more a notification that we're getting warm. Right, and I think we should take Benjamin's patch after all - perhaps even tag it for stable if that message is annoying people too much - and Srinivas can do the dynamic thing ontop. Thx.
>> That all sounds like the printk should be downgraded too, it is not a >> KERN_CRIT warning. It is more a notification that we're getting warm. > > Right, and I think we should take Benjamin's patch after all - perhaps > even tag it for stable if that message is annoying people too much - and > Srinivas can do the dynamic thing ontop. That sounds like the right short term action. Depending on what we end up with from Srinivas ... we may want to reconsider the severity. The basic premise of Srinivas' patch is to avoid printing anything for short excursions above temperature threshold. But the effect of that is that when we find the core/package staying above temperature for an extended period of time, we are in a serious situation where some action may be needed. E.g. move the laptop off the soft surface that is blocking the air vents. -Tony
On Thu, Oct 17, 2019 at 09:31:30PM +0000, Luck, Tony wrote: > That sounds like the right short term action. > > Depending on what we end up with from Srinivas ... we may want > to reconsider the severity. The basic premise of Srinivas' patch > is to avoid printing anything for short excursions above temperature > threshold. But the effect of that is that when we find the core/package > staying above temperature for an extended period of time, we are > in a serious situation where some action may be needed. E.g. > move the laptop off the soft surface that is blocking the air vents. I don't think having a critical severity message is nearly enough. There are cases where the users simply won't see that message, no shell opened, nothing scanning dmesg, nothing pops up on the desktop to show KERN_CRIT messages, etc. If we really wanna handle this case then we must be much more reliable: * we throttle the machine from within the kernel - whatever that may mean * if that doesn't help, we stop scheduling !root tasks * if that doesn't help, we halt * ... These are purely hypothetical things to do but I'm pointing them out as an example that in a high temperature situation we should be actively doing something and not wait for the user to do that. Come to think of it, one can apply the same type of logic here and split the temp severity into action-required events and action-optional events and then depending on the type, we do things. Now what those things are, should be determined by the severity of the events. Which would mean, we'd need to know how severe those events are. And since this is left in the hands of the OEMs, good luck to us. ;-\
> * we throttle the machine from within the kernel - whatever that may mean > * if that doesn't help, we stop scheduling !root tasks > * if that doesn't help, we halt The silicon will do that "halt" step all by itself if the temperature continues to rise and hits the highest of the temperature thresholds. -Tony
On Thu, Oct 17, 2019 at 11:53:18PM +0000, Luck, Tony wrote: > > * we throttle the machine from within the kernel - whatever that may mean > > * if that doesn't help, we stop scheduling !root tasks > > * if that doesn't help, we halt > > The silicon will do that "halt" step all by itself if the temperature > continues to rise and hits the highest of the temperature thresholds. Oh, I know that. But that is not of our concern - I believe we're discussing right now what to do when the lower, softer limits are reached and the thermal interrupt fires. When the hw forces HLT, it means we didn't do a very good job earlier, before it got so hot. :-)
On Thu, Oct 17, 2019 at 11:44:45PM +0200, Borislav Petkov wrote: > On Thu, Oct 17, 2019 at 09:31:30PM +0000, Luck, Tony wrote: > > That sounds like the right short term action. > > > > Depending on what we end up with from Srinivas ... we may want > > to reconsider the severity. The basic premise of Srinivas' patch > > is to avoid printing anything for short excursions above temperature > > threshold. But the effect of that is that when we find the core/package > > staying above temperature for an extended period of time, we are > > in a serious situation where some action may be needed. E.g. > > move the laptop off the soft surface that is blocking the air vents. > > I don't think having a critical severity message is nearly enough. > There are cases where the users simply won't see that message, no shell > opened, nothing scanning dmesg, nothing pops up on the desktop to show > KERN_CRIT messages, etc. > > If we really wanna handle this case then we must be much more reliable: > > * we throttle the machine from within the kernel - whatever that may mean > * if that doesn't help, we stop scheduling !root tasks > * if that doesn't help, we halt > * ... We have forced idle injection, that should be able to reduce the system to barely functional but non-cooker status.
On Thu, 2019-10-17 at 23:44 +0200, Borislav Petkov wrote: > On Thu, Oct 17, 2019 at 09:31:30PM +0000, Luck, Tony wrote: > > That sounds like the right short term action. > > > > Depending on what we end up with from Srinivas ... we may want > > to reconsider the severity. The basic premise of Srinivas' patch > > is to avoid printing anything for short excursions above > > temperature > > threshold. But the effect of that is that when we find the > > core/package > > staying above temperature for an extended period of time, we are > > in a serious situation where some action may be needed. E.g. > > move the laptop off the soft surface that is blocking the air > > vents. > > I don't think having a critical severity message is nearly enough. > There are cases where the users simply won't see that message, no > shell > opened, nothing scanning dmesg, nothing pops up on the desktop to > show > KERN_CRIT messages, etc. > > If we really wanna handle this case then we must be much more > reliable: > > * we throttle the machine from within the kernel - whatever that may > mean There are actions associated with the high temperature using acpi thermal subsystems. The problem with associating with this warning directly is that, this threhold temperature is set to too low in some recent laptops at power up. Server/desktops generally rely on the embedded controller for FAN control, which kernel have no control. For them this warning helps to either bring in additional cooling or fix existing cooling. If something needs to force throttle from kernel, then we should use some offset from the max temperature (aka TJMax), instead of this warning threshold. Then we can use idle injection or change duty cycle of CPU clocks. Thanks, Srinivas > * if that doesn't help, we stop scheduling !root tasks > * if that doesn't help, we halt > * ... > > These are purely hypothetical things to do but I'm pointing them out > as > an example that in a high temperature situation we should be actively > doing something and not wait for the user to do that. > > Come to think of it, one can apply the same type of logic here and > split > the temp severity into action-required events and action-optional > events > and then depending on the type, we do things. > > Now what those things are, should be determined by the severity of > the > events. Which would mean, we'd need to know how severe those events > are. > And since this is left in the hands of the OEMs, good luck to us. ;-\ >
On Fri, Oct 18, 2019 at 05:26:36AM -0700, Srinivas Pandruvada wrote: > Server/desktops generally rely on the embedded controller for FAN > control, which kernel have no control. For them this warning helps to > either bring in additional cooling or fix existing cooling. How exactly does this warning help? A detailed example please. > If something needs to force throttle from kernel, then we should use > some offset from the max temperature (aka TJMax), instead of this > warning threshold. Then we can use idle injection or change duty cycle > of CPU clocks. Yes, as I said, all this needs to be properly defined first. That is, *if* there's even need for reacting to thermal interrupts in the kernel.
On Fri, 2019-10-18 at 15:23 +0200, Borislav Petkov wrote: > On Fri, Oct 18, 2019 at 05:26:36AM -0700, Srinivas Pandruvada wrote: > > Server/desktops generally rely on the embedded controller for FAN > > control, which kernel have no control. For them this warning helps > > to > > either bring in additional cooling or fix existing cooling. > > How exactly does this warning help? A detailed example please. I assume that someone is having performance issues or occasion reboots, look at the logs. Is it a fair assumption? If not, logging has no value. In the current code, this logging is misleading. It is reporting all normal throttling at PROCHOT. But if a system is running at up to 87.5% of duty cycle on top of lowest possible frequency of around 800MHz, someone will notice. If logs are not the starting point, someone has to run tools like turbostat and understand the cause of performance issues. Then probably someone cleanup air vents on dusty desktop sitting under the desk. Anyway, we can provide better document for the sysfs counters this code is dumping and how to interpret them with or without logging support. I can add some document under kernel documentation. Thanks, Srinivas > > > If something needs to force throttle from kernel, then we should > > use > > some offset from the max temperature (aka TJMax), instead of this > > warning threshold. Then we can use idle injection or change duty > > cycle > > of CPU clocks. > > Yes, as I said, all this needs to be properly defined first. That is, > *if* there's even need for reacting to thermal interrupts in the > kernel. >
On Fri, Oct 18, 2019 at 03:23:09PM +0200, Borislav Petkov wrote: > On Fri, Oct 18, 2019 at 05:26:36AM -0700, Srinivas Pandruvada wrote: > > Server/desktops generally rely on the embedded controller for FAN > > control, which kernel have no control. For them this warning helps to > > either bring in additional cooling or fix existing cooling. > > How exactly does this warning help? A detailed example please. > > > If something needs to force throttle from kernel, then we should use > > some offset from the max temperature (aka TJMax), instead of this > > warning threshold. Then we can use idle injection or change duty cycle > > of CPU clocks. > > Yes, as I said, all this needs to be properly defined first. That is, > *if* there's even need for reacting to thermal interrupts in the kernel. Recap: We are starting from a place where the kernel prints a message. Patch already in flight to reduce the severity of the message (since users are seeing it, and find it annoying/unhelpful that it has such a high severity). Srinivas has asserted that in many cases we can eliminate the message. But wants to keep the message if it seems that there is something really wrong. --- So what should we do next? I don't think there is much by way of actions that the kernel should take. While we could stop scheduling processes, the h/w and f/w have better tools to reduce frequency, inject idle cycles, speed up fans, etc. If you do have ideas ... then please share. So this thread is now about doing the proper definition of what we actions Linux should take. Proposal on the table is the algoritm embodied in Srinivas' patch (which originated from Alan Cox). I.e. 1) ignore short excursions above this threshold. 2) Print a message for persistent problems. 3) Keep a record of total time spent above threshold. If that's a reasonable approach, the we just need to come up with a way to define "short excursion" (which might be platform dependent). If someone has a brilliant idea on how to do that, we can use it. If not we #define a number. If it isn't reasonable ... then propose something better. -Tony
On Fri, Oct 18, 2019 at 08:55:17AM -0700, Srinivas Pandruvada wrote: > I assume that someone is having performance issues or occasion reboots, > look at the logs. Is it a fair assumption? Yes, that is a valid use case IMO. > But if a system is running at up to 87.5% of duty cycle on top of > lowest possible frequency of around 800MHz, someone will notice. Yes, but that doesn't justify for those printk statements to be KERN_CRIT. They're just fine as warnings.
On Fri, Oct 18, 2019 at 11:02:57AM -0700, Luck, Tony wrote: > So what should we do next? I was simply keying off this statement of yours: "Depending on what we end up with from Srinivas ... we may want to reconsider the severity." and I don't think that having KERN_CRIT severity for those messages makes any sense. That's why I was hinting at us organizing and defining our handling of thermal interrupt events properly so that we handle those things correctly and not have people look at dmesg. > I don't think there is much by way of actions that the kernel should > take. While we could stop scheduling processes, the h/w and f/w have > better tools to reduce frequency, inject idle cycles, speed up fans, > etc. If you do have ideas ... then please share. See above. All resulted from me stating that KERN_CRIT messages or any type of messages in dmesg as a result of hitting thermal limits are useless. If we wanna handle those properly, then we need to do something else. > Proposal on the table is the algoritm embodied in Srinivas' > patch (which originated from Alan Cox). I think we agree on doing the dynamic threshold determination, no? If, as Srinivas points out in another mail, the purpose of those messages is when one wants to examine what happened, then fine. If we must do more, then see above. Does that make more sense?
On Fri, Oct 18, 2019 at 09:45:03PM +0200, Borislav Petkov wrote: > On Fri, Oct 18, 2019 at 11:02:57AM -0700, Luck, Tony wrote: > > So what should we do next? > > I was simply keying off this statement of yours: > > "Depending on what we end up with from Srinivas ... we may want to > reconsider the severity." > > and I don't think that having KERN_CRIT severity for those messages > makes any sense. That's why I was hinting at us organizing and defining > our handling of thermal interrupt events properly so that we handle > those things correctly and not have people look at dmesg. Sorry to have caused confusion. The thoughts behind that statement are that we currently have an issue with too many noisy high severity messages. The interim solution we are going with is to downgrade the severity. But if we apply a time based filter to remove most of the noise by not printing at all, maybe what we have left is a very small number of high severity messages. But that's completely up for debate. > I think we agree on doing the dynamic threshold determination, no? I agree it is a good thing to look at. I'm not so sure we will find a good enough method that works all the way from tablet to server, so we might end up with "#define MAX_THERM_TIME 8000" ... but some study of options would either turn up a good heuristic, or provide evidence for why that is either hard, or no better than a constant. > Does that make more sense? Yes. Thanks for the clarifications. -Tony
On Fri, Oct 18, 2019 at 01:38:32PM -0700, Luck, Tony wrote: > Sorry to have caused confusion. Ditto. But us causing confusion is fine - this way we can talk about what we really wanna do! :-))) > The thoughts behind that statement are that we currently have an issue > with too many noisy high severity messages. The interim solution we > are going with is to downgrade the severity. But if we apply a time > based filter to remove most of the noise by not printing at all, maybe > what we have left is a very small number of high severity messages. > > But that's completely up for debate. Well, I think those messages being pr_warn are fine if one wants to inspect dmesg for signs of overheating and the platform is hitting some thermal limits. And if the time-based filter is not too accurate, that's fine too, I guess, as long as we don't flood dmesg. What I don't like is the command line parameter and us putting the onus on the user to decide although we have all that info in the kernel already and we can do that decision automatically. > I agree it is a good thing to look at. I'm not so sure we will find > a good enough method that works all the way from tablet to server, > so we might end up with "#define MAX_THERM_TIME 8000" ... but some > study of options would either turn up a good heuristic, or provide > evidence for why that is either hard, or no better than a constant. Yeah, I still think a simple avg filter which starts from a sufficiently high value and improves it over time, should be good enough. Hell, even the trivial formula we use in the CMCI interrupt for polling, might work, where we either double the interval or halve it, depending on recent history. Thx.
diff --git a/arch/x86/kernel/cpu/mce/therm_throt.c b/arch/x86/kernel/cpu/mce/therm_throt.c index 6e2becf547c5..b2e9d10bef44 100644 --- a/arch/x86/kernel/cpu/mce/therm_throt.c +++ b/arch/x86/kernel/cpu/mce/therm_throt.c @@ -47,8 +47,13 @@ struct _thermal_state { bool new_event; int event; u64 next_check; + u64 last_interrupt_time; + struct timer_list timer; unsigned long count; - unsigned long last_count; + unsigned long max_time_ms; + int rate_control_active; + int cpu; + int level; }; struct thermal_state { @@ -121,8 +126,15 @@ define_therm_throt_device_one_ro(package_throttle_count); define_therm_throt_device_show_func(package_power_limit, count); define_therm_throt_device_one_ro(package_power_limit_count); +define_therm_throt_device_show_func(core_throttle, max_time_ms); +define_therm_throt_device_one_ro(core_throttle_max_time_ms); + +define_therm_throt_device_show_func(package_throttle, max_time_ms); +define_therm_throt_device_one_ro(package_throttle_max_time_ms); + static struct attribute *thermal_throttle_attrs[] = { &dev_attr_core_throttle_count.attr, + &dev_attr_core_throttle_max_time_ms.attr, NULL }; @@ -135,6 +147,19 @@ static const struct attribute_group thermal_attr_group = { #define CORE_LEVEL 0 #define PACKAGE_LEVEL 1 +#define THERM_THROT_WARN_INTERVAL_MS 8000 +static unsigned int thermal_warn_interval = THERM_THROT_WARN_INTERVAL_MS; + +static void therm_throt_active_timer_fn(struct timer_list *t) +{ + struct _thermal_state *state = from_timer(state, t, timer); + + pr_crit("CPU%d: %s temperature is above threshold, cpu clock is throttled from last %d milli seconds (total events = %lu)\n", + state->cpu, + state->level == CORE_LEVEL ? "Core" : "Package", + thermal_warn_interval, state->count); +} + /*** * therm_throt_process - Process thermal throttling event from interrupt * @curr: Whether the condition is current or not (boolean), since the @@ -174,31 +199,42 @@ static void therm_throt_process(bool new_event, int event, int level) old_event = state->new_event; state->new_event = new_event; + state->level = level; if (new_event) state->count++; if (time_before64(now, state->next_check) && - state->count != state->last_count) + state->rate_control_active) return; + state->rate_control_active = 0; + state->next_check = now + CHECK_INTERVAL; - state->last_count = state->count; - /* if we just entered the thermal event */ - if (new_event) { - if (event == THERMAL_THROTTLING_EVENT) - pr_crit("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n", - this_cpu, - level == CORE_LEVEL ? "Core" : "Package", - state->count); - return; - } - if (old_event) { - if (event == THERMAL_THROTTLING_EVENT) - pr_info("CPU%d: %s temperature/speed normal\n", this_cpu, - level == CORE_LEVEL ? "Core" : "Package"); - return; + if (event == THERMAL_THROTTLING_EVENT) { + if (new_event && !state->last_interrupt_time) { + state->last_interrupt_time = now; + if (!timer_pending(&state->timer)) + mod_timer(&state->timer, + (now + msecs_to_jiffies(thermal_warn_interval))); + } else if (old_event && state->last_interrupt_time) { + unsigned long throttle_time; + int ret; + + ret = del_timer(&state->timer); + throttle_time = jiffies_delta_to_msecs(now - state->last_interrupt_time); + if (!ret) { + pr_crit("CPU%d: %s temperature/speed normal (total events = %lu, throttled time: %lu milli seconds)\n", + state->cpu, + state->level == CORE_LEVEL ? "Core" : "Package", + state->count, throttle_time); + state->rate_control_active = 1; + } + if (throttle_time > state->max_time_ms) + state->max_time_ms = throttle_time; + state->last_interrupt_time = 0; + } } } @@ -252,6 +288,9 @@ static int thermal_throttle_add_dev(struct device *dev, unsigned int cpu) err = sysfs_add_file_to_group(&dev->kobj, &dev_attr_package_throttle_count.attr, thermal_attr_group.name); + err = sysfs_add_file_to_group(&dev->kobj, + &dev_attr_package_throttle_max_time_ms.attr, + thermal_attr_group.name); if (cpu_has(c, X86_FEATURE_PLN) && int_pln_enable) err = sysfs_add_file_to_group(&dev->kobj, &dev_attr_package_power_limit_count.attr, @@ -269,15 +308,28 @@ static void thermal_throttle_remove_dev(struct device *dev) /* Get notified when a cpu comes on/off. Be hotplug friendly. */ static int thermal_throttle_online(unsigned int cpu) { + struct thermal_state *state = &per_cpu(thermal_state, cpu); struct device *dev = get_cpu_device(cpu); + state->package_throttle.cpu = cpu; + state->core_throttle.cpu = cpu; + + timer_setup(&state->package_throttle.timer, therm_throt_active_timer_fn, 0); + timer_setup(&state->core_throttle.timer, therm_throt_active_timer_fn, 0); + return thermal_throttle_add_dev(dev, cpu); } static int thermal_throttle_offline(unsigned int cpu) { + struct thermal_state *state = &per_cpu(thermal_state, cpu); struct device *dev = get_cpu_device(cpu); + del_timer(&state->package_throttle.timer); + del_timer(&state->core_throttle.timer); + state->package_throttle.last_interrupt_time = 0; + state->core_throttle.last_interrupt_time = 0; + thermal_throttle_remove_dev(dev); return 0; } @@ -522,3 +574,11 @@ void intel_init_thermal(struct cpuinfo_x86 *c) /* enable thermal throttle processing */ atomic_set(&therm_throt_en, 1); } + +static int __init therm_warn_delay(char *str) +{ + get_option(&str, &thermal_warn_interval); + + return 0; +} +early_param("x86_therm_warn_delay", therm_warn_delay);
Some modern systems have very tight thermal tolerances. Because of this they may cross thermal thresholds when running normal workloads (even during boot). The CPU hardware will react by limiting power/frequency and using duty cycles to bring the temperature back into normal range. Thus users may see a "critical" message about the "temperature above threshold" which is soon followed by "temperature/speed normal". These messages are rate limited, but still may repeat every few minutes. The solution here is to set a timeout when the temperature first exceeds the threshold. If the CPU returns to normal before the timeout fires, we skip printing any messages. If we reach the timeout, then there may be a real thermal issue (e.g. inoperative or blocked fan) and we print the message (together with a count of how many thermal events have occurred). A rate control method is used to avoid printing repeatedly on these broken systems. Some experimentation with fans enabled showed that temperature returned to normal on a laptop in ~4 seconds. With fans disabled it took over 10 seconds. Default timeout is thus set to 8 seconds, but may be changed with kernel boot parameter: "x86_therm_warn_delay". This default interval is twice of typical sampling interval for cooling using running average power limit from user space thermal control softwares. In addition a new sysfs attribute is added to show what is the maximum amount of time in miili-seconds the system was in throttled state. This will allow to change x86_therm_warn_delay, if required. Suggested-by: Alan Cox <alan@linux.intel.com> Commit-comment-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> --- arch/x86/kernel/cpu/mce/therm_throt.c | 94 ++++++++++++++++++++++----- 1 file changed, 77 insertions(+), 17 deletions(-)