Message ID | 20191009155424.249277-1-bberg@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/mce: Lower throttling MCE messages to warnings | expand |
Hi, On 09-10-2019 17:54, Benjamin Berg wrote: > On modern CPUs it is quite normal that the temperature limits are > reached and the CPU is throttled. In fact, often the thermal design is > not sufficient to cool the CPU at full load and limits can quickly be > reached when a burst in load happens. This will even happen with > technologies like RAPL limitting the long term power consumption of > the package. > > So these messages do not usually indicate a hardware issue (e.g. > insufficient cooling). Log them as warnings to avoid confusion about > their severity. > > Signed-off-by: Benjamin Berg <bberg@redhat.com> > Tested-by: Christian Kellner <ckellner@redhat.com> Ah, yes lets please lower the log-prio of these messages: Reviewed-by: Hans de Goede <hdegoede@redhat.com> Regards, Hans > --- > arch/x86/kernel/cpu/mce/therm_throt.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/cpu/mce/therm_throt.c b/arch/x86/kernel/cpu/mce/therm_throt.c > index 6e2becf547c5..bc441d68d060 100644 > --- a/arch/x86/kernel/cpu/mce/therm_throt.c > +++ b/arch/x86/kernel/cpu/mce/therm_throt.c > @@ -188,7 +188,7 @@ static void therm_throt_process(bool new_event, int event, int level) > /* if we just entered the thermal event */ > if (new_event) { > if (event == THERMAL_THROTTLING_EVENT) > - pr_crit("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n", > + pr_warn("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n", > this_cpu, > level == CORE_LEVEL ? "Core" : "Package", > state->count); >
On Wed, Oct 09, 2019 at 05:54:24PM +0200, Benjamin Berg wrote: > On modern CPUs it is quite normal that the temperature limits are > reached and the CPU is throttled. In fact, often the thermal design is > not sufficient to cool the CPU at full load and limits can quickly be > reached when a burst in load happens. This will even happen with > technologies like RAPL limitting the long term power consumption of > the package. > > So these messages do not usually indicate a hardware issue (e.g. > insufficient cooling). Log them as warnings to avoid confusion about > their severity. > > Signed-off-by: Benjamin Berg <bberg@redhat.com> > Tested-by: Christian Kellner <ckellner@redhat.com> > --- > arch/x86/kernel/cpu/mce/therm_throt.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/cpu/mce/therm_throt.c b/arch/x86/kernel/cpu/mce/therm_throt.c > index 6e2becf547c5..bc441d68d060 100644 > --- a/arch/x86/kernel/cpu/mce/therm_throt.c > +++ b/arch/x86/kernel/cpu/mce/therm_throt.c > @@ -188,7 +188,7 @@ static void therm_throt_process(bool new_event, int event, int level) > /* if we just entered the thermal event */ > if (new_event) { > if (event == THERMAL_THROTTLING_EVENT) > - pr_crit("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n", > + pr_warn("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n", > this_cpu, > level == CORE_LEVEL ? "Core" : "Package", > state->count); > -- This has carried over since its very first addition in commit 3867eb75b9279c7b0f6840d2ad9f27694ba6c4e4 Author: Dave Jones <davej@suse.de> Date: Tue Apr 2 20:02:27 2002 -0800 [PATCH] x86 bluesmoke update. o Make MCE compile time optional (Paul Gortmaker) o P4 thermal trip monitoring. (Zwane Mwaikambo) o Non-fatal MCE logging. (Me) It used to be KERN_EMERG back then, though. And yes, this issue has come up in the past already so I think I'll take it. I'll just give Intel folks a couple of days to object should there be anything to object to. Thx.
On Wed, 2019-10-09 at 19:56 +0200, Borislav Petkov wrote: > On Wed, Oct 09, 2019 at 05:54:24PM +0200, Benjamin Berg wrote: > > On modern CPUs it is quite normal that the temperature limits are > > reached and the CPU is throttled. In fact, often the thermal design is > > not sufficient to cool the CPU at full load and limits can quickly be > > reached when a burst in load happens. This will even happen with > > technologies like RAPL limitting the long term power consumption of > > the package. > > > > So these messages do not usually indicate a hardware issue (e.g. > > insufficient cooling). Log them as warnings to avoid confusion about > > their severity. [] > > diff --git a/arch/x86/kernel/cpu/mce/therm_throt.c b/arch/x86/kernel/cpu/mce/therm_throt.c [] > > @@ -188,7 +188,7 @@ static void therm_throt_process(bool new_event, int event, int level) > > /* if we just entered the thermal event */ > > if (new_event) { > > if (event == THERMAL_THROTTLING_EVENT) > > - pr_crit("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n", > > + pr_warn("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n", > > this_cpu, > > level == CORE_LEVEL ? "Core" : "Package", > > state->count); > > -- > > This has carried over since its very first addition in > > commit 3867eb75b9279c7b0f6840d2ad9f27694ba6c4e4 > Author: Dave Jones <davej@suse.de> > Date: Tue Apr 2 20:02:27 2002 -0800 > > [PATCH] x86 bluesmoke update. > > o Make MCE compile time optional (Paul Gortmaker) > o P4 thermal trip monitoring. (Zwane Mwaikambo) > o Non-fatal MCE logging. (Me) > > > It used to be KERN_EMERG back then, though. > > And yes, this issue has come up in the past already so I think I'll take > it. I'll just give Intel folks a couple of days to object should there > be anything to object to. Perhaps this should be pr_warn_ratelimited(...) as the temperature changes can be relatively quick.
On Wed, Oct 09, 2019 at 11:05:37AM -0700, Joe Perches wrote: > Perhaps this should be > > pr_warn_ratelimited(...) > > as the temperature changes can be relatively quick. There's already ratelimiting machinery a bit above in the same function.
On Wed, 2019-10-09 at 20:22 +0200, Borislav Petkov wrote: > On Wed, Oct 09, 2019 at 11:05:37AM -0700, Joe Perches wrote: > > Perhaps this should be > > > > pr_warn_ratelimited(...) > > > > as the temperature changes can be relatively quick. > > There's already ratelimiting machinery a bit above in the same function. right, thanks, nevermind...
Hi Benjamin, On Wed, 2019-10-09 at 19:56 +0200, Borislav Petkov wrote: > On Wed, Oct 09, 2019 at 05:54:24PM +0200, Benjamin Berg wrote: > > On modern CPUs it is quite normal that the temperature limits are > > reached and the CPU is throttled. In fact, often the thermal design > > is > > not sufficient to cool the CPU at full load and limits can quickly > > be > > reached when a burst in load happens. This will even happen with > > technologies like RAPL limitting the long term power consumption of > > the package. > > > > So these messages do not usually indicate a hardware issue (e.g. > > insufficient cooling). Log them as warnings to avoid confusion > > about > > their severity. > > I have a patch to address this. Instead of avoiding any critical warnings or wait for 300 seconds for next one, the warning is based on how long the system is working on throttled condition. If for example the fan broke, then the throttling is extended for a long time. Then we better warn. I am waiting for internal review, and hope to post by tomorrow. Thanks Srinivas > > Signed-off-by: Benjamin Berg <bberg@redhat.com> > > Tested-by: Christian Kellner <ckellner@redhat.com> > > --- > > arch/x86/kernel/cpu/mce/therm_throt.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/arch/x86/kernel/cpu/mce/therm_throt.c > > b/arch/x86/kernel/cpu/mce/therm_throt.c > > index 6e2becf547c5..bc441d68d060 100644 > > --- a/arch/x86/kernel/cpu/mce/therm_throt.c > > +++ b/arch/x86/kernel/cpu/mce/therm_throt.c > > @@ -188,7 +188,7 @@ static void therm_throt_process(bool new_event, > > int event, int level) > > /* if we just entered the thermal event */ > > if (new_event) { > > if (event == THERMAL_THROTTLING_EVENT) > > - pr_crit("CPU%d: %s temperature above threshold, > > cpu clock throttled (total events = %lu)\n", > > + pr_warn("CPU%d: %s temperature above threshold, > > cpu clock throttled (total events = %lu)\n", > > this_cpu, > > level == CORE_LEVEL ? "Core" : > > "Package", > > state->count); > > -- > > This has carried over since its very first addition in > > commit 3867eb75b9279c7b0f6840d2ad9f27694ba6c4e4 > Author: Dave Jones <davej@suse.de> > Date: Tue Apr 2 20:02:27 2002 -0800 > > [PATCH] x86 bluesmoke update. > > o Make MCE compile time optional (Paul Gortmaker) > o P4 thermal trip monitoring. (Zwane Mwaikambo) > o Non-fatal MCE logging. (Me) > > > It used to be KERN_EMERG back then, though. > > And yes, this issue has come up in the past already so I think I'll > take > it. I'll just give Intel folks a couple of days to object should > there > be anything to object to. > > Thx. >
Hi Srinivas, On Thu, 2019-10-10 at 14:08 -0700, Srinivas Pandruvada wrote: > I have a patch to address this. Instead of avoiding any critical > warnings or wait for 300 seconds for next one, the warning is based on > how long the system is working on throttled condition. If for example > the fan broke, then the throttling is extended for a long time. Then we > better warn. > I am waiting for internal review, and hope to post by tomorrow. Nice! I agree that a heuristic seems better than the very simple approach taken in this patch. Thanks, Benjamin > Thanks > Srinivas > > > > Signed-off-by: Benjamin Berg <bberg@redhat.com> > > > Tested-by: Christian Kellner <ckellner@redhat.com> > > > --- > > > arch/x86/kernel/cpu/mce/therm_throt.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/arch/x86/kernel/cpu/mce/therm_throt.c > > > b/arch/x86/kernel/cpu/mce/therm_throt.c > > > index 6e2becf547c5..bc441d68d060 100644 > > > --- a/arch/x86/kernel/cpu/mce/therm_throt.c > > > +++ b/arch/x86/kernel/cpu/mce/therm_throt.c > > > @@ -188,7 +188,7 @@ static void therm_throt_process(bool > > > new_event, > > > int event, int level) > > > /* if we just entered the thermal event */ > > > if (new_event) { > > > if (event == THERMAL_THROTTLING_EVENT) > > > - pr_crit("CPU%d: %s temperature above threshold, > > > cpu clock throttled (total events = %lu)\n", > > > + pr_warn("CPU%d: %s temperature above threshold, > > > cpu clock throttled (total events = %lu)\n", > > > this_cpu, > > > level == CORE_LEVEL ? "Core" : > > > "Package", > > > state->count); > > > -- > > > > This has carried over since its very first addition in > > > > commit 3867eb75b9279c7b0f6840d2ad9f27694ba6c4e4 > > Author: Dave Jones <davej@suse.de> > > Date: Tue Apr 2 20:02:27 2002 -0800 > > > > [PATCH] x86 bluesmoke update. > > > > o Make MCE compile time optional (Paul Gortmaker) > > o P4 thermal trip monitoring. (Zwane Mwaikambo) > > o Non-fatal MCE logging. (Me) > > > > > > It used to be KERN_EMERG back then, though. > > > > And yes, this issue has come up in the past already so I think I'll > > take > > it. I'll just give Intel folks a couple of days to object should > > there > > be anything to object to. > > > > Thx. > >
diff --git a/arch/x86/kernel/cpu/mce/therm_throt.c b/arch/x86/kernel/cpu/mce/therm_throt.c index 6e2becf547c5..bc441d68d060 100644 --- a/arch/x86/kernel/cpu/mce/therm_throt.c +++ b/arch/x86/kernel/cpu/mce/therm_throt.c @@ -188,7 +188,7 @@ static void therm_throt_process(bool new_event, int event, int level) /* if we just entered the thermal event */ if (new_event) { if (event == THERMAL_THROTTLING_EVENT) - pr_crit("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n", + pr_warn("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n", this_cpu, level == CORE_LEVEL ? "Core" : "Package", state->count);