[tip:,ras/core] x86/mce/therm_throt: Optimize notifications of thermal throttle

The following commit has been merged into the ras/core branch of tip:

Commit-ID:     f6656208f04e5b3804054008eba4bf7170f4c841
Gitweb:        https://git.kernel.org/tip/f6656208f04e5b3804054008eba4bf7170f4c841
Author:        Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
AuthorDate:    Mon, 11 Nov 2019 13:43:12 -08:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Tue, 12 Nov 2019 15:56:04 +01:00

x86/mce/therm_throt: Optimize notifications of thermal throttle

Some modern systems have very tight thermal tolerances. Because of this
they may cross thermal thresholds when running normal workloads (even
during boot). The CPU hardware will react by limiting power/frequency
and using duty cycles to bring the temperature back into normal range.

Thus users may see a "critical" message about the "temperature above
threshold" which is soon followed by "temperature/speed normal". These
messages are rate-limited, but still may repeat every few minutes.

This issue became worse starting with the Ivy Bridge generation of
CPUs because they include a TCC activation offset in the MSR
IA32_TEMPERATURE_TARGET. OEMs use this to provide alerts long before
critical temperatures are reached.

A test run on a laptop with Intel 8th Gen i5 core for two hours with a
workload resulted in 20K+ thermal interrupts per CPU for core level and
another 20K+ interrupts at package level. The kernel logs were full of
throttling messages.

The real value of these threshold interrupts, is to debug problems with
the external cooling solutions and performance issues due to excessive
throttling.

So the solution here is the following:

  - In the current thermal_throttle folder, show:
    - the maximum time for one throttling event and,
    - the total amount of time the system was in throttling state.

  - Do not log short excursions.

  - Log only when, in spite of thermal throttling, the temperature is rising.
  On the high threshold interrupt trigger a delayed workqueue that
  monitors the threshold violation log bit (THERM_STATUS_PROCHOT_LOG). When
  the log bit is set, this workqueue callback calculates three point moving
  average and logs a warning message when the temperature trend is rising.

  When this log bit is clear and temperature is below threshold
  temperature, then the workqueue callback logs a "Normal" message. Once a
  high threshold event is logged, the logging is rate-limited.

With this patch on the same test laptop, no warnings are printed in the logs
as the max time the processor could bring the temperature under control is
only 280 ms.

This implementation is done with the inputs from Alan Cox and Tony Luck.

 [ bp: Touchups. ]

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: bberg@redhat.com
Cc: ckellner@redhat.com
Cc: hdegoede@redhat.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191111214312.81365-1-srinivas.pandruvada@linux.intel.com
---
 arch/x86/kernel/cpu/mce/therm_throt.c | 251 ++++++++++++++++++++++---
 1 file changed, 227 insertions(+), 24 deletions(-)

Message ID	157357085376.29376.9191302193256787154.tip-bot2@tip-bot2 (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=K3rn=ZE=vger.kernel.org=linux-edac-owner@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 10A4E14E5 for <patchwork-linux-edac@patchwork.kernel.org>; Tue, 12 Nov 2019 15:01:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id DE787222C1 for <patchwork-linux-edac@patchwork.kernel.org>; Tue, 12 Nov 2019 15:01:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727137AbfKLPBL (ORCPT <rfc822;patchwork-linux-edac@patchwork.kernel.org>); Tue, 12 Nov 2019 10:01:11 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:34652 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726957AbfKLPBL (ORCPT <rfc822;linux-edac@vger.kernel.org>); Tue, 12 Nov 2019 10:01:11 -0500 Received: from [5.158.153.53] (helo=tip-bot2.lab.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from <tip-bot2@linutronix.de>) id 1iUXf4-00058v-I3; Tue, 12 Nov 2019 16:00:54 +0100 Received: from [127.0.1.1] (localhost [IPv6:::1]) by tip-bot2.lab.linutronix.de (Postfix) with ESMTP id 2E2341C0084; Tue, 12 Nov 2019 16:00:54 +0100 (CET) Date: Tue, 12 Nov 2019 15:00:53 -0000 From: "tip-bot2 for Srinivas Pandruvada" <tip-bot2@linutronix.de> Reply-to: linux-kernel@vger.kernel.org To: linux-tip-commits@vger.kernel.org Subject: [tip: ras/core] x86/mce/therm_throt: Optimize notifications of thermal throttle Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>, Borislav Petkov <bp@suse.de>, "H. Peter Anvin" <hpa@zytor.com>, bberg@redhat.com, ckellner@redhat.com, hdegoede@redhat.com, Ingo Molnar <mingo@redhat.com>, "linux-edac" <linux-edac@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, Tony Luck <tony.luck@intel.com>, "x86-ml" <x86@kernel.org>, Ingo Molnar <mingo@kernel.org>, Borislav Petkov <bp@alien8.de>, linux-kernel@vger.kernel.org In-Reply-To: <20191111214312.81365-1-srinivas.pandruvada@linux.intel.com> References: <20191111214312.81365-1-srinivas.pandruvada@linux.intel.com> MIME-Version: 1.0 Message-ID: <157357085376.29376.9191302193256787154.tip-bot2@tip-bot2> X-Mailer: tip-git-log-daemon Robot-ID: <tip-bot2.linutronix.de> Robot-Unsubscribe: Contact <mailto:tglx@linutronix.de> to get blacklisted from these emails Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: linux-edac-owner@vger.kernel.org Precedence: bulk List-ID: <linux-edac.vger.kernel.org> X-Mailing-List: linux-edac@vger.kernel.org
Series	[tip:,ras/core] x86/mce/therm_throt: Optimize notifications of thermal throttle \| expand [tip:,ras/core] x86/mce/therm_throt: Optimize notifications of thermal throttle

[tip:,ras/core] x86/mce/therm_throt: Optimize notifications of thermal throttle

Commit Message

Patch