diff mbox series

x86, mce, therm_throt: Optimize notifications of thermal throttle

Message ID 20191111214312.81365-1-srinivas.pandruvada@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series x86, mce, therm_throt: Optimize notifications of thermal throttle | expand

Commit Message

Srinivas Pandruvada Nov. 11, 2019, 9:43 p.m. UTC
Some modern systems have very tight thermal tolerances. Because of this
they may cross thermal thresholds when running normal workloads (even
during boot). The CPU hardware will react by limiting power/frequency
and using duty cycles to bring the temperature back into normal range.

Thus users may see a "critical" message about the "temperature above
threshold" which is soon followed by "temperature/speed normal". These
messages are rate-limited, but still may repeat every few minutes.

This issue became worse starting with the Ivy Bridge generation of
CPUs because they include a TCC activation offset in the MSR
IA32_TEMPERATURE_TARGET. OEMs use this to provide alerts long before
critical temperatures are reached.

A test run on a laptop with Intel 8th Gen i5 core for two hours with a
workload resulted in 20K+ thermal interrupts per CPU for core level and
another 20K+ interrupts at package level. The kernel logs were full of
throttling messages.

The real value of these threshold interrupts, is to debug problems with
the external cooling solutions and performance issues due to excessive
throttling.

So the solution here is the following:
- In the current thermal_throttle folder, show the maximum time for one
throttling event and total amount of time, the system was in throttling
state.
- Do not log short excursions.
- Log only when, in spite of thermal throttling the temperature is rising.
On the high threshold interrupt trigger a delayed workqueue, that
monitors the threshold violation log bit (THERM_STATUS_PROCHOT_LOG). When
the log bit is set, this workqueue callback calculates three point moving
average and logs warning message when the temperature trend is rising.
When this log bit is clear and temperature is below threshold temperature,
then the workqueue callback logs "Normal" message". Once a high threshold
event is logged, the logging is rate limited.

With this patch, on the same test laptop, no warnings are printed in logs
as the max time the processor could bring the temperature under control is
only 280 ms.

This implementation is done with the inputs from Alan Cox and Tony Luck.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---

Version History:
v1:
Changes compared to RFC PATCH:
	Addressed comments from Boris
	Rebased to tip/master
	Added kenrel doc for struct
	Edits to the commit description
	Optimize storage for _thermal_state
	Ignore invalid sample for threshold high

 arch/x86/kernel/cpu/mce/therm_throt.c | 251 +++++++++++++++++++++++---
 1 file changed, 227 insertions(+), 24 deletions(-)

Comments

Chris Wilson Feb. 8, 2020, 10:24 p.m. UTC | #1
Quoting Srinivas Pandruvada (2019-11-11 21:43:12)
> +static void throttle_active_work(struct work_struct *work)
> +{
> +       struct _thermal_state *state = container_of(to_delayed_work(work),
> +                                               struct _thermal_state, therm_work);
> +       unsigned int i, avg, this_cpu = smp_processor_id();
> +       u64 now = get_jiffies_64();
> +       bool hot;
> +       u8 temp;

<6> [198.901895] [IGT] perf_pmu: starting subtest cpu-hotplug
<4> [199.088851] IRQ 24: no longer affine to CPU0
<4> [199.088871] IRQ 25: no longer affine to CPU0
<6> [199.091679] smpboot: CPU 0 is now offline
<6> [200.122204] smpboot: Booting Node 0 Processor 0 APIC 0x0
<6> [200.297267] smpboot: CPU 1 is now offline
<3> [201.218812] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/1:0/17
<4> [201.218974] caller is throttle_active_work+0x12/0x280
<4> [201.218985] CPU: 0 PID: 17 Comm: kworker/1:0 Tainted: G     U            5.5.0-CI-CI_DRM_7867+ #1
<4> [201.218991] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
<4> [201.219001] Workqueue: events throttle_active_work
<4> [201.219009] Call Trace:
<4> [201.219021]  dump_stack+0x71/0x9b
<4> [201.219035]  debug_smp_processor_id+0xad/0xb0
<4> [201.219047]  throttle_active_work+0x12/0x280
<4> [201.219063]  process_one_work+0x26a/0x620
<4> [201.219087]  worker_thread+0x37/0x380
<4> [201.219103]  ? process_one_work+0x620/0x620
<4> [201.219110]  kthread+0x119/0x130
<4> [201.219119]  ? kthread_park+0x80/0x80
<4> [201.219134]  ret_from_fork+0x3a/0x50
<6> [201.315866] x86: Booting SMP configuration:
<6> [201.315880] smpboot: Booting Node 0 Processor 1 APIC 0x2
<4> [201.319814] ------------[ cut here ]------------
<3> [201.319832] ODEBUG: init active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x10
<4> [201.319971] WARNING: CPU: 1 PID: 14 at lib/debugobjects.c:484 debug_print_object+0x67/0x90
<4> [201.319977] Modules linked in: vgem snd_hda_codec_hdmi i915 mei_hdcp x86_pkg_temp_thermal coretemp snd_hda_codec_realtek crct10dif_pclmul snd_hda_codec_generic crc32_pclmul snd_hda_intel snd_intel_dspcfg snd_hda_codec ghash_clmulni_intel snd_hwdep snd_hda_core snd_pcm mei_me r8169 mei realtek lpc_ich prime_numbers
<4> [201.320023] CPU: 1 PID: 14 Comm: cpuhp/1 Tainted: G     U            5.5.0-CI-CI_DRM_7867+ #1
<4> [201.320029] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
<4> [201.320038] RIP: 0010:debug_print_object+0x67/0x90
<4> [201.320046] Code: 83 c2 01 8b 4b 14 4c 8b 45 00 89 15 17 f7 8b 02 8b 53 10 4c 89 e6 48 c7 c7 b0 ce 31 82 48 8b 14 d5 00 37 07 82 e8 89 7b b8 ff <0f> 0b 5b 83 05 33 fb 21 01 01 5d 41 5c c3 83 05 28 fb 21 01 01 c3
<4> [201.320053] RSP: 0000:ffffc900000dbd40 EFLAGS: 00010286
<4> [201.320060] RAX: 0000000000000000 RBX: ffff888408665d68 RCX: 0000000000000001
<4> [201.320066] RDX: 0000000080000001 RSI: ffff88840d6e30f8 RDI: 00000000ffffffff
<4> [201.320072] RBP: ffffffff826489e0 R08: ffff88840d6e30f8 R09: 0000000000000000
<4> [201.320078] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff822d7bd1
<4> [201.320084] R13: ffffffff826489e0 R14: ffff88840f898300 R15: 0000000000000202
<4> [201.320091] FS:  0000000000000000(0000) GS:ffff88840f880000(0000) knlGS:0000000000000000
<4> [201.320098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [201.320104] CR2: 0000000000000000 CR3: 0000000005610001 CR4: 00000000001606e0
<4> [201.320109] Call Trace:
<4> [201.320125]  __debug_object_init+0x359/0x510
<4> [201.320140]  ? _raw_spin_unlock_irqrestore+0x34/0x60
<4> [201.320156]  ? queue_work_node+0x70/0x70
<4> [201.320165]  init_timer_key+0x25/0x140
<4> [201.320180]  ? intel_thermal_supported+0x30/0x30
<4> [201.320191]  thermal_throttle_online+0xb4/0x260
<4> [201.320204]  ? unexpected_thermal_interrupt+0x20/0x20
<4> [201.320213]  cpuhp_invoke_callback+0x9b/0x9d0
<4> [201.320235]  cpuhp_thread_fun+0x1c8/0x220
<4> [201.320249]  ? smpboot_thread_fn+0x23/0x280
<4> [201.320259]  ? smpboot_thread_fn+0x6b/0x280
<4> [201.320271]  smpboot_thread_fn+0x1d3/0x280
<4> [201.320288]  ? sort_range+0x20/0x20
<4> [201.320295]  kthread+0x119/0x130
<4> [201.320303]  ? kthread_park+0x80/0x80
<4> [201.320317]  ret_from_fork+0x3a/0x50
<4> [201.320348] irq event stamp: 4846
<4> [201.320358] hardirqs last  enabled at (4845): [<ffffffff8112dcca>] console_unlock+0x4ba/0x5a0
<4> [201.320368] hardirqs last disabled at (4846): [<ffffffff81001ca0>] trace_hardirqs_off_thunk+0x1a/0x1c
<4> [201.320379] softirqs last  enabled at (4746): [<ffffffff81e00385>] __do_softirq+0x385/0x47f
<4> [201.320388] softirqs last disabled at (4739): [<ffffffff810ba15a>] irq_exit+0xba/0xc0
<4> [201.320394] ---[ end trace 06576bf31ad2ac2b ]---

Are we otherwise relying on current->nr_cpus_allowed == 1 here?
(As this section is not within a preempt_disable or local_irq_disable
region.)
-Chris
Srinivas Pandruvada Feb. 9, 2020, 6:09 a.m. UTC | #2
On Sat, 2020-02-08 at 22:24 +0000, Chris Wilson wrote:
> Quoting Srinivas Pandruvada (2019-11-11 21:43:12)
> > +static void throttle_active_work(struct work_struct *work)
> > +{
> > +       struct _thermal_state *state =
> > container_of(to_delayed_work(work),
> > +                                               struct
> > _thermal_state, therm_work);
> > +       unsigned int i, avg, this_cpu = smp_processor_id();
> > +       u64 now = get_jiffies_64();
> > +       bool hot;
> > +       u8 temp;
> 
> <6> [198.901895] [IGT] perf_pmu: starting subtest cpu-hotplug
> <4> [199.088851] IRQ 24: no longer affine to CPU0
> <4> [199.088871] IRQ 25: no longer affine to CPU0
> <6> [199.091679] smpboot: CPU 0 is now offline
> <6> [200.122204] smpboot: Booting Node 0 Processor 0 APIC 0x0
> <6> [200.297267] smpboot: CPU 1 is now offline
> <3> [201.218812] BUG: using smp_processor_id() in preemptible
> [00000000] code: kworker/1:0/17
> <4> [201.218974] caller is throttle_active_work+0x12/0x280
> <4> [201.218985] CPU: 0 PID: 17 Comm: kworker/1:0 Tainted:
> G     U            5.5.0-CI-CI_DRM_7867+ #1
> <4> [201.218991] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS
> V1.12 02/15/2016
> <4> [201.219001] Workqueue: events throttle_active_work
> <4> [201.219009] Call Trace:
> <4> [201.219021]  dump_stack+0x71/0x9b
> <4> [201.219035]  debug_smp_processor_id+0xad/0xb0
> <4> [201.219047]  throttle_active_work+0x12/0x280
> <4> [201.219063]  process_one_work+0x26a/0x620
> <4> [201.219087]  worker_thread+0x37/0x380
> <4> [201.219103]  ? process_one_work+0x620/0x620
> <4> [201.219110]  kthread+0x119/0x130
> <4> [201.219119]  ? kthread_park+0x80/0x80
> <4> [201.219134]  ret_from_fork+0x3a/0x50
> <6> [201.315866] x86: Booting SMP configuration:
> <6> [201.315880] smpboot: Booting Node 0 Processor 1 APIC 0x2
> <4> [201.319814] ------------[ cut here ]------------
> <3> [201.319832] ODEBUG: init active (active state 0) object type:
> timer_list hint: delayed_work_timer_fn+0x0/0x10
> <4> [201.319971] WARNING: CPU: 1 PID: 14 at lib/debugobjects.c:484
> debug_print_object+0x67/0x90
> <4> [201.319977] Modules linked in: vgem snd_hda_codec_hdmi i915
> mei_hdcp x86_pkg_temp_thermal coretemp snd_hda_codec_realtek
> crct10dif_pclmul snd_hda_codec_generic crc32_pclmul snd_hda_intel
> snd_intel_dspcfg snd_hda_codec ghash_clmulni_intel snd_hwdep
> snd_hda_core snd_pcm mei_me r8169 mei realtek lpc_ich prime_numbers
> <4> [201.320023] CPU: 1 PID: 14 Comm: cpuhp/1 Tainted:
> G     U            5.5.0-CI-CI_DRM_7867+ #1
> <4> [201.320029] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS
> V1.12 02/15/2016
> <4> [201.320038] RIP: 0010:debug_print_object+0x67/0x90
> <4> [201.320046] Code: 83 c2 01 8b 4b 14 4c 8b 45 00 89 15 17 f7 8b
> 02 8b 53 10 4c 89 e6 48 c7 c7 b0 ce 31 82 48 8b 14 d5 00 37 07 82 e8
> 89 7b b8 ff <0f> 0b 5b 83 05 33 fb 21 01 01 5d 41 5c c3 83 05 28 fb
> 21 01 01 c3
> <4> [201.320053] RSP: 0000:ffffc900000dbd40 EFLAGS: 00010286
> <4> [201.320060] RAX: 0000000000000000 RBX: ffff888408665d68 RCX:
> 0000000000000001
> <4> [201.320066] RDX: 0000000080000001 RSI: ffff88840d6e30f8 RDI:
> 00000000ffffffff
> <4> [201.320072] RBP: ffffffff826489e0 R08: ffff88840d6e30f8 R09:
> 0000000000000000
> <4> [201.320078] R10: 0000000000000000 R11: 0000000000000000 R12:
> ffffffff822d7bd1
> <4> [201.320084] R13: ffffffff826489e0 R14: ffff88840f898300 R15:
> 0000000000000202
> <4> [201.320091] FS:  0000000000000000(0000)
> GS:ffff88840f880000(0000) knlGS:0000000000000000
> <4> [201.320098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4> [201.320104] CR2: 0000000000000000 CR3: 0000000005610001 CR4:
> 00000000001606e0
> <4> [201.320109] Call Trace:
> <4> [201.320125]  __debug_object_init+0x359/0x510
> <4> [201.320140]  ? _raw_spin_unlock_irqrestore+0x34/0x60
> <4> [201.320156]  ? queue_work_node+0x70/0x70
> <4> [201.320165]  init_timer_key+0x25/0x140
> <4> [201.320180]  ? intel_thermal_supported+0x30/0x30
> <4> [201.320191]  thermal_throttle_online+0xb4/0x260
> <4> [201.320204]  ? unexpected_thermal_interrupt+0x20/0x20
> <4> [201.320213]  cpuhp_invoke_callback+0x9b/0x9d0
> <4> [201.320235]  cpuhp_thread_fun+0x1c8/0x220
> <4> [201.320249]  ? smpboot_thread_fn+0x23/0x280
> <4> [201.320259]  ? smpboot_thread_fn+0x6b/0x280
> <4> [201.320271]  smpboot_thread_fn+0x1d3/0x280
> <4> [201.320288]  ? sort_range+0x20/0x20
> <4> [201.320295]  kthread+0x119/0x130
> <4> [201.320303]  ? kthread_park+0x80/0x80
> <4> [201.320317]  ret_from_fork+0x3a/0x50
> <4> [201.320348] irq event stamp: 4846
> <4> [201.320358] hardirqs last  enabled at (4845):
> [<ffffffff8112dcca>] console_unlock+0x4ba/0x5a0
> <4> [201.320368] hardirqs last disabled at (4846):
> [<ffffffff81001ca0>] trace_hardirqs_off_thunk+0x1a/0x1c
> <4> [201.320379] softirqs last  enabled at (4746):
> [<ffffffff81e00385>] __do_softirq+0x385/0x47f
> <4> [201.320388] softirqs last disabled at (4739):
> [<ffffffff810ba15a>] irq_exit+0xba/0xc0
> <4> [201.320394] ---[ end trace 06576bf31ad2ac2b ]---
> 
> Are we otherwise relying on current->nr_cpus_allowed == 1 here?
No.
I am checking internally, if I can use raw_smp_processor_id() instead.

Thanks,
Srinivas

> (As this section is not within a preempt_disable or local_irq_disable
> region.)
> -Chris
Srinivas Pandruvada Feb. 10, 2020, 3:16 p.m. UTC | #3
On Sat, 2020-02-08 at 22:09 -0800, Srinivas Pandruvada wrote:
> On Sat, 2020-02-08 at 22:24 +0000, Chris Wilson wrote:
> > Quoting Srinivas Pandruvada (2019-11-11 21:43:12)
> > > +static void throttle_active_work(struct work_struct *work)
> > > +{
> > > +       struct _thermal_state *state =
> > > container_of(to_delayed_work(work),
> > > +                                               struct
> > > _thermal_state, therm_work);
> > > +       unsigned int i, avg, this_cpu = smp_processor_id();
> > > +       u64 now = get_jiffies_64();
> > > +       bool hot;
> > > +       u8 temp;
> > 
> > <6> [198.901895] [IGT] perf_pmu: starting subtest cpu-hotplug
> > <4> [199.088851] IRQ 24: no longer affine to CPU0
> > <4> [199.088871] IRQ 25: no longer affine to CPU0
> > <6> [199.091679] smpboot: CPU 0 is now offline
> > <6> [200.122204] smpboot: Booting Node 0 Processor 0 APIC 0x0
> > <6> [200.297267] smpboot: CPU 1 is now offline
> > <3> [201.218812] BUG: using smp_processor_id() in preemptible
> > [00000000] code: kworker/1:0/17
> > <4> [201.218974] caller is throttle_active_work+0x12/0x280
> > <4> [201.218985] CPU: 0 PID: 17 Comm: kworker/1:0 Tainted:
> > G     U            5.5.0-CI-CI_DRM_7867+ #1
> > <4> [201.218991] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS
> > V1.12 02/15/2016
> > <4> [201.219001] Workqueue: events throttle_active_work
> > <4> [201.219009] Call Trace:
> > <4> [201.219021]  dump_stack+0x71/0x9b
> > <4> [201.219035]  debug_smp_processor_id+0xad/0xb0
> > <4> [201.219047]  throttle_active_work+0x12/0x280
> > <4> [201.219063]  process_one_work+0x26a/0x620
> > <4> [201.219087]  worker_thread+0x37/0x380
> > <4> [201.219103]  ? process_one_work+0x620/0x620
> > <4> [201.219110]  kthread+0x119/0x130
> > <4> [201.219119]  ? kthread_park+0x80/0x80
> > <4> [201.219134]  ret_from_fork+0x3a/0x50
> > <6> [201.315866] x86: Booting SMP configuration:
> > <6> [201.315880] smpboot: Booting Node 0 Processor 1 APIC 0x2
> > <4> [201.319814] ------------[ cut here ]------------
> > <3> [201.319832] ODEBUG: init active (active state 0) object type:
> > timer_list hint: delayed_work_timer_fn+0x0/0x10
> > <4> [201.319971] WARNING: CPU: 1 PID: 14 at lib/debugobjects.c:484
> > debug_print_object+0x67/0x90
> > <4> [201.319977] Modules linked in: vgem snd_hda_codec_hdmi i915
> > mei_hdcp x86_pkg_temp_thermal coretemp snd_hda_codec_realtek
> > crct10dif_pclmul snd_hda_codec_generic crc32_pclmul snd_hda_intel
> > snd_intel_dspcfg snd_hda_codec ghash_clmulni_intel snd_hwdep
> > snd_hda_core snd_pcm mei_me r8169 mei realtek lpc_ich prime_numbers
> > <4> [201.320023] CPU: 1 PID: 14 Comm: cpuhp/1 Tainted:
> > G     U            5.5.0-CI-CI_DRM_7867+ #1
> > <4> [201.320029] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS
> > V1.12 02/15/2016
> > <4> [201.320038] RIP: 0010:debug_print_object+0x67/0x90
> > <4> [201.320046] Code: 83 c2 01 8b 4b 14 4c 8b 45 00 89 15 17 f7 8b
> > 02 8b 53 10 4c 89 e6 48 c7 c7 b0 ce 31 82 48 8b 14 d5 00 37 07 82
> > e8
> > 89 7b b8 ff <0f> 0b 5b 83 05 33 fb 21 01 01 5d 41 5c c3 83 05 28 fb
> > 21 01 01 c3
> > <4> [201.320053] RSP: 0000:ffffc900000dbd40 EFLAGS: 00010286
> > <4> [201.320060] RAX: 0000000000000000 RBX: ffff888408665d68 RCX:
> > 0000000000000001
> > <4> [201.320066] RDX: 0000000080000001 RSI: ffff88840d6e30f8 RDI:
> > 00000000ffffffff
> > <4> [201.320072] RBP: ffffffff826489e0 R08: ffff88840d6e30f8 R09:
> > 0000000000000000
> > <4> [201.320078] R10: 0000000000000000 R11: 0000000000000000 R12:
> > ffffffff822d7bd1
> > <4> [201.320084] R13: ffffffff826489e0 R14: ffff88840f898300 R15:
> > 0000000000000202
> > <4> [201.320091] FS:  0000000000000000(0000)
> > GS:ffff88840f880000(0000) knlGS:0000000000000000
> > <4> [201.320098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4> [201.320104] CR2: 0000000000000000 CR3: 0000000005610001 CR4:
> > 00000000001606e0
> > <4> [201.320109] Call Trace:
> > <4> [201.320125]  __debug_object_init+0x359/0x510
> > <4> [201.320140]  ? _raw_spin_unlock_irqrestore+0x34/0x60
> > <4> [201.320156]  ? queue_work_node+0x70/0x70
> > <4> [201.320165]  init_timer_key+0x25/0x140
> > <4> [201.320180]  ? intel_thermal_supported+0x30/0x30
> > <4> [201.320191]  thermal_throttle_online+0xb4/0x260
> > <4> [201.320204]  ? unexpected_thermal_interrupt+0x20/0x20
> > <4> [201.320213]  cpuhp_invoke_callback+0x9b/0x9d0
> > <4> [201.320235]  cpuhp_thread_fun+0x1c8/0x220
> > <4> [201.320249]  ? smpboot_thread_fn+0x23/0x280
> > <4> [201.320259]  ? smpboot_thread_fn+0x6b/0x280
> > <4> [201.320271]  smpboot_thread_fn+0x1d3/0x280
> > <4> [201.320288]  ? sort_range+0x20/0x20
> > <4> [201.320295]  kthread+0x119/0x130
> > <4> [201.320303]  ? kthread_park+0x80/0x80
> > <4> [201.320317]  ret_from_fork+0x3a/0x50
> > <4> [201.320348] irq event stamp: 4846
> > <4> [201.320358] hardirqs last  enabled at (4845):
> > [<ffffffff8112dcca>] console_unlock+0x4ba/0x5a0
> > <4> [201.320368] hardirqs last disabled at (4846):
> > [<ffffffff81001ca0>] trace_hardirqs_off_thunk+0x1a/0x1c
> > <4> [201.320379] softirqs last  enabled at (4746):
> > [<ffffffff81e00385>] __do_softirq+0x385/0x47f
> > <4> [201.320388] softirqs last disabled at (4739):
> > [<ffffffff810ba15a>] irq_exit+0xba/0xc0
> > <4> [201.320394] ---[ end trace 06576bf31ad2ac2b ]---
> > 
> > Are we otherwise relying on current->nr_cpus_allowed == 1 here?
> No.
> I am checking internally, if I can use raw_smp_processor_id()
> instead.
Let me correct my answer.
Here the call is from a workqueue callback which is scheduled to
execute on a specific CPU using schedule_delayed_work_on().
Meanwhile if the CPU is offline or dead, not sure if the thread can
execute on another CPU.

Thanks,
Srinivas




> 
> Thanks,
> Srinivas
> 
> > (As this section is not within a preempt_disable or
> > local_irq_disable
> > region.)
> > -Chris
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/therm_throt.c b/arch/x86/kernel/cpu/mce/therm_throt.c
index bc441d68d060..c102fcff23f3 100644
--- a/arch/x86/kernel/cpu/mce/therm_throt.c
+++ b/arch/x86/kernel/cpu/mce/therm_throt.c
@@ -40,15 +40,58 @@ 
 #define THERMAL_THROTTLING_EVENT	0
 #define POWER_LIMIT_EVENT		1
 
-/*
- * Current thermal event state:
+/**
+ * struct _thermal_state - Represent the current thermal event state
+ * @next_check:			Stores the next timestamp, when it is allowed
+ *				to log the next warning message.
+ * @last_interrupt_time:	Stores the timestamp for the last threshold
+ *				high event.
+ * @therm_work:			Delayed workqueue structure
+ * @count:			Stores the current running count for thermal
+ *				or power threshold interrupts.
+ * @last_count:			Stores the previous running count for thermal
+ *				or power threshold interrupts.
+ * @max_time_ms:		This shows the maximum amount of time CPU was
+ *				in throttled state for a single thermal
+ *				threshold high to low state.
+ * @total_time_ms:		This is a cumulative time during which CPU was
+ *				in the throttled state.
+ * @rate_control_active:	Flag to set, when throttling message is logged.
+ *				This is used for the purpose of rate-control.
+ * @new_event:			Stores the last high/low status of the
+ *				THERM_STATUS_PROCHOT or
+ *				THERM_STATUS_POWER_LIMIT.
+ * @level:			Stores whether this _thermal_state instance is
+ *				for a CORE level or for PACKAGE level.
+ * @sample_index:		Index for storage for the next sample in the
+ *				buffer temp_samples[].
+ * @sample_count:		Total number of samples collected in the buffer
+ *				temp_samples[].
+ * @average:			The last moving average of temperature samples
+ * @baseline_temp:		Temperature at which thermal threshold high
+ *				interrupt was generated.
+ * @temp_samples:		Storage for temperature samples to calculate
+ *				moving average.
+ *
+ * This structure is used to represent data related to thermal state for a CPU.
+ * There is a separate storage for core and package level for each CPU.
  */
 struct _thermal_state {
-	bool			new_event;
-	int			event;
 	u64			next_check;
+	u64			last_interrupt_time;
+	struct delayed_work	therm_work;
 	unsigned long		count;
 	unsigned long		last_count;
+	unsigned long		max_time_ms;
+	unsigned long		total_time_ms;
+	bool			rate_control_active;
+	bool			new_event;
+	u8			level;
+	u8			sample_index;
+	u8			sample_count;
+	u8			average;
+	u8			baseline_temp;
+	u8			temp_samples[3];
 };
 
 struct thermal_state {
@@ -121,8 +164,22 @@  define_therm_throt_device_one_ro(package_throttle_count);
 define_therm_throt_device_show_func(package_power_limit, count);
 define_therm_throt_device_one_ro(package_power_limit_count);
 
+define_therm_throt_device_show_func(core_throttle, max_time_ms);
+define_therm_throt_device_one_ro(core_throttle_max_time_ms);
+
+define_therm_throt_device_show_func(package_throttle, max_time_ms);
+define_therm_throt_device_one_ro(package_throttle_max_time_ms);
+
+define_therm_throt_device_show_func(core_throttle, total_time_ms);
+define_therm_throt_device_one_ro(core_throttle_total_time_ms);
+
+define_therm_throt_device_show_func(package_throttle, total_time_ms);
+define_therm_throt_device_one_ro(package_throttle_total_time_ms);
+
 static struct attribute *thermal_throttle_attrs[] = {
 	&dev_attr_core_throttle_count.attr,
+	&dev_attr_core_throttle_max_time_ms.attr,
+	&dev_attr_core_throttle_total_time_ms.attr,
 	NULL
 };
 
@@ -135,6 +192,105 @@  static const struct attribute_group thermal_attr_group = {
 #define CORE_LEVEL	0
 #define PACKAGE_LEVEL	1
 
+#define THERM_THROT_POLL_INTERVAL	HZ
+#define THERM_STATUS_PROCHOT_LOG	BIT(1)
+
+static void clear_therm_status_log(int level)
+{
+	int msr;
+	u64 msr_val;
+
+	if (level == CORE_LEVEL)
+		msr = MSR_IA32_THERM_STATUS;
+	else
+		msr = MSR_IA32_PACKAGE_THERM_STATUS;
+
+	rdmsrl(msr, msr_val);
+	wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG);
+}
+
+static void get_therm_status(int level, bool *proc_hot, u8 *temp)
+{
+	int msr;
+	u64 msr_val;
+
+	if (level == CORE_LEVEL)
+		msr = MSR_IA32_THERM_STATUS;
+	else
+		msr = MSR_IA32_PACKAGE_THERM_STATUS;
+
+	rdmsrl(msr, msr_val);
+	if (msr_val & THERM_STATUS_PROCHOT_LOG)
+		*proc_hot = true;
+	else
+		*proc_hot = false;
+
+	*temp = (msr_val >> 16) & 0x7F;
+}
+
+static void throttle_active_work(struct work_struct *work)
+{
+	struct _thermal_state *state = container_of(to_delayed_work(work),
+						struct _thermal_state, therm_work);
+	unsigned int i, avg, this_cpu = smp_processor_id();
+	u64 now = get_jiffies_64();
+	bool hot;
+	u8 temp;
+
+	get_therm_status(state->level, &hot, &temp);
+	/* temperature value is offset from the max so lesser means hotter */
+	if (!hot && temp > state->baseline_temp) {
+		if (state->rate_control_active)
+			pr_info("CPU%d: %s temperature/speed normal (total events = %lu)\n",
+				this_cpu,
+				state->level == CORE_LEVEL ? "Core" : "Package",
+				state->count);
+
+		state->rate_control_active = false;
+		return;
+	}
+
+	if (time_before64(now, state->next_check) &&
+			  state->rate_control_active)
+		goto re_arm;
+
+	state->next_check = now + CHECK_INTERVAL;
+
+	if (state->count != state->last_count) {
+		/* There was one new thermal interrupt */
+		state->last_count = state->count;
+		state->average = 0;
+		state->sample_count = 0;
+		state->sample_index = 0;
+	}
+
+	state->temp_samples[state->sample_index] = temp;
+	state->sample_count++;
+	state->sample_index = (state->sample_index + 1) % ARRAY_SIZE(state->temp_samples);
+	if (state->sample_count < ARRAY_SIZE(state->temp_samples))
+		goto re_arm;
+
+	avg = 0;
+	for (i = 0; i < ARRAY_SIZE(state->temp_samples); ++i)
+		avg += state->temp_samples[i];
+
+	avg /= ARRAY_SIZE(state->temp_samples);
+
+	if (state->average > avg) {
+		pr_warn("CPU%d: %s temperature is above threshold, cpu clock is throttled (total events = %lu)\n",
+			this_cpu,
+			state->level == CORE_LEVEL ? "Core" : "Package",
+			state->count);
+		state->rate_control_active = true;
+	}
+
+	state->average = avg;
+
+re_arm:
+	clear_therm_status_log(state->level);
+	schedule_delayed_work_on(this_cpu, &state->therm_work, THERM_THROT_POLL_INTERVAL);
+}
+
 /***
  * therm_throt_process - Process thermal throttling event from interrupt
  * @curr: Whether the condition is current or not (boolean), since the
@@ -178,27 +334,33 @@  static void therm_throt_process(bool new_event, int event, int level)
 	if (new_event)
 		state->count++;
 
-	if (time_before64(now, state->next_check) &&
-			state->count != state->last_count)
+	if (event != THERMAL_THROTTLING_EVENT)
 		return;
 
-	state->next_check = now + CHECK_INTERVAL;
-	state->last_count = state->count;
+	if (new_event && !state->last_interrupt_time) {
+		bool hot;
+		u8 temp;
+
+		get_therm_status(state->level, &hot, &temp);
+		/*
+		 * Ignore short temperature spike as the system is not close
+		 * to PROCHOT. 10C offset is large enough to ignore. It is
+		 * already dropped from the high threshold temperature.
+		 */
+		if (temp > 10)
+			return;
 
-	/* if we just entered the thermal event */
-	if (new_event) {
-		if (event == THERMAL_THROTTLING_EVENT)
-			pr_warn("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n",
-				this_cpu,
-				level == CORE_LEVEL ? "Core" : "Package",
-				state->count);
-		return;
-	}
-	if (old_event) {
-		if (event == THERMAL_THROTTLING_EVENT)
-			pr_info("CPU%d: %s temperature/speed normal\n", this_cpu,
-				level == CORE_LEVEL ? "Core" : "Package");
-		return;
+		state->baseline_temp = temp;
+		state->last_interrupt_time = now;
+		schedule_delayed_work_on(this_cpu, &state->therm_work, THERM_THROT_POLL_INTERVAL);
+	} else if (old_event && state->last_interrupt_time) {
+		unsigned long throttle_time;
+
+		throttle_time = jiffies_delta_to_msecs(now - state->last_interrupt_time);
+		if (throttle_time > state->max_time_ms)
+			state->max_time_ms = throttle_time;
+		state->total_time_ms += throttle_time;
+		state->last_interrupt_time = 0;
 	}
 }
 
@@ -244,20 +406,47 @@  static int thermal_throttle_add_dev(struct device *dev, unsigned int cpu)
 	if (err)
 		return err;
 
-	if (cpu_has(c, X86_FEATURE_PLN) && int_pln_enable)
+	if (cpu_has(c, X86_FEATURE_PLN) && int_pln_enable) {
 		err = sysfs_add_file_to_group(&dev->kobj,
 					      &dev_attr_core_power_limit_count.attr,
 					      thermal_attr_group.name);
+		if (err)
+			goto del_group;
+	}
+
 	if (cpu_has(c, X86_FEATURE_PTS)) {
 		err = sysfs_add_file_to_group(&dev->kobj,
 					      &dev_attr_package_throttle_count.attr,
 					      thermal_attr_group.name);
-		if (cpu_has(c, X86_FEATURE_PLN) && int_pln_enable)
+		if (err)
+			goto del_group;
+
+		err = sysfs_add_file_to_group(&dev->kobj,
+					      &dev_attr_package_throttle_max_time_ms.attr,
+					      thermal_attr_group.name);
+		if (err)
+			goto del_group;
+
+		err = sysfs_add_file_to_group(&dev->kobj,
+					      &dev_attr_package_throttle_total_time_ms.attr,
+					      thermal_attr_group.name);
+		if (err)
+			goto del_group;
+
+		if (cpu_has(c, X86_FEATURE_PLN) && int_pln_enable) {
 			err = sysfs_add_file_to_group(&dev->kobj,
 					&dev_attr_package_power_limit_count.attr,
 					thermal_attr_group.name);
+			if (err)
+				goto del_group;
+		}
 	}
 
+	return 0;
+
+del_group:
+	sysfs_remove_group(&dev->kobj, &thermal_attr_group);
+
 	return err;
 }
 
@@ -269,15 +458,29 @@  static void thermal_throttle_remove_dev(struct device *dev)
 /* Get notified when a cpu comes on/off. Be hotplug friendly. */
 static int thermal_throttle_online(unsigned int cpu)
 {
+	struct thermal_state *state = &per_cpu(thermal_state, cpu);
 	struct device *dev = get_cpu_device(cpu);
 
+	state->package_throttle.level = PACKAGE_LEVEL;
+	state->core_throttle.level = CORE_LEVEL;
+
+	INIT_DELAYED_WORK(&state->package_throttle.therm_work, throttle_active_work);
+	INIT_DELAYED_WORK(&state->core_throttle.therm_work, throttle_active_work);
+
 	return thermal_throttle_add_dev(dev, cpu);
 }
 
 static int thermal_throttle_offline(unsigned int cpu)
 {
+	struct thermal_state *state = &per_cpu(thermal_state, cpu);
 	struct device *dev = get_cpu_device(cpu);
 
+	cancel_delayed_work(&state->package_throttle.therm_work);
+	cancel_delayed_work(&state->core_throttle.therm_work);
+
+	state->package_throttle.rate_control_active = false;
+	state->core_throttle.rate_control_active = false;
+
 	thermal_throttle_remove_dev(dev);
 	return 0;
 }