diff mbox series

x86/mce/therm_throt incorrect THERM_STATUS_CLEAR_CORE_MASK?

Message ID CAK8P3a1mkHEjRJgJPsRy+kuN=48=JEDJAeR2z9n+O71qbJ8hSA@mail.gmail.com (mailing list archive)
State RFC, archived
Headers show
Series x86/mce/therm_throt incorrect THERM_STATUS_CLEAR_CORE_MASK? | expand

Commit Message

Arnd Bergmann June 2, 2022, 9:19 a.m. UTC
I have a Xeon W-2265 (family 6, model 85, stepping 7) that started
constantly spewing messages from the therm_throt driver after one
core overheated:

May 31 13:57:54 kernel: [15512.209474] unchecked MSR access error:
WRMSR to 0x19c (tried to write 0x0000000000002a80) at rIP:
0xffffffff9f67f974 (native_write_msr+0x4/0x20)
May 31 13:57:54 kernel: [15512.209486] Call Trace:
May 31 13:57:54 kernel: [15512.209488]  <TASK>
May 31 13:57:54 kernel: [15512.209489]  ? throttle_active_work+0xea/0x1f0
May 31 13:57:54 kernel: [15512.209498]  process_one_work+0x21d/0x3c0
May 31 13:57:54 kernel: [15512.209502]  worker_thread+0x4d/0x3f0
May 31 13:57:54 kernel: [15512.209505]  ? process_one_work+0x3c0/0x3c0
May 31 13:57:54 kernel: [15512.209508]  kthread+0x127/0x150
May 31 13:57:54 kernel: [15512.209510]  ? set_kthread_struct+0x40/0x40
May 31 13:57:54 kernel: [15512.209513]  ret_from_fork+0x1f/0x30
...
May 31 13:57:59 kernel: [15517.333445] CPU11: Core temperature is
above threshold, cpu clock is throttled (total events = 3)

I could not find CPU model specific documentation for this register,
but I see that in [1], the bits 13 through 15 are marked as reserved
in some cases but not others. Manually writing the value 0xa80
instead of 0x2a80 from user space makes the warnings stop, so
my guess is that this CPU does not support the 0x2000 bit:

$ sudo  wrmsr -p 11 0x19c 0xa80 ; dmesg
[177764.874555] msr: Write to unrecognized MSR 0x19c by wrmsr (pid: 142969).
[177764.874560] msr: See
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/about for
details.
[177765.371180] CPU11: Core temperature/speed normal (total events = 42)
[177765.371180] CPU23: Core temperature/speed normal (total events = 42)

I have not tried the patch below, but I think this would address it on my
system, while likely breaking other machines. Any ideas what the
correct fix is?

      Arnd

BIT(7) | BIT(9) | BIT(11))

 static void clear_therm_status_log(int level)

[1] https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.pdf

Comments

srinivas pandruvada June 2, 2022, 3:52 p.m. UTC | #1
On Thu, 2022-06-02 at 11:19 +0200, Arnd Bergmann wrote:
> I have a Xeon W-2265 (family 6, model 85, stepping 7) that started
> constantly spewing messages from the therm_throt driver after one
> core overheated:
> 
I think this is a Cascade Lake system. Have you tried the latest micro-
code?

Thanks,
Srinivas


> May 31 13:57:54 kernel: [15512.209474] unchecked MSR access error:
> WRMSR to 0x19c (tried to write 0x0000000000002a80) at rIP:
> 0xffffffff9f67f974 (native_write_msr+0x4/0x20)
> May 31 13:57:54 kernel: [15512.209486] Call Trace:
> May 31 13:57:54 kernel: [15512.209488]  <TASK>
> May 31 13:57:54 kernel: [15512.209489]  ?
> throttle_active_work+0xea/0x1f0
> May 31 13:57:54 kernel: [15512.209498]  process_one_work+0x21d/0x3c0
> May 31 13:57:54 kernel: [15512.209502]  worker_thread+0x4d/0x3f0
> May 31 13:57:54 kernel: [15512.209505]  ?
> process_one_work+0x3c0/0x3c0
> May 31 13:57:54 kernel: [15512.209508]  kthread+0x127/0x150
> May 31 13:57:54 kernel: [15512.209510]  ?
> set_kthread_struct+0x40/0x40
> May 31 13:57:54 kernel: [15512.209513]  ret_from_fork+0x1f/0x30
> ...
> May 31 13:57:59 kernel: [15517.333445] CPU11: Core temperature is
> above threshold, cpu clock is throttled (total events = 3)
> 
> I could not find CPU model specific documentation for this register,
> but I see that in [1], the bits 13 through 15 are marked as reserved
> in some cases but not others. Manually writing the value 0xa80
> instead of 0x2a80 from user space makes the warnings stop, so
> my guess is that this CPU does not support the 0x2000 bit:
> 
> $ sudo  wrmsr -p 11 0x19c 0xa80 ; dmesg
> [177764.874555] msr: Write to unrecognized MSR 0x19c by wrmsr (pid:
> 142969).
> [177764.874560] msr: See
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/about for
> details.
> [177765.371180] CPU11: Core temperature/speed normal (total events =
> 42)
> [177765.371180] CPU23: Core temperature/speed normal (total events =
> 42)
> 
> I have not tried the patch below, but I think this would address it
> on my
> system, while likely breaking other machines. Any ideas what the
> correct fix is?
> 
>       Arnd
> 
> diff --git a/drivers/thermal/intel/therm_throt.c
> b/drivers/thermal/intel/therm_throt.c
> index 8352083b87c7..620d7f4c013e 100644
> --- a/drivers/thermal/intel/therm_throt.c
> +++ b/drivers/thermal/intel/therm_throt.c
> @@ -196,7 +196,7 @@ static const struct attribute_group
> thermal_attr_group = {
>  #define THERM_THROT_POLL_INTERVAL      HZ
>  #define THERM_STATUS_PROCHOT_LOG       BIT(1)
> 
> -#define THERM_STATUS_CLEAR_CORE_MASK (BIT(1) | BIT(3) | BIT(5) |
> BIT(7) | BIT(9) | BIT(11) | BIT(13) | BIT(15))
> +#define THERM_STATUS_CLEAR_CORE_MASK (BIT(1) | BIT(3) | BIT(5) |
> BIT(7) | BIT(9) | BIT(11))
>  #define THERM_STATUS_CLEAR_PKG_MASK  (BIT(1) | BIT(3) | BIT(5) |
> BIT(7) | BIT(9) | BIT(11))
> 
>  static void clear_therm_status_log(int level)
> 
> [1]
> https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.pdf
Arnd Bergmann June 2, 2022, 4:18 p.m. UTC | #2
On Thu, Jun 2, 2022 at 5:52 PM srinivas pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
> On Thu, 2022-06-02 at 11:19 +0200, Arnd Bergmann wrote:
> > I have a Xeon W-2265 (family 6, model 85, stepping 7) that started
> > constantly spewing messages from the therm_throt driver after one
> > core overheated:
> >
> I think this is a Cascade Lake system. Have you tried the latest micro-
> code?

Thanks for your quick reply. I have installed the latest microcode 0x5003302
now (manually, because the version provided by the distro was still using
version 0x5003102).

After that, I tried writing the value 0x2a80 from userspace, and
that did not cause a trap, so I assume that fixed it.

It's hard to be sure, as the system has only run into the broken
state twice during its life, and now it's fine. I'll reply here if it
ever comes back with the new microcode.

Thanks a lot!

       Arnd
srinivas pandruvada June 2, 2022, 4:25 p.m. UTC | #3
On Thu, 2022-06-02 at 18:18 +0200, Arnd Bergmann wrote:
> On Thu, Jun 2, 2022 at 5:52 PM srinivas pandruvada
> <srinivas.pandruvada@linux.intel.com> wrote:
> > 
> > On Thu, 2022-06-02 at 11:19 +0200, Arnd Bergmann wrote:
> > > I have a Xeon W-2265 (family 6, model 85, stepping 7) that
> > > started
> > > constantly spewing messages from the therm_throt driver after one
> > > core overheated:
> > > 
> > I think this is a Cascade Lake system. Have you tried the latest
> > micro-
> > code?
> 
> Thanks for your quick reply. I have installed the latest microcode
> 0x5003302
> now (manually, because the version provided by the distro was still
> using
> version 0x5003102).
> 
> After that, I tried writing the value 0x2a80 from userspace, and
> that did not cause a trap, so I assume that fixed it.
> 
Thanks for reporting.
I am aware of this issue and should be fixed by microcode update.

Thanks,
Srinivas

> It's hard to be sure, as the system has only run into the broken
> state twice during its life, and now it's fine. I'll reply here if it
> ever comes back with the new microcode.
> 
> Thanks a lot!
> 
>        Arnd
Arnd Bergmann June 2, 2022, 6:53 p.m. UTC | #4
On Thu, Jun 2, 2022 at 6:25 PM srinivas pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
> On Thu, 2022-06-02 at 18:18 +0200, Arnd Bergmann wrote:
> > On Thu, Jun 2, 2022 at 5:52 PM srinivas pandruvada
> > <srinivas.pandruvada@linux.intel.com> wrote:
> > >
> > > On Thu, 2022-06-02 at 11:19 +0200, Arnd Bergmann wrote:
> > > > I have a Xeon W-2265 (family 6, model 85, stepping 7) that
> > > > started
> > > > constantly spewing messages from the therm_throt driver after one
> > > > core overheated:
> > > >
> > > I think this is a Cascade Lake system. Have you tried the latest
> > > micro-
> > > code?
> >
> > Thanks for your quick reply. I have installed the latest microcode
> > 0x5003302
> > now (manually, because the version provided by the distro was still
> > using
> > version 0x5003102).
> >
> > After that, I tried writing the value 0x2a80 from userspace, and
> > that did not cause a trap, so I assume that fixed it.
> >
> Thanks for reporting.
> I am aware of this issue and should be fixed by microcode update.

I wonder how common this problem it is. Would it help to add a driver workaround
like this?

diff --git a/drivers/thermal/intel/therm_throt.c
b/drivers/thermal/intel/therm_throt.c
index 8352083b87c7..acb402e56796 100644
--- a/drivers/thermal/intel/therm_throt.c
+++ b/drivers/thermal/intel/therm_throt.c
@@ -214,7 +214,13 @@ static void clear_therm_status_log(int level)

        rdmsrl(msr, msr_val);
        msr_val &= mask;
-       wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG);
+       if (wrmsrl_safe(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG)) {
+               /* work around Cascade Lake SKZ57 erratum */
+               printk_once(KERN_WARNING "Failed to update IA32_THERM_STATUS, "
+                                       "please upgrade microcode\n");
+               wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG &
+                       ~BIT(13) & ~BIT(15));
+       }
 }

 static void get_therm_status(int level, bool *proc_hot, u8 *temp)

        Arnd
srinivas pandruvada June 2, 2022, 8:10 p.m. UTC | #5
On Thu, 2022-06-02 at 20:53 +0200, Arnd Bergmann wrote:
> On Thu, Jun 2, 2022 at 6:25 PM srinivas pandruvada
> <srinivas.pandruvada@linux.intel.com> wrote:
> > On Thu, 2022-06-02 at 18:18 +0200, Arnd Bergmann wrote:
> > > On Thu, Jun 2, 2022 at 5:52 PM srinivas pandruvada
> > > <srinivas.pandruvada@linux.intel.com> wrote:
> > > > 
> > > > On Thu, 2022-06-02 at 11:19 +0200, Arnd Bergmann wrote:
> > > > > I have a Xeon W-2265 (family 6, model 85, stepping 7) that
> > > > > started
> > > > > constantly spewing messages from the therm_throt driver after
> > > > > one
> > > > > core overheated:
> > > > > 
> > > > I think this is a Cascade Lake system. Have you tried the
> > > > latest
> > > > micro-
> > > > code?
> > > 
> > > Thanks for your quick reply. I have installed the latest
> > > microcode
> > > 0x5003302
> > > now (manually, because the version provided by the distro was
> > > still
> > > using
> > > version 0x5003102).
> > > 
> > > After that, I tried writing the value 0x2a80 from userspace, and
> > > that did not cause a trap, so I assume that fixed it.
> > > 
> > Thanks for reporting.
> > I am aware of this issue and should be fixed by microcode update.
> 
> I wonder how common this problem it is. Would it help to add a driver
> workaround
> like this?
This issue affects only certain skews. The others already working as
expected. These are important log bits for debug, we don't want to
clear in this path. Printing warning for CLX stepping is fine without
clearing unrelated bits 13 and 15. 
Read-modify-update should always work where we only update the bits of
interest. Writing 1s to this register should be NOP.

Thanks,
Srinivas

> 
> diff --git a/drivers/thermal/intel/therm_throt.c
> b/drivers/thermal/intel/therm_throt.c
> index 8352083b87c7..acb402e56796 100644
> --- a/drivers/thermal/intel/therm_throt.c
> +++ b/drivers/thermal/intel/therm_throt.c
> @@ -214,7 +214,13 @@ static void clear_therm_status_log(int level)
> 
>         rdmsrl(msr, msr_val);
>         msr_val &= mask;
> -       wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG);
> +       if (wrmsrl_safe(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG)) {
> +               /* work around Cascade Lake SKZ57 erratum */
> +               printk_once(KERN_WARNING "Failed to update
> IA32_THERM_STATUS, "
> +                                       "please upgrade
> microcode\n");
> +               wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG &
> +                       ~BIT(13) & ~BIT(15));
> +       }
>  }
> 
>  static void get_therm_status(int level, bool *proc_hot, u8 *temp)
> 
>         Arnd
Arnd Bergmann June 2, 2022, 8:42 p.m. UTC | #6
On Thu, Jun 2, 2022 at 10:10 PM srinivas pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
> On Thu, 2022-06-02 at 20:53 +0200, Arnd Bergmann wrote:
> >
> > I wonder how common this problem it is. Would it help to add a driver
> > workaround
> > like this?
> This issue affects only certain skews. The others already working as
> expected. These are important log bits for debug, we don't want to
> clear in this path. Printing warning for CLX stepping is fine without
> clearing unrelated bits 13 and 15.
> Read-modify-update should always work where we only update the bits of
> interest. Writing 1s to this register should be NOP.

The patch I suggested doesn't change the behavior unless the initial
write causes an exception. As long as only buggy microcode rejects the
write, the second write just serves to clear the state that causes the
repeated stack dumps.

       Arnd

> > @@ -214,7 +214,13 @@ static void clear_therm_status_log(int level)
> >
> >         rdmsrl(msr, msr_val);
> >         msr_val &= mask;
> > -       wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG);
> > +       if (wrmsrl_safe(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG)) {
> > +               /* work around Cascade Lake SKZ57 erratum */
> > +               printk_once(KERN_WARNING "Failed to update IA32_THERM_STATUS, "
> > +                                       "please upgrade microcode\n");
> > +               wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG &
> > +                       ~BIT(13) & ~BIT(15));
> > +       }
> >  }
> >
srinivas pandruvada June 2, 2022, 9:13 p.m. UTC | #7
On Thu, 2022-06-02 at 22:42 +0200, Arnd Bergmann wrote:
> On Thu, Jun 2, 2022 at 10:10 PM srinivas pandruvada
> <srinivas.pandruvada@linux.intel.com> wrote:
> > On Thu, 2022-06-02 at 20:53 +0200, Arnd Bergmann wrote:
> > > 
> > > I wonder how common this problem it is. Would it help to add a
> > > driver
> > > workaround
> > > like this?
> > This issue affects only certain skews. The others already working
> > as
> > expected. These are important log bits for debug, we don't want to
> > clear in this path. Printing warning for CLX stepping is fine
> > without
> > clearing unrelated bits 13 and 15.
> > Read-modify-update should always work where we only update the bits
> > of
> > interest. Writing 1s to this register should be NOP.
> 
> The patch I suggested doesn't change the behavior unless the initial
> write causes an exception. As long as only buggy microcode rejects
> the
> write, the second write just serves to clear the state that causes
> the
> repeated stack dumps.
But it will clear BIT 13 and 15 in this case. So atleast print the
current msr value in the warning message so that we don't loose the BIT
13 and BIT 15 values, in case we need them for debug.

Thanks,
Srinivas

> 
>        Arnd
> 
> > > @@ -214,7 +214,13 @@ static void clear_therm_status_log(int
> > > level)
> > > 
> > >         rdmsrl(msr, msr_val);
> > >         msr_val &= mask;
> > > -       wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG);
> > > +       if (wrmsrl_safe(msr, msr_val &
> > > ~THERM_STATUS_PROCHOT_LOG)) {
> > > +               /* work around Cascade Lake SKZ57 erratum */
> > > +               printk_once(KERN_WARNING "Failed to update
> > > IA32_THERM_STATUS, "
> > > +                                       "please upgrade
> > > microcode\n");
> > > +               wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG &
> > > +                       ~BIT(13) & ~BIT(15));
> > > +       }
> > >  }
> > >
diff mbox series

Patch

diff --git a/drivers/thermal/intel/therm_throt.c
b/drivers/thermal/intel/therm_throt.c
index 8352083b87c7..620d7f4c013e 100644
--- a/drivers/thermal/intel/therm_throt.c
+++ b/drivers/thermal/intel/therm_throt.c
@@ -196,7 +196,7 @@  static const struct attribute_group thermal_attr_group = {
 #define THERM_THROT_POLL_INTERVAL      HZ
 #define THERM_STATUS_PROCHOT_LOG       BIT(1)

-#define THERM_STATUS_CLEAR_CORE_MASK (BIT(1) | BIT(3) | BIT(5) |
BIT(7) | BIT(9) | BIT(11) | BIT(13) | BIT(15))
+#define THERM_STATUS_CLEAR_CORE_MASK (BIT(1) | BIT(3) | BIT(5) |
BIT(7) | BIT(9) | BIT(11))
 #define THERM_STATUS_CLEAR_PKG_MASK  (BIT(1) | BIT(3) | BIT(5) |