diff mbox series

[RFC] mce: prevent concurrent polling of MCE events

Message ID Y8WtE7BNJ0gTrqIS@cathedrallabs.org (mailing list archive)
State New, archived
Headers show
Series [RFC] mce: prevent concurrent polling of MCE events | expand

Commit Message

Aristeu Rozanski Jan. 16, 2023, 8:01 p.m. UTC
I've considered creating an array so there'd be one lock per package but
that'd add excessive complexity for something that happens by default every
5 minutes. Thoughts?

---------

Error injection in modern HP machines with CMCI disabled will cause the
injected MCE to be found only by polling. Because these newer machines have a
big number of CPUs per package, it makes a lot more likely for multiple
CPUs polling IMC registers (that are shared in the same package) at same time,
causing multiple reports of the same MCE.

Signed-off-by: Aristeu Rozanski <aris@ruivo.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: linux-edac@vger.kernel.org

Comments

Luck, Tony Jan. 17, 2023, 5:42 p.m. UTC | #1
> +static DEFINE_RAW_SPINLOCK(timer_fn_lock);
>
>  static unsigned long mce_adjust_timer_default(unsigned long interval)
>  {
> @@ -1628,7 +1629,9 @@ static void mce_timer_fn(struct timer_list *t)
>       iv = __this_cpu_read(mce_next_interval);
>
>       if (mce_available(this_cpu_ptr(&cpu_info))) {
> +             raw_spin_lock(&timer_fn_lock);
>               machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
> +             raw_spin_unlock(&timer_fn_lock);
>
>               if (mce_intel_cmci_poll()) {
>                       iv = mce_adjust_timer(iv);

If the CMCI polling interrupts on those large number of CPUs are
staggered at different times, then this should be fine. But if a largish
number of CPUs get in lockstep with each other, then this could get
ugly.  I.e. 200 CPUs all take the "CMCI poll" interrupt together. Then
get stuck in a convoy here as one at a time step through checking
the machine check banks.

I'm not sure if that is just a theoretical issue, or how bad it might
actually be if it happened.

One option to avoid this might be to change from a fixed five minute
polling interval to "five minute plus/minus rand(50) jiffies". Then even
if some CPUs did sync up, they'd quickly diverge again.

-Tony
Aristeu Rozanski Jan. 17, 2023, 6:44 p.m. UTC | #2
On Tue, Jan 17, 2023 at 05:42:36PM +0000, Luck, Tony wrote:
> >
> >               if (mce_intel_cmci_poll()) {
> >                       iv = mce_adjust_timer(iv);
> 
> If the CMCI polling interrupts on those large number of CPUs are
> staggered at different times, then this should be fine. But if a largish
> number of CPUs get in lockstep with each other, then this could get
> ugly.  I.e. 200 CPUs all take the "CMCI poll" interrupt together. Then
> get stuck in a convoy here as one at a time step through checking
> the machine check banks.

Yes, but could change the patch to include mce_available() into the
protection of the lock. It should cleared once machine_check_poll() clears
the bank state, no?

> One option to avoid this might be to change from a fixed five minute
> polling interval to "five minute plus/minus rand(50) jiffies". Then even
> if some CPUs did sync up, they'd quickly diverge again.

That'd be fine for me as well. Perhaps initialize the polling interval using
cpu numbers to space them then random for rescheduling?
Luck, Tony Jan. 17, 2023, 6:54 p.m. UTC | #3
> Yes, but could change the patch to include mce_available() into the
> protection of the lock. It should cleared once machine_check_poll() clears
> the bank state, no?

Which machines are showing this problem? Most modern systems support
CMCI. So I'm thinking that this case shows up because the sysadmin booted
with "mce=no_cmc;". In that case I don't think mce_available() check would
change anything.

-Tony
Aristeu Rozanski Jan. 17, 2023, 6:58 p.m. UTC | #4
On Tue, Jan 17, 2023 at 06:54:02PM +0000, Luck, Tony wrote:
> > Yes, but could change the patch to include mce_available() into the
> > protection of the lock. It should cleared once machine_check_poll() clears
> > the bank state, no?
> 
> Which machines are showing this problem? Most modern systems support
> CMCI. So I'm thinking that this case shows up because the sysadmin booted
> with "mce=no_cmc;". In that case I don't think mce_available() check would
> change anything.

That is correct, ignore what I said.
Aristeu Rozanski April 28, 2023, 3:51 p.m. UTC | #5
Hi Tony,

On Tue, Jan 17, 2023 at 05:42:36PM +0000, Luck, Tony wrote:
> > +static DEFINE_RAW_SPINLOCK(timer_fn_lock);
> >
> >  static unsigned long mce_adjust_timer_default(unsigned long interval)
> >  {
> > @@ -1628,7 +1629,9 @@ static void mce_timer_fn(struct timer_list *t)
> >       iv = __this_cpu_read(mce_next_interval);
> >
> >       if (mce_available(this_cpu_ptr(&cpu_info))) {
> > +             raw_spin_lock(&timer_fn_lock);
> >               machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
> > +             raw_spin_unlock(&timer_fn_lock);
> >
> >               if (mce_intel_cmci_poll()) {
> >                       iv = mce_adjust_timer(iv);
> 
> If the CMCI polling interrupts on those large number of CPUs are
> staggered at different times, then this should be fine. But if a largish
> number of CPUs get in lockstep with each other, then this could get
> ugly.  I.e. 200 CPUs all take the "CMCI poll" interrupt together. Then
> get stuck in a convoy here as one at a time step through checking
> the machine check banks.
> 
> I'm not sure if that is just a theoretical issue, or how bad it might
> actually be if it happened.
> 
> One option to avoid this might be to change from a fixed five minute
> polling interval to "five minute plus/minus rand(50) jiffies". Then even
> if some CPUs did sync up, they'd quickly diverge again.

I did experiment different ranges including forcing a minimum difference
based on cpu number but eventually I'd see repeated events. Perhaps a
combination would be acceptable? Set the first run of each cpu spaced based on
CPU number then use a spinlock to synchronize it? That way we minimize the
chance of a big number of CPUs stuck on the same lock but at same time
guarantee we won't have duplicated events.
Luck, Tony April 28, 2023, 4:43 p.m. UTC | #6
> One option to avoid this might be to change from a fixed five minute
> > polling interval to "five minute plus/minus rand(50) jiffies". Then even
> > if some CPUs did sync up, they'd quickly diverge again.
>
> I did experiment different ranges including forcing a minimum difference
> based on cpu number but eventually I'd see repeated events. Perhaps a
> combination would be acceptable? Set the first run of each cpu spaced based on
> CPU number then use a spinlock to synchronize it? That way we minimize the
> chance of a big number of CPUs stuck on the same lock but at same time
> guarantee we won't have duplicated events.

Aristeu,

Yes. A variable poll interval (to avoid CPUs all bunching together) with
the spinlock to serialize the cases where some CPUs do line up their
polling would work.

-Tony
Luck, Tony May 11, 2023, 11:38 p.m. UTC | #7
On Fri, Apr 28, 2023 at 04:43:46PM +0000, Luck, Tony wrote:
> > One option to avoid this might be to change from a fixed five minute
> > > polling interval to "five minute plus/minus rand(50) jiffies". Then even
> > > if some CPUs did sync up, they'd quickly diverge again.
> >
> > I did experiment different ranges including forcing a minimum difference
> > based on cpu number but eventually I'd see repeated events. Perhaps a
> > combination would be acceptable? Set the first run of each cpu spaced based on
> > CPU number then use a spinlock to synchronize it? That way we minimize the
> > chance of a big number of CPUs stuck on the same lock but at same time
> > guarantee we won't have duplicated events.
> 
> Aristeu,
> 
> Yes. A variable poll interval (to avoid CPUs all bunching together) with
> the spinlock to serialize the cases where some CPUs do line up their
> polling would work.

I just ran a test on a system booted with mce=no_cmci and saw the
duplicate reporting.  Digging into the code I see that I'm not the
first to think that a little randomness in timers on different CPUs
would be a good idea. The MCE code starts/restarts timers with:

	mod_timer(t, round_jiffies(when));

Looking at the round_jiffies() function I sw that it is grabbing
the current CPU number.

unsigned long round_jiffies(unsigned long j)
{
        return round_jiffies_common(j, raw_smp_processor_id(), false);
}

And round_jiffies_common() explains with this comment:

        /*
         * We don't want all cpus firing their timers at once hitting the
         * same lock or cachelines, so we skew each extra cpu with an extra
         * 3 jiffies. This 3 jiffies came originally from the mm/ code which
         * already did this.
         * The skew is done by adding 3*cpunr, then round, then subtract this
         * extra offset again.
         */
        j += cpu * 3;

	rem = j % HZ;

So your original patch https://lore.kernel.org/all/Y8WtE7BNJ0gTrqIS@cathedrallabs.org/
that just added the spinlock should be fine!

-Tony
Aristeu Rozanski May 12, 2023, 5:32 p.m. UTC | #8
On Thu, May 11, 2023 at 04:38:57PM -0700, Tony Luck wrote:
> I just ran a test on a system booted with mce=no_cmci and saw the
> duplicate reporting.  Digging into the code I see that I'm not the
> first to think that a little randomness in timers on different CPUs
> would be a good idea. The MCE code starts/restarts timers with:
> 
> 	mod_timer(t, round_jiffies(when));
> 
> Looking at the round_jiffies() function I sw that it is grabbing
> the current CPU number.
> 
> unsigned long round_jiffies(unsigned long j)
> {
>         return round_jiffies_common(j, raw_smp_processor_id(), false);
> }
> 
> And round_jiffies_common() explains with this comment:
> 
>         /*
>          * We don't want all cpus firing their timers at once hitting the
>          * same lock or cachelines, so we skew each extra cpu with an extra
>          * 3 jiffies. This 3 jiffies came originally from the mm/ code which
>          * already did this.
>          * The skew is done by adding 3*cpunr, then round, then subtract this
>          * extra offset again.
>          */
>         j += cpu * 3;
> 
> 	rem = j % HZ;
> 
> So your original patch https://lore.kernel.org/all/Y8WtE7BNJ0gTrqIS@cathedrallabs.org/
> that just added the spinlock should be fine!

Thanks Tony. You want me to resend it or the original is enough?
Luck, Tony May 12, 2023, 7:19 p.m. UTC | #9
> Thanks Tony. You want me to resend it or the original is enough?

The original was marked RFC. So send a new version without the RFC tag.

You can include:

Reviewed-by: Tony Luck <tony.luck@intel.com>

-Tony
diff mbox series

Patch

--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1597,6 +1597,7 @@  static unsigned long check_interval = INITIAL_CHECK_INTERVAL;
 
 static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */
 static DEFINE_PER_CPU(struct timer_list, mce_timer);
+static DEFINE_RAW_SPINLOCK(timer_fn_lock);
 
 static unsigned long mce_adjust_timer_default(unsigned long interval)
 {
@@ -1628,7 +1629,9 @@  static void mce_timer_fn(struct timer_list *t)
 	iv = __this_cpu_read(mce_next_interval);
 
 	if (mce_available(this_cpu_ptr(&cpu_info))) {
+		raw_spin_lock(&timer_fn_lock);
 		machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
+		raw_spin_unlock(&timer_fn_lock);
 
 		if (mce_intel_cmci_poll()) {
 			iv = mce_adjust_timer(iv);